Hi, While looking for performance problems a customer of us discovered http://software.intel.com/en-us/articles/windows-client-cifs-behavior-can-slow-linux-nas-performance/ [1] and an patch proposal from Jeremy: http://lists.samba.org/archive/samba-technical/2008-December/062310.html First I did some investigation and double checked with Vista and iTunes and was able to reproduce the facts in the first article. Please note that fragmentation is a performance issue for both reading and writing. We should try to avoid it. I double checked with the client and the files on the server are heavily fragmented. A random file picked used over 650 fragments (extends). BTW: You can check fragmentation of file on AIX with "fileplacej2" and on Linux ext3 with "filefrag" Now I wrote a small Windows Client programm (attached) which reproduces the network pattern, which leads to fragmentation. I adapted Jeremys Patch to 3.4.2 -- added the case of async i/o -- and did some microbenchmarking: The runtime of the client with AIX samba on a slow machine, runtime as printed from the client program. Synchronous I/O: strict allocate = off 3.5 sec ~800 fragments strict allocate = on or partial 1.3 sec 2-5 fragments Async I/O (async read size = 1 , async write size = 1) strict allocate = off 4.2 sec ~800 fragments strict allocate = on or partial 1.4 sec ~5 fragments // Depends on disk layout I assume that read performance would benefit drastically if the file is not in core, but do not have the ressources to set up a benchmark scenario. I think this should be enough to justify a special patch handling the patterns created by a crazy client networking stack. I am not to sure about how to handle the async i/o case: I am quite confident that there should no other aio request on the same fsp, possibly enlarging the file... Perhaps this can be done better.
Created attachment 5027 [details] Sample client program leading to fragmented files
Created attachment 5028 [details] proposed patch (originally from ja, added aio part)
Can you also try "write cache size" adapted to your write pattern? Volker
Have to try it out... Olaf
Would it be possible that you submit a patch after looking at README.Coding? No lines >=80 cols, trailing whitespace etc. For example, inschedule_aio_write_and_X() there are also non-tab indents. Also, would it be possible to also provide a patch for master/3.5, not only for 3.4? I would certainly do all these things, but it would be great to get rid of some work :-) Regarding the logic: What blocks doing the strict allocate thing asynchronously? Can't you allocate the free space, fill it with zeros and do the aio write on it? Volker
(In reply to comment #3) > Can you also try "write cache size" adapted to your write pattern? > > Volker > write cache size (despite it is deprecated) gave me only a marginal performance boost, if any boost at all. (I used 1 MB cache size, cannot afford much more considering the number of concurrent connections to be handled ;-) 3.3 sec
Hmmm. I'd have to look deeper, but I think I remember that "write cache size" and "aio write size" somehow conflict. Volker
(In reply to comment #5) > Would it be possible that you submit a patch after looking at README.Coding? Will rewrite patch. > inschedule_aio_write_and_X() there are also non-tab indents. Also, would it be > possible to also provide a patch for master/3.5, not only for 3.4? yep. > > I would certainly do all these things, but it would be great to get rid of some > work :-) > > Regarding the logic: What blocks doing the strict allocate thing > asynchronously? Can't you allocate the free space, fill it with zeros and do > the aio write on it? Point is I have to fire up a couple of write request: I do not want to allocate a very large zero filled buffer for writing the filling. I have to split it in chunks. So I have to fire up several aio_write and wait for all of them to finish. Only if the last of these several writes has finished, samba can send the reply. If I remember correctly, Right now every single aio_write is bound to only one CIFS write_andx. I would have to change Infrastructure to handle with termination of more than one aio_write. Since I am not sure if it had chances at all to be included, I used the "easy way" to get some performance figures. After all there are a couple of async code paths which are handled synchonously, so I decided it won't harm, if a special code path is handled synchronous too. BTW: The customer does not need async I/O for performance reasons. It is needed because samba operates on a HSM filesystem. What do you think. Should I rewrite it using a batch of aio_writes??? This can be done, but it is not too easy to handle
(In reply to comment #7) > Hmmm. I'd have to look deeper, but I think I remember that "write cache size" > and "aio write size" somehow conflict. > Yep it does conflict. It won't help me in production. The timing was synchronous for benchmarking.
Created attachment 5032 [details] Patch relative to 3-5-test
Created attachment 5033 [details] Patch relative to v3-4-test
you might also want to give 95c18626107484d5d1d475e34fc4dde03cfe6ff5 on the AIX box a try.
Björn, thanks for pointing to the posix_fallocate() call. I was definetly looking for something like that. I just double checked that it works with AIX5.3 and JFS2. Almost as fast as the speed of light ;-) On my linux it did a quite good emulation by itself. (Writing single bytes with a 4KB stride). I do have to test it within samba.
Hi, while looking at Björns patch: I have to propose to add an vfs fuction fallocate. When using partial allocation I cannot use the vfs_truncate interface, since it would allocate sparse files. I would like to support appending to sparse files. X ....... XXXXX Y... Y...Y...Y <- sparse -> <- write ahead -> Calling vfs_truncate would allocate the whole file. Volker and Björn: Is there any chance to get an modification to the vfs layer into 3.4 ??? Olaf
I'm not sure if I fully understand what you mean. You mean you already have a (partly) sparse file and you only want to allocate space after the currently existing file (whether it's sparse or not)? In that case we might posix_fallocate with offset equals current stat filesize. Effectively that would allocate exactly the space that's allocated by the manual zero'ing.
Yep, POSIX_FALLOCATE needs adding to the VFS. Jeremy.
Olaf: http://pastie.org/726473 is what I mean...
Björn: Surely, but the semantics of stict_allocate is to allocate the whole file. (And it is only called for stricta allocate = yes.) This would break the strict semantics -- Just in case anyone needs it. If we want to support semantics of both partial and strict allocation we would have to introduce a new function, or rename this function and introducing a new parameter. But I vote for a clean VFS interface to posix_fallocate used in smbd code, rather a tweak to ftruncate. I will make a proposal using a new vfs function.
Olaf: we actualy only want to allocate "new" space. We don't care about existing space. Imagine you have huge file and you want to enlarge it by some kilobytes, this will put a lot of work on the allocation routines as all the "old" blocks will have to be "checked" whether or not they are already allocated. The change from comment #17 is now in master (c8615b6a0). Maybe you want to test with that ...
Björn: I am actually doing fallocate only for "new" blocks.
Created attachment 5065 [details] Patch relative to 3-5-test
Created attachment 5066 [details] Patch relative to v3-4-test
Created attachment 5312 [details] Patch relative to 3-4-test cherry-pick 1
Created attachment 5313 [details] cherrypick for 3-4-test 2
Created attachment 5314 [details] cherrypick for 3-4-test 3
Created attachment 5315 [details] cherrypick for 3-4-test 4
Created attachment 5317 [details] 3-4-test part5 new vfs call
Quick question -- did you try compiling the vfs_full_audit module with this change? I don't see the definition of the SMB_VFS_NEXT_FALLOCATE macro that is used there. And, in the last patch there's a few whitespace changes that if possible could be removed. And, if you use "git format-patch --stdout > filename" you can add those patches in a single file here. Easier to git am :-) Thanks, Volker
New patches do not show expected performance ... searching for the bug. Patches are _not_ ready to be commited
Created attachment 5359 [details] Patch relative to v3-4-test
After I did all the work to actually use fallocate() for filling up the preallocation requests of the CIFS client, I found out that it is not suitable for production performancewise compared to the simple "stict allocate on" strategy. The essential parts for fallocate() are in place in 3.5.x (thanks to Björn). There is no point to merge it in 3.4.x How I need "stict allocate on" even for the Async I/O so I added a fall back to synchronous. Please commit patch to v3-4, because performance implications are tremendous.
Created attachment 5360 [details] Patch relative to 3-5-test Same for 3.5.x
the fallocate is also in the non-async case only done for ftruncate's to a given size, not for any write beyond the current end of file. Afaik ftruncates aren't done differently in the aio case. So I think falling back to a sync write whenever a write enlages a file is not be what we want to do. Volker can you please have a look and comment on this, too?
i think we can consider this to be fixed