Bug 6942 - Avoid fragmentation
Summary: Avoid fragmentation
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.4
Classification: Unclassified
Component: File services (show other bugs)
Version: 3.4.3
Hardware: Other Linux
: P3 normal
Target Milestone: ---
Assignee: Volker Lendecke
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-11-30 10:49 UTC by Olaf Flebbe
Modified: 2016-10-21 14:50 UTC (History)
5 users (show)

See Also:


Attachments
Sample client program leading to fragmented files (730 bytes, text/x-csrc)
2009-11-30 10:50 UTC, Olaf Flebbe
no flags Details
proposed patch (originally from ja, added aio part) (9.28 KB, text/x-patch)
2009-11-30 10:52 UTC, Olaf Flebbe
no flags Details
Patch relative to 3-5-test (9.51 KB, patch)
2009-12-01 09:01 UTC, Olaf Flebbe
no flags Details
Patch relative to v3-4-test (9.46 KB, patch)
2009-12-01 09:02 UTC, Olaf Flebbe
no flags Details
Patch relative to 3-5-test (20.28 KB, patch)
2009-12-08 07:04 UTC, Olaf Flebbe
no flags Details
Patch relative to v3-4-test (20.27 KB, patch)
2009-12-08 09:01 UTC, Olaf Flebbe
no flags Details
Patch relative to 3-4-test cherry-pick 1 (5.42 KB, patch)
2010-02-10 04:18 UTC, Olaf Flebbe
no flags Details
cherrypick for 3-4-test 2 (2.12 KB, text/x-patch)
2010-02-10 04:19 UTC, Olaf Flebbe
no flags Details
cherrypick for 3-4-test 3 (1.75 KB, text/x-patch)
2010-02-10 04:19 UTC, Olaf Flebbe
no flags Details
cherrypick for 3-4-test 4 (3.21 KB, patch)
2010-02-10 04:19 UTC, Olaf Flebbe
no flags Details
3-4-test part5 new vfs call (10.75 KB, text/x-patch)
2010-02-10 09:11 UTC, Olaf Flebbe
no flags Details
Patch relative to v3-4-test (1.96 KB, text/plain)
2010-02-17 02:30 UTC, Olaf Flebbe
no flags Details
Patch relative to 3-5-test (1.96 KB, patch)
2010-02-17 03:14 UTC, Olaf Flebbe
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Olaf Flebbe 2009-11-30 10:49:51 UTC
Hi,

While looking for performance problems a customer of us discovered 

http://software.intel.com/en-us/articles/windows-client-cifs-behavior-can-slow-linux-nas-performance/ [1]

and an patch proposal from Jeremy:

http://lists.samba.org/archive/samba-technical/2008-December/062310.html

First I did some investigation and double checked with Vista and
iTunes and was able to reproduce the facts in the first article.
Please note that fragmentation is a performance issue for both reading
and writing. We should try to avoid it.

I double checked with the client and the files on the server are
heavily fragmented. A random file picked used over 650 fragments
(extends). BTW: You can check fragmentation of file on AIX with
"fileplacej2" and on Linux ext3 with "filefrag"

Now I wrote a small Windows Client programm (attached) which
reproduces the network pattern, which leads to fragmentation.

I adapted Jeremys Patch to 3.4.2 -- added the case of async i/o -- and
did some microbenchmarking:

The runtime of the client with AIX samba on a slow machine, runtime as
printed from the client program.

Synchronous I/O:

strict allocate = off   
         3.5 sec   ~800 fragments

strict allocate = on  or partial

         1.3 sec   2-5 fragments

Async I/O (async read size = 1 , async write size = 1)

strict allocate = off

         4.2 sec   ~800 fragments

strict allocate = on  or partial

         1.4 sec   ~5 fragments  // Depends on disk layout

I assume that read performance would benefit drastically if the file is
not in core, but do not have the ressources to set up a benchmark scenario.

I think this should be enough to justify a special patch handling the
patterns created by a crazy client networking stack.

I am not to sure about how to handle the async i/o case: I am quite
confident that there should no other aio request on the same fsp,
possibly enlarging the file... Perhaps this can be done better.
Comment 1 Olaf Flebbe 2009-11-30 10:50:58 UTC
Created attachment 5027 [details]
Sample client program leading to fragmented files
Comment 2 Olaf Flebbe 2009-11-30 10:52:33 UTC
Created attachment 5028 [details]
proposed patch (originally from ja, added aio part)
Comment 3 Volker Lendecke 2009-11-30 10:54:01 UTC
Can you also try "write cache size" adapted to your write pattern?

Volker
Comment 4 Olaf Flebbe 2009-11-30 10:55:53 UTC
Have to try it out...

Olaf
Comment 5 Volker Lendecke 2009-11-30 14:06:41 UTC
Would it be possible that you submit a patch after looking at README.Coding? No lines >=80 cols, trailing whitespace etc. For example, inschedule_aio_write_and_X() there are also non-tab indents. Also, would it be possible to also provide a patch for master/3.5, not only for 3.4?

I would certainly do all these things, but it would be great to get rid of some work :-)

Regarding the logic: What blocks doing the strict allocate thing asynchronously? Can't you allocate the free space, fill it with zeros and do the aio write on it?

Volker

Comment 6 Olaf Flebbe 2009-12-01 07:15:11 UTC
(In reply to comment #3)
> Can you also try "write cache size" adapted to your write pattern?
> 
> Volker
> 

write cache size (despite it is deprecated) gave me only a marginal performance boost, if any boost at all. (I used 1 MB cache size, cannot afford much more considering the number of concurrent connections to be handled ;-)

3.3 sec 
Comment 7 Volker Lendecke 2009-12-01 07:17:33 UTC
Hmmm. I'd have to look deeper, but I think I remember that "write cache size" and "aio write size" somehow conflict. 

Volker
Comment 8 Olaf Flebbe 2009-12-01 07:31:19 UTC
    (In reply to comment #5)
    > Would it be possible that you submit a patch after looking at README.Coding? 

    Will rewrite patch.

    > inschedule_aio_write_and_X() there are also non-tab indents. Also, would it be
    > possible to also provide a patch for master/3.5, not only for 3.4?

    yep.

    > 
    > I would certainly do all these things, but it would be great to get rid of some
    > work :-)
    > 
    > Regarding the logic: What blocks doing the strict allocate thing
    > asynchronously? Can't you allocate the free space, fill it with zeros and do
    > the aio write on it?

    Point is I have to fire up a couple of write request: 

    I do not want to allocate a very large zero filled buffer for writing the
    filling. I have to split it in chunks. So I have to fire up several aio_write
    and wait for all of them to finish. 

    Only if the last of these several writes has finished, samba can send the
    reply. If I remember correctly, Right now every single aio_write is bound to
    only one CIFS write_andx. I would have to change Infrastructure  to handle with
    termination of more than one aio_write. Since I am not sure if it had chances
    at all to be included, I used the "easy way" to get some performance figures. 

    After all there are a couple of async code paths which are handled
    synchonously, so I decided it won't harm, if a special code path is handled
    synchronous too.

    BTW: The customer does not need async I/O for performance reasons. It is needed
    because samba operates on a HSM filesystem. 

    What do you think. Should I rewrite it using a batch of aio_writes??? This can
    be done, but it is not too easy to handle 
Comment 9 Olaf Flebbe 2009-12-01 07:34:49 UTC
(In reply to comment #7)
> Hmmm. I'd have to look deeper, but I think I remember that "write cache size"
> and "aio write size" somehow conflict. 
> 

Yep it does conflict. It won't help me in production. 
The timing was synchronous for benchmarking.
Comment 10 Olaf Flebbe 2009-12-01 09:01:21 UTC
Created attachment 5032 [details]
Patch relative to 3-5-test
Comment 11 Olaf Flebbe 2009-12-01 09:02:03 UTC
Created attachment 5033 [details]
Patch relative to v3-4-test
Comment 12 Björn Jacke 2009-12-02 14:39:18 UTC
you might also want to give 95c18626107484d5d1d475e34fc4dde03cfe6ff5 on the AIX box a try.
Comment 13 Olaf Flebbe 2009-12-03 09:26:33 UTC
Björn,

thanks for pointing to the posix_fallocate() call. I was definetly looking for something like that. 

I just double checked that it works with AIX5.3 and JFS2. Almost as fast as the speed of light ;-)

On my linux it did a quite good emulation by itself. (Writing single bytes with a 4KB stride).

I do have to test it within samba.

Comment 14 Olaf Flebbe 2009-12-03 10:33:17 UTC
Hi,

while looking at Björns patch: I have to propose to add an vfs fuction fallocate.

When using partial allocation I cannot use the vfs_truncate interface, since it would allocate sparse files. 

I would like to support appending to sparse files.

X ....... XXXXX Y... Y...Y...Y
<- sparse ->    <- write ahead ->

Calling vfs_truncate would allocate the whole file. 

Volker and Björn: Is there any chance to get an modification to the vfs layer into 3.4 ???

Olaf
Comment 15 Björn Jacke 2009-12-03 11:21:56 UTC
I'm not sure if I fully understand what you mean. You mean you already have a (partly) sparse file and you only want to allocate space after the currently existing file (whether it's sparse or not)? In that case we might posix_fallocate with offset equals current stat filesize. Effectively that would allocate exactly the space that's allocated by the manual zero'ing.
Comment 16 Jeremy Allison 2009-12-03 11:38:20 UTC
Yep, POSIX_FALLOCATE needs adding to the VFS.

Jeremy.
Comment 17 Björn Jacke 2009-12-03 16:31:19 UTC
Olaf: http://pastie.org/726473 is what I mean...
Comment 18 Olaf Flebbe 2009-12-04 01:16:20 UTC
Björn: Surely, but the semantics of stict_allocate is to allocate the whole file. (And it is only called for stricta allocate = yes.) This would break the strict semantics -- Just in case anyone needs it.

If we want to support semantics of both partial and strict allocation we would have to introduce a new function, or rename this function and introducing a new parameter. But I vote for a clean VFS interface to posix_fallocate used in smbd code, rather a tweak to ftruncate. 

I will make a proposal using a new vfs function.
Comment 19 Björn Jacke 2009-12-08 03:47:55 UTC
Olaf: we actualy only want to allocate "new" space. We don't care about existing space. Imagine you have huge file and you want to enlarge it by some kilobytes, this will put a lot of work on the allocation routines as all the "old" blocks will have to be "checked" whether or not they are already allocated. The change from comment #17 is now in master (c8615b6a0). Maybe you want to test with that ...
Comment 20 Olaf Flebbe 2009-12-08 07:03:13 UTC
Björn: I am actually doing fallocate only for "new" blocks. 
Comment 21 Olaf Flebbe 2009-12-08 07:04:14 UTC
Created attachment 5065 [details]
Patch relative to 3-5-test
Comment 22 Olaf Flebbe 2009-12-08 09:01:10 UTC
Created attachment 5066 [details]
Patch relative to v3-4-test
Comment 23 Olaf Flebbe 2010-02-10 04:18:46 UTC
Created attachment 5312 [details]
Patch relative to 3-4-test cherry-pick 1
Comment 24 Olaf Flebbe 2010-02-10 04:19:16 UTC
Created attachment 5313 [details]
cherrypick for 3-4-test 2
Comment 25 Olaf Flebbe 2010-02-10 04:19:32 UTC
Created attachment 5314 [details]
cherrypick for 3-4-test 3
Comment 26 Olaf Flebbe 2010-02-10 04:19:47 UTC
Created attachment 5315 [details]
cherrypick for 3-4-test 4
Comment 27 Olaf Flebbe 2010-02-10 09:11:06 UTC
Created attachment 5317 [details]
3-4-test part5 new vfs call
Comment 28 Volker Lendecke 2010-02-13 10:12:51 UTC
Quick question -- did you try compiling the vfs_full_audit module with this change? I don't see the definition of the SMB_VFS_NEXT_FALLOCATE macro that is used there. And, in the last patch there's a few whitespace changes that if possible could be removed. And, if you use "git format-patch --stdout > filename" you can add those patches in a single file here. Easier to git am :-)

Thanks,

Volker
Comment 29 Olaf Flebbe 2010-02-15 07:43:35 UTC
New patches do not show expected performance ... searching for the bug. Patches are _not_ ready to be commited
Comment 30 Olaf Flebbe 2010-02-17 02:30:27 UTC
Created attachment 5359 [details]
Patch relative to v3-4-test
Comment 31 Olaf Flebbe 2010-02-17 02:38:55 UTC
After I did all the work to actually use fallocate() for filling up the preallocation requests of the CIFS client, I found out that it is not suitable for production performancewise  compared to the simple "stict allocate on" strategy. The essential parts for fallocate() are in place in 3.5.x (thanks to Björn). There is no point to merge it in 3.4.x

How I need "stict allocate on" even for the Async I/O so I added a fall back to synchronous. 

Please commit patch to v3-4, because performance implications are tremendous.

Comment 32 Olaf Flebbe 2010-02-17 03:14:00 UTC
Created attachment 5360 [details]
Patch relative to 3-5-test

Same for 3.5.x
Comment 33 Björn Jacke 2010-02-23 07:41:38 UTC
the fallocate is also in the non-async case only done for ftruncate's to a given size, not for any write beyond the current end of file. Afaik ftruncates aren't done differently in the aio case. So I think falling back to a sync write whenever a write enlages a file is not be what we want to do. Volker can you please have a look and comment on this, too?
Comment 34 Björn Jacke 2016-10-21 14:50:01 UTC
i think we can consider this to be fixed