We've come across incidents where we see on the console of our NAS devices something like this: smbd: page allocation failure: order:4, mode:0xc0d0 Pid: 6443, comm: smbd Tainted: P C O 3.2.54 #1 Call Trace: [<ffffffff810b6345>] warn_alloc_failed+0xf5/0x140 [<ffffffff810b7822>] ? drain_pages+0x32/0x90 [<ffffffff810b78d0>] ? page_alloc_cpu_notify+0x50/0x50 [<ffffffff810b78e1>] ? drain_local_pages+0x11/0x20 [<ffffffff810b70c1>] __alloc_pages_nodemask+0x481/0x6d0 [<ffffffff810e707f>] alloc_pages_current+0x7f/0xe0 [<ffffffff810b5a99>] __get_free_pages+0x9/0x40 [<ffffffff810ee3a8>] __kmalloc+0xe8/0xf0 [<ffffffff811154e8>] getxattr+0x98/0x130 [<ffffffff8110371d>] ? do_path_lookup+0x2d/0xc0 [<ffffffff810eeb15>] ? kmem_cache_free+0x15/0x90 [<ffffffff810ff92e>] ? putname+0x2e/0x40 [<ffffffff8110453f>] ? user_path_at_empty+0x5f/0xa0 [<ffffffff810a0ec0>] ? call_rcu_sched+0x10/0x20 [<ffffffff8106d25a>] ? __put_cred+0x3a/0x50 [<ffffffff81115654>] sys_getxattr+0x54/0x80 [<ffffffff8105db54>] ? sys_setresgid+0x84/0x120 [<ffffffff8169ced2>] system_call_fastpath+0x16/0x1b ...and a bunch of memory stats. Our analysis is that: 1. Samba's vfs_acl_xattr() algorithm for getting an extended attribute is "try with a 1K buffer and if that doesn't work try with 64K". This is presumably an optimistic strategy to save system calls. 2. In older Linux (prior to 3.4), getxattr() used to require a physically-contiguous (kmalloc'd) buffer whose size equals the max size requested by the user. So even though no Linux in-tree file system supports EA's larger than 4K, the getxattr() system call would try to allocate 64K if Samba passes this number. 3. On busy servers, memory gets fragmented over time, and that's the failure we've been seeing (notice the "order:4" - that means it tried 2^4 pages or 64K). That causes the getxattr to fail with ENOMEM. 4. In newer kernels (since commit 44c82498), it still tries the kmalloc, but falls back to vmalloc. I'm not sure why kmalloc first, maybe that's also some optimistic stragegy... 5. So in newer kernels, the 64K still incurs some extra-allocation overhead, but at least it doesn't fail. 6. Since the initial kmalloc is done with GFP_KERNEL, I think there's a middle case where the kernel would try to evict pages, and that might cause a performance hit, all for memory that's not really required by any in-tree file system (only ZFS supports 64K on Linux AFAIK). Bottom line - Samba works well with recent Linux kernels, but we might be taking a performance hit by asking for memory we probably don't need. We might want to do the initial getxattr with 4K, or have a 4K step between the 1K and the 64K.
Some corrections: 1. Many file systems support EAs of more than 4K, even on Linux. XFS is one of them. 2. The Linux fallback to vmalloc for getxattr came around kernel 3.6 in commit 779302e6
Created attachment 13157 [details] git-am fix for 4.6.next
Created attachment 13158 [details] git-am fix for 4.5.next The 4.5.next fix is identical to the master / 4.6 fix, except that the inclusion of the test in tests.py had to be tweaked.
Comment on attachment 13157 [details] git-am fix for 4.6.next LGTM.
Reassigning to Karolin for inclusion in 4.6.next, 4.5.next.
(In reply to Jeremy Allison from comment #5) Pushed to autobuild-v4-{6,5}-test.
(In reply to Karolin Seeger from comment #6) Pushed to both branches. Closing out bug report. Thanks!