This is the dmesg from the kernel: Oct 10 08:35:22 vs-01 kernel: [85795.105056] BUG: unable to handle kernel paging request at 0000000700000007 Oct 10 08:35:22 vs-01 kernel: [85795.105096] IP: [<ffffffff8032d6ad>] strnlen+0x11/0x1a Oct 10 08:35:22 vs-01 kernel: [85795.105128] PGD 410148067 PUD 0 Oct 10 08:35:22 vs-01 kernel: [85795.105154] Oops: 0000 [1] SMP Oct 10 08:35:22 vs-01 kernel: [85795.105180] CPU 1 Oct 10 08:35:22 vs-01 kernel: [85795.105202] Modules linked in: fuse bnx2 nls_utf8 cifs nls_base nfs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ac bat Oct 10 08:35:22 vs-01 kernel: [85795.105524] Pid: 22828, comm: nautilus Not tainted 2.6.26-bpo.1-vserver-amd64 #1 Oct 10 08:35:22 vs-01 kernel: [85795.105569] RIP: 0010:[<ffffffff8032d6ad>] [<ffffffff8032d6ad>] strnlen+0x11/0x1a Oct 10 08:35:22 vs-01 kernel: [85795.105617] RSP: 0018:ffff8104409e9be0 EFLAGS: 00010297 Oct 10 08:35:22 vs-01 kernel: [85795.105644] RAX: 0000000700000007 RBX: 0000000000000000 RCX: 00000000ffffffff Oct 10 08:35:22 vs-01 kernel: [85795.105674] RDX: 0000000700000007 RSI: fffffffffffffffe RDI: 0000000700000007 Oct 10 08:35:22 vs-01 kernel: [85795.105705] RBP: ffff81044795f28d R08: 00000000ffffffff R09: 0000000000000000 Oct 10 08:35:22 vs-01 kernel: [85795.105735] R10: ffffffffffffffff R11: ffff8104b8dde8c0 R12: 0000000700000007 Oct 10 08:35:22 vs-01 kernel: [85795.105766] R13: 00000000ffffffff R14: ffff810447960000 R15: ffff810447960000 Oct 10 08:35:22 vs-01 kernel: [85795.105797] FS: 00007f473c0b66d0(0000) GS:ffff8104bcc9d840(0000) knlGS:0000000000000000 Oct 10 08:35:22 vs-01 kernel: [85795.105843] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 10 08:35:22 vs-01 kernel: [85795.105870] CR2: 0000000700000007 CR3: 00000004409e3000 CR4: 00000000000006e0 Oct 10 08:35:22 vs-01 kernel: [85795.105901] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 10 08:35:22 vs-01 kernel: [85795.105931] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Oct 10 08:35:22 vs-01 kernel: [85795.105962] Process nautilus (pid: 22828, threadinfo ffff8104409e8000, task ffff81033808ebb0) Oct 10 08:35:22 vs-01 kernel: [85795.106008] Stack: ffffffff8032e602 ffff81036e0aa6c0 ffff8104409e9d88 ffff81044795f28d Oct 10 08:35:22 vs-01 kernel: [85795.106064] 00000000ffffffff 0000000000000000 ffffffff8032e97d ffff810000000002 Oct 10 08:35:22 vs-01 kernel: [85795.106116] 00000000001280d2 ffff810000000000 ffff81000001dc00 0000000000000000 Oct 10 08:35:22 vs-01 kernel: [85795.106153] Call Trace: Oct 10 08:35:22 vs-01 kernel: [85795.106192] [<ffffffff8032e602>] ? string+0x34/0x93 Oct 10 08:35:22 vs-01 kernel: [85795.106221] [<ffffffff8032e97d>] ? vsnprintf+0x31c/0x5be Oct 10 08:35:22 vs-01 kernel: [85795.106257] [<ffffffff8028c1f3>] ? handle_mm_fault+0x449/0xa85 Oct 10 08:35:22 vs-01 kernel: [85795.106293] [<ffffffff802c0b0b>] ? seq_printf+0x67/0x8f Oct 10 08:35:22 vs-01 kernel: [85795.106324] [<ffffffff802bc2f5>] ? mntput_no_expire+0x20/0x117 Oct 10 08:35:22 vs-01 kernel: [85795.106354] [<ffffffff802b83b2>] ? d_path+0xd4/0xe2 Oct 10 08:35:22 vs-01 kernel: [85795.106381] [<ffffffff802c0a26>] ? seq_escape+0x3a/0xb8 Oct 10 08:35:22 vs-01 kernel: [85795.106420] [<ffffffffa02d1525>] ? :cifs:cifs_show_options+0x84/0x27b Oct 10 08:35:22 vs-01 kernel: [85795.106451] [<ffffffff802be5fe>] ? show_vfsmnt+0xcb/0xe6 Oct 10 08:35:22 vs-01 kernel: [85795.106481] [<ffffffff802c1082>] ? seq_read+0x1c7/0x2a5 Oct 10 08:35:22 vs-01 kernel: [85795.106512] [<ffffffff802a6ed4>] ? vfs_read+0xaa/0x152 Oct 10 08:35:22 vs-01 kernel: [85795.106541] [<ffffffff802a72b5>] ? sys_read+0x45/0x6e Oct 10 08:35:22 vs-01 kernel: [85795.106570] [<ffffffff8020be9a>] ? system_call_after_swapgs+0x8a/0x8f Oct 10 08:35:22 vs-01 kernel: [85795.106603] Oct 10 08:35:22 vs-01 kernel: [85795.106622] Oct 10 08:35:22 vs-01 kernel: [85795.106641] Code: f2 ae 48 f7 d1 48 8d 44 11 ff 40 38 30 74 0a 48 ff c8 48 39 d0 73 f3 31 c0 c3 48 89 f8 eb 03 48 ff c0 4 Oct 10 08:35:22 vs-01 kernel: [85795.106850] RIP [<ffffffff8032d6ad>] strnlen+0x11/0x1a Oct 10 08:35:22 vs-01 kernel: [85795.106879] RSP <ffff8104409e9be0> Oct 10 08:35:22 vs-01 kernel: [85795.106902] CR2: 0000000700000007 It repeats in context of other processes but OOPS is always at the same place inside cifs_show_options. This is the kernel version string we are using: [ 0.000000] Linux version 2.6.26-bpo.1-vserver-amd64 (Debian 2.6.26-5~bpo40+1) (nobse@debian.org) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 SMP Sat Sep 20 11:49:02 UTC 2008 We are using vserver machines and mount 30-40 cifs shares inside it. After some time the kernel OOPSes and remains usable until some period and than prevents launching any processes and we need to reboot.
This bug appears all the time at the same place. And pointers in registers where the fault happens always look the same. I've recompiled the kernel to include SLAB poisoning and some additional debugging code. This happens again after some time. I made a dirty and quick change in the cifsfs.c and it now return immediately without printing any details when /proc/mounts requested. We'll see what are the changes... Oct 13 11:35:02 vs-01 kernel: [104810.140974] BUG: unable to handle kernel paging request at 0000002100000021 Oct 13 11:35:02 vs-01 kernel: [104810.141015] IP: [<ffffffff8032d6ad>] strnlen+0x11/0x1a Oct 13 11:35:02 vs-01 kernel: [104810.141048] PGD 39d9b6067 PUD 0 Oct 13 11:35:02 vs-01 kernel: [104810.141074] Oops: 0000 [1] SMP Oct 13 11:35:02 vs-01 kernel: [104810.142695] CPU 6 Oct 13 11:35:02 vs-01 kernel: [104810.142717] Modules linked in: fuse nfs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs bnx 2 nls_utf8 cifs nls_base ac battery dm_snapshot dm_mirror dm_log dm_mod loop iTCO_wdt snd_pcm sg snd_timer snd serio_raw i2c _i801 soundcore sr_mod rng_core psmouse snd_page_alloc pcspkr ses enclosure i2c_core i5000_edac cdrom shpchp pci_hotplug but ton edac_core evdev ext3 jbd mbcache ata_piix usbhid hid ff_memless sd_mod ata_generic libata dock ehci_hcd firmware_class u hci_hcd aacraid scsi_mod thermal processor fan thermal_sys [last unloaded: bnx2] Oct 13 11:35:02 vs-01 kernel: [104810.143047] Pid: 10105, comm: nautilus Not tainted 2.6.26-bpo.1-vserver-amd64 #1 Oct 13 11:35:02 vs-01 kernel: [104810.143092] RIP: 0010:[<ffffffff8032d6ad>] [<ffffffff8032d6ad>] strnlen+0x11/0x1a Oct 13 11:35:02 vs-01 kernel: [104810.144120] RSP: 0018:ffff8103ef0ddbe0 EFLAGS: 00010297 Oct 13 11:35:02 vs-01 kernel: [104810.144120] RAX: 0000002100000021 RBX: 0000000000000000 RCX: 00000000ffffffff Oct 13 11:35:02 vs-01 kernel: [104810.144120] RDX: 0000002100000021 RSI: fffffffffffffffe RDI: 0000002100000021 Oct 13 11:35:02 vs-01 kernel: [104810.144120] RBP: ffff8102f3c5b28d R08: 00000000ffffffff R09: 0000000000000000 Oct 13 11:35:02 vs-01 kernel: [104810.144120] R10: ffffffffffffffff R11: ffff8104ba0f1680 R12: 0000002100000021 Oct 13 11:35:02 vs-01 kernel: [104810.144120] R13: 00000000ffffffff R14: ffff8102f3c5c000 R15: ffff8102f3c5c000 Oct 13 11:35:02 vs-01 kernel: [104810.144120] FS: 00007fe267cfb6d0(0000) GS:ffff8104bcd9cd40(0000) knlGS:0000000000000000 Oct 13 11:35:02 vs-01 kernel: [104810.144120] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 13 11:35:02 vs-01 kernel: [104810.144120] CR2: 0000002100000021 CR3: 00000003f59da000 CR4: 00000000000006e0 Oct 13 11:35:02 vs-01 kernel: [104810.144120] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 13 11:35:02 vs-01 kernel: [104810.144120] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Oct 13 11:35:02 vs-01 kernel: [104810.144120] Process nautilus (pid: 10105, threadinfo ffff8103ef0dc000, task ffff8103f59d24 c0) Oct 13 11:35:02 vs-01 kernel: [104810.144120] Stack: ffffffff8032e602 ffff81039d937640 ffff8103ef0ddd88 ffff8102f3c5b28d Oct 13 11:35:02 vs-01 kernel: [104810.144120] 00000000ffffffff 0000000000000000 ffffffff8032e97d ffff810000000002 Oct 13 11:35:02 vs-01 kernel: [104810.144120] 00000000001280d2 ffff810000000000 ffff81000001dc00 0000000000000000 Oct 13 11:35:02 vs-01 kernel: [104810.144120] Call Trace: Oct 13 11:35:02 vs-01 kernel: [104810.144120] [<ffffffff8032e602>] ? string+0x34/0x93 Oct 13 11:35:02 vs-01 kernel: [104810.144120] [<ffffffff8032e97d>] ? vsnprintf+0x31c/0x5be Oct 13 11:35:02 vs-01 kernel: [104810.144120] [<ffffffff8028c1f3>] ? handle_mm_fault+0x449/0xa85 Oct 13 11:35:02 vs-01 kernel: [104810.144120] [<ffffffff802c0b0b>] ? seq_printf+0x67/0x8f Oct 13 11:35:02 vs-01 kernel: [104810.144120] [<ffffffff802bc2f5>] ? mntput_no_expire+0x20/0x117 Oct 13 11:35:02 vs-01 kernel: [104810.148121] [<ffffffff802b83b2>] ? d_path+0xd4/0xe2 Oct 13 11:35:02 vs-01 kernel: [104810.148121] [<ffffffff802c0a26>] ? seq_escape+0x3a/0xb8 Oct 13 11:35:02 vs-01 kernel: [104810.148121] [<ffffffffa020a525>] ? :cifs:cifs_show_options+0x84/0x27b Oct 13 11:35:02 vs-01 kernel: [104810.148121] [<ffffffff802be5fe>] ? show_vfsmnt+0xcb/0xe6 Oct 13 11:35:02 vs-01 kernel: [104810.148121] [<ffffffff802c1082>] ? seq_read+0x1c7/0x2a5 Oct 13 11:35:02 vs-01 kernel: [104810.148121] [<ffffffff802a6ed4>] ? vfs_read+0xaa/0x152 Oct 13 11:35:02 vs-01 kernel: [104810.148121] [<ffffffff802a72b5>] ? sys_read+0x45/0x6e Oct 13 11:35:02 vs-01 kernel: [104810.148121] [<ffffffff8020be9a>] ? system_call_after_swapgs+0x8a/0x8f Oct 13 11:35:02 vs-01 kernel: [104810.148121] Oct 13 11:35:02 vs-01 kernel: [104810.148121] Oct 13 11:35:02 vs-01 kernel: [104810.148121] Code: f2 ae 48 f7 d1 48 8d 44 11 ff 40 38 30 74 0a 48 ff c8 48 39 d0 73 f3 31 c0 c3 48 89 f8 eb 03 48 ff c0 48 ff ce 48 83 fe ff 74 05 <80> 38 00 75 ef 48 29 f8 c3 31 c0 eb 12 41 38 c8 74 0a 48 ff c2 Oct 13 11:35:02 vs-01 kernel: [104810.148121] RIP [<ffffffff8032d6ad>] strnlen+0x11/0x1a Oct 13 11:35:02 vs-01 kernel: [104810.148121] RSP <ffff8103ef0ddbe0> Oct 13 11:35:02 vs-01 kernel: [104810.148121] CR2: 0000002100000021 Oct 13 11:35:02 vs-01 kernel: [104810.154695] ---[ end trace 084faf644649cd60 ]--- Some point are still not very clear. According to the stack frame tracing seq_escape() is called from cifs_show_options() but in fact there is no call to that function inside. I've disassembled the code of the module to be completely shure that there is no callq instruction to seq_escape(). Does it means that the stack is corrupted somehow?
Ok, after commenting out cifs_show_options we got the following: [ 3529.455763] general protection fault: 0000 [1] SMP [ 3529.455800] CPU 4 [ 3529.455822] Modules linked in: fuse bnx2 nls_utf8 cifs nls_base ac battery dm_snapshot dm_mirror dm_log dm_mod loop iTCO_wdt snd_pcm snd_timer snd sg rng_core pcspkr i2c_i801 soundcore serio_raw i2c_core button i5000_edac snd_page_alloc sr_mod cdrom ses psmouse edac_core shpchp pci_hotplug enclosure evdev ext3 jbd mbcache ata_piix usbhid hid ff_memless sd_mod ata_generic libata dock ehci_hcd firmware_class uhci_hcd aacraid scsi_mod thermal processor fan thermal_sys [last unloaded: bnx2] [ 3529.456148] Pid: 14595, comm: mount.cifs Not tainted 2.6.26-bpo.1-vserver-amd64 #1 [ 3529.456195] RIP: 0010:[<ffffffffa0221be5>] [<ffffffffa0221be5>] :cifs:find_unc+0x6e/0x136 [ 3529.456259] RSP: 0018:ffff81036f449b88 EFLAGS: 00010246 [ 3529.456287] RAX: 5a5a5a5a5a5a5a5a RBX: ffff8104b990f508 RCX: 000000000000006b [ 3529.462468] RDX: 0000000000000046 RSI: ffff810365561a15 RDI: ffff8104b990f515 [ 3529.462468] RBP: ffff81036f449bb8 R08: 0000000000000000 R09: ffff810367de19d0 [ 3529.462468] R10: ffff8104b785a188 R11: ffff8104bcd85ef8 R12: ffff8104bb032548 [ 3529.462468] R13: ffff8104bb032548 R14: ffff810365561a08 R15: 0000000014011dac [ 3529.462468] FS: 00007fe7bbc556d0(0000) GS:ffff8104bccf9e88(0000) knlGS:0000000000000000 [ 3529.462468] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 3529.462468] CR2: 00007fffc3c67000 CR3: 0000000416021000 CR4: 00000000000006e0 [ 3529.462468] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3529.462468] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 3529.462468] Process mount.cifs (pid: 14595, threadinfo ffff81036f448000, task ffff8103af362de0) [ 3529.462468] Stack: ffff8104b341d032 ffff8104b785a198 ffff8104b785a188 0000000000000000 [ 3529.462468] 0000000000000000 ffff81038203ea20 ffff81036f449d08 ffffffffa022452b [ 3529.462468] 0000000000000000 ffff810370c96ac8 00015c2a08000000 0000000000000000 [ 3529.462468] Call Trace: [ 3529.462468] [<ffffffffa022452b>] :cifs:cifs_mount+0x1d96/0x2236 [ 3529.462468] [<ffffffffa02165a1>] ? :cifs:cifs_get_sb+0xb9/0x26e [ 3529.462468] [<ffffffff802a6b68>] ? poison_obj+0x27/0x32 [ 3529.462468] [<ffffffffa02165f3>] :cifs:cifs_get_sb+0x10b/0x26e [ 3529.462468] [<ffffffff802b01b4>] vfs_kern_mount+0xe1/0x1dd [ 3529.462468] [<ffffffff802b030d>] do_kern_mount+0x47/0xe2 [ 3529.462468] [<ffffffff802c6dbe>] do_new_mount+0x9c/0xce [ 3529.462468] [<ffffffff802c7032>] do_mount+0x242/0x271 [ 3529.462468] [<ffffffff80285e45>] ? __alloc_pages_nodemask+0x9/0xb [ 3529.462468] [<ffffffff802a1c2e>] ? alloc_pages_current+0xb9/0xc2 [ 3529.462468] [<ffffffff802c70ea>] sys_mount+0x89/0xce [ 3529.462468] [<ffffffff8020c05a>] system_call_after_swapgs+0x8a/0x8f [ 3529.462468] [ 3529.462468] [ 3529.462468] Code: 31 c0 e8 af 57 01 e0 49 8b 45 38 4d 89 ec 48 85 c0 0f 84 ab 00 00 00 48 8b 40 28 48 85 c0 0f 84 9e 00 00 00 f6 05 9b de 02 00 01 <8b> 58 44 74 13 44 89 fa 89 de 48 c7 c7 4e e1 23 a0 31 c0 e8 73 [ 3529.462468] RIP [<ffffffffa0221be5>] :cifs:find_unc+0x6e/0x136 [ 3529.462468] RSP <ffff81036f449b88> [ 3529.462891] ---[ end trace da7bdb0ec9933e29 ]--- RAX contains poisoned pointer. It means that we are referencing to freed memory somewhere...
What I could investigate is the following. The fault is located in cifs/connect.c: list_for_each(tmp, &GlobalTreeConnectionList) { cFYI(1, ("Next tcon")); tcon = list_entry(tmp, struct cifsTconInfo, cifsConnectionList); if (!tcon->ses || !tcon->ses->server) continue; old_ip = tcon->ses->server->addr.sockAddr.sin_addr.s_addr; <== here is the fault Disassemled code looks like that: fs/cifs/connect.c:1393 22d4: 48 85 c0 test %rax,%rax 22d7: 0f 84 ab 00 00 00 je 2388 <find_unc+0x105> 22dd: 48 8b 40 28 mov 0x28(%rax),%rax 22e1: 48 85 c0 test %rax,%rax 22e4: 0f 84 9e 00 00 00 je 2388 <find_unc+0x105> fs/cifs/connect.c:1397 22ea: f6 05 00 00 00 00 01 testb $0x1,0(%rip) # 22f1 <find_unc+0x6e> 22ec: R_X86_64_PC32 cifsFYI+0xfffffffffffffffb fs/cifs/connect.c:1396 22f1: 8b 58 44 mov 0x44(%rax),%ebx fs/cifs/connect.c:1397 It is clear that the fault occures during dereference of tcon->ses->server->addr.sockAddr.sin_addr.s_addr, which is at offset 0x44 in struct TCP_Server_Info. Register %rax contains poisoned value 0x5a5a5a5a5a5a5a5a (which is a value of tcon->ses->server pointer) it means that structure cifsSesInfo is already invalid. tcon->ses->server is mistakenly freed somewhere or not set to NULL?
Last night we moved the server out of vserver environment and we have near 40 cifs mounts inside it. No faults, no errors, no oops'es. I think this corruption is related somehow to vserver patch. We must close this bug because it is not CifsVFS issue but vserver's.
We moved to the non-vserver kernel and we have near 40 cifs mounts. After one day we got kernel OOPS with the same message and trace. This is not vserver issue and it looks like reference counting/locking problem. So, reopening the bug.
At last here are the steps to reproduce the failure. Tested on kernels 2.6.26-bpo.1-amd64, 2.6.26-bpo.1-vserver-amd64 and 2.6.24-bpo.1-amd64 and 2.4.18-amd64 from backports and Debian. We also tested CentOS kernel kernel-2.6.18-92.el5 for AMD64 with the same result. 1. Mount 20-30 cifs shares to the SAME SAMBA/cifs server using mount.cifs(8). 2. Start to unmount the shares using umount(8). 3. After third or fifth umount the kernel will oops and continue to oops on cifs:cifs_show_options() untill completely hangs.
Is this reproducible on current mainline kernels (from either Steve F's or Linus git tree?). I think you may be running across some refcounting bugs that were in earlier code. I believe them to be fixed in mainline kernels now.
This should be fixed in current 2.6.27.stable and later kernels. There were refcounting problems with the cifs mount code which are better now.