Bug 5821 - kernel oops in strnlen inside cifs_show_options
kernel oops in strnlen inside cifs_show_options
Status: RESOLVED FIXED
Product: CifsVFS
Classification: Unclassified
Component: kernel fs
2.6
x64 Linux
: P3 normal
: ---
Assigned To: Steve French
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-10-10 01:09 UTC by Grigoriy A. Sitkarev
Modified: 2009-02-18 23:23 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Grigoriy A. Sitkarev 2008-10-10 01:09:00 UTC
This is the dmesg from the kernel:

Oct 10 08:35:22 vs-01 kernel: [85795.105056] BUG: unable to handle kernel paging request at 0000000700000007
Oct 10 08:35:22 vs-01 kernel: [85795.105096] IP: [<ffffffff8032d6ad>] strnlen+0x11/0x1a
Oct 10 08:35:22 vs-01 kernel: [85795.105128] PGD 410148067 PUD 0
Oct 10 08:35:22 vs-01 kernel: [85795.105154] Oops: 0000 [1] SMP
Oct 10 08:35:22 vs-01 kernel: [85795.105180] CPU 1
Oct 10 08:35:22 vs-01 kernel: [85795.105202] Modules linked in: fuse bnx2 nls_utf8 cifs nls_base nfs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ac bat
Oct 10 08:35:22 vs-01 kernel: [85795.105524] Pid: 22828, comm: nautilus Not tainted 2.6.26-bpo.1-vserver-amd64 #1
Oct 10 08:35:22 vs-01 kernel: [85795.105569] RIP: 0010:[<ffffffff8032d6ad>]  [<ffffffff8032d6ad>] strnlen+0x11/0x1a
Oct 10 08:35:22 vs-01 kernel: [85795.105617] RSP: 0018:ffff8104409e9be0  EFLAGS: 00010297
Oct 10 08:35:22 vs-01 kernel: [85795.105644] RAX: 0000000700000007 RBX: 0000000000000000 RCX: 00000000ffffffff
Oct 10 08:35:22 vs-01 kernel: [85795.105674] RDX: 0000000700000007 RSI: fffffffffffffffe RDI: 0000000700000007
Oct 10 08:35:22 vs-01 kernel: [85795.105705] RBP: ffff81044795f28d R08: 00000000ffffffff R09: 0000000000000000
Oct 10 08:35:22 vs-01 kernel: [85795.105735] R10: ffffffffffffffff R11: ffff8104b8dde8c0 R12: 0000000700000007
Oct 10 08:35:22 vs-01 kernel: [85795.105766] R13: 00000000ffffffff R14: ffff810447960000 R15: ffff810447960000
Oct 10 08:35:22 vs-01 kernel: [85795.105797] FS:  00007f473c0b66d0(0000) GS:ffff8104bcc9d840(0000) knlGS:0000000000000000
Oct 10 08:35:22 vs-01 kernel: [85795.105843] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 10 08:35:22 vs-01 kernel: [85795.105870] CR2: 0000000700000007 CR3: 00000004409e3000 CR4: 00000000000006e0
Oct 10 08:35:22 vs-01 kernel: [85795.105901] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 10 08:35:22 vs-01 kernel: [85795.105931] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 10 08:35:22 vs-01 kernel: [85795.105962] Process nautilus (pid: 22828, threadinfo ffff8104409e8000, task ffff81033808ebb0)
Oct 10 08:35:22 vs-01 kernel: [85795.106008] Stack:  ffffffff8032e602 ffff81036e0aa6c0 ffff8104409e9d88 ffff81044795f28d
Oct 10 08:35:22 vs-01 kernel: [85795.106064]  00000000ffffffff 0000000000000000 ffffffff8032e97d ffff810000000002
Oct 10 08:35:22 vs-01 kernel: [85795.106116]  00000000001280d2 ffff810000000000 ffff81000001dc00 0000000000000000
Oct 10 08:35:22 vs-01 kernel: [85795.106153] Call Trace:
Oct 10 08:35:22 vs-01 kernel: [85795.106192]  [<ffffffff8032e602>] ? string+0x34/0x93
Oct 10 08:35:22 vs-01 kernel: [85795.106221]  [<ffffffff8032e97d>] ? vsnprintf+0x31c/0x5be
Oct 10 08:35:22 vs-01 kernel: [85795.106257]  [<ffffffff8028c1f3>] ? handle_mm_fault+0x449/0xa85
Oct 10 08:35:22 vs-01 kernel: [85795.106293]  [<ffffffff802c0b0b>] ? seq_printf+0x67/0x8f
Oct 10 08:35:22 vs-01 kernel: [85795.106324]  [<ffffffff802bc2f5>] ? mntput_no_expire+0x20/0x117
Oct 10 08:35:22 vs-01 kernel: [85795.106354]  [<ffffffff802b83b2>] ? d_path+0xd4/0xe2
Oct 10 08:35:22 vs-01 kernel: [85795.106381]  [<ffffffff802c0a26>] ? seq_escape+0x3a/0xb8
Oct 10 08:35:22 vs-01 kernel: [85795.106420]  [<ffffffffa02d1525>] ? :cifs:cifs_show_options+0x84/0x27b
Oct 10 08:35:22 vs-01 kernel: [85795.106451]  [<ffffffff802be5fe>] ? show_vfsmnt+0xcb/0xe6
Oct 10 08:35:22 vs-01 kernel: [85795.106481]  [<ffffffff802c1082>] ? seq_read+0x1c7/0x2a5
Oct 10 08:35:22 vs-01 kernel: [85795.106512]  [<ffffffff802a6ed4>] ? vfs_read+0xaa/0x152
Oct 10 08:35:22 vs-01 kernel: [85795.106541]  [<ffffffff802a72b5>] ? sys_read+0x45/0x6e
Oct 10 08:35:22 vs-01 kernel: [85795.106570]  [<ffffffff8020be9a>] ? system_call_after_swapgs+0x8a/0x8f
Oct 10 08:35:22 vs-01 kernel: [85795.106603]
Oct 10 08:35:22 vs-01 kernel: [85795.106622]
Oct 10 08:35:22 vs-01 kernel: [85795.106641] Code: f2 ae 48 f7 d1 48 8d 44 11 ff 40 38 30 74 0a 48 ff c8 48 39 d0 73 f3 31 c0 c3 48 89 f8 eb 03 48 ff c0 4
Oct 10 08:35:22 vs-01 kernel: [85795.106850] RIP  [<ffffffff8032d6ad>] strnlen+0x11/0x1a
Oct 10 08:35:22 vs-01 kernel: [85795.106879]  RSP <ffff8104409e9be0>
Oct 10 08:35:22 vs-01 kernel: [85795.106902] CR2: 0000000700000007

It repeats in context of other processes but OOPS is always at the same place inside cifs_show_options.

This is the kernel version string we are using:
[    0.000000] Linux version 2.6.26-bpo.1-vserver-amd64 (Debian 2.6.26-5~bpo40+1) (nobse@debian.org) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 SMP Sat Sep 20 11:49:02 UTC 2008

We are using vserver machines and mount 30-40 cifs shares inside it. After some time the kernel OOPSes and remains usable until some period and than prevents launching any processes and we need to reboot.
Comment 1 Grigoriy A. Sitkarev 2008-10-13 06:30:56 UTC
This bug appears all the time at the same place. And pointers in registers where the fault happens always look the same.

I've recompiled the kernel to include SLAB poisoning and some additional debugging code. This happens again after some time. I made a dirty and quick change in the cifsfs.c and it now return immediately without printing any details when /proc/mounts requested. We'll see what are the changes...

Oct 13 11:35:02 vs-01 kernel: [104810.140974] BUG: unable to handle kernel paging request at 0000002100000021
Oct 13 11:35:02 vs-01 kernel: [104810.141015] IP: [<ffffffff8032d6ad>] strnlen+0x11/0x1a
Oct 13 11:35:02 vs-01 kernel: [104810.141048] PGD 39d9b6067 PUD 0
Oct 13 11:35:02 vs-01 kernel: [104810.141074] Oops: 0000 [1] SMP
Oct 13 11:35:02 vs-01 kernel: [104810.142695] CPU 6
Oct 13 11:35:02 vs-01 kernel: [104810.142717] Modules linked in: fuse nfs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs bnx
2 nls_utf8 cifs nls_base ac battery dm_snapshot dm_mirror dm_log dm_mod loop iTCO_wdt snd_pcm sg snd_timer snd serio_raw i2c
_i801 soundcore sr_mod rng_core psmouse snd_page_alloc pcspkr ses enclosure i2c_core i5000_edac cdrom shpchp pci_hotplug but
ton edac_core evdev ext3 jbd mbcache ata_piix usbhid hid ff_memless sd_mod ata_generic libata dock ehci_hcd firmware_class u
hci_hcd aacraid scsi_mod thermal processor fan thermal_sys [last unloaded: bnx2]
Oct 13 11:35:02 vs-01 kernel: [104810.143047] Pid: 10105, comm: nautilus Not tainted 2.6.26-bpo.1-vserver-amd64 #1
Oct 13 11:35:02 vs-01 kernel: [104810.143092] RIP: 0010:[<ffffffff8032d6ad>]  [<ffffffff8032d6ad>] strnlen+0x11/0x1a
Oct 13 11:35:02 vs-01 kernel: [104810.144120] RSP: 0018:ffff8103ef0ddbe0  EFLAGS: 00010297
Oct 13 11:35:02 vs-01 kernel: [104810.144120] RAX: 0000002100000021 RBX: 0000000000000000 RCX: 00000000ffffffff
Oct 13 11:35:02 vs-01 kernel: [104810.144120] RDX: 0000002100000021 RSI: fffffffffffffffe RDI: 0000002100000021
Oct 13 11:35:02 vs-01 kernel: [104810.144120] RBP: ffff8102f3c5b28d R08: 00000000ffffffff R09: 0000000000000000
Oct 13 11:35:02 vs-01 kernel: [104810.144120] R10: ffffffffffffffff R11: ffff8104ba0f1680 R12: 0000002100000021
Oct 13 11:35:02 vs-01 kernel: [104810.144120] R13: 00000000ffffffff R14: ffff8102f3c5c000 R15: ffff8102f3c5c000
Oct 13 11:35:02 vs-01 kernel: [104810.144120] FS:  00007fe267cfb6d0(0000) GS:ffff8104bcd9cd40(0000) knlGS:0000000000000000
Oct 13 11:35:02 vs-01 kernel: [104810.144120] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Oct 13 11:35:02 vs-01 kernel: [104810.144120] CR2: 0000002100000021 CR3: 00000003f59da000 CR4: 00000000000006e0
Oct 13 11:35:02 vs-01 kernel: [104810.144120] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 13 11:35:02 vs-01 kernel: [104810.144120] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 13 11:35:02 vs-01 kernel: [104810.144120] Process nautilus (pid: 10105, threadinfo ffff8103ef0dc000, task ffff8103f59d24
c0)
Oct 13 11:35:02 vs-01 kernel: [104810.144120] Stack:  ffffffff8032e602 ffff81039d937640 ffff8103ef0ddd88 ffff8102f3c5b28d
Oct 13 11:35:02 vs-01 kernel: [104810.144120]  00000000ffffffff 0000000000000000 ffffffff8032e97d ffff810000000002
Oct 13 11:35:02 vs-01 kernel: [104810.144120]  00000000001280d2 ffff810000000000 ffff81000001dc00 0000000000000000
Oct 13 11:35:02 vs-01 kernel: [104810.144120] Call Trace:
Oct 13 11:35:02 vs-01 kernel: [104810.144120]  [<ffffffff8032e602>] ? string+0x34/0x93
Oct 13 11:35:02 vs-01 kernel: [104810.144120]  [<ffffffff8032e97d>] ? vsnprintf+0x31c/0x5be
Oct 13 11:35:02 vs-01 kernel: [104810.144120]  [<ffffffff8028c1f3>] ? handle_mm_fault+0x449/0xa85
Oct 13 11:35:02 vs-01 kernel: [104810.144120]  [<ffffffff802c0b0b>] ? seq_printf+0x67/0x8f
Oct 13 11:35:02 vs-01 kernel: [104810.144120]  [<ffffffff802bc2f5>] ? mntput_no_expire+0x20/0x117
Oct 13 11:35:02 vs-01 kernel: [104810.148121]  [<ffffffff802b83b2>] ? d_path+0xd4/0xe2
Oct 13 11:35:02 vs-01 kernel: [104810.148121]  [<ffffffff802c0a26>] ? seq_escape+0x3a/0xb8
Oct 13 11:35:02 vs-01 kernel: [104810.148121]  [<ffffffffa020a525>] ? :cifs:cifs_show_options+0x84/0x27b
Oct 13 11:35:02 vs-01 kernel: [104810.148121]  [<ffffffff802be5fe>] ? show_vfsmnt+0xcb/0xe6
Oct 13 11:35:02 vs-01 kernel: [104810.148121]  [<ffffffff802c1082>] ? seq_read+0x1c7/0x2a5
Oct 13 11:35:02 vs-01 kernel: [104810.148121]  [<ffffffff802a6ed4>] ? vfs_read+0xaa/0x152
Oct 13 11:35:02 vs-01 kernel: [104810.148121]  [<ffffffff802a72b5>] ? sys_read+0x45/0x6e
Oct 13 11:35:02 vs-01 kernel: [104810.148121]  [<ffffffff8020be9a>] ? system_call_after_swapgs+0x8a/0x8f
Oct 13 11:35:02 vs-01 kernel: [104810.148121]
Oct 13 11:35:02 vs-01 kernel: [104810.148121]
Oct 13 11:35:02 vs-01 kernel: [104810.148121] Code: f2 ae 48 f7 d1 48 8d 44 11 ff 40 38 30 74 0a 48 ff c8 48 39 d0 73 f3 31
c0 c3 48 89 f8 eb 03 48 ff c0 48 ff ce 48 83 fe ff 74 05 <80> 38 00 75 ef 48 29 f8 c3 31 c0 eb 12 41 38 c8 74 0a 48 ff c2
Oct 13 11:35:02 vs-01 kernel: [104810.148121] RIP  [<ffffffff8032d6ad>] strnlen+0x11/0x1a
Oct 13 11:35:02 vs-01 kernel: [104810.148121]  RSP <ffff8103ef0ddbe0>
Oct 13 11:35:02 vs-01 kernel: [104810.148121] CR2: 0000002100000021
Oct 13 11:35:02 vs-01 kernel: [104810.154695] ---[ end trace 084faf644649cd60 ]---

Some point are still not very clear. According to the stack frame tracing seq_escape() is called from cifs_show_options() but in fact there is no call to that function inside. I've disassembled the code of the module to be completely shure that there is no callq instruction to seq_escape().

Does it means that the stack is corrupted somehow?
Comment 2 Grigoriy A. Sitkarev 2008-10-13 06:47:38 UTC
Ok, after commenting out cifs_show_options we got the following:

[ 3529.455763] general protection fault: 0000 [1] SMP
[ 3529.455800] CPU 4
[ 3529.455822] Modules linked in: fuse bnx2 nls_utf8 cifs nls_base ac battery dm_snapshot dm_mirror dm_log dm_mod loop iTCO_wdt snd_pcm snd_timer snd sg rng_core pcspkr i2c_i801 soundcore serio_raw i2c_core button i5000_edac snd_page_alloc sr_mod cdrom ses psmouse edac_core shpchp pci_hotplug enclosure evdev ext3 jbd mbcache ata_piix usbhid hid ff_memless sd_mod ata_generic libata dock ehci_hcd firmware_class uhci_hcd aacraid scsi_mod thermal processor fan thermal_sys [last unloaded: bnx2]
[ 3529.456148] Pid: 14595, comm: mount.cifs Not tainted 2.6.26-bpo.1-vserver-amd64 #1
[ 3529.456195] RIP: 0010:[<ffffffffa0221be5>]  [<ffffffffa0221be5>] :cifs:find_unc+0x6e/0x136
[ 3529.456259] RSP: 0018:ffff81036f449b88  EFLAGS: 00010246
[ 3529.456287] RAX: 5a5a5a5a5a5a5a5a RBX: ffff8104b990f508 RCX: 000000000000006b
[ 3529.462468] RDX: 0000000000000046 RSI: ffff810365561a15 RDI: ffff8104b990f515
[ 3529.462468] RBP: ffff81036f449bb8 R08: 0000000000000000 R09: ffff810367de19d0
[ 3529.462468] R10: ffff8104b785a188 R11: ffff8104bcd85ef8 R12: ffff8104bb032548
[ 3529.462468] R13: ffff8104bb032548 R14: ffff810365561a08 R15: 0000000014011dac
[ 3529.462468] FS:  00007fe7bbc556d0(0000) GS:ffff8104bccf9e88(0000) knlGS:0000000000000000
[ 3529.462468] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3529.462468] CR2: 00007fffc3c67000 CR3: 0000000416021000 CR4: 00000000000006e0
[ 3529.462468] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3529.462468] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3529.462468] Process mount.cifs (pid: 14595, threadinfo ffff81036f448000, task ffff8103af362de0)
[ 3529.462468] Stack:  ffff8104b341d032 ffff8104b785a198 ffff8104b785a188 0000000000000000
[ 3529.462468]  0000000000000000 ffff81038203ea20 ffff81036f449d08 ffffffffa022452b
[ 3529.462468]  0000000000000000 ffff810370c96ac8 00015c2a08000000 0000000000000000
[ 3529.462468] Call Trace:
[ 3529.462468]  [<ffffffffa022452b>] :cifs:cifs_mount+0x1d96/0x2236
[ 3529.462468]  [<ffffffffa02165a1>] ? :cifs:cifs_get_sb+0xb9/0x26e
[ 3529.462468]  [<ffffffff802a6b68>] ? poison_obj+0x27/0x32
[ 3529.462468]  [<ffffffffa02165f3>] :cifs:cifs_get_sb+0x10b/0x26e
[ 3529.462468]  [<ffffffff802b01b4>] vfs_kern_mount+0xe1/0x1dd
[ 3529.462468]  [<ffffffff802b030d>] do_kern_mount+0x47/0xe2
[ 3529.462468]  [<ffffffff802c6dbe>] do_new_mount+0x9c/0xce
[ 3529.462468]  [<ffffffff802c7032>] do_mount+0x242/0x271
[ 3529.462468]  [<ffffffff80285e45>] ? __alloc_pages_nodemask+0x9/0xb
[ 3529.462468]  [<ffffffff802a1c2e>] ? alloc_pages_current+0xb9/0xc2
[ 3529.462468]  [<ffffffff802c70ea>] sys_mount+0x89/0xce
[ 3529.462468]  [<ffffffff8020c05a>] system_call_after_swapgs+0x8a/0x8f
[ 3529.462468]
[ 3529.462468]
[ 3529.462468] Code: 31 c0 e8 af 57 01 e0 49 8b 45 38 4d 89 ec 48 85 c0 0f 84 ab 00 00 00 48 8b 40 28 48 85 c0 0f 84 9e 00 00 00 f6 05 9b de 02 00 01 <8b> 58 44 74 13 44 89 fa 89 de 48 c7 c7 4e e1 23 a0 31 c0 e8 73
[ 3529.462468] RIP  [<ffffffffa0221be5>] :cifs:find_unc+0x6e/0x136
[ 3529.462468]  RSP <ffff81036f449b88>
[ 3529.462891] ---[ end trace da7bdb0ec9933e29 ]---

RAX contains poisoned pointer. It means that we are referencing to freed memory somewhere...
Comment 3 Grigoriy A. Sitkarev 2008-10-13 07:23:51 UTC
What I could investigate is the following.

The fault is located in cifs/connect.c:

	list_for_each(tmp, &GlobalTreeConnectionList) {
		cFYI(1, ("Next tcon"));
		tcon = list_entry(tmp, struct cifsTconInfo, cifsConnectionList);
		if (!tcon->ses || !tcon->ses->server)
			continue;

		old_ip = tcon->ses->server->addr.sockAddr.sin_addr.s_addr; <== here is the fault

Disassemled code looks like that:

fs/cifs/connect.c:1393
    22d4:       48 85 c0                test   %rax,%rax
    22d7:       0f 84 ab 00 00 00       je     2388 <find_unc+0x105>
    22dd:       48 8b 40 28             mov    0x28(%rax),%rax
    22e1:       48 85 c0                test   %rax,%rax
    22e4:       0f 84 9e 00 00 00       je     2388 <find_unc+0x105>
fs/cifs/connect.c:1397
    22ea:       f6 05 00 00 00 00 01    testb  $0x1,0(%rip)        # 22f1 <find_unc+0x6e>
                        22ec: R_X86_64_PC32     cifsFYI+0xfffffffffffffffb
fs/cifs/connect.c:1396
    22f1:       8b 58 44                mov    0x44(%rax),%ebx
fs/cifs/connect.c:1397

It is clear that the fault occures during dereference of tcon->ses->server->addr.sockAddr.sin_addr.s_addr, which is at offset 0x44 in struct TCP_Server_Info. Register %rax contains poisoned value 0x5a5a5a5a5a5a5a5a (which is a value of tcon->ses->server pointer) it means that structure cifsSesInfo is already invalid.

tcon->ses->server is mistakenly freed somewhere or not set to NULL?
Comment 4 Grigoriy A. Sitkarev 2008-10-14 15:05:02 UTC
Last night we moved the server out of vserver environment and we have near 40 cifs mounts inside it. No faults, no errors, no oops'es.

I think this corruption is related somehow to vserver patch.

We must close this bug because it is not CifsVFS issue but vserver's.
Comment 5 Grigoriy A. Sitkarev 2008-10-16 03:36:57 UTC
We moved to the non-vserver kernel and we have near 40 cifs mounts. After one day we got kernel OOPS with the same message and trace.

This is not vserver issue and it looks like reference counting/locking problem.

So, reopening the bug.
Comment 6 Grigoriy A. Sitkarev 2008-10-16 16:05:54 UTC
At last here are the steps to reproduce the failure. Tested on kernels 2.6.26-bpo.1-amd64, 2.6.26-bpo.1-vserver-amd64 and 2.6.24-bpo.1-amd64 and 2.4.18-amd64 from backports and Debian. We also tested CentOS kernel kernel-2.6.18-92.el5 for AMD64 with the same result.

1. Mount 20-30 cifs shares to the SAME SAMBA/cifs server using mount.cifs(8).
2. Start to unmount the shares using umount(8).
3. After third or fifth umount the kernel will oops and continue to oops on cifs:cifs_show_options() untill completely hangs.
Comment 7 Jeff Layton 2008-11-23 08:53:14 UTC
Is this reproducible on current mainline kernels (from either Steve F's or Linus git tree?). I think you may be running across some refcounting bugs that were in earlier code. I believe them to be fixed in mainline kernels now.
Comment 8 Steve French 2009-02-18 23:23:39 UTC
This should  be fixed in current 2.6.27.stable and later kernels.  There were refcounting problems with the cifs mount code which are better now.