Bug 5234 - smbd causes cluster node to crash
Summary: smbd causes cluster node to crash
Status: RESOLVED INVALID
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: File Services (show other bugs)
Version: 3.0.26a
Hardware: x64 Linux
: P3 major
Target Milestone: none
Assignee: Samba Bugzilla Account
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-31 02:18 UTC by Reiner Rottmann
Modified: 2008-01-31 02:27 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Reiner Rottmann 2008-01-31 02:18:21 UTC
On a cluster that is based upon RHEL4 U5 and GFS 6.1 as filesystem(s) Samba is configured as rgmanager cluster service.

Samba is part of a W2K driven domain. Authentification is done from the W2k password server (winbind).

The servers are HP Proliant DL585 but from different hardware generations.

ServerX: G1
ServerY: G2 

Besides this issue the servers are stable for a longer time.

The system was not changed for at least 60 days.

The samba users were increased (~ +15%).
 

Problem Description:
 
Only one node of the cluster crashes with the following information. The crash seems to be caused in interaction with a smbd process.

---%----------------------------------------------------------------------
Nov 26 15:16:55 serverX  Unable to handle kernel paging request at 
0000000000100108 RIP:
Nov 26 15:16:55 serverX  <ffffffff80141147>{free_uid+45}
Nov 26 15:16:55 serverX  PML4 8ad74067 PGD 0
Nov 26 15:16:55 serverX  Oops: 0002 [1] SMP
Nov 26 15:16:55 serverX  CPU 1
Nov 26 15:16:55 serverX  Modules linked in: netconsole netdump i2c_dev i2c_core sunrpc ext3 jbd dm_round_robin dm_multipath button battery ac ohci_hcd cciss hw_random k8_edac edac_mc floppy md5 ipv6 bonding(U) lock_dlm(U) dlm(U) gfs(U) lock_harness(U) cman(U) mppVhba(U) qla2300 qla2xxx scsi_transport_fc mppUpper(U) sg sd_mod scsi_mod dm_snapshot dm_mirror dm_mod tg3 e1000
Nov 26 15:16:55 serverX  Pid: 3262, comm: smbd Not tainted 2.6.9-55.0.2.ELsmp
Nov 26 15:16:55 serverX  RIP: 0010:[<ffffffff80141147>] <ffffffff80141147>{free_uid+45}
Nov 26 15:16:55 serverX  RSP: 0018:00000100b2fffd98  EFLAGS: 00010002
Nov 26 15:16:55 serverX  RAX: 0000000000100100 RBX: 0000010056476a80 RCX: 0000010056476aa0
Nov 26 15:16:55 serverX  RDX: 0000000000200200 RSI: ffffffff803e72e8 RDI: ffffffff803e72e8
Nov 26 15:16:55 serverX  RBP: 00000100c02c2bd0 R08: 00000100b2ffe000 R09: 0000000300000000
Nov 26 15:16:55 serverX  R10: 0000000300000000 R11: 0000000000000246 R12: 00000100b2fffe78
Nov 26 15:16:55 serverX  R13: 000000000000000a R14: 0000000000000000 R15: 00000100ef145ec8
Nov 26 15:16:55 serverX  FS:  0000002a96fdc080(0000) GS:ffffffff804edc00(0000) knlGS:000000000810f400
Nov 26 15:16:55 serverX  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 26 15:16:55 serverX  CR2: 0000000000100108 CR3: 000000007e392000 CR4: 00000000000006e0
Nov 26 15:16:55 serverX  Process smbd (pid: 3262, threadinfo 00000100b2ffe000, task 00000100ef1457f0)
Nov 26 15:16:55 serverX  Stack: 0000000000000000 0000000000000002 000001009c704e20 ffffffff80141919
Nov 26 15:16:55 serverX         0000000000000000 00000100b2fffe78 00000100ef1457f0 00000100ef145ec8
Nov 26 15:16:55 serverX         00000100b2ffff58 ffffffff801419ce
Nov 26 15:16:55 serverX  Call Trace:<ffffffff80141919>{__dequeue_signal+347} <ffffffff801419ce>{dequeue_signal+58}
Nov 26 15:16:55 serverX         <ffffffff8014351a>{get_signal_to_deliver+338} <ffffffff8010f6fb>{do_signal+131}
Nov 26 15:16:55 serverX         <ffffffff8030c0f1>{thread_return+88} <ffffffff801102f3>{sysret_signal+28}
Nov 26 15:16:55 serverX         <ffffffff801105df>{ptregscall_common+103}
Nov 26 15:16:55 serverX
Nov 26 15:16:55 serverX  Code: 48 89 50 08 48 89 02 48 c7 41 08 00 02 20 00 48 8b 7b 38 48
Nov 26 15:16:55 serverX  RIP <ffffffff80141147>{free_uid+45} RSP <00000100b2fffd98>
Nov 26 15:16:55 serverX  CR2: 0000000000100108 ---%----------------------------------------------------------------------


---%----------------------------------------------------------------------
Jan 28 13:54:10 serverX  Unable to handle kernel paging request at 0000000000100108 RIP:
Jan 28 13:54:10 serverX  <ffffffff80141147>{free_uid+45}
Jan 28 13:54:10 serverX  PML4 3f0b5067 PGD 0
Jan 28 13:54:10 serverX  Oops: 0002 [1] SMP
Jan 28 13:54:10 serverX  CPU 0
Jan 28 13:54:10 serverX  Modules linked in: netconsole netdump i2c_dev i2c_core sunrpc ide_dump cciss_dump scsi_dump diskdump zlib_deflate dm_round_robin dm_multipath button battery ac ohci_hcd hw_random k8_edac edac_mc floppy md5 ipv6 ext3 jbd lock_dlm(U) dlm(U) gfs(U) lock_harness(U) cman(U) mppVhba(U) cciss qla2300 qla2xxx scsi_transport_fc mppUpper(U) sg sd_mod scsi_mod dm_snapshot dm_mirror dm_mod bonding(U) tg3 e1000
Jan 28 13:54:10 serverX  Pid: 8271, comm: smbd Not tainted 2.6.9-55.0.2.ELsmp
Jan 28 13:54:10 serverX  RIP: 0010:[<ffffffff80141147>] <ffffffff80141147>{free_uid+45}
Jan 28 13:54:10 serverX  RSP: 0018:0000010075129d98  EFLAGS: 00010002
Jan 28 13:54:10 serverX  RAX: 0000000000100100 RBX: 00000100d12df700 RCX: 00000100d12df720
Jan 28 13:54:10 serverX  RDX: 0000000000200200 RSI: ffffffff803e72e8 RDI: ffffffff803e72e8
Jan 28 13:54:10 serverX  RBP: 0000010047184dd0 R08: 0000010075128000 R09: 0000000300000000
Jan 28 13:54:10 serverX  R10: 0000000300000000 R11: 0000000000000246 R12: 0000010075129e78
Jan 28 13:54:10 serverX  R13: 000000000000000a R14: 0000000000000000 R15: 0000010023afcec8
Jan 28 13:54:10 serverX  FS:  0000002a96fdc080(0000) GS:ffffffff804edb80(0000) knlGS:000000000820d8c0
Jan 28 13:54:10 serverX  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 28 13:54:10 serverX  CR2: 0000000000100108 CR3: 0000000000101000 CR4: 00000000000006e0
Jan 28 13:54:10 serverX  Process smbd (pid: 8271, threadinfo 0000010075128000, task 0000010023afc7f0)
Jan 28 13:54:10 serverX  Stack: 0000000000000000 0000000000000006 00000100f17d7838 ffffffff80141919
Jan 28 13:54:10 serverX         0000000000000000 0000010075129e78 0000010023afc7f0 0000010023afcec8
Jan 28 13:54:10 serverX         0000010075129f58 ffffffff801419ce
Jan 28 13:54:10 serverX  Call Trace:<ffffffff80141919>{__dequeue_signal+347} <ffffffff801419ce>{dequeue_signal+58}
Jan 28 13:54:10 serverX         <ffffffff8014351a>{get_signal_to_deliver+338} <ffffffff8010f6fb>{do_signal+131}
Jan 28 13:54:10 serverX         <ffffffff8030c0f1>{thread_return+88} <ffffffff801102f3>{sysret_signal+28}
Jan 28 13:54:10 serverX         <ffffffff801105df>{ptregscall_common+103}
Jan 28 13:54:10 serverX
Jan 28 13:54:10 serverX  Code: 48 89 50 08 48 89 02 48 c7 41 08 00 02 20 00 48 8b 7b 38 48
Jan 28 13:54:10 serverX  RIP <ffffffff80141147>{free_uid+45} RSP <0000010075129d98>
Jan 28 13:54:10 serverX  CR2: 0000000000100108
Jan 28 13:54:26 serverY winbindd[14364]: [2008/01/28 14:08:05, 0] nsswitch/idmap_ldap.c:idmap_ldap_allocate_id(469)
Jan 28 13:54:26 serverY winbindd[14364]:   Cannot allocate gid above 30000!
Jan 28 13:54:26 serverY smbd[27838]: [2008/01/28 14:08:05, 0] auth/auth_util.c:create_builtin_administrators(792)
Jan 28 13:54:26 serverY smbd[27838]:   create_builtin_administrators: Failed to create Administrators
Jan 28 13:54:26 serverY winbindd[14364]: [2008/01/28 14:08:05, 0] nsswitch/idmap_ldap.c:idmap_ldap_allocate_id(469)
Jan 28 13:54:26 serverY winbindd[14364]:   Cannot allocate gid above 30000!
Jan 28 13:54:26 serverY smbd[27838]: [2008/01/28 14:08:05, 0] auth/auth_util.c:create_builtin_users(758)
Jan 28 13:54:26 serverY smbd[27838]:   create_builtin_users: Failed to create Users
Jan 28 13:54:30 serverY  CMAN: removing node serverX from the cluster : Missed too many heartbeats
Jan 28 13:54:30 serverY fenced[1805]: serverX_local not a cluster member after 0 sec post_fail_delay
Jan 28 13:54:30 serverY fenced[1805]: fencing node "serverX"
Jan 28 13:54:30 serverY fence_manual: Node serverX needs to be reset before recovery can procede.  Waiting for serverX to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n serverX)
---%----------------------------------------------------------------------

We have a vmcore saved from the last system crash.
Comment 1 Volker Lendecke 2008-01-31 02:27:05 UTC
Kernel crashes are not a Samba bug.

It might be possible that there is a memory leak, but this should show much earlier in "top" or excessive swapping activity.

Marking this as invalid. If we have a memleak, please re-open with more info like which process chews memory.

Volker