On a cluster that is based upon RHEL4 U5 and GFS 6.1 as filesystem(s) Samba is configured as rgmanager cluster service. Samba is part of a W2K driven domain. Authentification is done from the W2k password server (winbind). The servers are HP Proliant DL585 but from different hardware generations. ServerX: G1 ServerY: G2 Besides this issue the servers are stable for a longer time. The system was not changed for at least 60 days. The samba users were increased (~ +15%). Problem Description: Only one node of the cluster crashes with the following information. The crash seems to be caused in interaction with a smbd process. ---%---------------------------------------------------------------------- Nov 26 15:16:55 serverX Unable to handle kernel paging request at 0000000000100108 RIP: Nov 26 15:16:55 serverX <ffffffff80141147>{free_uid+45} Nov 26 15:16:55 serverX PML4 8ad74067 PGD 0 Nov 26 15:16:55 serverX Oops: 0002 [1] SMP Nov 26 15:16:55 serverX CPU 1 Nov 26 15:16:55 serverX Modules linked in: netconsole netdump i2c_dev i2c_core sunrpc ext3 jbd dm_round_robin dm_multipath button battery ac ohci_hcd cciss hw_random k8_edac edac_mc floppy md5 ipv6 bonding(U) lock_dlm(U) dlm(U) gfs(U) lock_harness(U) cman(U) mppVhba(U) qla2300 qla2xxx scsi_transport_fc mppUpper(U) sg sd_mod scsi_mod dm_snapshot dm_mirror dm_mod tg3 e1000 Nov 26 15:16:55 serverX Pid: 3262, comm: smbd Not tainted 2.6.9-55.0.2.ELsmp Nov 26 15:16:55 serverX RIP: 0010:[<ffffffff80141147>] <ffffffff80141147>{free_uid+45} Nov 26 15:16:55 serverX RSP: 0018:00000100b2fffd98 EFLAGS: 00010002 Nov 26 15:16:55 serverX RAX: 0000000000100100 RBX: 0000010056476a80 RCX: 0000010056476aa0 Nov 26 15:16:55 serverX RDX: 0000000000200200 RSI: ffffffff803e72e8 RDI: ffffffff803e72e8 Nov 26 15:16:55 serverX RBP: 00000100c02c2bd0 R08: 00000100b2ffe000 R09: 0000000300000000 Nov 26 15:16:55 serverX R10: 0000000300000000 R11: 0000000000000246 R12: 00000100b2fffe78 Nov 26 15:16:55 serverX R13: 000000000000000a R14: 0000000000000000 R15: 00000100ef145ec8 Nov 26 15:16:55 serverX FS: 0000002a96fdc080(0000) GS:ffffffff804edc00(0000) knlGS:000000000810f400 Nov 26 15:16:55 serverX CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 26 15:16:55 serverX CR2: 0000000000100108 CR3: 000000007e392000 CR4: 00000000000006e0 Nov 26 15:16:55 serverX Process smbd (pid: 3262, threadinfo 00000100b2ffe000, task 00000100ef1457f0) Nov 26 15:16:55 serverX Stack: 0000000000000000 0000000000000002 000001009c704e20 ffffffff80141919 Nov 26 15:16:55 serverX 0000000000000000 00000100b2fffe78 00000100ef1457f0 00000100ef145ec8 Nov 26 15:16:55 serverX 00000100b2ffff58 ffffffff801419ce Nov 26 15:16:55 serverX Call Trace:<ffffffff80141919>{__dequeue_signal+347} <ffffffff801419ce>{dequeue_signal+58} Nov 26 15:16:55 serverX <ffffffff8014351a>{get_signal_to_deliver+338} <ffffffff8010f6fb>{do_signal+131} Nov 26 15:16:55 serverX <ffffffff8030c0f1>{thread_return+88} <ffffffff801102f3>{sysret_signal+28} Nov 26 15:16:55 serverX <ffffffff801105df>{ptregscall_common+103} Nov 26 15:16:55 serverX Nov 26 15:16:55 serverX Code: 48 89 50 08 48 89 02 48 c7 41 08 00 02 20 00 48 8b 7b 38 48 Nov 26 15:16:55 serverX RIP <ffffffff80141147>{free_uid+45} RSP <00000100b2fffd98> Nov 26 15:16:55 serverX CR2: 0000000000100108 ---%---------------------------------------------------------------------- ---%---------------------------------------------------------------------- Jan 28 13:54:10 serverX Unable to handle kernel paging request at 0000000000100108 RIP: Jan 28 13:54:10 serverX <ffffffff80141147>{free_uid+45} Jan 28 13:54:10 serverX PML4 3f0b5067 PGD 0 Jan 28 13:54:10 serverX Oops: 0002 [1] SMP Jan 28 13:54:10 serverX CPU 0 Jan 28 13:54:10 serverX Modules linked in: netconsole netdump i2c_dev i2c_core sunrpc ide_dump cciss_dump scsi_dump diskdump zlib_deflate dm_round_robin dm_multipath button battery ac ohci_hcd hw_random k8_edac edac_mc floppy md5 ipv6 ext3 jbd lock_dlm(U) dlm(U) gfs(U) lock_harness(U) cman(U) mppVhba(U) cciss qla2300 qla2xxx scsi_transport_fc mppUpper(U) sg sd_mod scsi_mod dm_snapshot dm_mirror dm_mod bonding(U) tg3 e1000 Jan 28 13:54:10 serverX Pid: 8271, comm: smbd Not tainted 2.6.9-55.0.2.ELsmp Jan 28 13:54:10 serverX RIP: 0010:[<ffffffff80141147>] <ffffffff80141147>{free_uid+45} Jan 28 13:54:10 serverX RSP: 0018:0000010075129d98 EFLAGS: 00010002 Jan 28 13:54:10 serverX RAX: 0000000000100100 RBX: 00000100d12df700 RCX: 00000100d12df720 Jan 28 13:54:10 serverX RDX: 0000000000200200 RSI: ffffffff803e72e8 RDI: ffffffff803e72e8 Jan 28 13:54:10 serverX RBP: 0000010047184dd0 R08: 0000010075128000 R09: 0000000300000000 Jan 28 13:54:10 serverX R10: 0000000300000000 R11: 0000000000000246 R12: 0000010075129e78 Jan 28 13:54:10 serverX R13: 000000000000000a R14: 0000000000000000 R15: 0000010023afcec8 Jan 28 13:54:10 serverX FS: 0000002a96fdc080(0000) GS:ffffffff804edb80(0000) knlGS:000000000820d8c0 Jan 28 13:54:10 serverX CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jan 28 13:54:10 serverX CR2: 0000000000100108 CR3: 0000000000101000 CR4: 00000000000006e0 Jan 28 13:54:10 serverX Process smbd (pid: 8271, threadinfo 0000010075128000, task 0000010023afc7f0) Jan 28 13:54:10 serverX Stack: 0000000000000000 0000000000000006 00000100f17d7838 ffffffff80141919 Jan 28 13:54:10 serverX 0000000000000000 0000010075129e78 0000010023afc7f0 0000010023afcec8 Jan 28 13:54:10 serverX 0000010075129f58 ffffffff801419ce Jan 28 13:54:10 serverX Call Trace:<ffffffff80141919>{__dequeue_signal+347} <ffffffff801419ce>{dequeue_signal+58} Jan 28 13:54:10 serverX <ffffffff8014351a>{get_signal_to_deliver+338} <ffffffff8010f6fb>{do_signal+131} Jan 28 13:54:10 serverX <ffffffff8030c0f1>{thread_return+88} <ffffffff801102f3>{sysret_signal+28} Jan 28 13:54:10 serverX <ffffffff801105df>{ptregscall_common+103} Jan 28 13:54:10 serverX Jan 28 13:54:10 serverX Code: 48 89 50 08 48 89 02 48 c7 41 08 00 02 20 00 48 8b 7b 38 48 Jan 28 13:54:10 serverX RIP <ffffffff80141147>{free_uid+45} RSP <0000010075129d98> Jan 28 13:54:10 serverX CR2: 0000000000100108 Jan 28 13:54:26 serverY winbindd[14364]: [2008/01/28 14:08:05, 0] nsswitch/idmap_ldap.c:idmap_ldap_allocate_id(469) Jan 28 13:54:26 serverY winbindd[14364]: Cannot allocate gid above 30000! Jan 28 13:54:26 serverY smbd[27838]: [2008/01/28 14:08:05, 0] auth/auth_util.c:create_builtin_administrators(792) Jan 28 13:54:26 serverY smbd[27838]: create_builtin_administrators: Failed to create Administrators Jan 28 13:54:26 serverY winbindd[14364]: [2008/01/28 14:08:05, 0] nsswitch/idmap_ldap.c:idmap_ldap_allocate_id(469) Jan 28 13:54:26 serverY winbindd[14364]: Cannot allocate gid above 30000! Jan 28 13:54:26 serverY smbd[27838]: [2008/01/28 14:08:05, 0] auth/auth_util.c:create_builtin_users(758) Jan 28 13:54:26 serverY smbd[27838]: create_builtin_users: Failed to create Users Jan 28 13:54:30 serverY CMAN: removing node serverX from the cluster : Missed too many heartbeats Jan 28 13:54:30 serverY fenced[1805]: serverX_local not a cluster member after 0 sec post_fail_delay Jan 28 13:54:30 serverY fenced[1805]: fencing node "serverX" Jan 28 13:54:30 serverY fence_manual: Node serverX needs to be reset before recovery can procede. Waiting for serverX to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n serverX) ---%---------------------------------------------------------------------- We have a vmcore saved from the last system crash.
Kernel crashes are not a Samba bug. It might be possible that there is a memory leak, but this should show much earlier in "top" or excessive swapping activity. Marking this as invalid. If we have a memleak, please re-open with more info like which process chews memory. Volker