Bug 5737 - Winbind dead. Winbind works fine during 1 month and this night the process died. There is bellow the error message in /var/log/messages
Summary: Winbind dead. Winbind works fine during 1 month and this night the process di...
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: winbind (show other bugs)
Version: 3.0.28
Hardware: x86 Linux
: P3 critical
Target Milestone: none
Assignee: Jeremy Allison
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-09-05 08:11 UTC by Olivier de Corlieu
Modified: 2008-09-24 11:07 UTC (History)
2 users (show)

See Also:


Attachments
The winbindd core dump seen in case of "fd" exhaustions (551 bytes, patch)
2008-09-20 04:55 UTC, hargagan
no flags Details
fixing the core dump in the case of file descritor exhaustion (821 bytes, patch)
2008-09-20 06:38 UTC, hargagan
no flags Details
Patch I've applied. (1.24 KB, patch)
2008-09-23 18:37 UTC, Jeremy Allison
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Olivier de Corlieu 2008-09-05 08:11:04 UTC
Sep  5 04:56:16 valencay winbindd[5300]: [2008/09/05 04:56:16, 0] lib/fault.c:fault_report(41)

Sep  5 04:56:16 valencay winbindd[5300]:   ===============================================================

Sep  5 04:56:16 valencay winbindd[5300]: [2008/09/05 04:56:16, 0] lib/fault.c:fault_report(42)

Sep  5 04:56:16 valencay winbindd[5300]:   INTERNAL ERROR: Signal 6 in pid 5300 (3.0.28-1.el5_2.1)

Sep  5 04:56:16 valencay winbindd[5300]:   Please read the Trouble-Shooting section of the Samba3-HOWTO

Sep  5 04:56:16 valencay winbindd[5300]: [2008/09/05 04:56:16, 0] lib/fault.c:fault_report(44)

Sep  5 04:56:16 valencay winbindd[5300]:

Sep  5 04:56:16 valencay winbindd[5300]:   From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf

Sep  5 04:56:16 valencay winbindd[5300]: [2008/09/05 04:56:16, 0] lib/fault.c:fault_report(45)

Sep  5 04:56:16 valencay winbindd[5300]:   ===============================================================

Sep  5 04:56:16 valencay winbindd[5300]: [2008/09/05 04:56:16, 0] lib/util.c:smb_panic(1655)

Sep  5 04:56:16 valencay winbindd[5300]:   PANIC (pid 5300): internal error

Sep  5 04:56:16 valencay winbindd[5300]: [2008/09/05 04:56:16, 0] lib/util.c:log_stack_trace(1759)

Sep  5 04:56:16 valencay winbindd[5300]:   BACKTRACE: 15 stack frames:

Sep  5 04:56:16 valencay winbindd[5300]:    #0 winbindd(log_stack_trace+0x2d) [0x683bad]

Sep  5 04:56:16 valencay winbindd[5300]:    #1 winbindd(smb_panic+0x5d) [0x683cdd]

Sep  5 04:56:16 valencay winbindd[5300]:    #2 winbindd [0x66e86a]

Sep  5 04:56:16 valencay winbindd[5300]:    #3 [0x110420]

Sep  5 04:56:16 valencay winbindd[5300]:    #4 [0x110402]

Sep  5 04:56:16 valencay winbindd[5300]:    #5 /lib/libc.so.6(gsignal+0x50) [0xa9fd10]

Sep  5 04:56:16 valencay winbindd[5300]:    #6 /lib/libc.so.6(abort+0x101) [0xaa1621]

Sep  5 04:56:16 valencay winbindd[5300]:    #7 winbindd(talloc_get_name+0x3e) [0x66715e]

Sep  5 04:56:16 valencay winbindd[5300]:    #8 winbindd(talloc_check_name+0x2c) [0x6673cc]

Sep  5 04:56:16 valencay winbindd[5300]:    #9 winbindd(talloc_check_name_abort+0x2c) [0x683e6c]

Sep  5 04:56:16 valencay winbindd[5300]:    #10 winbindd [0x61a5d8]

Sep  5 04:56:16 valencay winbindd[5300]:    #11 winbindd [0x5eb508]

Sep  5 04:56:16 valencay winbindd[5300]:    #12 winbindd(main+0x94d) [0x5ebf5d]

Sep  5 04:56:16 valencay winbindd[5300]:    #13 /lib/libc.so.6(__libc_start_main+0xdc) [0xa8cdec]

Sep  5 04:56:16 valencay winbindd[5300]:    #14 winbindd [0x5ea201]

Sep  5 04:56:16 valencay winbindd[5300]: [2008/09/05 04:56:16, 0] lib/fault.c:dump_core(181)

Sep  5 04:56:16 valencay winbindd[5300]:   dumping core in /var/log/samba/cores/winbindd

Sep  5 04:56:16 valencay winbindd[5300]:

Sep  5 04:56:16 valencay winbindd[8934]: [2008/09/05 04:56:16, 0] lib/util_sock.c:read_data(536)

Sep  5 04:56:16 valencay winbindd[8934]:   read_data: read failure for 2088. Error = Connection reset by peer

Sep  5 04:56:16 valencay winbindd[4967]: [2008/09/05 04:56:16, 0] lib/fault.c:fault_report(41)

Sep  5 04:56:16 valencay winbindd[4967]:   ===============================================================
Comment 1 Olivier de Corlieu 2008-09-05 08:15:02 UTC
Winbind works fine during 1 month and this night the process died. There is bellow the error message in /var/log/messages. I'm using Redhat Enterprise 5.1 on a DELL Poweredge 1950. The samba RPM package is 3.0.28-1.el5_2.1 and was installed RHN site. The samba server connected to AD domain ( windows 2003 server ) and we have 290 users which connect to it.
Comment 2 Volker Lendecke 2008-09-05 08:28:09 UTC
Can you try the patch from http://git.samba.org/?p=samba.git;a=commitdiff;h=c93d4296945

Thanks,

Volker
Comment 3 Tukaram 2008-09-15 00:42:24 UTC
In our setup too we are facing this issue. We are using Samba-3.0.28. We did apply the patch given comment #2,
http://git.samba.org/?p=samba.git;a=commitdiff;h=c93d4296945
but still we are seeing the issue. We are seeing this issue when touch 200+ simultaneous connections. Copying the log file entries below.

 winbindd: Exceeding 200 client connections, no idle connection found
[2008/09/13 17:25:10, 0] nsswitch/winbindd.c:process_loop(916)
  winbindd: Exceeding 200 client connections, no idle connection found
[2008/09/13 17:25:10, 0] lib/fault.c:fault_report(41)
  ===============================================================
[2008/09/13 17:25:10, 0] lib/fault.c:fault_report(42)
  INTERNAL ERROR: Signal 6 in pid 25570 (3.0.28-0-1902-SUSE-CODE10)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2008/09/13 17:25:10, 0] lib/fault.c:fault_report(44)

  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2008/09/13 17:25:10, 0] lib/fault.c:fault_report(45)
  ===============================================================
[2008/09/13 17:25:10, 0] lib/util.c:smb_panic(1633)
  PANIC (pid 25570): internal error
[2008/09/13 17:25:10, 0] lib/util.c:log_stack_trace(1737)
  BACKTRACE: 13 stack frames:
   #0 /usr/sbin/winbindd(log_stack_trace+0x2d) [0x800d5a0d]
   #1 /usr/sbin/winbindd(smb_panic+0x5d) [0x800d5b3d]
   #2 /usr/sbin/winbindd [0x800c0b9a]
   #3 [0xffffe420]
   #4 /lib/libc.so.6(abort+0x103) [0xb7c91ff3]
   #5 /usr/sbin/winbindd(talloc_get_name+0x3e) [0x800b963e]
   #6 /usr/sbin/winbindd(talloc_check_name+0x2c) [0x800b98ac]
   #7 /usr/sbin/winbindd(talloc_check_name_abort+0x2c) [0x800d5ccc]
   #8 /usr/sbin/winbindd [0x8006f1d2]
   #9 /usr/sbin/winbindd [0x80040298]
   #10 /usr/sbin/winbindd(main+0x92d) [0x80040ccd]
   #11 /lib/libc.so.6(__libc_start_main+0xdc) [0xb7c7d8ac]
   #12 /usr/sbin/winbindd [0x8003ed41]
[2008/09/13 17:25:10, 0] lib/fault.c:dump_core(181)
  dumping core in /var/log/samba/cores/winbindd

Please help us in resolving the issue.
Comment 4 Gerald (Jerry) Carter (dead mail address) 2008-09-17 09:39:51 UTC
Did you manually patch you RPM?  The backtrace still says 3.0.28-0-1902-SUSE-CODE10.
Comment 5 hargagan 2008-09-18 00:28:46 UTC
We are able to locate the root cause of the problem. Please correct us if we are wrong here. My analysis is as follows:

In our setup, the winbindd connections are reaching 1000+ due to some entity not responding (or responding slowly), as this is a stress environment. (We are working on finding the culprit). Now in this case if we get one more request, we are forking a child (fork_domain_child() in schedule_async_request()), where we are calling socketpair(). This call fails with an error "Could not open child pipe: Too many open files". Thus we call  the continuation function for all the requests present in the child->requests in the error path. This set the "state->finished" flag to "True" and comes out. Now in the next iteration of "process_loop()" we are freeing all the states but the "child->request" is not removed from the queue.

Now if async_main_request_sent() is called, the "talloc_check_name_abort()" does "smb_panic()". Or in case of a new request, the addition to child->requests using  DLIST_ADD_END() in "async_request()" is causing SIGSEGV.

As a fix, can we call the async_request_fail() in this failure path of schedule_async_request() after calling fork_domain_child(), similar to  Volker's patch in above comment ? Also let us know should we call this for all of the child->requests or only the current one is enough ?
Comment 6 hargagan 2008-09-20 04:55:33 UTC
Created attachment 3604 [details]
The winbindd core dump seen in case of "fd" exhaustions

The file descriptor are exhausted by the too many socket connections of winbindd (1000+). If a request times out now, the fork_domain_child() fails to create the socketpair() with error "Too many open files". Now a call to the continuation function doesn't free the child->requests.
In the fix calling the async_request_fail() in the failure path of fork_domain_child() in schedule_async_request(). This call will free the "state" from child->request and fd_event.
Comment 7 hargagan 2008-09-20 06:36:30 UTC
please ignore the above patch.
Comment 8 hargagan 2008-09-20 06:38:20 UTC
Created attachment 3605 [details]
fixing the core dump in the case of file descritor exhaustion
Comment 9 Jeremy Allison 2008-09-20 17:42:53 UTC
I will evaluate this for inclusion early next week - thanks !
Jeremy.
Comment 10 Jeremy Allison 2008-09-23 18:37:52 UTC
Created attachment 3619 [details]
Patch I've applied.

Your first patch was closer, but not quite correct. This is the patch I've applied to all branches. Please test !
Thanks,
Jeremy.
Comment 11 hargagan 2008-09-24 02:43:13 UTC
Jeremy,
Thanks for the reply. I tested your patch. It is working fine.
Comment 12 hargagan 2008-09-24 06:59:22 UTC
We are using 3.0.28 version of samba. Did you checkin the patch in this branch ? How do I get this patch ?
Comment 13 Jeremy Allison 2008-09-24 11:07:03 UTC
I checked this into all git trees, including 3_0. Talk to Jim McDonnough for details on getting this patch.
Jeremy.