This is winbindd from the Debian Lenny package version 2:3.2.5-4lenny2. It is interfacing with a windows 2003 active directory, authenticating users for the ntlm_auth helper. /etc/samba/smb.conf: [global] workgroup = ALSOP-HIGH netbios name = CENSORNET realm = ALSOP-HIGH.LOCALDOMAIN security = ads encrypt passwords = yes password server = ahsserver3.alsop-high.localdomain winbind separator = / idmap uid = 10000-40000 idmap gid = 10000-40000 winbind enum users = yes winbind enum groups = yes /var/log/samba/log.winbind: [2009/05/19 11:12:17, 0] winbindd/winbindd.c:process_loop(955) winbindd: Exceeding 200 client connections, no idle connection found [2009/05/19 11:12:17, 0] winbindd/winbindd.c:process_loop(955) winbindd: Exceeding 200 client connections, no idle connection found [2009/05/19 11:12:17, 0] winbindd/winbindd.c:process_loop(955) winbindd: Exceeding 200 client connections, no idle connection found [2009/05/19 11:12:17, 0] winbindd/winbindd.c:process_loop(955) winbindd: Exceeding 200 client connections, no idle connection found [2009/05/19 11:12:17, 0] winbindd/winbindd.c:process_loop(974) winbindd: Exceeding 200 client connections, no idle connection found [2009/05/19 11:12:17, 0] lib/fault.c:fault_report(40) =============================================================== [2009/05/19 11:12:17, 0] lib/fault.c:fault_report(41) INTERNAL ERROR: Signal 11 in pid 2635 (3.2.5) Please read the Trouble-Shooting section of the Samba3-HOWTO [2009/05/19 11:12:17, 0] lib/fault.c:fault_report(43) From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf [2009/05/19 11:12:17, 0] lib/fault.c:fault_report(44) =============================================================== [2009/05/19 11:12:17, 0] lib/util.c:smb_panic(1663) PANIC (pid 2635): internal error [2009/05/19 11:12:17, 0] lib/util.c:log_stack_trace(1767) BACKTRACE: 13 stack frames: #0 /usr/sbin/winbindd(log_stack_trace+0x2d) [0x812b264] #1 /usr/sbin/winbindd(smb_panic+0x80) [0x812b3c1] #2 /usr/sbin/winbindd [0x8118bf3] #3 [0xb7f50400] #4 /usr/sbin/winbindd(async_domain_request+0x57) [0x80bcd72] #5 /usr/sbin/winbindd(sendto_domain+0x46) [0x80bce91] #6 /usr/sbin/winbindd(winbindd_pam_auth_crap+0x1ee) [0x80a8f8e] #7 /usr/sbin/winbindd [0x808c52d] #8 /usr/sbin/winbindd [0x808c63b] #9 /usr/sbin/winbindd [0x808ceeb] #10 /usr/sbin/winbindd(main+0xfa2) [0x808df59] #11 /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5) [0xb7c51455] #12 /usr/sbin/winbindd [0x808b841] [2009/05/19 11:12:17, 0] lib/fault.c:dump_core(201) dumping core in /var/log/samba/cores/winbindd I have the core file if anyone is interested. This is a serious bug that's a show stopper for us, can you please address it ASAP. Regards, Dan...
As a first measure I would like to ask you to set "winbind enum users" and "winbind enum groups" to no. For a pure squid proxy helper I don't really see why you would need that. Then, can you please somehow install debugging symbols so that we get the line number of the crash? And, using it for a proxy you might have more than one proxy and you might be able to run one of them with severely degraded performance, namely under valgrind --tool=memcheck. This would help tremendously finding that crash. Alternatively, having a debug level 10 log leading to that crash would also greatly help, although this also degrades performance significantly. Thanks, Volker
Firstly, I have switched off user/group enumeration. Not sure if this is related to the crash, however. After about 60 seconds of normal operation, the winbindd daemon starts chewing 100% CPU. I started winbindd with -d 10 debugging enabled see the attached log files for details. Dan...
Created attachment 4169 [details] winbind debug logs winbind debug logs
Hmmm. What I would need is the logfile *before* a crash, not after. And, how many ntlm_auth processes do you have beating winbind? Is there a way to reduce that number to something way below 200? There's no point in having so many ntlm_auth processes waiting, it must all go through a single winbind child process anyway. Volker
Those debug logs show what happens before and after the winbindd process starts chewing 100% CPU, not related to the crash as such, which I have not managed to re-create yet. I was just curious if you could spot anything in them that would indicate why winbindd started chewing 100% CPU. Also, I only have 50 ntlm_auth processes running, so why it's showing 200+ client connections I have no idea, this tells me there's something strange going on.
Please reduce those 50 to something like 10 or so. There is just no point in feeding more to an already overloaded winbind. In the future we will have winbind connect to more than one DC or with more than one connection to a single DC, but until then I would even say more than 5 ntlm_auth processes won't gain you anything. This does not solve the crash, which we have to diagnose separately, but it might be a workaround for you at this moment. Volker
Created attachment 4197 [details] patch for a race condition I've tried to reproduce your crash. I've come across a different race condition that in my case lead to a different panic. But as it also happened when I artificially created an overload situation, I would not be entirely surprised if it also fixes your crash. Can you give it a try? Thanks a lot, Volker
Any info here? Is this fixed for you? Volker
Apologies, I have not had the chance to test this. We've moved to basic kerberos auth instead, and have another solution in the pipeline to completely avoid explicit user auth all together. Many thanks for your assistance.
Damn, I would really have liked to fix this bug. But over the course of several days I was not able to reproduce it. I'm closing this as fixed in the assumption that my patch actually also fixes your bug. Volker