Bug 6373 - winbindd INTERNAL ERROR, core dump
Summary: winbindd INTERNAL ERROR, core dump
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.2
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 3.2.5
Hardware: x86 Linux
: P3 critical
Target Milestone: ---
Assignee: Volker Lendecke
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-05-19 07:20 UTC by Dan Searle
Modified: 2009-06-05 04:14 UTC (History)
1 user (show)

See Also:


Attachments
winbind debug logs (550.53 KB, application/octet-stream)
2009-05-19 08:09 UTC, Dan Searle
no flags Details
patch for a race condition (1.74 KB, patch)
2009-05-24 12:38 UTC, Volker Lendecke
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dan Searle 2009-05-19 07:20:32 UTC
This is winbindd from the Debian Lenny package version 2:3.2.5-4lenny2.

It is interfacing with a windows 2003 active directory, authenticating users for the ntlm_auth helper.

/etc/samba/smb.conf:
[global]
workgroup = ALSOP-HIGH
netbios name = CENSORNET
realm = ALSOP-HIGH.LOCALDOMAIN
security = ads
encrypt passwords = yes
password server = ahsserver3.alsop-high.localdomain
winbind separator = /
idmap uid = 10000-40000
idmap gid = 10000-40000
winbind enum users = yes
winbind enum groups = yes

/var/log/samba/log.winbind:
[2009/05/19 11:12:17,  0] winbindd/winbindd.c:process_loop(955)
  winbindd: Exceeding 200 client connections, no idle connection found
[2009/05/19 11:12:17,  0] winbindd/winbindd.c:process_loop(955)
  winbindd: Exceeding 200 client connections, no idle connection found
[2009/05/19 11:12:17,  0] winbindd/winbindd.c:process_loop(955)
  winbindd: Exceeding 200 client connections, no idle connection found
[2009/05/19 11:12:17,  0] winbindd/winbindd.c:process_loop(955)
  winbindd: Exceeding 200 client connections, no idle connection found
[2009/05/19 11:12:17,  0] winbindd/winbindd.c:process_loop(974)
  winbindd: Exceeding 200 client connections, no idle connection found
[2009/05/19 11:12:17,  0] lib/fault.c:fault_report(40)
  ===============================================================
[2009/05/19 11:12:17,  0] lib/fault.c:fault_report(41)
  INTERNAL ERROR: Signal 11 in pid 2635 (3.2.5)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2009/05/19 11:12:17,  0] lib/fault.c:fault_report(43)

  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2009/05/19 11:12:17,  0] lib/fault.c:fault_report(44)
  ===============================================================
[2009/05/19 11:12:17,  0] lib/util.c:smb_panic(1663)
  PANIC (pid 2635): internal error
[2009/05/19 11:12:17,  0] lib/util.c:log_stack_trace(1767)
  BACKTRACE: 13 stack frames:
   #0 /usr/sbin/winbindd(log_stack_trace+0x2d) [0x812b264]
   #1 /usr/sbin/winbindd(smb_panic+0x80) [0x812b3c1]
   #2 /usr/sbin/winbindd [0x8118bf3]
   #3 [0xb7f50400]
   #4 /usr/sbin/winbindd(async_domain_request+0x57) [0x80bcd72]
   #5 /usr/sbin/winbindd(sendto_domain+0x46) [0x80bce91]
   #6 /usr/sbin/winbindd(winbindd_pam_auth_crap+0x1ee) [0x80a8f8e]
   #7 /usr/sbin/winbindd [0x808c52d]
   #8 /usr/sbin/winbindd [0x808c63b]
   #9 /usr/sbin/winbindd [0x808ceeb]
   #10 /usr/sbin/winbindd(main+0xfa2) [0x808df59]
   #11 /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5) [0xb7c51455]
   #12 /usr/sbin/winbindd [0x808b841]
[2009/05/19 11:12:17,  0] lib/fault.c:dump_core(201)
  dumping core in /var/log/samba/cores/winbindd

I have the core file if anyone is interested.

This is a serious bug that's a show stopper for us, can you please address it ASAP.

Regards, Dan...
Comment 1 Volker Lendecke 2009-05-19 07:38:44 UTC
As a first measure I would like to ask you to set "winbind enum users" and "winbind enum groups" to no. For a pure squid proxy helper I don't really see why you would need that.

Then, can you please somehow install debugging symbols so that we get the line number of the crash?

And, using it for a proxy you might have more than one proxy and you might be able to run one of them with severely degraded performance, namely under valgrind --tool=memcheck. This would help tremendously finding that crash.

Alternatively, having a debug level 10 log leading to that crash would also greatly help, although this also degrades performance significantly.

Thanks,

Volker
Comment 2 Dan Searle 2009-05-19 08:07:15 UTC
Firstly, I have switched off user/group enumeration.

Not sure if this is related to the crash, however. After about 60 seconds of normal operation, the winbindd daemon starts chewing 100% CPU. I started winbindd with -d 10 debugging enabled see the attached log files for details.

Dan...




Comment 3 Dan Searle 2009-05-19 08:09:50 UTC
Created attachment 4169 [details]
winbind debug logs

winbind debug logs
Comment 4 Volker Lendecke 2009-05-19 08:14:44 UTC
Hmmm. What I would need is the logfile *before* a crash, not after.

And, how many ntlm_auth processes do you have beating winbind? Is there a way to reduce that number to something way below 200? There's no point in having so many ntlm_auth processes waiting, it must all go through a single winbind child process anyway.

Volker
Comment 5 Dan Searle 2009-05-19 08:19:52 UTC
Those debug logs show what happens before and after the winbindd process starts chewing 100% CPU, not related to the crash as such, which I have not managed to re-create yet. I was just curious if you could spot anything in them that would indicate why winbindd started chewing 100% CPU.

Also, I only have 50 ntlm_auth processes running, so why it's showing 200+ client connections I have no idea, this tells me there's something strange going on.

Comment 6 Volker Lendecke 2009-05-19 08:29:50 UTC
Please reduce those 50 to something like 10 or so. There is just no point in feeding more to an already overloaded winbind. In the future we will have winbind connect to more than one DC or with more than one connection to a single DC, but until then I would even say more than 5 ntlm_auth processes won't gain you anything.

This does not solve the crash, which we have to diagnose separately, but it might be a workaround for you at this moment.

Volker
Comment 7 Volker Lendecke 2009-05-24 12:38:28 UTC
Created attachment 4197 [details]
patch for a race condition

I've tried to reproduce your crash. I've come across a different race condition that in my case lead to a different panic. But as it also happened when I artificially created an overload situation, I would not be entirely surprised if it also fixes your crash.

Can you give it a try?

Thanks a lot,

Volker
Comment 8 Volker Lendecke 2009-06-05 03:18:49 UTC
Any info here? Is this fixed for you?

Volker
Comment 9 Dan Searle 2009-06-05 03:58:20 UTC
Apologies, I have not had the chance to test this. We've moved to basic kerberos auth instead, and have another solution in the pipeline to completely avoid explicit user auth all together.

Many thanks for your assistance.
Comment 10 Volker Lendecke 2009-06-05 04:14:35 UTC
Damn, I would really have liked to fix this bug. But over the course of several days I was not able to reproduce it. I'm closing this as fixed in the assumption that my patch actually also fixes your bug.

Volker