Bug 3753 - smbd may hang in initgroups() call (when libnss_ldap is used)
Summary: smbd may hang in initgroups() call (when libnss_ldap is used)
Status: RESOLVED WORKSFORME
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: File Services (show other bugs)
Version: 3.0.22
Hardware: x86 Linux
: P3 major
Target Milestone: none
Assignee: Samba Bugzilla Account
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-05-04 16:17 UTC by Petter Osterlund
Modified: 2006-05-16 15:43 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Petter Osterlund 2006-05-04 16:17:47 UTC
(Note: this may be same as #3545 but I am not sure.
 I have more information however).

Problem
=======

I run a stable Debian Sarge installation but using
Samba 3.0.22. Using LDAP as backend. I have recently
migrated from NT PDC to Samba PDC. But I am
experiencing problems with samba hangs which causes
the entire network (NT) to hang. The problem is
intermittent. But I can now reproduce it by issuing
smbclient commands more or less simultaneously from
two processes.

The problem is causing severe problems when it
occures.

How I reproduce
===============

By doing the following commands I can reproduce the fault
every time.

   /etc/init.d/samba restart
   
   # Test that all is ok
   smbclient -L presul1a -Uid%pw

   # In separate shell start a contious loop
   while true do; smbclient -L presul1a -Uid%pw; done

   # Trigg the hanging by doing one or a few concurrently
   # Doing this will hang smbd
   smbclient -L presul1a -Uid%pw

   # smbd not completely dead it spawns new smbd processes
   # and can work annonumusly like this
   smbclient -L presul1a -U%


How to get it up running again
==============================

* Do a kill -6 on all PIDS except for the two initially spawned
  processes.
* /etc/init.d/slapd restart

The kill -6 does not work all the time, I think it is due to that
there is a lot of connections tried all the time and the same fault
occurs again directly once the fault situation has been cleared.

Strange side effect of slapd restart
------------------------------------

The restart of ldap server has a very interresting side effect;
I cannot reproduce the fault after having done this! In fact,
this procedure seems to prevent the fault in my tests:

   /etc/init.d/samba restart
   smbclient -L presul1a -Uid%pw
   /etc/init.d/slapd restart

   # In separate shell start a contious loop
   while true do; smbclient -L presul1a -Uid%pw; done

   # Seems to not cause the fault this time
   smbclient -L presul1a -Uid%pw

More in depth logs and analysis
===============================


When doing a single command like this

 smbclient -L presul1a -Uid%pw

then I get a trace that looks like this (part of it):

-------------
[2006/05/04 19:00:46, 3] smbd/sec_ctx.c:set_sec_ctx(288)
  setting sec ctx (0, 0) - sec_ctx_stack_ndx = 1
[2006/05/04 19:00:46, 5] auth/auth_util.c:debug_nt_user_token(433)
  NT user token: (NULL)
[2006/05/04 19:00:46, 5] auth/auth_util.c:debug_unix_user_token(454)
  UNIX token of user 0
  Primary group is 0 and contains 0 supplementary groups
[2006/05/04 19:00:46, 10] lib/system_smbd.c:sys_getgrouplist(167)
  sys_getgrouplist: user [petos]
[2006/05/04 19:00:46, 10] lib/system_smbd.c:sys_getgrouplist(176)
  sys_getgrouplist(): disabled winbindd for group lookup [user == petos]
[2006/05/04 19:00:46, 3] smbd/sec_ctx.c:push_sec_ctx(256)
  push_sec_ctx(0, 0) : sec_ctx_stack_ndx = 2
[2006/05/04 19:00:46, 3] smbd/uid.c:push_conn_ctx(393)
  push_conn_ctx(0) : conn_ctx_stack_ndx = 1
[2006/05/04 19:00:46, 3] smbd/sec_ctx.c:set_sec_ctx(288)
  setting sec ctx (0, 0) - sec_ctx_stack_ndx = 2
[2006/05/04 19:00:46, 5] auth/auth_util.c:debug_nt_user_token(433)
  NT user token: (NULL)
[2006/05/04 19:00:46, 5] auth/auth_util.c:debug_unix_user_token(454)
  UNIX token of user 0
  Primary group is 0 and contains 0 supplementary groups
[2006/05/04 19:00:46, 8] lib/system_smbd.c:remove_duplicate_gids(49)
  remove_duplicate_gids: Enter 15 gids
[2006/05/04 19:00:46, 8] lib/system_smbd.c:remove_duplicate_gids(67)
  remove_duplicate_gids: Exit 14 gids
[2006/05/04 19:00:46, 3] smbd/sec_ctx.c:pop_sec_ctx(386)
  pop_sec_ctx (0, 0) - sec_ctx_stack_ndx = 1
[2006/05/04 19:00:46, 3] smbd/sec_ctx.c:push_sec_ctx(256)
  push_sec_ctx(0, 0) : sec_ctx_stack_ndx = 2
[2006/05/04 19:00:46, 3] smbd/uid.c:push_conn_ctx(393)
-------------

When the fault occures (hanging) I get this once I also do
a pkill -6 smbd:

-------------
[2006/05/04 18:20:28, 3] smbd/sec_ctx.c:set_sec_ctx(288)
  setting sec ctx (0, 0) - sec_ctx_stack_ndx = 1
[2006/05/04 18:20:28, 5] auth/auth_util.c:debug_nt_user_token(433)
  NT user token: (NULL)
[2006/05/04 18:20:28, 5] auth/auth_util.c:debug_unix_user_token(454)
  UNIX token of user 0
  Primary group is 0 and contains 0 supplementary groups
[2006/05/04 18:20:28, 10] lib/system_smbd.c:sys_getgrouplist(167)
  sys_getgrouplist: user [petos]
[2006/05/04 18:20:28, 10] lib/system_smbd.c:sys_getgrouplist(176)
  sys_getgrouplist(): disabled winbindd for group lookup [user == petos]
[2006/05/04 18:20:28, 3] smbd/sec_ctx.c:push_sec_ctx(256)
  push_sec_ctx(0, 0) : sec_ctx_stack_ndx = 2
[2006/05/04 18:20:28, 3] smbd/uid.c:push_conn_ctx(393)
  push_conn_ctx(0) : conn_ctx_stack_ndx = 1
[2006/05/04 18:20:28, 3] smbd/sec_ctx.c:set_sec_ctx(288)
  setting sec ctx (0, 0) - sec_ctx_stack_ndx = 2
[2006/05/04 18:20:28, 5] auth/auth_util.c:debug_nt_user_token(433)
  NT user token: (NULL)
[2006/05/04 18:20:28, 5] auth/auth_util.c:debug_unix_user_token(454)
  UNIX token of user 0
  Primary group is 0 and contains 0 supplementary groups
[2006/05/04 18:20:39, 0] lib/fault.c:fault_report(36)
  ===============================================================
[2006/05/04 18:20:39, 0] lib/fault.c:fault_report(37)
  INTERNAL ERROR: Signal 6 in pid 15115 (3.0.22)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2006/05/04 18:20:39, 0] lib/fault.c:fault_report(39)

  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2006/05/04 18:20:39, 0] lib/fault.c:fault_report(40)
  ===============================================================
[2006/05/04 18:20:39, 0] lib/util.c:smb_panic2(1554)
  PANIC: internal error
[2006/05/04 18:20:39, 0] lib/util.c:smb_panic2(1562)
  BACKTRACE: 42 stack frames:
   #0 /usr/sbin/smbd(smb_panic2+0x111) [0x8212181]
   #1 /usr/sbin/smbd(smb_panic+0x1a) [0x821206a]
   #2 /usr/sbin/smbd [0x81fea14]
   #3 [0xffffe420]
   #4 /usr/lib/libldap_r.so.2 [0x4002acef]
   #5 /lib/libnss_ldap.so.2 [0x40745f8a]
   #6 /lib/libnss_ldap.so.2 [0x407465b2]
   #7 /lib/libnss_ldap.so.2 [0x40747093]
   #8 /lib/libnss_ldap.so.2 [0x407494f5]
   #9 /lib/libnss_ldap.so.2 [0x4074936a]
   #10 /lib/libnss_ldap.so.2 [0x40746552]
   #11 /lib/libnss_ldap.so.2 [0x40747093]
   #12 /lib/libnss_ldap.so.2(_nss_ldap_initgroups_dyn+0x14c) [0x407496cc]
   #13 /lib/tls/libc.so.6 [0x4022eb54]
   #14 /lib/tls/libc.so.6(initgroups+0x6b) [0x4022ea4b]
   #15 /usr/sbin/smbd [0x81f0bb4]
   #16 /usr/sbin/smbd [0x81f0d84]
   #17 /usr/sbin/smbd [0x81f0ed6]
   #18 /usr/sbin/smbd(pdb_default_enum_group_memberships+0x29) [0x81f1039]
   #19 /usr/sbin/smbd [0x81e1cf5]
   #20 /usr/sbin/smbd [0x81d4134]
   #21 /usr/sbin/smbd(pdb_enum_group_memberships+0x4e) [0x81d5e7e]
   #22 /usr/sbin/smbd [0x824f202]
   #23 /usr/sbin/smbd [0x824f35e]
   #24 /usr/sbin/smbd(make_server_info_sam+0x140) [0x824f650]
   #25 /usr/sbin/smbd [0x824a5f4]
   #26 /usr/sbin/smbd [0x824a8f1]
   #27 /usr/sbin/smbd [0x8247d9f]
   #28 /usr/sbin/smbd [0x8251a30]
   #29 /usr/sbin/smbd [0x8119d81]
   #30 /usr/sbin/smbd(ntlmssp_update+0x19b) [0x81192bb]
   #31 /usr/sbin/smbd(auth_ntlmssp_update+0x3d) [0x8251d5d]
   #32 /usr/sbin/smbd [0x80b7446]
   #33 /usr/sbin/smbd [0x80b78e6]
   #34 /usr/sbin/smbd(reply_sesssetup_and_X+0xd8e) [0x80b89fe]
   #35 /usr/sbin/smbd [0x80e0d18]
   #36 /usr/sbin/smbd [0x80e0f60]
   #37 /usr/sbin/smbd(process_smb+0x8c) [0x80e116c]
   #38 /usr/sbin/smbd(smbd_process+0x155) [0x80e2055]
   #39 /usr/sbin/smbd(main+0x563) [0x82a03d3]
   #40 /lib/tls/libc.so.6(__libc_start_main+0xf4) [0x4019d974]
   #41 /usr/sbin/smbd [0x807ee91]
-----------------


Comparing the two logs shows that they are identical up to the
point of panic (caused by the kill -6). The log from the fully
functioning invocation continues like this:

[2006/05/04 19:00:46, 8] lib/system_smbd.c:remove_duplicate_gids(49)
  remove_duplicate_gids: Enter 15 gids

Together with surrounding the log entries and study of the code
one may thus conclude the following functions is involved:

 sys_getgrouplist()
   getgrouplist_internals()
     initgroups()
     remove_duplicate_gids()

The call to initgroups() is apperently never returning for some
reason - the process hangs. The backtrace when issuing a kill -6
shows that we indeed are involved in the initgroups() call which
takes the path via libnss_ldap.so to do its work.

As we know, doing a slapd restart resolves the problem for us,
but then also killing the smbd processes involved in the hanging.

Also the peculiar side effect that the problem is not (or very hard)
to reproduce once slapd is restarted and samba is kept running.

Can it for this reason be that different forked smbd processes
shares a common file handle to slapd and after the slapd restart
each process gets its own handle....???

regards Petter
Comment 1 Petter Osterlund 2006-05-09 12:06:40 UTC
I think I have located the bug to the nss_ldap package (file libnss_ldap-2.3.2.so).

I have registered in BugZilla at www.padl.com with #271 for this issue.
I do not know if the fault is limited to debian/sarge.

The bug is causes the smbd child processes to share the parents (and
thus each others) nss_ldap socket to the LDAP server. And thus when two
smbd child processes is forked close to each other the fault is likely to
happen, making the clients to mess up the connection to the ldap server.

The bug is that ldap-nss.c should have registered fork hooks via the
pthread_atfork() function and close the child's socket after the fork.
The reason for this function is not called is due to a week reference to
the function, __pthread_atfork(), is used instead. But at least on debian/sarge
this weak reference is NULL unless some code actually made a reference to the
pthread_atfork() symbol. And thus no fork hooks is registered.

This fault also explans the numerous (every smbd fork) log entries:
  smbd[18358]: nss_ldap: reconnecting to LDAP server...
  smbd[18358]: nss_ldap: reconnected to LDAP server after 1 attempt(s)
Since the message implies exactly that confusion has occured, a search has
been attempted but failed. A new reopen of conection succeeds.

Possible workarounds:
* Restart slapd after samba has been started. This closes the ldap connection
in the master smbd process so that clients cannot share the socket,
* or rebuild the libnss_ldap source package, like this (e.g:)
  LIBS=-lpthread apt-src build libnss-ldap
  This makes the macro HAVE_PTHREAD_ATFORK to be defined and then
  ldap-nss.c will explicitly call pthread_atfork(). This fix also
  takes care of the nss-ldap log entries!

Comment 2 Gerald (Jerry) Carter (dead mail address) 2006-05-16 15:43:53 UTC
closing since this is PADL's bug.