Bug 2387 - smbd hung processes
Summary: smbd hung processes
Status: RESOLVED INVALID
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: File Services (show other bugs)
Version: 3.0.11
Hardware: Sparc Solaris
: P3 normal
Target Milestone: none
Assignee: Samba Bugzilla Account
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-02-23 06:05 UTC by Roman Berjoza
Modified: 2005-02-25 04:06 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Roman Berjoza 2005-02-23 06:05:35 UTC
The problem is: number of smbd processes unusual high. 
We service about 200 workstations office on samba-3.0.11 PDC, but
number of smbd processes grows to 500 in period of two days (while deadtime=20). 
Symptoms look like "http://lists.samba.org/archive/samba/2004-December/096674.
html"
"smbstatus reports approximately the right number of clients, but ps 
shows a much larger number of smbd processes active.  Smbstatus 
reports a list of active smbd processes ... but there is a block of smbd 
processes ... that are not in the smbstatus report.  
The hung processes need to be kill -9'ed."
Fortunately, we have not file access, domain logon or memory shortage problems.
netstat -an reports a large number of sockets in the CLOSE_WAIT state:
samba_pdc.63688  ldap_server.389   49640   0 49640      0 CLOSE_WAIT
(We use Solaris pam_unix with ldap'ed nsswitch.conf)
samba-3.0.11, configured "--with-pam", smbpasswd backend
SunOS samba_pdc 5.9 Generic_117171-17 sun4u sparc SUNW,UltraAX-i2
gcc 3.4.2
openldap 2.2.17
We have such situation only in main office. Other smaller offices (20 ws) with 
same hardware/software (except ldapsam) works fine.
Comment 1 Jeremy Allison 2005-02-23 11:16:34 UTC
For the processes that are hung and not in the status list can you
attach to them with gdb and get a backtrace of where they are ?
Also attach with strace -p <pid> and see if they're doing any system
calls.
Please post the results here.
 
Thanks,

Jeremy.
Comment 2 Roman Berjoza 2005-02-24 02:51:21 UTC
Here is gdb output:
# gdb
GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "sparc-sun-solaris2.9".
(gdb) attach 10110
Attaching to process 10110
Reading symbols from /usr/local/samba3/sbin/smbd...done.
Reading symbols from /usr/lib/libthread.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libthread.so.1
Reading symbols from /usr/local/lib/libldap-2.2.so.7...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/local/lib/libldap-2.2.so.7
Reading symbols from /usr/local/lib/liblber-2.2.so.7...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/local/lib/liblber-2.2.so.7
Reading symbols from /usr/lib/libpam.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libpam.so.1
Reading symbols from /usr/lib/libsendfile.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libsendfile.so.1
Reading symbols from /usr/lib/libsec.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libsec.so.1
Reading symbols from /usr/lib/libgen.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libgen.so.1
Reading symbols from /usr/lib/libresolv.so.2...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libresolv.so.2
Reading symbols from /usr/lib/libsocket.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libsocket.so.1
Reading symbols from /usr/lib/libnsl.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libnsl.so.1
Reading symbols from /usr/lib/libdl.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libdl.so.1
Reading symbols from /usr/local/lib/libiconv.so.2...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/local/lib/libiconv.so.2
Reading symbols from /usr/local/lib/libpopt.so.0...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/local/lib/libpopt.so.0
Reading symbols from /usr/lib/libc.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libc.so.1
Reading symbols from /usr/lib/libcmd.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libcmd.so.1
Reading symbols from /usr/lib/libmp.so.2...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libmp.so.2
Reading symbols from /usr/platform/SUNW,UltraAX-i2/lib/libc_psr.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/platform/SUNW,UltraAX-i2/lib/libc_psr.so.1
Reading symbols from /usr/lib/nss_files.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/nss_files.so.1
Reading symbols from /usr/lib/nss_ldap.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/nss_ldap.so.1
Reading symbols from /usr/lib/libsldap.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libsldap.so.1
Reading symbols from /usr/lib/libldap.so.5...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libldap.so.5
Reading symbols from /usr/lib/libdoor.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libdoor.so.1
Reading symbols from /usr/lib/librt.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/librt.so.1
Reading symbols from /usr/lib/libmd5.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libmd5.so.1
Reading symbols from /usr/lib/libaio.so.1...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/lib/libaio.so.1
Reading symbols from /usr/local/samba3/lib/vfs/extd_audit.so...done.
warning: sol_thread_new_objfile: td_ta_new: Debugger service failed
Loaded symbols for /usr/local/samba3/lib/vfs/extd_audit.so
Retry #1:
Retry #2:
Retry #3:
Retry #4:
[New LWP 1]
Symbols already loaded for /usr/lib/libthread.so.1
Symbols already loaded for /usr/local/lib/libldap-2.2.so.7
Symbols already loaded for /usr/local/lib/liblber-2.2.so.7
Symbols already loaded for /usr/lib/libpam.so.1
Symbols already loaded for /usr/lib/libsendfile.so.1
Symbols already loaded for /usr/lib/libsec.so.1
Symbols already loaded for /usr/lib/libgen.so.1
Symbols already loaded for /usr/lib/libresolv.so.2
Symbols already loaded for /usr/lib/libsocket.so.1
Symbols already loaded for /usr/lib/libnsl.so.1
Symbols already loaded for /usr/lib/libdl.so.1
Symbols already loaded for /usr/local/lib/libiconv.so.2
Symbols already loaded for /usr/local/lib/libpopt.so.0
Symbols already loaded for /usr/lib/libc.so.1
Symbols already loaded for /usr/lib/libcmd.so.1
Symbols already loaded for /usr/lib/libmp.so.2
Symbols already loaded for /usr/platform/SUNW,UltraAX-i2/lib/libc_psr.so.1
Symbols already loaded for /usr/lib/nss_files.so.1
Symbols already loaded for /usr/lib/nss_ldap.so.1
Symbols already loaded for /usr/lib/libsldap.so.1
Symbols already loaded for /usr/lib/libldap.so.5
Symbols already loaded for /usr/lib/libdoor.so.1
Symbols already loaded for /usr/lib/librt.so.1
Symbols already loaded for /usr/lib/libmd5.so.1
Symbols already loaded for /usr/lib/libaio.so.1
Symbols already loaded for /usr/local/samba3/lib/vfs/extd_audit.so
0xff375e88 in __lwp_park () from /usr/lib/libthread.so.1
(gdb) bt
#0  0xff375e88 in __lwp_park () from /usr/lib/libthread.so.1
#1  0xff371c10 in mutex_lock_queue () from /usr/lib/libthread.so.1
#2  0xff372610 in slow_lock () from /usr/lib/libthread.so.1
#3  0xfef46d00 in malloc () from /usr/lib/libc.so.1

Here is truss output for a long period:
#truss -fp 10110
10110:      *** SUID: ruid/euid/suid = 0 / 60001 / 60001  ***
10110:      *** SGID: rgid/egid/sgid = 0 / 60001 / 60001  ***
10110:  lwp_park(0x00000000, 0)         (sleeping...)

Here is output of p* tools:
# pfiles  10110
10110:  /usr/local/samba3/sbin/smbd -D -s/usr/local/samba3/lib/smb.conf
  Current rlimit: 10020 file descriptors
   0: S_IFCHR mode:0666 dev:32,0 ino:21557 uid:0 gid:3 rdev:13,2
      O_RDWR|O_LARGEFILE
   1: S_IFCHR mode:0666 dev:32,0 ino:21557 uid:0 gid:3 rdev:13,2
      O_RDWR|O_LARGEFILE
   2: S_IFREG mode:0644 dev:32,0 ino:715682 uid:0 gid:0 size:2263066
      O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
   3: S_IFCHR mode:0644 dev:32,0 ino:35196 uid:0 gid:3 rdev:190,1
      O_RDONLY|O_LARGEFILE
   4: S_IFREG mode:0600 dev:32,0 ino:3621724 uid:0 gid:0 size:8192
      O_RDWR|O_LARGEFILE
   5: S_IFSOCK mode:0666 dev:239,0 ino:42453 uid:0 gid:0 size:0
      O_RDWR
        sockname: AF_INET 0.0.0.0  port: 0
   6: S_IFDOOR mode:0444 dev:245,0 ino:44 uid:0 gid:0 size:0
      O_RDONLY|O_LARGEFILE FD_CLOEXEC  door to nscd[21267]
   7: S_IFDOOR mode:0444 dev:245,0 ino:45 uid:0 gid:0 size:0
      O_RDONLY FD_CLOEXEC  door to ldap_cachemgr[20284]
   8: S_IFREG mode:0600 dev:32,0 ino:3621727 uid:0 gid:0 size:8192
      O_RDWR|O_LARGEFILE
      advisory write lock set by process 29362
   9: S_IFREG mode:0600 dev:32,0 ino:3621728 uid:0 gid:0 size:8192
      O_RDWR|O_LARGEFILE
  10: S_IFREG mode:0644 dev:32,0 ino:3621744 uid:0 gid:0 size:6
      O_WRONLY|O_NONBLOCK|O_CREAT|O_EXCL|O_LARGEFILE
      advisory write lock set by process 13198
  11: S_IFREG mode:0600 dev:32,0 ino:3621730 uid:0 gid:0 size:8192
      O_RDWR|O_LARGEFILE
      advisory read lock set by process 13201
  12: S_IFREG mode:0644 dev:32,0 ino:3621731 uid:0 gid:0 size:253952
      O_RDWR|O_LARGEFILE
      advisory read lock set by process 13198
  13: S_IFREG mode:0644 dev:32,0 ino:3621732 uid:0 gid:0 size:614400
      O_RDWR|O_LARGEFILE
      advisory read lock set by process 13198
  14: S_IFREG mode:0644 dev:32,0 ino:3621733 uid:0 gid:0 size:8192
      O_RDWR|O_LARGEFILE
      advisory read lock set by process 13198
  15: S_IFREG mode:0644 dev:32,0 ino:3621734 uid:0 gid:0 size:65536
      O_RDWR|O_LARGEFILE
      advisory read lock set by process 13198
  16: S_IFREG mode:0600 dev:32,0 ino:3621735 uid:0 gid:0 size:8192
      O_RDWR|O_LARGEFILE
  17: S_IFREG mode:0644 dev:32,0 ino:3621736 uid:0 gid:0 size:8192
      O_RDWR|O_LARGEFILE
  18: S_IFREG mode:0600 dev:32,0 ino:3621737 uid:0 gid:0 size:8192
      O_RDWR|O_LARGEFILE
  19: S_IFREG mode:0600 dev:32,0 ino:3621740 uid:0 gid:0 size:8192
      O_RDWR|O_LARGEFILE
  20: S_IFREG mode:0600 dev:32,0 ino:3621741 uid:0 gid:0 size:16384
      O_RDWR|O_LARGEFILE
  21: S_IFREG mode:0600 dev:32,0 ino:3621742 uid:0 gid:0 size:696
      O_RDWR|O_LARGEFILE
  22: S_IFREG mode:0644 dev:32,0 ino:715682 uid:0 gid:0 size:2263066
      O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
  23: S_IFSOCK mode:0666 dev:239,0 ino:33411 uid:0 gid:0 size:0
      O_RDWR
        sockname: AF_INET 127.0.0.1  port: 45860
  24: S_IFIFO mode:0000 dev:240,0 ino:808983 uid:0 gid:0 size:0
      O_RDWR|O_NONBLOCK
  25: S_IFIFO mode:0000 dev:240,0 ino:808983 uid:0 gid:0 size:0
      O_RDWR|O_NONBLOCK
  26: S_IFIFO mode:0000 dev:240,0 ino:818715 uid:0 gid:0 size:1
      O_RDWR|O_NONBLOCK
  27: S_IFIFO mode:0000 dev:240,0 ino:818715 uid:0 gid:0 size:0
      O_RDWR|O_NONBLOCK
  28: S_IFCHR mode:0666 dev:32,0 ino:21553 uid:0 gid:3 rdev:21,0
      O_WRONLY FD_CLOEXEC
  29: S_IFSOCK mode:0666 dev:239,0 ino:27063 uid:0 gid:0 size:0
      O_RDWR
        sockname: AF_INET our_samba_pdc  port: 59434
        peername: AF_INET our_ldap_server  port: 389

# pflags  10110
10110:  /usr/local/samba3/sbin/smbd -D -s/usr/local/samba3/lib/smb.conf
        data model = _ILP32  flags = PR_ORPHAN
  /1:   flags = PR_PCINVAL|PR_ASLEEP [ lwp_park(0x0,0x0,0x0) ]
  sigmask = 0x00011280,0x00000000

# pstack  10110
10110:  /usr/local/samba3/sbin/smbd -D -s/usr/local/samba3/lib/smb.conf
 ff375e88 lwp_park (0, 0, 0)
 ff371c08 mutex_lock_queue (ff388b44, 0, fefc0590, ff388000, 35, 0) + 104
 ff372608 slow_lock (fefc0590, fefd0000, 252e3e, fefbc000, 0, 0) + 58
 fef46cf8 malloc   (37, 0, 252e28, ffbfdc28, 0, 0) + 18
 0018d93c vasprintf (36, 252e28, ffbfdc28, fef97ec4, 13, 296bab) + 2c
 0018df0c x_vfprintf (2a7670, 252e28, ffbfdc28, fefc27b0, 1, ff00) + c
 001837d8 Debug1   (0, 0, 28e000, 252e28, 252e40, 252e50) + ec
 00183acc dbghdr   (1, 252e40, 252e50, 24, 0, 0) + 128
 00183b84 fault_report (a, 0, 0, 0, 0, 0) + 58
 00183d00 sig_fault (a, 0, ffbfdfc0, 0, 0, 0) + 4
 ff3760a0 __sighndlr (a, 0, ffbfdfc0, 183cfc, 0, 0) + c
 ff36fdd8 call_user_handler (a, 0, ffbfdfc0, 0, 0, 0) + 234
 ff36ff88 sigacthandler (a, 0, ffbfdfc0, 1, 1, ffbfe2f4) + 64
 --- called from signal handler with signal 10 (SIGBUS) ---
 fef47c08 _free_unlocked (185, 0, fec8c9a0, fefbc000, 1, ff00) + 40
 fef47bb8 free     (185, 0, 370888, 0, 0, 0) + 20
 fec95a10 ldap_set_lderrno (33ddc8, 0, 0, 0, ffbfe480, 0) + ec
 fecb0098 ldap_create_virtuallist_control (33ddc8, ffbfe460, ffbfe480, ffbfe484, 
652c, ff00) + 198
 feda9b38 setup_vlv_params (33dc38, fedaa4d0, 3719c0, 8, fedaa4d0, ff00) + 110
 fedaa808 search_state_machine (5, 1, fedc0000, 8, d, e) + 2e8
 fedab5d8 __ns_ldap_firstEntry (33dc1c, 0, feddb3e4, 3, 0, 0) + 258
 feddaf34 _nss_ldap_getent (33dc00, ffbfeee0, feddaeb4, 0, 0, 0) + 80
 fef4ee14 nss_getent_u (fefc0608, 2f9798, fefc0628, ffbfeee0, 0, 0) + c8
 fef4e948 nss_getent (fefc0608, fef98cc0, fefc0628, ffbfeee0, 62, ff00) + 34
 fef99200 getpwent_r (2e7204, 2e7228, 400, 18a5a4, 2000, ffbfe9d0) + 4c
 0018a6a4 getpwent_list (2a7da0, 33fd8c, 33fd9c, 1c80, ff, ffbff1b8) + 168
 0010bd34 get_memberuids (0, ffbff054, ffbff050, ffbff05c, 33ecc0, 400) + 30
 0010c00c _samr_query_groupmem (33b038, ffbff2a0, ffbff278, 1, 14, 2e1510) + 20c
 001016c4 api_samr_query_groupmem (33b038, 1015e0, 28dcf0, 33b046, 0, 33fd78) + 
e4
 001196a8 api_rpcTNP (33b038, 33b046, 28dc84, 0, 0, 33dbe4) + 2c4
 0011932c api_pipe_request (33b038, 1, 33cae0, 14, 65, 2f74a0) + e8
 00113d18 process_request_pdu (33b038, 0, 1c, 0, 0, 33ba20) + 564
 00113f74 process_complete_pdu (33b038, 1c, 0, 2, 0, 33ba20) + 218
 001142dc process_incoming_data (1c, 2f74b0, 1c, fefbc000, ffbfef9e, 3fa) + 210
 00114540 write_to_internal_pipe (ffffffff, 2f74b0, 2c, 1144b0, ffbfef98, 400) + 
90
 001144a0 write_to_pipe (2f7360, 2f74a0, 2c, 400, 45, 45) + 128
 00050dd4 api_fd_reply (2f7360, 65, 31abe8, 26, 2f74a0, 0) + 2e4
 00051044 named_pipe (2f8eb0, 65, 31abe8, ffbff9b6, 2edc60, 2f74a0) + 1c8
 00051aa4 reply_trans (400, 2fa798, 0, 2, 65, 2f74a0) + 9f0
 000984f8 switch_message (2f8eb0, 2fa798, 31abe8, 84, 20000, 0) + 5c0
 00098584 construct_reply (2fa798, 31abe8, 84, 20000, ffbffb88, 2a3c00) + 5c
 000988ec process_smb (2fa798, 31abe8, 31abe8, 20441, 0, 0) + 1d8
 0009958c smbd_process (bba, 4ecf4, 16, 0, 3, ffbffd20) + 178
 001fdd00 main     (ffffffff, ffbffe94, ffbffea4, 2a6198, 0, 0) + 824
 0003c0c4 _start   (0, 0, 0, 0, 0, 0) + 5c
Comment 3 Volker Lendecke 2005-02-24 04:17:38 UTC
This looks like a problem in the nss_ldap libraries. How large is your LDAP
tree? I'm currently working on improving the Samba->LDAP connection. In
particular the _samr_query_groupmem that seems to be problematic for you has
been vastly improved recently. This is post-3.0.11 however, and you need to
activate the option 'ldapsam:trusted = yes'. On the other hand you should try to
reproduce the problem by issuing several 'getent group' calls and see whether
you can get one of them into a stall. If you can, ask your friendly solaris
support for fixing the problem.

Volker
Comment 4 Roman Berjoza 2005-02-25 02:28:58 UTC
(In reply to comment #3)
> This looks like a problem in the nss_ldap libraries. How large is your LDAP
> tree? I'm currently working on improving the Samba->LDAP connection. In
> particular the _samr_query_groupmem that seems to be problematic for you has
> been vastly improved recently. This is post-3.0.11 however, and you need to
> activate the option 'ldapsam:trusted = yes'. On the other hand you should try 
to
> reproduce the problem by issuing several 'getent group' calls and see whether
> you can get one of them into a stall. If you can, ask your friendly solaris
> support for fixing the problem.
> 
> Volker
We have ~1000 user entries ldap tree in Sun One DS 5.2. It's not ldapsam on PDC, 
only smbpasswd, think "ldapsam:trusted = yes" useless. Unix accounts resolved 
thrue pam_unix->nss_ldap well, groups thrue files only. I cannot get getent 
group into a stall.
Comment 5 Volker Lendecke 2005-02-25 04:06:30 UTC
getpwent_list is the last function in the call chain in Samba. From there we
call getpwent() which gives the next entry from /etc/passwd and the
corresponding nss equivalent, probably nss_ldap. This could possibly be slow,
but it should never hand indefinitely. The way I'm reading your backtrace
suggests that nss_ldap gets a SIGBUS signal from within either nss_ldap or the
ldap libraries. It seems to use a AFAIK quite novel LDAP feature called virtual
list view that might have bugs in   its implementation. Getting a SIGBUS from
within free() really sounds like an nss_ldap and/or ldap library bug.

Closing this bug, close inspection of the relevant samba code does not show any
bugs. The only way this could be a samba bug is a general memory corruption. To
trace this there is not enough info in your bug report, and I'm afraid that I
don't know a way to really diagnose this.

Again: I *really* don't believe it's a samba bug, this too much smells like
nss_ldap and/or ldap libs.

Volker