Bug 8916 - smbd stops accepting new connections after few hours and requires a kill -9
Summary: smbd stops accepting new connections after few hours and requires a kill -9
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.5
Classification: Unclassified
Component: SMB2 (show other bugs)
Version: 3.5.6
Hardware: Other Linux
: P5 major
Target Milestone: ---
Assignee: Jeremy Allison
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-06 16:43 UTC by Leo
Modified: 2012-05-09 05:52 UTC (History)
1 user (show)

See Also:


Attachments
strace -ttT -f -p 610 -o /tmp/smbd.strace (85 bytes, text/plain)
2012-05-07 11:17 UTC, Leo
no flags Details
strace -ttT -o /tmp/smbclient.strace /usr/bin/smbclient -L 127.0.0.1 -Uuser%password (34.57 KB, text/plain)
2012-05-07 11:20 UTC, Leo
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Leo 2012-05-06 16:43:00 UTC
I am running Debian 6 (Linux 2.6.32-5-orion5x) on q qnap nas 109 (armv5tel). 

After a few hours use smbd seems to reliably fail, ceasing to accept any new connections though it continues to serve existing connections.

restarting the daemon using 

 service samba restart

does nothing, but killing the process using kill -9 on the first entry found using ps -e | grep smb and then issuing service samba start reliably solves the problem.

When listing ps -e | grep smb there are sometimes multiple smbd <defunct> entries. I have only seen this after a failure.  

The failure seems to occur even if no samba connections have been made (typically within 24 hours) but I seem to be able to make this happen more quickly by either:

* running rsync on the samba drives (but connected locally, not over samba), typically causes the failure within minutes or an hour
* running dozens of concurrent "smbclient -L" requests causes the failure within minutes reliably.
 
I have reinstalled debian twice with the same problems observed. I have run with an empty smb.conf and observed the same problems.
Comment 1 Leo 2012-05-06 16:44:20 UTC
(In reply to comment #0)
> I am running Debian 6 (Linux 2.6.32-5-orion5x) on q qnap nas 109 (armv5tel). 
> 
> After a few hours use smbd seems to reliably fail, ceasing to accept any new
> connections though it continues to serve existing connections.
> 
> restarting the daemon using 
> 
>  service samba restart
> 
> does nothing, but killing the process using kill -9 on the first entry found
> using ps -e | grep smb and then issuing service samba start reliably solves the
> problem.
> 
> When listing ps -e | grep smb there are sometimes multiple smbd <defunct>
> entries. I have only seen this after a failure.  
> 
> The failure seems to occur even if no samba connections have been made
> (typically within 24 hours) but I seem to be able to make this happen more
> quickly by either:
> 
> * running rsync on the samba drives (but connected locally, not over samba),
> typically causes the failure within minutes or an hour
> * running dozens of concurrent "smbclient -L" requests causes the failure
> within minutes reliably.
> 
> I have reinstalled debian twice with the same problems observed. I have run
> with an empty smb.conf and observed the same problems.

Sorry - I should add the log files show nothing unusual (even on level 10) and I have also checked the log files of every other service, and tried restarting every other service.
Comment 2 Leo 2012-05-07 10:35:25 UTC
I should add the way I am checking whether connections fail (in addition to normal experimentation) is the use of

smbclient -L 127.0.0.1 -Uuser%pass

which works normally, until it starts to fail when I get:

protocol negotiation failed: NT_STATUS_IO_TIMEOUT

Once it stops working, it never starts working again until I reboot the server or issue a kill -9
Comment 3 Volker Lendecke 2012-05-07 10:39:50 UTC
When it's in that state, can you do a

strace -ttT -f -p <pid> -o /tmp/smbd.strace

with <pid> being the parent smbd and watch a reconnect? Please upload /tmp/smbd.strace (potentially after bzip2 -9).

Thanks,

Volker
Comment 4 Leo 2012-05-07 11:17:17 UTC
Created attachment 7529 [details]
strace -ttT -f -p 610 -o /tmp/smbd.strace

generated by running:

strace -ttT -f -p 610 -o /tmp/smbd.strace

pid determined using both ps -e and more /var/run/samba/smbd.pid

Then connected to samba from a remote device which times out. Log appears though to be nearly empty.
Comment 5 Leo 2012-05-07 11:20:47 UTC
Created attachment 7530 [details]
strace -ttT -o /tmp/smbclient.strace /usr/bin/smbclient -L 127.0.0.1 -Uuser%password

As another option for strace I ran it whilst trying to connect locally using /usr/bin/smbclient -L 127.0.0.1 -Uuser%password

I edited file to remove my password
Comment 6 Leo 2012-05-07 11:21:56 UTC
Hi.

Thank you for your quick reply. I ran the command as requested by it produced very little, so I tried an alternative that I've seen listed elsewhere. Please let me know if you'd like me to retry your original request.

Leo
Comment 7 Volker Lendecke 2012-05-07 11:33:43 UTC
It hangs in the futex call. This stronly points at a threading problem. Samba itself does not use threads (unless you are using the aio_pthread module), so the problem must come from some library. Next step:

Can you get us a gdb backtrace from the parent?

gdb /usr/sbin/smbd --pid=<parent-smbd>

at the prompt please do a "bt full" and get us the output. It might be necessary  to also have debugging symbols installed.

Volker
Comment 8 Leo 2012-05-07 11:44:43 UTC
I get the following info:

warning: The current binary is a PIE (Position Independent Executable), which
GDB does NOT currently support.  Most debugger features will fail if used
in this session.

Reading symbols from /usr/sbin/smbd...(no debugging symbols found)...done.
Attaching to program: /usr/sbin/smbd, process 610
0x4039bd58 in ?? ()
(gdb) bt full
#0  0x4039bd58 in ?? ()
No symbol table info available.
#1  0x4039bd40 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I therefore suspect I need to install debugging symbols. Is there anything you could point me at so I know how to do that?

Incidentally samba was installed from the debian repository.
Comment 9 Volker Lendecke 2012-05-07 11:49:55 UTC
No idea, sorry. But we definitely need that correct backtrace information.
Comment 10 Leo 2012-05-08 21:11:37 UTC
In order to get gdb to process the PIE (Position Independent Executable) file I had to upgrade it. It did this by updating to debian unstable (sid). on installing gdb I was forced to resolve a dependency with python, which in turn seemed to upgrade a whole bunch of stuff (not sure why).

Since that upgrade I have thrown hundreds of concurrent connections, and run an rysnc for 24 hours and still non managed to make it stop accepting new connections (assumed fixed).

For info the complete list of things that were upgraded is show below

Leo

gdb 7.4.1-1 newer than version in archive
krb5-multidev 1.10+dfsg~beta1-2.1 newer than version in archive
libacl1 2.2.51-5 newer than version in archive
libattr1 1:2.4.46-5 newer than version in archive
libexpat1 2.1.0-1 newer than version in archive
libgssapi-krb5-2 1.10+dfsg~beta1-2.1 newer than version in archive
libgssrpc4 1.10+dfsg~beta1-2.1 newer than version in archive
libk5crypto3 1.10+dfsg~beta1-2.1 newer than version in archive
libkrb5-3 1.10+dfsg~beta1-2.1 newer than version in archive
libkrb5-dev 1.10+dfsg~beta1-2.1 newer than version in archive
libkrb5support0 1.10+dfsg~beta1-2.1 newer than version in archive
libtalloc2 2.0.7+git20120207-1 newer than version in archive
libtdb1 1.2.10-1 newer than version in archive
libwbclient0 2:3.6.5-1 newer than version in archive
openssh-client 1:5.9p1-5 newer than version in archive
openssh-server 1:5.9p1-5 newer than version in archive
samba 2:3.6.5-1 newer than version in archive
samba-common 2:3.6.5-1 newer than version in archive
samba-dbg 2:3.6.5-1 newer than version in archive
smbclient 2:3.6.5-1 newer than version in archive
tdb-tools 1.2.10-1 newer than version in archive
winbind 2:3.6.5-1 newer than version in archive
Comment 11 Volker Lendecke 2012-05-09 05:52:16 UTC
Closing as fixed so that it does not stay around in case you can not reproduce it. Please re-open this bug if you got the backtrace.

Thanks,

Volker