Bug 6648 - smbd/ctdb infinite loop on unlink of non-existent file (stat()=ENOENT), reproduce with smbtorture BASE-BENCH-TORTURE
smbd/ctdb infinite loop on unlink of non-existent file (stat()=ENOENT), repro...
Status: NEW
Product: Samba 3.3
Classification: Unclassified
Component: File services
3.3.7
x64 FreeBSD
: P3 normal
: ---
Assigned To: Volker Lendecke
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-08-19 16:29 UTC by Andrew Klosterman
Modified: 2009-09-29 08:16 UTC (History)
1 user (show)

See Also:


Attachments
GDB session stepping through smbd that is exhibiting the "file isn't there, but I'm trying to delete it" behavior (91.20 KB, text/plain)
2009-08-19 16:30 UTC, Andrew Klosterman
no flags Details
Another GDB session stepping through smbd that is exhibiting the "file isn't there, but I'm trying to delete it" behavior (88.71 KB, text/plain)
2009-08-19 16:30 UTC, Andrew Klosterman
no flags Details
Samba log at debug level 5 for an smbd showing the buggy behavior. (63.78 KB, text/plain)
2009-08-19 16:31 UTC, Andrew Klosterman
no flags Details
Samba log at debug level 10 for an smbd showing the buggy behavior. (202.99 KB, text/plain)
2009-08-19 16:32 UTC, Andrew Klosterman
no flags Details
Output of "truss" for the loop that smbd is stuck in. (2.80 KB, text/plain)
2009-08-20 07:41 UTC, Andrew Klosterman
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Klosterman 2009-08-19 16:29:18 UTC
Summary:
=======
Running two-node Samba I can get smbd into an infinite loop where it continues to try to delete a non-existent file (stat( "torture.lck" ) returns ENOENT).  This occurs *after* the end of the smbtorture program ... one or two smbd processes can be stuck in the "deleting the file" loop described here.

Samba is 3.3.7 with CTDB, downloaded as "rsync -avz samba.org::ftp/unpacked/ctdb .", using a hacked "vfs objects = fileid"/vfs_fileid.c to work on FreeBSD which uses different calls to get info on mounted file systems.  CTDB was "hacked" to take out the IP-takeover calls/functionality which do not work on FreeBSD.

Might be more appropriately labelled as a CTDB bug, but the problem seems to be in smbd rather than any ctdb code.  Could be an interaction trouble.

Testing done:
=========
Works fine against single-node Windows.
Works fine on single-node Centos/Linux (Linux meddy-centos-1 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux) with Samba (3.3.4).
Does not work when run against Samba/CTDB running on a single node CTDB cluster.
Does not work when run against a single node of a two-node CTDB cluster with Samba 3.3.7 on FreeBSD 7.0, amd64.

Reproduce command:
==============
/opt/samba-4.0.0alpha7/source4/bin/smbtorture //10.0.10.10/smbtorture --failures 0 --user XXXXXX --password XXXXXXXX --num-ops 100 --debuglevel 5 BASE-BENCH-TORTURE

Only seems to happen at higher values of "--num-ops".  Works at the default of "10", gets into trouble with "100".

"Amateur" analysis:
============
In the process of servicing the delete request, all of the stat() command set errno=2 (ENOENT) and the situation percolates back to smbd/reply.c:2638 where the code realizes that "open_was_deferred()" == True and no error is returned back to the caller.

I have text from a couple GDB sessions and debuglevel 10 and 5 log files of the situation to attach to this bug.

Recovering from the situation:
===================
The problem stems from trying to delete a file that is not there (it used to exist, but has been deleted).  If I re-create that file, then the spinning smbd process recovers.
Comment 1 Andrew Klosterman 2009-08-19 16:30:15 UTC
Created attachment 4573 [details]
GDB session stepping through smbd that is exhibiting the "file isn't there, but I'm trying to delete it" behavior
Comment 2 Andrew Klosterman 2009-08-19 16:30:50 UTC
Created attachment 4574 [details]
Another GDB session stepping through smbd that is exhibiting the "file isn't there, but I'm trying to delete it" behavior
Comment 3 Andrew Klosterman 2009-08-19 16:31:39 UTC
Created attachment 4575 [details]
Samba log at debug level 5 for an smbd showing the buggy behavior.
Comment 4 Andrew Klosterman 2009-08-19 16:32:10 UTC
Created attachment 4576 [details]
Samba log at debug level 10 for an smbd showing the buggy behavior.
Comment 5 Andrew Klosterman 2009-08-20 07:41:13 UTC
Created attachment 4578 [details]
Output of "truss" for the loop that smbd is stuck in.
Comment 6 Andrew Klosterman 2009-09-28 13:52:03 UTC
Any traction on this one?

Should it be submitted to CTDB?

Is there any more information that I could provide?
Comment 7 Volker Lendecke 2009-09-28 14:46:53 UTC
Well, it's just that we're all too busy, and this bug seems like at least a few hours of reproducing. Without some external pressure (like money for example :-)) I pick the more low-hanging fruit first :-))

Volker
Comment 8 Andrew Klosterman 2009-09-28 14:54:58 UTC
Understandable.  I was just checking.  :-)

I'll see about building a patch from my CTDB changes (that get it partially working on FreeBSD) to help speed things along.  (It just occurred to me that those patches would be particularly useful to reproducing the situation!  Although any underlying/latent logic error is probably still there.)
Comment 9 Volker Lendecke 2009-09-29 06:54:26 UTC
What clustered file system are you using on FreeBSD?

Volker
Comment 10 Andrew Klosterman 2009-09-29 08:16:44 UTC
Using a clustered (e.g., it does locking correctly between participating nodes, passes SPEC08) NFSv3 file server as backend storage.