On Samba 3.4.1, I ran the following test: 1. Mount a CIFS share via Samba to a Linux client. 2. On the system exporting the CIFS share, I ran a program that grabs a write byte lock in blocking mode and then sleeps. This lock is grabbed on the file that resides on the ext3 file system and is not being grabbed via CIFS. 3. I then ran the same program over CIFS to grab a write byte lock on the file that already has the lock held from step 2. My expectation is that this would block because the lock was already held in step #2. This works as expected. 4. I then killed the process in step #2 and expected that the process blocked on the lock in step 3 would wake up. I waited multiple minutes and it never did. The following output from the Samba debug log seems to imply it is waiting forever: fcntl_lock: fcntl lock gave errno 11 (Resource temporarily unavailable) [2009/10/19 19:11:35, 3] ../lib/util/util.c:273(fcntl_lock) fcntl_lock: lock failed at offset 0 count 9223372036854775808 op 6 type 1 (Res ource temporarily unavailable) [2009/10/19 19:11:35, 8] locking/posix.c:219(posix_fcntl_lock) posix_fcntl_lock: Lock call failed [2009/10/19 19:11:35, 5] locking/posix.c:1215(set_posix_lock_posix_flavour) set_posix_lock_posix_flavour: Lock fail !: Type = WRITE: offset = 0, count = 9 223372036854775808. Errno = Resource temporarily unavailable [2009/10/19 19:11:35, 10] locking/locking.c:282(do_lock) do_lock: returning status=NT_STATUS_FILE_LOCK_CONFLICT [2009/10/19 19:11:35, 10] smbd/blocking.c:89(recalc_brl_timeout) Next timeout = Infinite. [2009/10/19 19:11:35, 3] smbd/blocking.c:205(push_blocking_lock_request) push_blocking_lock_request: lock request blocked with expiry time (0 sec. 0 us ec) (+-1 msec) for fnum = 5116, name = hi [2009/10/19 19:11:35, 10] libsmb/smb_signing.c:67(store_sequence_for_reply) Samba should retry the lock (poll for it) in some increment and not wait forever. This bug was seen with Samba 3.4.1 but I believe it may exist in older versions of Samba as well. I wonder if the following is the culprit: The routine recalc_brl_timeout() has code that looks to me that if it had only 1 blocking lock on the blocking_lock_queue would evaluate next_timeout to be 0, which means infinite. The function then returns.
Ok, this used to work as it was part of my Samba lock tests. I'll look at this immediately and see if I can work out a test we can use to make sure we don't regress again. Jeremy.
Created attachment 4866 [details] Quick test fix. Ok, here's a (completely untested) fix I'll test at work tomorrow. Compiles, but nothing more. Most of this fix will be writing the smbtorture test case I think.... Jeremy.
Ok, the fix works (and is now regression tested in the build farm as LOCK9). I've committed to master and 3.4.3. I think this is an important enough regression from 3.0.x that it needs to be in 3.3.10 and 3.4.3. Back-ports to follow. Jeremy
Created attachment 4872 [details] git am format patch for 3.4.3.
Created attachment 4873 [details] git-am format patch for 3.3.10.
Volker and Metze, the git-am format patches for 3.4.3 and 3.3.10 are identical to the patch I've put into master and 3.5.0 that passes the new LOCK9 smbtorture regression test. Please review and then re-assign to Karolin for inclusion. Thanks ! Jeremy.
Comment on attachment 4873 [details] git-am format patch for 3.3.10. Looks fine for 3.4.x and 3.3.x
Looks good. This means 3.4.3 will be deferred a week and be released not before October 28. Volker
I thought 6829 was a blocker anyway (maybe you should look at this to see if you agree). Jeremy.
Pushed to v3-3-test and v3-4-test. Closing out bug report. Thanks!
Hi Jeremy, I applied the patch above for Samba 3.4.3 to Samba 3.4.1 and tried our test program below, and it still seems to have a problem. I have attached the smbd.log with debug level 10 to this defect. It looks like the lock request is retried once 10 seconds after the initial request failed because of a lock conflict, but then it is never retried a second or additional time. So basically if the lock is dropped within 10 seconds, it works fine but if the lock conflict lasts more than 10 seconds it doesn't seem to work. If you believe I have misapplied this patch to Samba 3.4.1, please close this defect again. I have attached the smbd.log file excerpt and the test program that I used. To repro: 1. Run the ctdb_test program from 1 window, but don't use CIFS, on a file. It should say "File locked". It is waiting for input before it drops the lock. 2. Then, run the ctdb_test program from a second window on the same file but do this through CIFS. The program should say "locking file" but won't say "File locked" because it conflicts with the lock above. Now wait 15-20 seconds and then kill the program in step #1 above. You will see that the program in #2 never grabs the lock. If it did, you should see the output that says "File locked". Turning on debugs with level 10 debug will show the output that I have attached above.
Created attachment 4889 [details] Test program to grab lock and not release it. Run this program as: ctdb_test <path to file> It will display when it tries to grab the byte lock and when it has successfully grabbed the byte lock.
Created attachment 4890 [details] Output of debug level 10 when I applied patch to Samba 3.4.1 and ran my program has described in defect.
It looks like Samba decides to close the share as being idle after the lock retries 10 seconds later and is never retried again. If you run my program as desribed above but kill the first locking process within 10 seconds, it looks like the fix works. But if the lock conflict lasts more than 10 seconds (i.e. 1 retry still reports the lock as being grabbed), then the lock is never retried again. If you can reproduce this same failure with your fix on Samba 3.4.1 or Samba 3.4.3, then this defect is valid. If you find that you can't reproduce it, then I will re-check how I applied the patch. Thanks. Barry
Created attachment 4891 [details] Follow on patch for master, 3.5.0 and v3-4-test. Here is an (untested) fix that should do 2 things. a). Update the "used" flag on connection structs so they don't get marked idle when a deferred lock is on them. b). Ensure we always recalculate lock timeouts on every pass though process_blocking_lock_queue() - this lock timeout recalculation code already copes with empty lock queues. Please let me know if this fixes the problem for you. If it does this fix is too late to make it into 3.4.3 (which is due on 28th) but will be in 3.5.0 and 3.4.4, 3.3.10 and beyond. Jeremy.
If this patch works I'll update the torture tester to check for this specific case. Jeremy.
Fix works with the test case given so I've checked the fix into master and 3.5.0. I'll update smbtorture to test this case tomorrow. Jeremy.
I just tried running some tests and it looks like the fix works for me too. Some of my co-workers may try some more tests tomorrow. Thanks for fixing this.
Thanks for testing. Please keep kicking this code and finding bugs. It's *much* appreciated ! Jeremy.
Created attachment 4899 [details] git-am format follow on patch for 3.4.4. Follow on patch for 3.4.4.
Created attachment 4900 [details] git-am format follow on patch for 3.3.10
Karolin please pull these patches into 3.3.10 and 3.4.4 (once 3.4.3 has shipped). Thanks ! Jeremy.
Trying to get some more reviewers so Karolin can get this pushed :-). Jeremy.
Comment on attachment 4900 [details] git-am format follow on patch for 3.3.10 looks good
Comment on attachment 4899 [details] git-am format follow on patch for 3.4.4. looks good
Pushed follow on patches to v3-3-test and v3-4-test. Closing out bug report. Thanks!