Bug 6828 - infinite timeout occurs when byte lock held outside of samba
infinite timeout occurs when byte lock held outside of samba
Status: RESOLVED FIXED
Product: Samba 3.4
Classification: Unclassified
Component: File services
3.4.1
Other Linux
: P3 regression
: ---
Assigned To: Karolin Seeger
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-10-19 21:38 UTC by Barry Sabsevitz
Modified: 2012-05-23 20:05 UTC (History)
0 users

See Also:
jra: review? (metze)


Attachments
Quick test fix. (1.57 KB, patch)
2009-10-19 22:43 UTC, Jeremy Allison
no flags Details
git am format patch for 3.4.3. (1.94 KB, patch)
2009-10-20 20:12 UTC, Jeremy Allison
no flags Details
git-am format patch for 3.3.10. (1.93 KB, patch)
2009-10-20 20:19 UTC, Jeremy Allison
metze: review+
Details
Test program to grab lock and not release it. (962 bytes, text/plain)
2009-10-26 19:19 UTC, Barry Sabsevitz
no flags Details
Output of debug level 10 when I applied patch to Samba 3.4.1 and ran my program has described in defect. (11.73 KB, text/plain)
2009-10-26 19:20 UTC, Barry Sabsevitz
no flags Details
Follow on patch for master, 3.5.0 and v3-4-test. (1.38 KB, patch)
2009-10-26 19:59 UTC, Jeremy Allison
no flags Details
git-am format follow on patch for 3.4.4. (1.89 KB, patch)
2009-10-27 13:44 UTC, Jeremy Allison
metze: review+
Details
git-am format follow on patch for 3.3.10 (2.05 KB, patch)
2009-10-27 13:57 UTC, Jeremy Allison
metze: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Barry Sabsevitz 2009-10-19 21:38:08 UTC
On Samba 3.4.1, I ran the following test:

1. Mount a CIFS share via Samba to a Linux client.

2. On the system exporting the CIFS share, I ran a program that grabs a write byte lock in blocking mode and then sleeps. This lock is grabbed on the file that resides on the ext3 file system and is not being grabbed via CIFS.

3. I then ran the same program over CIFS to grab a write byte lock on the file that already has the lock held from step 2. My expectation is that this would block because the lock was already held in step #2. This works as expected.

4. I then killed the process in step #2 and expected that the process blocked on the lock in step 3 would wake up. I waited multiple minutes and it never did.

The following output from the Samba debug log seems to imply it is waiting forever:

  fcntl_lock: fcntl lock gave errno 11 (Resource temporarily unavailable)
[2009/10/19 19:11:35,  3] ../lib/util/util.c:273(fcntl_lock)
  fcntl_lock: lock failed at offset 0 count 9223372036854775808 op 6 type 1 (Res
ource temporarily unavailable)
[2009/10/19 19:11:35,  8] locking/posix.c:219(posix_fcntl_lock)
  posix_fcntl_lock: Lock call failed
[2009/10/19 19:11:35,  5] locking/posix.c:1215(set_posix_lock_posix_flavour)
  set_posix_lock_posix_flavour: Lock fail !: Type = WRITE: offset = 0, count = 9
223372036854775808. Errno = Resource temporarily unavailable
[2009/10/19 19:11:35, 10] locking/locking.c:282(do_lock)
  do_lock: returning status=NT_STATUS_FILE_LOCK_CONFLICT
[2009/10/19 19:11:35, 10] smbd/blocking.c:89(recalc_brl_timeout)
  Next timeout = Infinite.
[2009/10/19 19:11:35,  3] smbd/blocking.c:205(push_blocking_lock_request)
  push_blocking_lock_request: lock request blocked with expiry time (0 sec. 0 us
ec) (+-1 msec) for fnum = 5116, name = hi
[2009/10/19 19:11:35, 10] libsmb/smb_signing.c:67(store_sequence_for_reply)

Samba should retry the lock (poll for it) in some increment and not wait forever. This bug was seen with Samba 3.4.1 but I believe it may exist in older versions of Samba as well. 

I wonder if the following is the culprit:

The routine recalc_brl_timeout() has code that looks to me that if it had only 1 blocking lock on the blocking_lock_queue would evaluate next_timeout to be 0, which means infinite. The function then returns.
Comment 1 Jeremy Allison 2009-10-19 22:15:40 UTC
Ok, this used to work as it was part of my Samba lock tests. I'll look at this immediately and see if I can work out a test we can use to make sure we don't regress again.
Jeremy.
Comment 2 Jeremy Allison 2009-10-19 22:43:10 UTC
Created attachment 4866 [details]
Quick test fix.

Ok, here's a (completely untested) fix I'll test at work tomorrow. Compiles, but nothing more. Most of this fix will be writing the smbtorture test case I think....
Jeremy.
Comment 3 Jeremy Allison 2009-10-20 19:55:58 UTC
Ok, the fix works (and is now regression tested in the build farm as LOCK9).
I've committed to master and 3.4.3. I think this is an important enough regression from 3.0.x that it needs to be in 3.3.10 and 3.4.3. Back-ports to follow.
Jeremy
Comment 4 Jeremy Allison 2009-10-20 20:12:03 UTC
Created attachment 4872 [details]
git am format patch for 3.4.3.
Comment 5 Jeremy Allison 2009-10-20 20:19:09 UTC
Created attachment 4873 [details]
git-am format patch for 3.3.10.
Comment 6 Jeremy Allison 2009-10-20 20:20:13 UTC
Volker and Metze, the git-am format patches for 3.4.3 and 3.3.10 are identical to the patch I've put into master and 3.5.0 that passes the new LOCK9 smbtorture regression test. Please review and then re-assign to Karolin for inclusion.

Thanks !

Jeremy.
Comment 7 Stefan Metzmacher 2009-10-21 02:39:11 UTC
Comment on attachment 4873 [details]
git-am format patch for 3.3.10.

Looks fine for 3.4.x and 3.3.x
Comment 8 Volker Lendecke 2009-10-21 04:25:36 UTC
Looks good.

This means 3.4.3 will be deferred a week and be released not before October 28.

Volker
Comment 9 Jeremy Allison 2009-10-21 11:23:56 UTC
I thought 6829 was a blocker anyway (maybe you should look at this to see if you agree).
Jeremy.
Comment 10 Karolin Seeger 2009-10-22 09:33:42 UTC
Pushed to v3-3-test and v3-4-test.
Closing out bug report.

Thanks!
Comment 11 Barry Sabsevitz 2009-10-26 19:17:43 UTC
Hi Jeremy, I applied the patch above for Samba 3.4.3 to Samba 3.4.1 and tried our test program below, and it still seems to have a problem. I have attached the smbd.log with debug level 10 to this defect. It looks like the lock request is retried once 10 seconds after the initial request failed because of a lock conflict, but then it is never retried a second or additional time. So basically if the lock is dropped within 10 seconds, it works fine but if the lock conflict lasts more than 10 seconds it doesn't seem to work. If you believe I have misapplied this patch to Samba 3.4.1, please close this defect again. I have attached the smbd.log file excerpt and the test program that I used. To repro:

1. Run the ctdb_test program from 1 window, but don't use CIFS, on a file. It should say "File locked". It is waiting for input before it drops the lock.

2. Then, run the ctdb_test program from a second window on the same file but do this through CIFS. The program should say "locking file" but won't say "File locked" because it conflicts with the lock above.

Now wait 15-20 seconds and then kill the program in step #1 above. You will see that the program in #2 never grabs the lock. If it did, you should see the output that says "File locked".

Turning on debugs with level 10 debug will show the output that I have attached above. 
Comment 12 Barry Sabsevitz 2009-10-26 19:19:20 UTC
Created attachment 4889 [details]
Test program to grab lock and not release it.

Run this program as: ctdb_test <path to file>

It will display when it tries to grab the byte lock and when it has successfully grabbed the byte lock.
Comment 13 Barry Sabsevitz 2009-10-26 19:20:20 UTC
Created attachment 4890 [details]
Output of debug level 10 when I applied patch to Samba 3.4.1 and ran my program has described in defect.
Comment 14 Barry Sabsevitz 2009-10-26 19:23:13 UTC
It looks like Samba decides to close the share as being idle after the lock retries 10 seconds later and is never retried again. If you run my program as desribed above but kill the first locking process within 10 seconds, it looks like the fix works. But if the lock conflict lasts more than 10 seconds (i.e. 1 retry still reports the lock as being grabbed), then the lock is never retried again.

If you can reproduce this same failure with your fix on Samba 3.4.1 or Samba 3.4.3, then this defect is valid. If you find that you can't reproduce it, then I will re-check how I applied the patch.

Thanks.
Barry
Comment 15 Jeremy Allison 2009-10-26 19:59:03 UTC
Created attachment 4891 [details]
Follow on patch for master, 3.5.0 and v3-4-test.

Here is an (untested) fix that should do 2 things.

a). Update the "used" flag on connection structs so they don't get marked idle when a deferred lock is on them.
b). Ensure we always recalculate lock timeouts on every pass though process_blocking_lock_queue() - this lock timeout recalculation code already copes with empty lock queues.

Please let me know if this fixes the problem for you.

If it does this fix is too late to make it into 3.4.3 (which is due on 28th) but will be in 3.5.0 and 3.4.4, 3.3.10 and beyond.

Jeremy.
Comment 16 Jeremy Allison 2009-10-26 19:59:40 UTC
If this patch works I'll update the torture tester to check for this specific case.
Jeremy.
Comment 17 Jeremy Allison 2009-10-26 23:24:57 UTC
Fix works with the test case given so I've checked the fix into master and 3.5.0. I'll update smbtorture to test this case tomorrow.
Jeremy.
Comment 18 Barry Sabsevitz 2009-10-26 23:27:33 UTC
I just tried running some tests and it looks like the fix works for me too. Some of my co-workers may try some more tests tomorrow. Thanks for fixing this.
Comment 19 Jeremy Allison 2009-10-26 23:36:50 UTC
Thanks for testing. Please keep kicking this code and finding bugs. It's *much* appreciated !

Jeremy.
Comment 20 Jeremy Allison 2009-10-27 13:44:51 UTC
Created attachment 4899 [details]
git-am format follow on patch for 3.4.4.

Follow on patch for 3.4.4.
Comment 21 Jeremy Allison 2009-10-27 13:57:00 UTC
Created attachment 4900 [details]
git-am format follow on patch for 3.3.10
Comment 22 Jeremy Allison 2009-10-27 17:24:24 UTC
Karolin please pull these patches into 3.3.10 and 3.4.4 (once 3.4.3 has shipped).
Thanks !
Jeremy.
Comment 23 Jeremy Allison 2009-11-09 15:06:51 UTC
Trying to get some more reviewers so Karolin can get this pushed :-).
Jeremy.
Comment 24 Stefan Metzmacher 2009-11-10 04:53:49 UTC
Comment on attachment 4900 [details]
git-am format follow on patch for 3.3.10

looks good
Comment 25 Stefan Metzmacher 2009-11-10 04:54:23 UTC
Comment on attachment 4899 [details]
git-am format follow on patch for 3.4.4.

looks good
Comment 26 Karolin Seeger 2009-11-10 05:29:40 UTC
Pushed follow on patches to v3-3-test and v3-4-test.
Closing out bug report.

Thanks!