6828 – infinite timeout occurs when byte lock held outside of samba

Bug 6828 - infinite timeout occurs when byte lock held outside of samba

Summary: infinite timeout occurs when byte lock held outside of samba

Status:	RESOLVED FIXED

Alias:	None

Product:	Samba 3.4
Classification:	Unclassified
Component:	File services (show other bugs)
Version:	3.4.1
Hardware:	Other Linux

Importance:	P3 regression
Target Milestone:	---
Assignee:	Karolin Seeger
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-10-19 21:38 UTC by Barry Sabsevitz (mail address dead)
Modified:	2020-12-11 07:14 UTC (History)
CC List:	0 users

See Also:

Attachments
Quick test fix. (1.57 KB, patch) 2009-10-19 22:43 UTC, Jeremy Allison	no flags	Details
git am format patch for 3.4.3. (1.94 KB, patch) 2009-10-20 20:12 UTC, Jeremy Allison	no flags	Details
git-am format patch for 3.3.10. (1.93 KB, patch) 2009-10-20 20:19 UTC, Jeremy Allison	metze: review+	Details
Test program to grab lock and not release it. (962 bytes, text/plain) 2009-10-26 19:19 UTC, Barry Sabsevitz (mail address dead)	no flags	Details
Output of debug level 10 when I applied patch to Samba 3.4.1 and ran my program has described in defect. (11.73 KB, text/plain) 2009-10-26 19:20 UTC, Barry Sabsevitz (mail address dead)	no flags	Details
Follow on patch for master, 3.5.0 and v3-4-test. (1.38 KB, patch) 2009-10-26 19:59 UTC, Jeremy Allison	no flags	Details
git-am format follow on patch for 3.4.4. (1.89 KB, patch) 2009-10-27 13:44 UTC, Jeremy Allison	metze: review+	Details
git-am format follow on patch for 3.3.10 (2.05 KB, patch) 2009-10-27 13:57 UTC, Jeremy Allison	metze: review+	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Barry Sabsevitz (mail address dead) 2009-10-19 21:38:08 UTC

On Samba 3.4.1, I ran the following test:

1. Mount a CIFS share via Samba to a Linux client.

2. On the system exporting the CIFS share, I ran a program that grabs a write byte lock in blocking mode and then sleeps. This lock is grabbed on the file that resides on the ext3 file system and is not being grabbed via CIFS.

3. I then ran the same program over CIFS to grab a write byte lock on the file that already has the lock held from step 2. My expectation is that this would block because the lock was already held in step #2. This works as expected.

4. I then killed the process in step #2 and expected that the process blocked on the lock in step 3 would wake up. I waited multiple minutes and it never did.

The following output from the Samba debug log seems to imply it is waiting forever:

  fcntl_lock: fcntl lock gave errno 11 (Resource temporarily unavailable)
[2009/10/19 19:11:35,  3] ../lib/util/util.c:273(fcntl_lock)
  fcntl_lock: lock failed at offset 0 count 9223372036854775808 op 6 type 1 (Res
ource temporarily unavailable)
[2009/10/19 19:11:35,  8] locking/posix.c:219(posix_fcntl_lock)
  posix_fcntl_lock: Lock call failed
[2009/10/19 19:11:35,  5] locking/posix.c:1215(set_posix_lock_posix_flavour)
  set_posix_lock_posix_flavour: Lock fail !: Type = WRITE: offset = 0, count = 9
223372036854775808. Errno = Resource temporarily unavailable
[2009/10/19 19:11:35, 10] locking/locking.c:282(do_lock)
  do_lock: returning status=NT_STATUS_FILE_LOCK_CONFLICT
[2009/10/19 19:11:35, 10] smbd/blocking.c:89(recalc_brl_timeout)
  Next timeout = Infinite.
[2009/10/19 19:11:35,  3] smbd/blocking.c:205(push_blocking_lock_request)
  push_blocking_lock_request: lock request blocked with expiry time (0 sec. 0 us
ec) (+-1 msec) for fnum = 5116, name = hi
[2009/10/19 19:11:35, 10] libsmb/smb_signing.c:67(store_sequence_for_reply)

Samba should retry the lock (poll for it) in some increment and not wait forever. This bug was seen with Samba 3.4.1 but I believe it may exist in older versions of Samba as well. 

I wonder if the following is the culprit:

The routine recalc_brl_timeout() has code that looks to me that if it had only 1 blocking lock on the blocking_lock_queue would evaluate next_timeout to be 0, which means infinite. The function then returns.

Comment 1 Jeremy Allison 2009-10-19 22:15:40 UTC

Ok, this used to work as it was part of my Samba lock tests. I'll look at this immediately and see if I can work out a test we can use to make sure we don't regress again.
Jeremy.

Comment 2 Jeremy Allison 2009-10-19 22:43:10 UTC

Created attachment 4866 [details]
Quick test fix.

Ok, here's a (completely untested) fix I'll test at work tomorrow. Compiles, but nothing more. Most of this fix will be writing the smbtorture test case I think....
Jeremy.

Comment 3 Jeremy Allison 2009-10-20 19:55:58 UTC

Ok, the fix works (and is now regression tested in the build farm as LOCK9).
I've committed to master and 3.4.3. I think this is an important enough regression from 3.0.x that it needs to be in 3.3.10 and 3.4.3. Back-ports to follow.
Jeremy

Comment 4 Jeremy Allison 2009-10-20 20:12:03 UTC

Created attachment 4872 [details]
git am format patch for 3.4.3.

Comment 5 Jeremy Allison 2009-10-20 20:19:09 UTC

Created attachment 4873 [details]
git-am format patch for 3.3.10.

Comment 6 Jeremy Allison 2009-10-20 20:20:13 UTC

Volker and Metze, the git-am format patches for 3.4.3 and 3.3.10 are identical to the patch I've put into master and 3.5.0 that passes the new LOCK9 smbtorture regression test. Please review and then re-assign to Karolin for inclusion.

Thanks !

Jeremy.

Comment 7 Stefan Metzmacher 2009-10-21 02:39:11 UTC

Comment on attachment 4873 [details]
git-am format patch for 3.3.10.

Looks fine for 3.4.x and 3.3.x

Comment 8 Volker Lendecke 2009-10-21 04:25:36 UTC

Looks good.

This means 3.4.3 will be deferred a week and be released not before October 28.

Volker

Comment 9 Jeremy Allison 2009-10-21 11:23:56 UTC

I thought 6829 was a blocker anyway (maybe you should look at this to see if you agree).
Jeremy.

Comment 10 Karolin Seeger 2009-10-22 09:33:42 UTC

Pushed to v3-3-test and v3-4-test.
Closing out bug report.

Thanks!

Comment 11 Barry Sabsevitz (mail address dead) 2009-10-26 19:17:43 UTC

Hi Jeremy, I applied the patch above for Samba 3.4.3 to Samba 3.4.1 and tried our test program below, and it still seems to have a problem. I have attached the smbd.log with debug level 10 to this defect. It looks like the lock request is retried once 10 seconds after the initial request failed because of a lock conflict, but then it is never retried a second or additional time. So basically if the lock is dropped within 10 seconds, it works fine but if the lock conflict lasts more than 10 seconds it doesn't seem to work. If you believe I have misapplied this patch to Samba 3.4.1, please close this defect again. I have attached the smbd.log file excerpt and the test program that I used. To repro:

1. Run the ctdb_test program from 1 window, but don't use CIFS, on a file. It should say "File locked". It is waiting for input before it drops the lock.

2. Then, run the ctdb_test program from a second window on the same file but do this through CIFS. The program should say "locking file" but won't say "File locked" because it conflicts with the lock above.

Now wait 15-20 seconds and then kill the program in step #1 above. You will see that the program in #2 never grabs the lock. If it did, you should see the output that says "File locked".

Turning on debugs with level 10 debug will show the output that I have attached above.

Comment 12 Barry Sabsevitz (mail address dead) 2009-10-26 19:19:20 UTC

Created attachment 4889 [details]
Test program to grab lock and not release it.

Run this program as: ctdb_test <path to file>

It will display when it tries to grab the byte lock and when it has successfully grabbed the byte lock.

Comment 13 Barry Sabsevitz (mail address dead) 2009-10-26 19:20:20 UTC

Created attachment 4890 [details]
Output of debug level 10 when I applied patch to Samba 3.4.1 and ran my program has described in defect.

Comment 14 Barry Sabsevitz (mail address dead) 2009-10-26 19:23:13 UTC

It looks like Samba decides to close the share as being idle after the lock retries 10 seconds later and is never retried again. If you run my program as desribed above but kill the first locking process within 10 seconds, it looks like the fix works. But if the lock conflict lasts more than 10 seconds (i.e. 1 retry still reports the lock as being grabbed), then the lock is never retried again.

If you can reproduce this same failure with your fix on Samba 3.4.1 or Samba 3.4.3, then this defect is valid. If you find that you can't reproduce it, then I will re-check how I applied the patch.

Thanks.
Barry

Comment 15 Jeremy Allison 2009-10-26 19:59:03 UTC

Created attachment 4891 [details]
Follow on patch for master, 3.5.0 and v3-4-test.

Here is an (untested) fix that should do 2 things.

a). Update the "used" flag on connection structs so they don't get marked idle when a deferred lock is on them.
b). Ensure we always recalculate lock timeouts on every pass though process_blocking_lock_queue() - this lock timeout recalculation code already copes with empty lock queues.

Please let me know if this fixes the problem for you.

If it does this fix is too late to make it into 3.4.3 (which is due on 28th) but will be in 3.5.0 and 3.4.4, 3.3.10 and beyond.

Jeremy.

Comment 16 Jeremy Allison 2009-10-26 19:59:40 UTC

If this patch works I'll update the torture tester to check for this specific case.
Jeremy.

Comment 17 Jeremy Allison 2009-10-26 23:24:57 UTC

Fix works with the test case given so I've checked the fix into master and 3.5.0. I'll update smbtorture to test this case tomorrow.
Jeremy.

Comment 18 Barry Sabsevitz (mail address dead) 2009-10-26 23:27:33 UTC

I just tried running some tests and it looks like the fix works for me too. Some of my co-workers may try some more tests tomorrow. Thanks for fixing this.

Comment 19 Jeremy Allison 2009-10-26 23:36:50 UTC

Thanks for testing. Please keep kicking this code and finding bugs. It's *much* appreciated !

Jeremy.

Comment 20 Jeremy Allison 2009-10-27 13:44:51 UTC

Created attachment 4899 [details]
git-am format follow on patch for 3.4.4.

Follow on patch for 3.4.4.

Comment 21 Jeremy Allison 2009-10-27 13:57:00 UTC

Created attachment 4900 [details]
git-am format follow on patch for 3.3.10

Comment 22 Jeremy Allison 2009-10-27 17:24:24 UTC

Karolin please pull these patches into 3.3.10 and 3.4.4 (once 3.4.3 has shipped).
Thanks !
Jeremy.

Comment 23 Jeremy Allison 2009-11-09 15:06:51 UTC

Trying to get some more reviewers so Karolin can get this pushed :-).
Jeremy.

Comment 24 Stefan Metzmacher 2009-11-10 04:53:49 UTC

Comment on attachment 4900 [details]
git-am format follow on patch for 3.3.10

looks good

Comment 25 Stefan Metzmacher 2009-11-10 04:54:23 UTC

Comment on attachment 4899 [details]
git-am format follow on patch for 3.4.4.

looks good

Comment 26 Karolin Seeger 2009-11-10 05:29:40 UTC

Pushed follow on patches to v3-3-test and v3-4-test.
Closing out bug report.

Thanks!