Bug 9196 - defer_open is triggered multiple times on the same request
defer_open is triggered multiple times on the same request
Status: RESOLVED FIXED
Product: Samba 4.0
Classification: Unclassified
Component: File services
4.0.0rc1
All All
: P5 normal
: ---
Assigned To: Karolin Seeger
Samba QA Contact
:
Depends on: 9191
Blocks: 9306
  Show dependency treegraph
 
Reported: 2012-09-23 18:07 UTC by Stefan Metzmacher
Modified: 2013-01-17 12:56 UTC (History)
7 users (show)

See Also:


Attachments
Backtrace from a crashed smbd process (same as https://bugzilla.samba.org/attachment.cgi?id=7924&action=edit) (27.00 KB, text/plain)
2012-09-23 18:07 UTC, Stefan Metzmacher
no flags Details
debug level 10 log of a crash (819.68 KB, text/x-log)
2012-09-25 23:01 UTC, Christian Ambach
no flags Details
capture leading to panic (34.11 KB, application/vnd.tcpdump.pcap)
2012-09-25 23:01 UTC, Christian Ambach
no flags Details
backtrace after panic (40.98 KB, text/plain)
2012-09-25 23:02 UTC, Christian Ambach
no flags Details
Patch for 4.0.0rc from metze. (2.04 KB, patch)
2012-09-26 22:11 UTC, Jeremy Allison
ambi: review+
metze: review-
Details
backported patch for 3.6.next. (2.09 KB, patch)
2012-09-26 22:40 UTC, Jeremy Allison
no flags Details
Patches for v4-0-test (4.95 KB, patch)
2012-09-29 07:55 UTC, Stefan Metzmacher
ambi: review+
metze: review? (jra)
Details
Patches for v3-6-test (6.93 KB, patch)
2012-09-29 08:46 UTC, Stefan Metzmacher
ambi: review+
metze: review? (jra)
Details
Bactrace from the crashed process (27.82 KB, text/plain)
2012-10-04 14:22 UTC, Lukasz Zalewski
no flags Details
Hack to reproduce the problem (1.67 KB, patch)
2012-10-04 22:23 UTC, Stefan Metzmacher
no flags Details
Hack to reproduce. (446 bytes, patch)
2012-12-14 16:43 UTC, Jeremy Allison
no flags Details
git-am fix for master and 4.0.1 (1.44 KB, patch)
2012-12-14 16:45 UTC, Jeremy Allison
metze: review+
jra: review? (vl)
jra: review? (asn)
ddiss: review+
Details
git-am fix for 3.6.next (1.42 KB, patch)
2012-12-14 16:58 UTC, Jeremy Allison
metze: review+
jra: review? (vl)
jra: review? (asn)
ddiss: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Metzmacher 2012-09-23 18:07:59 UTC
Created attachment 7925 [details]
Backtrace from a crashed smbd process (same as https://bugzilla.samba.org/attachment.cgi?id=7924&action=edit)

defer_open is triggered multiple times on the same request,
which triggers exit_server() which then triggers an assert
in release_file_oplock().
Comment 1 Stefan Metzmacher 2012-09-23 18:13:03 UTC
Lukasz, are you able to trigger this at will?

It would be very good to get a level 10 log file and a network capture
of a client triggering this.
Comment 2 Lukasz Zalewski 2012-09-23 19:49:46 UTC
Hi Metze,
Unfortunately i can't triger it at will. I will try to find a way to reliably replicate the crash.
Comment 3 Christian Ambach 2012-09-25 23:01:07 UTC
Created attachment 7937 [details]
debug level 10 log of a crash
Comment 4 Christian Ambach 2012-09-25 23:01:41 UTC
Created attachment 7938 [details]
capture leading to panic
Comment 5 Christian Ambach 2012-09-25 23:02:02 UTC
Created attachment 7939 [details]
backtrace after panic
Comment 6 Christian Ambach 2012-09-26 00:02:33 UTC
same crash can be initiated by running the smb2.oplock.batch22 test (which by default is skipped in make test).

Problem seems to be caused by client not replying to oplock break message
Comment 7 Jeremy Allison 2012-09-26 00:07:56 UTC
Can you check this doesn't occur against 3.6.x ? I'm wondering if my atomic open changes might have broken something (didn't think so, but it's possible).

Jeremy.
Comment 8 Jeremy Allison 2012-09-26 22:11:27 UTC
Created attachment 7946 [details]
Patch for 4.0.0rc from metze.

This is metze's patch that went into master.
Comment 9 Jeremy Allison 2012-09-26 22:40:13 UTC
Created attachment 7947 [details]
backported patch for 3.6.next.

I think this needs to be in 3.6.next also.
Jeremy.
Comment 10 Christian Ambach 2012-09-28 14:53:10 UTC
Comment on attachment 7947 [details]
backported patch for 3.6.next.

v3-6-test (w/o the patch) does not crash when running smb2.oplock.batch22, so I cannot make an easy judgement if this patch helps or not.

I'll leave it up to Metze to decide
Comment 11 Jeremy Allison 2012-09-28 16:21:54 UTC
Comment on attachment 7947 [details]
backported patch for 3.6.next.

Ok metze, I'm trusting in your judgement here :-). But it looks really suspicious to me and we *should* have crashed without this patch :-).

Thanks Christian !

Jeremy.
Comment 12 Stefan Metzmacher 2012-09-29 07:30:59 UTC
Comment on attachment 7947 [details]
backported patch for 3.6.next.

make test doesn't run the smb2.oplock test.

I'm currently testing patchset which also backports the changed tests.
Comment 13 Stefan Metzmacher 2012-09-29 07:52:56 UTC
Comment on attachment 7946 [details]
Patch for 4.0.0rc from metze.

We should also backport the changed tests...
Comment 14 Stefan Metzmacher 2012-09-29 07:55:54 UTC
Created attachment 7958 [details]
Patches for v4-0-test

This patchset also includes the updated tests
Comment 15 Stefan Metzmacher 2012-09-29 08:46:30 UTC
Created attachment 7959 [details]
Patches for v3-6-test

I think we should backport this fix to 3.6, as some bug could still pop up,
even if we don't trigger this easily.

This also backports the changed tests and activates smb2.oplock in
make test.

Karolin, this also includes a backport of
ef8e228a5b4952b51d9db6b28017142759f40d10
(s3:selftest: mark the driver_info_winreg tests knownfail),
which removes the last error from 'make test' in v3-6-test.
Comment 16 Stefan Metzmacher 2012-09-29 19:33:54 UTC
Karolin, please pick for the next releases
Comment 17 Karolin Seeger 2012-10-01 07:16:00 UTC
Pushed to v3-6-test and autobuild-v4-0-test.
Closing out bug report.

Thanks!
Comment 18 Lukasz Zalewski 2012-10-04 14:22:10 UTC
Created attachment 7990 [details]
Bactrace from the crashed process

Hi all,
Although the bug has been closed i'm still experiencing crashes - not as many though but still more-less daily. Please see attached backtrace from a crashed process. 
Shall i open a new bug for it?
Comment 19 Volker Lendecke 2012-10-04 16:23:03 UTC
(In reply to comment #18)
> Created attachment 7990 [details]
> Bactrace from the crashed process
> 
> Hi all,
> Although the bug has been closed i'm still experiencing crashes - not as many
> though but still more-less daily. Please see attached backtrace from a crashed
> process. 
> Shall i open a new bug for it?

Probably yes. If you somehow can get a debug level 10 log leading to this it would be great.
Comment 20 Stefan Metzmacher 2012-10-04 19:46:30 UTC
(In reply to comment #18)
> Created attachment 7990 [details]
> Bactrace from the crashed process
> 
> Hi all,
> Although the bug has been closed i'm still experiencing crashes - not as many
> though but still more-less daily. Please see attached backtrace from a crashed
> process. 
> Shall i open a new bug for it?

Are you testing with rc2? or from git?

The related patches were missing in rc2 due to a mistake,
but they're not in the v4-0-test branch.

To me it seems you don't have the patch.
Comment 21 Lukasz Zalewski 2012-10-04 20:37:04 UTC
Hi,
From git. I'm running Version 4.1.0pre1-GIT-aad669b (probably week old)
but it does contain commits 4604219ceba96955b3c4bf6ab31aa70c11442d61,
7a14f09f9cc14cfb6234175add1841faf751d51a and 98d117a5424d62804b7cb3d8a9ad35e703fc158a.
I will update and re-test
Comment 22 Stefan Metzmacher 2012-10-04 22:23:25 UTC
Created attachment 7994 [details]
Hack to reproduce the problem

Ok, I seems it can be triggered if the smbd that handles the oplock break,
doesn't response to the requester.
Comment 23 Stefan Metzmacher 2012-10-17 09:13:07 UTC
Jeremy can you have a look at this to me it's not clear why
we assert in the case where another smbd doesn't send an oplock
break response. I think in that case we should return INTERNAL_ERROR
or so to the client instead of disconnecting the requesting client,
which might cause data corruption for him.

The 2nd question is why this is triggered by Lukasz's workload,
it seems the real problem is in the other smbd (not the one that gives the backtrace).
Comment 24 Jeremy Allison 2012-10-17 09:38:44 UTC
Ok Metze I'll examine this. It will have to wait until I get back to the office though (Wed 24th next week).

Cheers,

Jeremy.
Comment 25 Stefan Metzmacher 2012-11-30 13:39:00 UTC
Ping Jeremy...
Comment 26 Jeremy Allison 2012-12-04 01:02:15 UTC
Looking at this again now.

Jeremy.
Comment 27 Jeremy Allison 2012-12-08 01:07:20 UTC
Ok, looking at this more closely we're asserting as we've got an incorrect count of the number of exclusive oplocks we've granted. This should never be negative, if it is it's a logic error.

We should be able to cope easily with the target smbd not responding, we have a timeout set up waiting for this to expire, set up in smb2_create:push_deferred_open_message_smb2().

I'm going to need to reproduce this myself to understand exactly.

Metze, with your sleep patch what did you do on the client side to reproduce ?

Jeremy.
Comment 28 Stefan Metzmacher 2012-12-08 07:05:20 UTC
(In reply to comment #27)
> Ok, looking at this more closely we're asserting as we've got an incorrect
> count of the number of exclusive oplocks we've granted. This should never be
> negative, if it is it's a logic error.

I thought it complains that it was already deferred.

> We should be able to cope easily with the target smbd not responding, we have a
> timeout set up waiting for this to expire, set up in
> smb2_create:push_deferred_open_message_smb2().
> 
> I'm going to need to reproduce this myself to understand exactly.
> 
> Metze, with your sleep patch what did you do on the client side to reproduce ?

If I remember correctly I just run any torture test that asked for an oplock,
maybe smb2.oplock
Comment 29 Jeremy Allison 2012-12-08 17:14:51 UTC
That's not where it panics, but it would have the same effect - if we defer twice then when the second defer timeout fires it'll try and decrement the oplock count again (when it's probably already zero).

I'll try and look at this on Monday, but if I can't get to it then it'll have to wait until I get back from the UK.

Jeremy.
Comment 30 Jeremy Allison 2012-12-11 00:29:52 UTC
Ok, it's not crashing for me with a sleep in that point in the code and running the oplock tests. I'll do some more investigating.
Jeremy.
Comment 31 Jeremy Allison 2012-12-11 01:06:11 UTC
Ok, got it to reproduce. Will investigate asap.

Jeremy.
Comment 32 Volker Lendecke 2012-12-11 05:54:22 UTC
Please tell me how you reproduced this. I tried as well but failed.
Comment 33 Jeremy Allison 2012-12-14 16:07:37 UTC
Ok, I have a fix for this. Testing now. I will upload the reproducing code and the fix for master/4.0.x.

Jeremy.
Comment 34 Jeremy Allison 2012-12-14 16:43:33 UTC
Created attachment 8348 [details]
Hack to reproduce.

As from metze, apply this patch then run the test:

smbtorture //localhost/tmp -Uuser%password smb2.oplock

and WAIT for 120 seconds. Auxiliary smbd processes will panic.

Jeremy.
Comment 35 Jeremy Allison 2012-12-14 16:45:22 UTC
Created attachment 8349 [details]
git-am fix for master  and 4.0.1

From the commit log:

    get_deferred_open_message_state_smb2() is buggy in that it is checking
    the wrong things to determine if an open is in the deferred state.
    
    It checks if (smb2req->async_te == NULL) which is incorrect,
    as we're not always async in a deferred open - remove this.
    
    It should check instead state->open_was_deferred as this
    is explicity set to 'true' when an open is going deferred,
    so add this check.

Running with the reproducer patch and this patch causes the smbtorture smb2.oplock test to complete (with errors, but no crashes) as you would expect if we're not sending oplock breaks to clients.

Jeremy.
Comment 36 Jeremy Allison 2012-12-14 16:58:17 UTC
Created attachment 8350 [details]
git-am fix for 3.6.next

Same fix back-ported to 3.6.x.

Jeremy.
Comment 37 Stefan Metzmacher 2012-12-19 07:19:00 UTC
Comment on attachment 8349 [details]
git-am fix for master  and 4.0.1

Looks good, Karolin, please
git cherry-pick -x 2148d86c7a2facd6e128b753aef98722843af3e1
Comment 38 Stefan Metzmacher 2012-12-19 07:34:27 UTC
Comment on attachment 8350 [details]
git-am fix for 3.6.next

Looks ok
Comment 39 David Disseldorp 2012-12-20 01:45:34 UTC
Comment on attachment 8350 [details]
git-am fix for 3.6.next


"smb2req->async_te" in the commit message should be "smb2req->async" for 3.6, but looks fine as is.
Comment 40 Karolin Seeger 2013-01-02 09:35:11 UTC
(In reply to comment #39)
> Comment on attachment 8350 [details]
> git-am fix for 3.6.next
> 
> 
> "smb2req->async_te" in the commit message should be "smb2req->async" for 3.6,
> but looks fine as is.

Pushed to v3-6-test after correcting the commit message.
Pushed to autobuild-v4-0-test.
Comment 41 Karolin Seeger 2013-01-02 11:17:25 UTC
Pushed to v4-0-test.
Closing out bug report.

Thanks!