Bug 9191 - Lock order violation in s3fs
Summary: Lock order violation in s3fs
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.0
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.0.0rc1
Hardware: x64 Linux
: P5 critical (vote)
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks: 9196
  Show dependency treegraph
 
Reported: 2012-09-21 15:28 UTC by Lukasz Zalewski
Modified: 2012-09-27 10:37 UTC (History)
1 user (show)

See Also:


Attachments
Patch for v4-0-test (1.17 KB, patch)
2012-09-22 23:39 UTC, Stefan Metzmacher
vl: review+
Details
Backtrace from a crashed smbd process (27.00 KB, text/plain)
2012-09-23 10:46 UTC, Lukasz Zalewski
no flags Details
git-am fix for 3.6.next (1.13 KB, patch)
2012-09-24 19:28 UTC, Jeremy Allison
vl: review+
Details
git-am fix for 3.5.next (1.15 KB, patch)
2012-09-24 19:37 UTC, Jeremy Allison
vl: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Lukasz Zalewski 2012-09-21 15:28:27 UTC
Hi,
Often (once a day) the following crash occurs in the smbd (4.1.0pre1-GIT-ea96d79)
[2012/09/21 12:23:24.576002,  0] ../lib/dbwrap/dbwrap.c:193(dbwrap_check_lock_order)
  Lock order violation: Trying /usr/local/samba/var/lock/smbXsrv_session_global.tdb at 1 while /usr/local/samba/var/lock/locking.tdb at 1 is locked
[2012/09/21 12:23:24.576148,  0] ../lib/dbwrap/dbwrap.c:133(debug_lock_order)
  lock order:  1:/usr/local/samba/var/lock/locking.tdb 2:<none> 3:<none>
[2012/09/21 12:23:24.576236,  0] ../source3/lib/util.c:810(smb_panic_s3)
  PANIC (pid 25048): invalid lock_order
[2012/09/21 12:23:24.577829,  0] ../source3/lib/util.c:921(log_stack_trace)
  BACKTRACE: 37 stack frames:
   #0 /usr/local/samba/lib/libsmbconf.so.0(log_stack_trace+0x1f) [0x7f5a4b211007]
   #1 /usr/local/samba/lib/libsmbconf.so.0(smb_panic_s3+0x6d) [0x7f5a4b210e76]
   #2 /usr/local/samba/lib/libsamba-util.so.0(smb_panic+0x28) [0x7f5a4d058c6a]
   #3 /usr/local/samba/lib/private/libdbwrap.so(+0x308d) [0x7f5a4758708d]
   #4 /usr/local/samba/lib/private/libdbwrap.so(+0x31cf) [0x7f5a475871cf]
   #5 /usr/local/samba/lib/private/libdbwrap.so(dbwrap_fetch_locked+0x41) [0x7f5a475872d9]
   #6 /usr/local/samba/lib/private/libsmbd_base.so(smbXsrv_session_logoff+0x126) [0x7f5a4c8442c9]
   #7 /usr/local/samba/lib/private/libsmbd_base.so(+0x194a78) [0x7f5a4c844a78]
   #8 /usr/local/samba/lib/private/libdbwrap.so(+0x59ff) [0x7f5a475899ff]
   #9 /usr/local/samba/lib/private/libdbwrap.so(+0x5ae5) [0x7f5a47589ae5]
   #10 /usr/local/samba/lib/private/libdbwrap.so(dbwrap_traverse+0x35) [0x7f5a475877b8]
   #11 /usr/local/samba/lib/private/libsmbd_base.so(smbXsrv_session_logoff_all+0xb5) [0x7f5a4c84489f]
   #12 /usr/local/samba/lib/private/libsmbd_base.so(+0x19b08c) [0x7f5a4c84b08c]
   #13 /usr/local/samba/lib/private/libsmbd_base.so(exit_server_cleanly+0) [0x7f5a4c84b45c]
   #14 /usr/local/samba/lib/private/libsmbd_base.so(+0x12af4d) [0x7f5a4c7daf4d]
   #15 /usr/local/samba/lib/private/libsmbd_base.so(+0x12b626) [0x7f5a4c7db626]
   #16 /usr/local/samba/lib/private/libsmbd_base.so(+0x12ca2a) [0x7f5a4c7dca2a]
   #17 /usr/local/samba/lib/private/libsmbd_base.so(+0x13041d) [0x7f5a4c7e041d]
   #18 /usr/local/samba/lib/private/libsmbd_base.so(create_file_default+0x2f8) [0x7f5a4c7e0f33]
   #19 /usr/local/samba/lib/private/libsmbd_base.so(+0x24966a) [0x7f5a4c8f966a]
   #20 /usr/local/samba/lib/private/libsmbd_base.so(smb_vfs_call_create_file+0xcb) [0x7f5a4c7ec77a]
   #21 /usr/local/samba/lib/private/libsmbd_base.so(+0x17e3e1) [0x7f5a4c82e3e1]
   #22 /usr/local/samba/lib/private/libsmbd_base.so(smbd_smb2_request_process_create+0x7ae) [0x7f5a4c82b862]
   #23 /usr/local/samba/lib/private/libsmbd_base.so(smbd_smb2_request_dispatch+0xf3a) [0x7f5a4c823493]
   #24 /usr/local/samba/lib/private/libsmbd_base.so(+0x17fd0e) [0x7f5a4c82fd0e]
   #25 /usr/local/samba/lib/libsmbconf.so.0(run_events_poll+0x262) [0x7f5a4b22dbb2]
   #26 /usr/local/samba/lib/libsmbconf.so.0(+0x4330a) [0x7f5a4b22e30a]
   #27 /usr/local/samba/lib/private/libtevent.so.0(_tevent_loop_once+0xe8) [0x7f5a4b47bee4]
   #28 /usr/local/samba/lib/private/libsmbd_base.so(smbd_process+0x117e) [0x7f5a4c807e51]
   #29 /usr/local/samba/sbin/smbd() [0x409554]
   #30 /usr/local/samba/lib/libsmbconf.so.0(run_events_poll+0x71a) [0x7f5a4b22e06a]
   #31 /usr/local/samba/lib/libsmbconf.so.0(+0x4330a) [0x7f5a4b22e30a]
   #32 /usr/local/samba/lib/private/libtevent.so.0(_tevent_loop_once+0xe8) [0x7f5a4b47bee4]
   #33 /usr/local/samba/sbin/smbd() [0x40a0e4]
   #34 /usr/local/samba/sbin/smbd(main+0x15a1) [0x40b7d6]
   #35 /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f5a49ae4cdd]
   #36 /usr/local/samba/sbin/smbd() [0x405209]
[2012/09/21 12:23:24.581183,  0] ../source3/lib/util.c:822(smb_panic_s3)
  smb_panic(): calling panic action [/bin/sleep 999999999]
Comment 1 Volker Lendecke 2012-09-21 15:52:45 UTC
Please recompile with debugging symbols, so that we get line numbers in the stack trace.
Comment 2 Lukasz Zalewski 2012-09-21 16:10:41 UTC
Hi Volker,
The build was created using configure.developer so I'm assuming it has been built with the debug symbols. The trace below was taken from the log file. When the problem occurs again i will try to get the full trace
Comment 3 Stefan Metzmacher 2012-09-22 23:39:35 UTC
Created attachment 7919 [details]
Patch for v4-0-test
Comment 4 Lukasz Zalewski 2012-09-23 10:46:00 UTC
Created attachment 7924 [details]
Backtrace from a crashed smbd process

Hi Metze,
After applying your fix I'm still experiencing crashes. Please see attached backtrace from the crashed smbd process
Comment 5 Volker Lendecke 2012-09-23 17:02:22 UTC
Comment on attachment 7919 [details]
Patch for v4-0-test

This bug is not fixed yet, but this patch here is right and should go into 4.0
Comment 6 Stefan Metzmacher 2012-09-23 17:52:49 UTC
(In reply to comment #4)
> Created attachment 7924 [details]
> Backtrace from a crashed smbd process
> 
> Hi Metze,
> After applying your fix I'm still experiencing crashes. Please see attached
> backtrace from the crashed smbd process

From reading the backtrace it seems it is a different more complex bug.
Can you file a new bug report for it?
Comment 7 Stefan Metzmacher 2012-09-23 18:08:47 UTC
(In reply to comment #6)
> (In reply to comment #4)
> > Created attachment 7924 [details] [details]
> > Backtrace from a crashed smbd process
> > 
> > Hi Metze,
> > After applying your fix I'm still experiencing crashes. Please see attached
> > backtrace from the crashed smbd process
> 
> From reading the backtrace it seems it is a different more complex bug.
> Can you file a new bug report for it?

Ok, I already created
https://bugzilla.samba.org/show_bug.cgi?id=9196
Comment 8 Stefan Metzmacher 2012-09-23 18:09:36 UTC
Karolin please pick the patch for v4-0-test.
Comment 9 Jeremy Allison 2012-09-24 19:28:01 UTC
Created attachment 7931 [details]
git-am fix for 3.6.next

This is also needed for 3.6.next. Fix attached.
Jeremy.
Comment 10 Jeremy Allison 2012-09-24 19:37:21 UTC
Created attachment 7932 [details]
git-am fix for 3.5.next

And for 3.5.next also...

Jeremy.
Comment 11 Volker Lendecke 2012-09-24 19:48:41 UTC
I'm not sure these need to go into 3.5 and 3.6. First, we do not have the lock order checks there. Second, in master we are doing a lot more to take down the connections and files. I am not sure we will ever hit a deadlock with the 3.5 and 3.6 code, at least I have not seen any. Agreed, it is a hygiene thing to make this a bit cleaner, but without at least some slight hint what this really fixes I am a bit reluctant to + it for 3.6 and 3.5.
Comment 12 Stefan Metzmacher 2012-09-24 20:12:59 UTC
Comment on attachment 7931 [details]
git-am fix for 3.6.next

looks ok
Comment 13 Stefan Metzmacher 2012-09-24 20:13:29 UTC
Comment on attachment 7932 [details]
git-am fix for 3.5.next

looks good
Comment 14 Stefan Metzmacher 2012-09-24 20:15:59 UTC
(In reply to comment #13)
> Comment on attachment 7932 [details]
> git-am fix for 3.5.next
> 
> looks good

Volker, sorry I didn't saw your comment.

I agree that it might not be strictly needed in 3.5 and 3.6,
but I also don't think it would harm.

But I led Volker and Jeremy decide if we backport it to 3.5 and 3.6.
Comment 15 Jeremy Allison 2012-09-24 20:44:26 UTC
That's my opinion too (it won't hurt). Every other error code path out of that function defer_open() and callers in 3.5.x and 3.6.x explicitly does a TALLOC_FREE on lck, so as a hygiene thing it's certainly a good fix to do.

Volker, you can make the final call for 3.5.x and 3.6.x, I'm not going to force it.

Jeremy.
Comment 16 Volker Lendecke 2012-09-24 20:53:31 UTC
(In reply to comment #15)
> That's my opinion too (it won't hurt). Every other error code path out of that
> function defer_open() and callers in 3.5.x and 3.6.x explicitly does a
> TALLOC_FREE on lck, so as a hygiene thing it's certainly a good fix to do.
> 
> Volker, you can make the final call for 3.5.x and 3.6.x, I'm not going to force
> it.

Now it's all my fault, right? I am just in general reluctant to modify a stable release without a good reason. *Probably* it will not hurt, but I have never ever seen a bug in this. Even with ctdb around I have not seen a deadlock coming from this, and large clusters are running with this, finding all sorts of weird rundown problems. I think our rundown code is no exception: It is not as well tested as the "good" code paths, and I worry that we break something with no good reason.

Metze +'ed this, so it is good to go in, right? I am out of this game now, no more ack needed. I will not - it, I just do not see a good reason to mess with things that have never caused us trouble so far.

Jeremy, if you can point me at a code path that will be affected by this, then please tell me.
Comment 17 Jeremy Allison 2012-09-24 21:28:05 UTC
Sigh... and I thought I was being courteous and polite by giving you the final decision :-). Should have learned my lesson, never  try and do a good turn for anyone, they won't appreciate it :-).

What seems to be behind this is the fear our code is so fragile that a pretty obvious general clean-up fix will break something. I don't think we're that bad.

I disagree (obviously), but as this problem wasn't reported on 3.6.x I'm not going to push it and let's let Karolin just pick for 4.0.x only. If it is reported on earlier versions we at least have these fixes available.

So Karolin, please ignore the 3.6.next and 3.5.next fixes and just pick the v4-0-test fix.

Cheers,

Jeremy.
Comment 18 Volker Lendecke 2012-09-24 21:36:15 UTC
Comment on attachment 7931 [details]
git-am fix for 3.6.next

Sorry for being so disruptive. If we are in this code path, we already have a severe bug. So it does not matter if we break more. So, following your advice, let's put this into the stable releases.
Comment 19 Volker Lendecke 2012-09-24 21:37:25 UTC
Comment on attachment 7932 [details]
git-am fix for 3.5.next

Same comment as the last ack: Getting here we already have a very severe bug, so it does not hurt further. And it seems we need this patch in a stable release.
Comment 20 Volker Lendecke 2012-09-24 21:39:04 UTC
BTW, what is our general policy on 3.5 and 3.6? Is the new policy to always backport everything to those? I thought that at least 3.5 is phasing out slowly, but probably we can not ever phase out anything anymore, given that we are many, many months late on 4.0? Can we get some more general discussion about this topic?
Comment 21 Jeremy Allison 2012-09-24 21:41:48 UTC
Personally if it looks like code that affects these earlier releases I back-port (as I did here, as the code is relatively identical) as 3.5.x and 3.6.x are our active release streams. However, once we finally ship 4.0.0 I'll stop any back-ports to 3.5.x as we only have bandwidth for 2 active release streams (IMHO).

Jeremy.
Comment 22 Volker Lendecke 2012-09-24 21:49:58 UTC
The problem I have with this particular piece is that we are fixing a code path that is only called if we have a severe bug somewhere already (duplicate defer). I just doubt that we have this kind of bug in 3.5 and 3.6. So I just do not see a good reason to modify it. If you can show me how we can end up in this code path or which piece of code this TALLOC_FREE actually fixes, I will be much more happy.

I know this discussion will only lead to me not having to ack patches again, but maybe I am just too old and conservative regarding production code.
Comment 23 Jeremy Allison 2012-09-24 22:10:28 UTC
No problem. I can't show you what you request, and to me this is more of a error path cleanup issue which is not critical.

I'm happy to leave it with the instruction to Karolin in comment 17 to push for 4.0.0 only.

Cheers,

Jeremy.
Comment 24 Karolin Seeger 2012-09-27 10:37:25 UTC
Pushed to v4-0-test.
Will be included in 4.0.0rc2.
Closing out bug report.

Thanks!