Hi, Often (once a day) the following crash occurs in the smbd (4.1.0pre1-GIT-ea96d79) [2012/09/21 12:23:24.576002, 0] ../lib/dbwrap/dbwrap.c:193(dbwrap_check_lock_order) Lock order violation: Trying /usr/local/samba/var/lock/smbXsrv_session_global.tdb at 1 while /usr/local/samba/var/lock/locking.tdb at 1 is locked [2012/09/21 12:23:24.576148, 0] ../lib/dbwrap/dbwrap.c:133(debug_lock_order) lock order: 1:/usr/local/samba/var/lock/locking.tdb 2:<none> 3:<none> [2012/09/21 12:23:24.576236, 0] ../source3/lib/util.c:810(smb_panic_s3) PANIC (pid 25048): invalid lock_order [2012/09/21 12:23:24.577829, 0] ../source3/lib/util.c:921(log_stack_trace) BACKTRACE: 37 stack frames: #0 /usr/local/samba/lib/libsmbconf.so.0(log_stack_trace+0x1f) [0x7f5a4b211007] #1 /usr/local/samba/lib/libsmbconf.so.0(smb_panic_s3+0x6d) [0x7f5a4b210e76] #2 /usr/local/samba/lib/libsamba-util.so.0(smb_panic+0x28) [0x7f5a4d058c6a] #3 /usr/local/samba/lib/private/libdbwrap.so(+0x308d) [0x7f5a4758708d] #4 /usr/local/samba/lib/private/libdbwrap.so(+0x31cf) [0x7f5a475871cf] #5 /usr/local/samba/lib/private/libdbwrap.so(dbwrap_fetch_locked+0x41) [0x7f5a475872d9] #6 /usr/local/samba/lib/private/libsmbd_base.so(smbXsrv_session_logoff+0x126) [0x7f5a4c8442c9] #7 /usr/local/samba/lib/private/libsmbd_base.so(+0x194a78) [0x7f5a4c844a78] #8 /usr/local/samba/lib/private/libdbwrap.so(+0x59ff) [0x7f5a475899ff] #9 /usr/local/samba/lib/private/libdbwrap.so(+0x5ae5) [0x7f5a47589ae5] #10 /usr/local/samba/lib/private/libdbwrap.so(dbwrap_traverse+0x35) [0x7f5a475877b8] #11 /usr/local/samba/lib/private/libsmbd_base.so(smbXsrv_session_logoff_all+0xb5) [0x7f5a4c84489f] #12 /usr/local/samba/lib/private/libsmbd_base.so(+0x19b08c) [0x7f5a4c84b08c] #13 /usr/local/samba/lib/private/libsmbd_base.so(exit_server_cleanly+0) [0x7f5a4c84b45c] #14 /usr/local/samba/lib/private/libsmbd_base.so(+0x12af4d) [0x7f5a4c7daf4d] #15 /usr/local/samba/lib/private/libsmbd_base.so(+0x12b626) [0x7f5a4c7db626] #16 /usr/local/samba/lib/private/libsmbd_base.so(+0x12ca2a) [0x7f5a4c7dca2a] #17 /usr/local/samba/lib/private/libsmbd_base.so(+0x13041d) [0x7f5a4c7e041d] #18 /usr/local/samba/lib/private/libsmbd_base.so(create_file_default+0x2f8) [0x7f5a4c7e0f33] #19 /usr/local/samba/lib/private/libsmbd_base.so(+0x24966a) [0x7f5a4c8f966a] #20 /usr/local/samba/lib/private/libsmbd_base.so(smb_vfs_call_create_file+0xcb) [0x7f5a4c7ec77a] #21 /usr/local/samba/lib/private/libsmbd_base.so(+0x17e3e1) [0x7f5a4c82e3e1] #22 /usr/local/samba/lib/private/libsmbd_base.so(smbd_smb2_request_process_create+0x7ae) [0x7f5a4c82b862] #23 /usr/local/samba/lib/private/libsmbd_base.so(smbd_smb2_request_dispatch+0xf3a) [0x7f5a4c823493] #24 /usr/local/samba/lib/private/libsmbd_base.so(+0x17fd0e) [0x7f5a4c82fd0e] #25 /usr/local/samba/lib/libsmbconf.so.0(run_events_poll+0x262) [0x7f5a4b22dbb2] #26 /usr/local/samba/lib/libsmbconf.so.0(+0x4330a) [0x7f5a4b22e30a] #27 /usr/local/samba/lib/private/libtevent.so.0(_tevent_loop_once+0xe8) [0x7f5a4b47bee4] #28 /usr/local/samba/lib/private/libsmbd_base.so(smbd_process+0x117e) [0x7f5a4c807e51] #29 /usr/local/samba/sbin/smbd() [0x409554] #30 /usr/local/samba/lib/libsmbconf.so.0(run_events_poll+0x71a) [0x7f5a4b22e06a] #31 /usr/local/samba/lib/libsmbconf.so.0(+0x4330a) [0x7f5a4b22e30a] #32 /usr/local/samba/lib/private/libtevent.so.0(_tevent_loop_once+0xe8) [0x7f5a4b47bee4] #33 /usr/local/samba/sbin/smbd() [0x40a0e4] #34 /usr/local/samba/sbin/smbd(main+0x15a1) [0x40b7d6] #35 /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f5a49ae4cdd] #36 /usr/local/samba/sbin/smbd() [0x405209] [2012/09/21 12:23:24.581183, 0] ../source3/lib/util.c:822(smb_panic_s3) smb_panic(): calling panic action [/bin/sleep 999999999]
Please recompile with debugging symbols, so that we get line numbers in the stack trace.
Hi Volker, The build was created using configure.developer so I'm assuming it has been built with the debug symbols. The trace below was taken from the log file. When the problem occurs again i will try to get the full trace
Created attachment 7919 [details] Patch for v4-0-test
Created attachment 7924 [details] Backtrace from a crashed smbd process Hi Metze, After applying your fix I'm still experiencing crashes. Please see attached backtrace from the crashed smbd process
Comment on attachment 7919 [details] Patch for v4-0-test This bug is not fixed yet, but this patch here is right and should go into 4.0
(In reply to comment #4) > Created attachment 7924 [details] > Backtrace from a crashed smbd process > > Hi Metze, > After applying your fix I'm still experiencing crashes. Please see attached > backtrace from the crashed smbd process From reading the backtrace it seems it is a different more complex bug. Can you file a new bug report for it?
(In reply to comment #6) > (In reply to comment #4) > > Created attachment 7924 [details] [details] > > Backtrace from a crashed smbd process > > > > Hi Metze, > > After applying your fix I'm still experiencing crashes. Please see attached > > backtrace from the crashed smbd process > > From reading the backtrace it seems it is a different more complex bug. > Can you file a new bug report for it? Ok, I already created https://bugzilla.samba.org/show_bug.cgi?id=9196
Karolin please pick the patch for v4-0-test.
Created attachment 7931 [details] git-am fix for 3.6.next This is also needed for 3.6.next. Fix attached. Jeremy.
Created attachment 7932 [details] git-am fix for 3.5.next And for 3.5.next also... Jeremy.
I'm not sure these need to go into 3.5 and 3.6. First, we do not have the lock order checks there. Second, in master we are doing a lot more to take down the connections and files. I am not sure we will ever hit a deadlock with the 3.5 and 3.6 code, at least I have not seen any. Agreed, it is a hygiene thing to make this a bit cleaner, but without at least some slight hint what this really fixes I am a bit reluctant to + it for 3.6 and 3.5.
Comment on attachment 7931 [details] git-am fix for 3.6.next looks ok
Comment on attachment 7932 [details] git-am fix for 3.5.next looks good
(In reply to comment #13) > Comment on attachment 7932 [details] > git-am fix for 3.5.next > > looks good Volker, sorry I didn't saw your comment. I agree that it might not be strictly needed in 3.5 and 3.6, but I also don't think it would harm. But I led Volker and Jeremy decide if we backport it to 3.5 and 3.6.
That's my opinion too (it won't hurt). Every other error code path out of that function defer_open() and callers in 3.5.x and 3.6.x explicitly does a TALLOC_FREE on lck, so as a hygiene thing it's certainly a good fix to do. Volker, you can make the final call for 3.5.x and 3.6.x, I'm not going to force it. Jeremy.
(In reply to comment #15) > That's my opinion too (it won't hurt). Every other error code path out of that > function defer_open() and callers in 3.5.x and 3.6.x explicitly does a > TALLOC_FREE on lck, so as a hygiene thing it's certainly a good fix to do. > > Volker, you can make the final call for 3.5.x and 3.6.x, I'm not going to force > it. Now it's all my fault, right? I am just in general reluctant to modify a stable release without a good reason. *Probably* it will not hurt, but I have never ever seen a bug in this. Even with ctdb around I have not seen a deadlock coming from this, and large clusters are running with this, finding all sorts of weird rundown problems. I think our rundown code is no exception: It is not as well tested as the "good" code paths, and I worry that we break something with no good reason. Metze +'ed this, so it is good to go in, right? I am out of this game now, no more ack needed. I will not - it, I just do not see a good reason to mess with things that have never caused us trouble so far. Jeremy, if you can point me at a code path that will be affected by this, then please tell me.
Sigh... and I thought I was being courteous and polite by giving you the final decision :-). Should have learned my lesson, never try and do a good turn for anyone, they won't appreciate it :-). What seems to be behind this is the fear our code is so fragile that a pretty obvious general clean-up fix will break something. I don't think we're that bad. I disagree (obviously), but as this problem wasn't reported on 3.6.x I'm not going to push it and let's let Karolin just pick for 4.0.x only. If it is reported on earlier versions we at least have these fixes available. So Karolin, please ignore the 3.6.next and 3.5.next fixes and just pick the v4-0-test fix. Cheers, Jeremy.
Comment on attachment 7931 [details] git-am fix for 3.6.next Sorry for being so disruptive. If we are in this code path, we already have a severe bug. So it does not matter if we break more. So, following your advice, let's put this into the stable releases.
Comment on attachment 7932 [details] git-am fix for 3.5.next Same comment as the last ack: Getting here we already have a very severe bug, so it does not hurt further. And it seems we need this patch in a stable release.
BTW, what is our general policy on 3.5 and 3.6? Is the new policy to always backport everything to those? I thought that at least 3.5 is phasing out slowly, but probably we can not ever phase out anything anymore, given that we are many, many months late on 4.0? Can we get some more general discussion about this topic?
Personally if it looks like code that affects these earlier releases I back-port (as I did here, as the code is relatively identical) as 3.5.x and 3.6.x are our active release streams. However, once we finally ship 4.0.0 I'll stop any back-ports to 3.5.x as we only have bandwidth for 2 active release streams (IMHO). Jeremy.
The problem I have with this particular piece is that we are fixing a code path that is only called if we have a severe bug somewhere already (duplicate defer). I just doubt that we have this kind of bug in 3.5 and 3.6. So I just do not see a good reason to modify it. If you can show me how we can end up in this code path or which piece of code this TALLOC_FREE actually fixes, I will be much more happy. I know this discussion will only lead to me not having to ack patches again, but maybe I am just too old and conservative regarding production code.
No problem. I can't show you what you request, and to me this is more of a error path cleanup issue which is not critical. I'm happy to leave it with the instruction to Karolin in comment 17 to push for 4.0.0 only. Cheers, Jeremy.
Pushed to v4-0-test. Will be included in 4.0.0rc2. Closing out bug report. Thanks!