Bug 11844 - With clustering get update_num_read_oplocks failed and PANIC: num_share_modes == 1 assertion failure
With clustering get update_num_read_oplocks failed and PANIC: num_share_modes...
Status: RESOLVED FIXED
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Clustering
4.4.2
All All
: P5 major
: ---
Assigned To: Karolin Seeger
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2016-04-15 19:54 UTC by pspencer
Modified: 2016-06-09 10:53 UTC (History)
4 users (show)

See Also:


Attachments
Patch (1.17 KB, patch)
2016-04-20 11:30 UTC, Volker Lendecke
no flags Details
Patch for 4.2 (824 bytes, patch)
2016-04-28 17:06 UTC, Volker Lendecke
jra: review+
Details
git-am fix for 4.4.next. (1.42 KB, patch)
2016-04-29 00:36 UTC, Jeremy Allison
vl: review+
Details
git-am fix for 4.4.next, 4.3.next. (1003 bytes, patch)
2016-05-19 17:06 UTC, Jeremy Allison
vl: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description pspencer 2016-04-15 19:54:21 UTC
Since upgrading our clustered samba configuration from 4.2.1 to 4.4.2, Windows clients started experiencing random losses of connections, application crashes, and other strange behaviour.

Samba was logging many occurrences of

../source3/locking/brlock.c:2087 (brl_get_locks_readonly) 
  Could not parse byte range lock record: NT_STATUS_OBJECT_NAME_NOT_FOUND

../source3/smbd/oplock.c:260 (remove_oplock)
  remove_oplock: update_num_read_oplocks failed for file ....

PANIC: assert failed at ../source3/smbd/oplock.c(193): d->num_share_modes == 1
(
The issues are originating in this section of code, beginning at line 2066 of source3/locking/brlock.c:

       status = dbwrap_parse_record(
                brlock_db,
                make_tdb_data((uint8_t *)&fsp->file_id,
                              sizeof(fsp->file_id)),
                brl_get_locks_readonly_parser, &state);

        if (NT_STATUS_EQUAL(status,NT_STATUS_NOT_FOUND)) {
                /*
                 * No locks on this file. Return an empty br_lock.
                 */
  
In our case, the call to dbwrap_parse_record is returning NT_STATUS_OBJECT_NAME_NOT_FOUND rather than NT_STATUS_NOT_FOUND. I am assuming this is due to changes in how ctdb works; in the ctdb code I see many more lines of the form "return ENOENT;" than were in the 4.2 version, and I see that
ctdb maps unix return codes to NT_STATUS codes by means of a mapping that sends ENOENT to NT_STATUS_OBJECT_NAME_NOT_FOUND rather than NT_STATUS_NOT_FOUND.

My suspicion/hope is that this ENOENT->NT_STATUS_OBJECT_NAME_NOT_FOUND return from ctdb is a normal thing to happen when there are no locks on the file, so I have changed the code in brlock.c from

if (NT_STATUS_EQUAL(status,NT_STATUS_NOT_FOUND)) 

to

if (NT_STATUS_EQUAL(status,NT_STATUS_NOT_FOUND) || NT_STATUS_EQUAL(status, NT_STATUS_OBJECT_NAME_NOT_FOUND))

and so far all our Windows clients have become happy and stopped experiencing all the random problems they were having (and there have been no more PANIC assertion failures from smbd).

Questions:
 
  * Is my assessment correct, that ctdb now returns NT_STATUS_OBJECT_NAME_NOT_FOUND where in 4.2 and previous versions NT_STATUS_NOT_FOUND would have been returned?

  * If so, is there anywhere else in the code, beside that section of brlock.c, that will have been broken by the change in behaviour and need fixing?

  * And if I am not right in this assessment, what other problem could be
causing NT_STATUS_OBJECT_NAME_NOT_FOUND to be returned in the context where
brl_get_locks_readonly was expecting either success or NT_STATUS_NOT_FOUND?
Comment 1 Stefan Metzmacher 2016-04-19 12:23:36 UTC
Backtrace from http://samba.2283325.n4.nabble.com/PANIC-on-update-num-read-oplocks-td4700707.html (with 4.3.3)

> Mar  7 15:18:54 xxxxxx smbd_audit[51710]: [2016/03/07 15:18:54.364808,  0] ../source3/smbd/oplock.c:192(update_num_read_oplocks)
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:   PANIC: assert failed at ../source3/smbd/oplock.c(192): d->num_share_modes == 1
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]: [2016/03/07 15:18:54.364846,  0] ../source3/lib/util.c:789(smb_panic_s3)
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:   PANIC (pid 51710): assert failed: d->num_share_modes == 1
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]: [2016/03/07 15:18:54.366169,  0] ../source3/lib/util.c:900(log_stack_trace)
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:   BACKTRACE: 28 stack frames:
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #0 0x8039a9b41 <log_stack_trace+0x21> at /usr/local/lib/libsmbconf.so.0
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #1 0x8039a9928 <smb_panic_s3+0x98> at /usr/local/lib/libsmbconf.so.0
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #2 0x8016914d5 <smb_panic+0x35> at /usr/local/lib/libsamba-util.so.0
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #3 0x801b1a03e <update_num_read_oplocks+0xee> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #4 0x801a7c970 <create_file_default+0xa800> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #5 0x801a7836c <create_file_default+0x61fc> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #6 0x801a73531 <create_file_default+0x13c1> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #7 0x801a72738 <create_file_default+0x5c8> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #8 0x801c0e20c <vfs_default_init+0x147c> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #9 0x801a8ebec <smb_vfs_call_create_file+0x22c> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #10 0x801ae2988 <smbd_smb2_request_process_create+0x3218> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #11 0x801adffae <smbd_smb2_request_process_create+0x83e> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #12 0x801acdc8d <smbd_smb2_request_dispatch+0x137d> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #13 0x801ad5a49 <smbd_smb2_first_negprot+0x4159> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #14 0x801ad5066 <smbd_smb2_first_negprot+0x3776> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #15 0x8039cd976 <run_events_poll+0x6b6> at /usr/local/lib/libsmbconf.so.0
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #16 0x8039ce7c4 <event_add_idle+0x994> at /usr/local/lib/libsmbconf.so.0
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #17 0x80528d842 <_tevent_loop_once+0x72> at /usr/local/lib/libtevent.so.0
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #18 0x80528da7b <tevent_common_loop_wait+0x3b> at /usr/local/lib/libtevent.so.0
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #19 0x801aad68f <smbd_process+0xe2f> at /usr/local/lib/samba/libsmbd-base-samba4.so
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #20 0x1032d68 <main+0x4318> at /usr/local/sbin/smbd
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #21 0x8039cd976 <run_events_poll+0x6b6> at /usr/local/lib/libsmbconf.so.0
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #22 0x8039ce7c4 <event_add_idle+0x994> at /usr/local/lib/libsmbconf.so.0
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #23 0x80528d842 <_tevent_loop_once+0x72> at /usr/local/lib/libtevent.so.0
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #24 0x80528da7b <tevent_common_loop_wait+0x3b> at /usr/local/lib/libtevent.so.0
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #25 0x103184f <main+0x2dff> at /usr/local/sbin/smbd
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #26 0x1030ab5 <main+0x2065> at /usr/local/sbin/smbd
> Mar  7 15:18:54 xxxxxx smbd_audit[51710]:    #27 0x1027a51 <_start+0x1a1> at /usr/local/sbin/smbd
Comment 2 Volker Lendecke 2016-04-20 11:30:34 UTC
Created attachment 12010 [details]
Patch

Thanks for the great analysis! As this might hit us elsewhere too, I'd rather fix this in a lower layer. Does the attached patch also help you?

Thanks,

Volker
Comment 3 pspencer 2016-04-21 15:01:03 UTC
You're welcome; nothing like having one user fail to open an InDesign document, another lose their Firefox bookmarks, another have their mail program suddenly stop fetching mail, to motivate one to spend a few hours tracing logs and source code!

I have replaced our patch with yours and so far so good (as I would expect, since your patch changes NT_STATUS_OBJECT_NAME_NOT_FOUND to NT_STATUS_NOT_FOUND for the cases we were seeing). 

If anything changes during the next few days I will post it here. But I think your patch solves the problem, and prevents similar problems elsewhere in the higher level layers -- as long as nothing in those layers is actually EXPECTING to see NT_STATUS_OBJECT_NAME_NOT_FOUND, of course.

One problem is I am having trouble clearing a few stray read-only locks that were left over from the problems last week and are now re-appearing. E.g., on node 0, smbstatus show

1:25644      6071       DENY_NONE  0x100080    RDONLY     NONE            /home   .   Thu Apr 14 13:07:30 2016

This is missing from the output of smbstatus on node 1, as there is no such lock, and no such smbd process running on node 1.

HOWEVER, on node 1 there is a THREAD with id 25644 belonging to a different process, unrelated to samba, that I can't or don't want to kill (I have a half-dozen of these situations; most are glusterfsd threads and and a few are kernel threads).

So, "ctdb process-exists 1:25644" returns true even though there is no smbd process with that id on node 1.

(Aside: should process-exists be changed to check only for processes? As it stands it uses "kill -0 pid" which returns true on linux if pid is the id of a thread but not a process).

Restarting samba on both nodes does not clear the lock, and I think there are more of these stale locks that will re-appear as soon as a matching thread or process gets started.

Which database should I look in (tried ctdb catdb but that shows raw binary data that is hard to parse)? Is there a tool I can use to clear the stale records? (Or should I ask this question on the mailing list -- and if so, should it be the user list or the technical list?)
Comment 4 pspencer 2016-04-21 21:50:06 UTC
Never mind the last part of the last comment ... I realized that the stale locks were being carried over because I was restarting the smb processes sequentially and each time I restarted one process the live one preserved the lock records.

I brought the whole cluster down by stopping smb on both nodes (first telling ctdb temporarily to stop managing samba so it didn't bring other services down!) and then I could wipe the locking database.

Still so far so good on the patch.
Comment 5 Jeremy Allison 2016-04-25 19:24:48 UTC
Comment on attachment 12010 [details]
Patch

Volker, are you ready for this to be pushed to master ? It's had several good results from users I think.
Comment 6 pspencer 2016-04-25 19:34:26 UTC
It is still working fine for us.
Comment 7 u.sibiller 2016-04-28 15:49:56 UTC
I think I am seeing the same bug on 4.2.11-20, too. Will there be a backport?

#0  0x00007f05293ab625 in raise () from /lib64/libc.so.6
#1  0x00007f05293ace05 in abort () from /lib64/libc.so.6
#2  0x00007f052a8abcea in dump_core () at ../source3/lib/dumpcore.c:337
#3  0x00007f052a89f249 in smb_panic_s3 (why=<value optimized out>) at ../source3/lib/util.c:811
#4  0x00007f052cb4070e in smb_panic (why=0x7f052c880198 "assert failed: d->num_share_modes == 1") at ../lib/util/fault.c:166
#5  0x00007f052c76aa5e in update_num_read_oplocks (fsp=0x7f052e007110, lck=<value optimized out>) at ../source3/smbd/oplock.c:193
#6  0x00007f052c714f8d in grant_fsp_oplock_type (conn=<value optimized out>, req=<value optimized out>, access_mask=1179785, share_access=7, create_disposition=<value optimized out>, create_options=64,
    new_dos_attributes=0, oplock_request=2, lease=0x0, private_flags=0, pinfo=0x7ffd1d5ce4ac, fsp=0x7f052e007110) at ../source3/smbd/open.c:1866
#7  open_file_ntcreate (conn=<value optimized out>, req=<value optimized out>, access_mask=1179785, share_access=7, create_disposition=<value optimized out>, create_options=64, new_dos_attributes=0, oplock_request=2,
    lease=0x0, private_flags=0, pinfo=0x7ffd1d5ce4ac, fsp=0x7f052e007110) at ../source3/smbd/open.c:3144
#8  0x00007f052c716a39 in create_file_unixpath (conn=<value optimized out>, req=<value optimized out>, smb_fname=0x7f052e001ad0, access_mask=1179785, share_access=7, create_disposition=1, create_options=64,
    file_attributes=0, oplock_request=2, lease=0x0, allocation_size=0, private_flags=0, sd=0x0, ea_list=0x0, result=0x7ffd1d5ce550, pinfo=0x7ffd1d5ce55c) at ../source3/smbd/open.c:4641
#9  0x00007f052c7176a0 in create_file_default (conn=0x7f052dff4790, req=0x7f052e016b20, root_dir_fid=<value optimized out>, smb_fname=0x7f052e001ad0, access_mask=1179785, share_access=7, create_disposition=1,
    create_options=64, file_attributes=0, oplock_request=2, lease=0x0, allocation_size=0, private_flags=0, sd=0x0, ea_list=0x0, result=0x7ffd1d5ce8a8, pinfo=0x7ffd1d5ce8cc, in_context_blobs=0x7ffd1d5ce860,
    out_context_blobs=0x7f052e0078e0) at ../source3/smbd/open.c:5048
#10 0x00007f052c7ea10b in vfswrap_create_file (handle=<value optimized out>, req=<value optimized out>, root_dir_fid=<value optimized out>, smb_fname=<value optimized out>, access_mask=<value optimized out>,
    share_access=<value optimized out>, create_disposition=1, create_options=64, file_attributes=0, oplock_request=2, lease=0x0, allocation_size=0, private_flags=0, sd=0x0, ea_list=0x0, result=0x7ffd1d5ce8a8,
    pinfo=0x7ffd1d5ce8cc, in_context_blobs=0x7ffd1d5ce860, out_context_blobs=0x7f052e0078e0) at ../source3/modules/vfs_default.c:538
#11 0x00007f052c71d347 in smb_vfs_call_create_file (handle=<value optimized out>, req=<value optimized out>, root_dir_fid=<value optimized out>, smb_fname=<value optimized out>, access_mask=<value optimized out>,
    share_access=<value optimized out>, create_disposition=1, create_options=64, file_attributes=0, oplock_request=2, lease=0x0, allocation_size=0, private_flags=0, sd=0x0, ea_list=0x0, result=0x7ffd1d5ce8a8,
    pinfo=0x7ffd1d5ce8cc, in_context_blobs=0x7ffd1d5ce860, out_context_blobs=0x7f052e0078e0) at ../source3/smbd/vfs.c:1628
#12 0x00007f052c74c5ca in smbd_smb2_create_send (smb2req=0x7f052e004590) at ../source3/smbd/smb2_create.c:1064
#13 smbd_smb2_request_process_create (smb2req=0x7f052e004590) at ../source3/smbd/smb2_create.c:227
#14 0x00007f052c743cad in smbd_smb2_request_dispatch (req=0x7f052e004590) at ../source3/smbd/smb2_server.c:2241
#15 0x00007f052c7453b2 in smbd_smb2_io_handler (ev=<value optimized out>, fde=<value optimized out>, flags=<value optimized out>, private_data=<value optimized out>) at ../source3/smbd/smb2_server.c:3476
#16 smbd_smb2_connection_handler (ev=<value optimized out>, fde=<value optimized out>, flags=<value optimized out>, private_data=<value optimized out>) at ../source3/smbd/smb2_server.c:3514
#17 0x00007f052a8b48ef in run_events_poll (ev=0x7f052de82ce0, pollrtn=<value optimized out>, pfds=0x7f052dfd3360, num_pfds=5) at ../source3/lib/events.c:257
#18 0x00007f052a8b4d37 in s3_event_loop_once (ev=0x7f052de82ce0, location=<value optimized out>) at ../source3/lib/events.c:326
#19 0x00007f052bdf0917 in _tevent_loop_once (ev=0x7f052de82ce0, location=0x7f052c8635f8 "../source3/smbd/process.c:3997") at ../lib/tevent/tevent.c:533
#20 0x00007f052bdf0982 in tevent_common_loop_wait (ev=0x7f052de82ce0, location=0x7f052c8635f8 "../source3/smbd/process.c:3997") at ../lib/tevent/tevent.c:637
#21 0x00007f052bdf071b in _tevent_loop_wait (ev=<value optimized out>, location=<value optimized out>) at ../lib/tevent/tevent.c:656
#22 0x00007f052c731032 in smbd_process (ev_ctx=0x7f052de82ce0, msg_ctx=<value optimized out>, sock_fd=11, interactive=false) at ../source3/smbd/process.c:3997
#23 0x00007f052d1983d9 in smbd_accept_connection (ev=0x7f052de82ce0, fde=<value optimized out>, flags=<value optimized out>, private_data=<value optimized out>) at ../source3/smbd/server.c:627
#24 0x00007f052a8b48ef in run_events_poll (ev=0x7f052de82ce0, pollrtn=<value optimized out>, pfds=0x7f052dfd3360, num_pfds=7) at ../source3/lib/events.c:257
#25 0x00007f052a8b4d37 in s3_event_loop_once (ev=0x7f052de82ce0, location=<value optimized out>) at ../source3/lib/events.c:326
#26 0x00007f052bdf0917 in _tevent_loop_once (ev=0x7f052de82ce0, location=0x7f052d19c9d0 "../source3/smbd/server.c:985") at ../lib/tevent/tevent.c:533
#27 0x00007f052bdf0982 in tevent_common_loop_wait (ev=0x7f052de82ce0, location=0x7f052d19c9d0 "../source3/smbd/server.c:985") at ../lib/tevent/tevent.c:637
#28 0x00007f052bdf071b in _tevent_loop_wait (ev=<value optimized out>, location=<value optimized out>) at ../lib/tevent/tevent.c:656
#29 0x00007f052d19a1d1 in smbd_parent_loop (argc=<value optimized out>, argv=<value optimized out>) at ../source3/smbd/server.c:985
#30 main (argc=<value optimized out>, argv=<value optimized out>) at ../source3/smbd/server.c:1626
Comment 8 Volker Lendecke 2016-04-28 17:06:56 UTC
Created attachment 12042 [details]
Patch for 4.2

Can you try the attached patch? 4.2.11 should not suffer from the already patched problem
Comment 9 Jeremy Allison 2016-04-29 00:36:08 UTC
Created attachment 12046 [details]
git-am fix for 4.4.next.

Cherry-pick from master.
Comment 10 Jeremy Allison 2016-04-29 02:06:46 UTC
Just checked 4.3.x source code and I don't think the bug occurs there. Please correct me if I'm wrong.
Comment 11 Volker Lendecke 2016-04-29 13:16:48 UTC
Karo, please add https://bugzilla.samba.org/attachment.cgi?id=12046 to 4.4. There seems to be another problem leading to a similar backtrace. I'm waiting for feedback on https://bugzilla.samba.org/attachment.cgi?id=12042
Comment 12 Karolin Seeger 2016-05-09 08:22:29 UTC
(In reply to Volker Lendecke from comment #11)
Pushed to autobuild-v4-4-test.
Comment 13 Karolin Seeger 2016-05-11 11:37:31 UTC
(In reply to Karolin Seeger from comment #12)
Pushed to v4-4-test.
Re-assigning to Volker.
Comment 14 Andrea Venturoli 2016-05-12 16:22:09 UTC
(In reply to Jeremy Allison from comment #10)

Not sure if I should open another report or reply here, since in my case I'm not clustering, so I'm not sure this is the same bug,

However I still see this panic when two users attempt to open the same file (which should not happen due to locking).

I'm using 4.3 on FreeBSD 9.3/amd64.

I've also seen this on another box (FreeBSD 10.1/amd64), but I haven't verified if it comes from the locking issue: there it will keep crashing until I restart Samba as a whole.

Does a patch exist for 4.3?
Can I use the one for 4.2 (with the obvious changes)?
Should I try 4.4?

Thanks.
Comment 15 Adi Kriegisch 2016-05-13 11:50:07 UTC
(In reply to Volker Lendecke from comment #8)
The patch for v4.2 works fine for us; thank you very much!

Plattform: Debian/Jessie with Samba 4.2.10 + the patch.
Comment 16 Michael Adam 2016-05-15 22:10:07 UTC
(In reply to Volker Lendecke from comment #8)
> Created attachment 12042 [details]
> Patch for 4.2
> 
> Can you try the attached patch? 4.2.11 should not suffer from the already
> patched problem

Volker, is this something that we would also need to apply to
master and other versions?
Comment 17 Jeremy Allison 2016-05-19 17:06:49 UTC
Created attachment 12118 [details]
git-am fix for 4.4.next, 4.3.next.

Additional fix that went into master. Cherry-pick applies cleanly to 4.4.next, 4.3.next.
Comment 18 Jeremy Allison 2016-05-19 19:15:14 UTC
Re-assigning to Karolin for inclusion in 4.4.next, 4.3.next.
Comment 19 Karolin Seeger 2016-05-30 08:41:15 UTC
(In reply to Jeremy Allison from comment #18)
Pushed to autobuild-v4-[4|3]-test.
Comment 20 Karolin Seeger 2016-06-01 07:28:11 UTC
(In reply to Karolin Seeger from comment #19)
Pushed to both branches.
Comment 21 Karolin Seeger 2016-06-01 07:29:37 UTC
Thinking about another 4.2 maintenance release, I asked for review for the 4.2 patch. Could be included then...
Comment 22 Jeremy Allison 2016-06-01 17:59:53 UTC
Comment on attachment 12042 [details]
Patch for 4.2

LGTM.
Comment 23 Karolin Seeger 2016-06-02 10:18:45 UTC
(In reply to Jeremy Allison from comment #22)
Thanks, Jeremy! :-)

Pushed to autobuild-v4-2-test.
Comment 24 Karolin Seeger 2016-06-09 10:53:52 UTC
(In reply to Karolin Seeger from comment #23)
Pushed to v4-2-test.
Closing out bug report.

Thanks!