Bug 13439 - After some hours smbd seems to be stuck in smbXsrv_open_global_fetch_locked causing clients using SMB2.0 and newer to be unable to connect to shares
Summary: After some hours smbd seems to be stuck in smbXsrv_open_global_fetch_locked c...
Status: NEW
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.8.2
Hardware: x64 Linux
: P5 major (vote)
Target Milestone: ---
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-17 07:29 UTC by Sysadmin HTL-Leonding
Modified: 2020-01-05 10:02 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sysadmin HTL-Leonding 2018-05-17 07:29:54 UTC
It seems that since Samba 4.7 smbd daemon seems to deadlock after a while (backtrace below).

I had issues with samba 4.7.7 (as far as I remember I had issues with earlier 4.7.x releases before) and 4.8.1 and 4.8.2. Source packages from samba.org.

The following backtrace is from samba 4.8.2. mount.cifs from another client (other samba release) stuck and I connected with gdb to the smbd which probably managed the connection (according to netstat -a -n -p) and forced a backtrace.

When forcing a SMB1 connection the connection was successful (probably because samba didn't need to access the lock table).

tdb version 1.3.15 (unfortunately built without debug information)
ldb version was 1.3.3
tevent version was 0.9.36

glibc was 2.25-r10 (Gentoo)

The samba server is providing file shares to clients and is member server of an Windows 2016 Active Directory domain.

Only in case this could be causing the issues: 
Socket options have been changed from defaults, because Windows 10 clients didn't inform the server when they were shutdown, causing roaming profiles to fail when being accessed from another server (fast startup is NOT enabled, so they are not in hibernation).

socket options = TCP_NODELAY TCP_KEEPIDLE=60 TCP_KEEPCNT=2 TCP_KEEPINTVL=10

#0  0x00007fc7a516e2a2 in __pthread_mutex_lock_full ()
   from /lib64/libpthread.so.0
#1  0x00007fc79ba36b3e in ?? () from /usr/lib64/libtdb.so.1
#2  0x00007fc79ba36e3b in ?? () from /usr/lib64/libtdb.so.1
#3  0x00007fc79ba2f261 in ?? () from /usr/lib64/libtdb.so.1
#4  0x00007fc79ba2f889 in ?? () from /usr/lib64/libtdb.so.1
#5  0x00007fc79ba2fb92 in ?? () from /usr/lib64/libtdb.so.1
#6  0x00007fc79ba2fc2b in ?? () from /usr/lib64/libtdb.so.1
#7  0x00007fc79ba2fe3e in tdb_chainlock () from /usr/lib64/libtdb.so.1
#8  0x00007fc79eee948a in db_tdb_fetch_locked (db=db@entry=0x55871ddf1790,
    mem_ctx=mem_ctx@entry=0x55871de093d0, key=...)
    at ../lib/dbwrap/dbwrap_tdb.c:157
#9  0x00007fc79eee64b4 in dbwrap_fetch_locked_internal (
    db=db@entry=0x55871ddf1790, mem_ctx=mem_ctx@entry=0x55871de093d0, key=...,
    db_fn=0x7fc79eee9430 <db_tdb_fetch_locked>) at ../lib/dbwrap/dbwrap.c:221
#10 0x00007fc79eee6628 in dbwrap_fetch_locked (db=db@entry=0x55871ddf1790,
    mem_ctx=mem_ctx@entry=0x55871de093d0, key=...)
    at ../lib/dbwrap/dbwrap.c:235
#11 0x00007fc7a49502a7 in smbXsrv_open_global_fetch_locked (
    db=db@entry=0x55871ddf1790, id=id@entry=2573108159,
    mem_ctx=mem_ctx@entry=0x55871de093d0) at ../source3/smbd/smbXsrv_open.c:167
#12 0x00007fc7a49514ff in smbXsrv_open_global_allocate (
    _global=<synthetic pointer>, mem_ctx=0x55871de093d0, db=0x55871ddf1790)
---Type <return> to continue, or q <return> to quit---
   ./source3/smbd/smbXsrv_open.c:552
#13 smbXsrv_open_create (conn=conn@entry=0x55871dd3a8c0,
    session_info=<optimized out>, now=131709791013423550,
    _open=_open@entry=0x7ffe65cd7948) at ../source3/smbd/smbXsrv_open.c:852
#14 0x00007fc7a48af524 in file_new (req=req@entry=0x55871ddfa240,
    conn=conn@entry=0x55871ddf7d20, result=result@entry=0x7ffe65cd7a08)
    at ../source3/smbd/files.c:100
#15 0x00007fc7a49044f6 in create_file_unixpath (
    conn=conn@entry=0x55871ddf7d20, req=req@entry=0x55871ddfa240,
    smb_fname=smb_fname@entry=0x55871de12150,
    access_mask=access_mask@entry=128, share_access=share_access@entry=7,
    create_disposition=create_disposition@entry=1, create_options=0,
    file_attributes=0, oplock_request=0, lease=0x0, allocation_size=0,
    private_flags=0, sd=0x0, ea_list=0x0, result=0x7ffe65cd7b30,
    pinfo=0x7ffe65cd7b2c) at ../source3/smbd/open.c:5199
#16 0x00007fc7a4905f6c in create_file_default (conn=0x55871ddf7d20,
    req=0x55871ddfa240, root_dir_fid=<optimized out>,
    smb_fname=0x55871de12150, access_mask=128, share_access=7,
    create_disposition=1, create_options=0, file_attributes=0,
    oplock_request=0, lease=0x0, allocation_size=0, private_flags=0, sd=0x0,
    ea_list=0x0, result=0x55871ddf9f48, pinfo=0x55871ddf9f5c,
    in_context_blobs=0x7ffe65cd7d70, out_context_blobs=0x55871ddfa1c0)
    at ../source3/smbd/open.c:5630
#17 0x00007fc7a486d129 in vfswrap_create_file (handle=<optimized out>,
    req=<optimized out>, root_dir_fid=<optimized out>,
    smb_fname=<optimized out>, access_mask=<optimized out>,
    share_access=<optimized out>, create_disposition=1, create_options=0,
    file_attributes=0, oplock_request=0, lease=0x0, allocation_size=0,
    private_flags=0, sd=0x0, ea_list=0x0, result=0x55871ddf9f48,
    pinfo=0x55871ddf9f5c, in_context_blobs=0x7ffe65cd7d70,
    out_context_blobs=0x55871ddfa1c0) at ../source3/modules/vfs_default.c:598
#18 0x00007fc7a490c9a7 in smb_vfs_call_create_file (handle=<optimized out>,
    req=req@entry=0x55871ddfa240, root_dir_fid=root_dir_fid@entry=0,
    smb_fname=<optimized out>, access_mask=access_mask@entry=128,
    share_access=share_access@entry=7, create_disposition=1, create_options=0,
    file_attributes=0, oplock_request=0, lease=0x0, allocation_size=0,
---Type <return> to continue, or q <return> to quit---
    private_flags=0, sd=0x0, ea_list=0x0, result=0x55871ddf9f48,
    pinfo=0x55871ddf9f5c, in_context_blobs=0x7ffe65cd7d70,
    out_context_blobs=0x55871ddfa1c0) at ../source3/smbd/vfs.c:1717
#19 0x00007fc7a493ab09 in smbd_smb2_create_send (in_context_blobs=...,
    in_name=<optimized out>, in_create_options=<optimized out>,
    in_create_disposition=<optimized out>, in_share_access=<optimized out>,
    in_file_attributes=<optimized out>, in_desired_access=<optimized out>,
    in_impersonation_level=<optimized out>, in_oplock_level=<optimized out>,
    smb2req=0x55871ddf9930, ev=<optimized out>, mem_ctx=0x55871ddf9930)
    at ../source3/smbd/smb2_create.c:921
#20 smbd_smb2_request_process_create (smb2req=smb2req@entry=0x55871ddf9930)
    at ../source3/smbd/smb2_create.c:230
#21 0x00007fc7a49324b3 in smbd_smb2_request_dispatch (
    req=req@entry=0x55871ddf9930) at ../source3/smbd/smb2_server.c:2650
#22 0x00007fc7a493323c in smbd_smb2_io_handler (fde_flags=<optimized out>,
    xconn=0x55871dd3a8c0) at ../source3/smbd/smb2_server.c:3946
#23 smbd_smb2_connection_handler (ev=<optimized out>, fde=<optimized out>,
    flags=<optimized out>, private_data=<optimized out>)
    at ../source3/smbd/smb2_server.c:3984
#24 0x00007fc7a161d673 in ?? () from /usr/lib64/libtevent.so.0
#25 0x00007fc7a161ba47 in ?? () from /usr/lib64/libtevent.so.0
#26 0x00007fc7a1617e3d in _tevent_loop_once () from /usr/lib64/libtevent.so.0
#27 0x00007fc7a161806b in tevent_common_loop_wait ()
   from /usr/lib64/libtevent.so.0
#28 0x00007fc7a161b9e7 in ?? () from /usr/lib64/libtevent.so.0
#29 0x00007fc7a4922549 in smbd_process (ev_ctx=0x55871dd3a410,
    msg_ctx=<optimized out>, sock_fd=32, interactive=<optimized out>)
    at ../source3/smbd/process.c:4120
#30 0x000055871d1e84e0 in smbd_accept_connection (ev=0x55871dd3a410,
    fde=<optimized out>, flags=<optimized out>, private_data=<optimized out>)
    at ../source3/smbd/server.c:1031
#31 0x00007fc7a161d673 in ?? () from /usr/lib64/libtevent.so.0
#32 0x00007fc7a161ba47 in ?? () from /usr/lib64/libtevent.so.0
#33 0x00007fc7a1617e3d in _tevent_loop_once () from /usr/lib64/libtevent.so.0
#34 0x00007fc7a161806b in tevent_common_loop_wait ()
   from /usr/lib64/libtevent.so.0
---Type <return> to continue, or q <return> to quit---
#35 0x00007fc7a161b9e7 in ?? () from /usr/lib64/libtevent.so.0
#36 0x000055871d1e3466 in smbd_parent_loop (parent=0x55871dd3aa10,
    ev_ctx=0x55871dd3a410) at ../source3/smbd/server.c:1383
#37 main (argc=<optimized out>, argv=<optimized out>)
    at ../source3/smbd/server.c:2146
Comment 1 Sysadmin HTL-Leonding 2018-05-17 07:37:42 UTC
I reverted back to samba 4.6.x as this branch didn't show this issues.

Samba 4.6.x causes regular core dumps of child pids, but seems to be working fine in general (except that winbindd hangs about once a week, unfortunately I can't tell whether this has been fixed in newer Samba versions, because of the original bug):

#0  0x00007f42fb2b4f60 in raise () from /lib64/libc.so.6
#1  0x00007f42fb2b6c2a in abort () from /lib64/libc.so.6
#2  0x00007f42fcc27d42 in dump_core () at ../source3/lib/dumpcore.c:338
#3  0x00007f42fcc192fb in smb_panic_s3 (why=<optimized out>)
    at ../source3/lib/util.c:814
#4  0x00007f42fecf3e9f in smb_panic (
    why=why@entry=0x7f42fea3a818 "assert failed: lock_ref_count >= 0")
    at ../lib/util/fault.c:166
#5  0x00007f42fe92ea20 in decrement_lock_ref_count (
    fsp=fsp@entry=0x55e380b58bb0) at ../source3/locking/posix.c:455
#6  0x00007f42fe92ec1f in decrement_posix_lock_count (fsp=0x55e380b58bb0, 
    smblctx=<optimized out>) at ../source3/locking/posix.c:1237
#7  0x00007f42fe930dbc in release_posix_lock_posix_flavour (
    fsp=0x55e380b58bb0, u_offset=<optimized out>, u_count=<optimized out>, 
    lock_ctx=lock_ctx@entry=0x7ffdcbce9290, 
    plocks=plocks@entry=0x55e380b45e10, num_locks=num_locks@entry=0)
    at ../source3/locking/posix.c:1408
#8  0x00007f42fe92ce96 in brl_unlock_posix (plock=0x7ffdcbce9290, 
    br_lck=0x55e380b45b50, msg_ctx=0x55e380a366d0)
    at ../source3/locking/brlock.c:1286
#9  brl_unlock (msg_ctx=msg_ctx@entry=0x55e380a366d0, 
    br_lck=br_lck@entry=0x55e380b45b50, smblctx=smblctx@entry=3245055878, 
    pid=..., start=start@entry=0, size=size@entry=9223372036854775808, 
---Type <return> to continue, or q <return> to quit---
    lock_flav=lock_flav@entry=POSIX_LOCK) at ../source3/locking/brlock.c:1380
#10 0x00007f42fe928cae in do_unlock (msg_ctx=0x55e380a366d0, 
    fsp=fsp@entry=0x55e380b58bb0, smblctx=smblctx@entry=3245055878, 
    count=count@entry=9223372036854775808, offset=offset@entry=0, 
    lock_flav=lock_flav@entry=POSIX_LOCK) at ../source3/locking/locking.c:328
#11 0x00007f42fe88af80 in smb_set_posix_lock (fsp=0x55e380b58bb0, 
    total_data=24, pdata=<optimized out>, req=0x55e380b45760, conn=0x0)
    at ../source3/smbd/trans2.c:7209
#12 smbd_do_setfilepathinfo (conn=conn@entry=0x55e380afbb10, 
    req=req@entry=0x55e380b45760, mem_ctx=mem_ctx@entry=0x55e380b45760, 
    info_level=info_level@entry=520, fsp=fsp@entry=0x55e380b58bb0, 
    smb_fname=<optimized out>, ppdata=0x55e380b3e640, total_data=24, 
    ret_data_size=0x7ffdcbce95a0) at ../source3/smbd/trans2.c:8607
#13 0x00007f42fe88ca9b in call_trans2setfilepathinfo (
    max_data_bytes=<optimized out>, total_data=24, ppdata=0x55e380b3e640, 
    total_params=<optimized out>, pparams=0x55e380b3e630, 
    tran_call=<optimized out>, req=0x55e380b45760, conn=0x55e380afbb10)
    at ../source3/smbd/trans2.c:8850
#14 handle_trans2 (conn=conn@entry=0x55e380afbb10, 
    req=req@entry=0x55e380b45760, state=state@entry=0x55e380b3e5d0)
    at ../source3/smbd/trans2.c:9391
#15 0x00007f42fe88f375 in reply_trans2 (req=0x55e380b45760)
    at ../source3/smbd/trans2.c:9620
---Type <return> to continue, or q <return> to quit---
#16 0x00007f42fe8b6b36 in switch_message (type=<optimized out>, 
    req=req@entry=0x55e380b45760) at ../source3/smbd/process.c:1726
#17 0x00007f42fe8b8f8a in construct_reply (deferred_pcd=0x0, 
    encrypted=<optimized out>, seqnum=0, unread_bytes=0, size=102, 
    inbuf=<optimized out>, xconn=0x55e380af1480)
    at ../source3/smbd/process.c:1762
#18 process_smb (xconn=xconn@entry=0x55e380af1480, inbuf=<optimized out>, 
    nread=102, unread_bytes=0, seqnum=0, encrypted=<optimized out>, 
    deferred_pcd=0x0) at ../source3/smbd/process.c:2009
#19 0x00007f42fe8b9f7c in smbd_server_connection_read_handler (
    xconn=0x55e380af1480, fd=33) at ../source3/smbd/process.c:2609
#20 0x00007f42fb63b673 in ?? () from /usr/lib64/libtevent.so.0
#21 0x00007f42fb639a47 in ?? () from /usr/lib64/libtevent.so.0
#22 0x00007f42fb635e3d in _tevent_loop_once () from /usr/lib64/libtevent.so.0
#23 0x00007f42fb63606b in tevent_common_loop_wait ()
   from /usr/lib64/libtevent.so.0
#24 0x00007f42fb6399e7 in ?? () from /usr/lib64/libtevent.so.0
#25 0x00007f42fe8bb2b9 in smbd_process (ev_ctx=0x55e380a361e0, 
    msg_ctx=<optimized out>, sock_fd=33, interactive=<optimized out>)
    at ../source3/smbd/process.c:4126
#26 0x000055e37ef5b134 in smbd_accept_connection (ev=0x55e380a361e0, 
    fde=<optimized out>, flags=<optimized out>, private_data=<optimized out>)
    at ../source3/smbd/server.c:1027
---Type <return> to continue, or q <return> to quit---
#27 0x00007f42fb63b673 in ?? () from /usr/lib64/libtevent.so.0
#28 0x00007f42fb639a47 in ?? () from /usr/lib64/libtevent.so.0
#29 0x00007f42fb635e3d in _tevent_loop_once () from /usr/lib64/libtevent.so.0
#30 0x00007f42fb63606b in tevent_common_loop_wait ()
   from /usr/lib64/libtevent.so.0
#31 0x00007f42fb6399e7 in ?? () from /usr/lib64/libtevent.so.0
#32 0x000055e37ef5634e in smbd_parent_loop (parent=0x55e380a390b0, 
    ev_ctx=0x55e380a361e0) at ../source3/smbd/server.c:1394
#33 main (argc=<optimized out>, argv=<optimized out>)
    at ../source3/smbd/server.c:2048
Comment 2 Sysadmin HTL-Leonding 2018-05-18 13:02:52 UTC
Maybe there the tdb files are corrupted or there is some bug in tdb?
Winbindd of Samba 3.6.15 also locks up during tdb access - see bug id 13443
Comment 4 Volker Lendecke 2018-06-12 08:43:11 UTC
What does this tell us? We can't use mutexes? We can easily disable them again, no problem.
Comment 5 Andreas Schneider 2018-06-12 14:08:52 UTC
It is not fixed at least in the RHEL7 Kernel yet :-( It looked like similar backtraces which were related to robust mutexes not fixed yet.
Comment 6 Jeremy Allison 2018-06-12 17:38:02 UTC
(In reply to Volker Lendecke from comment #4)

> What does this tell us? We can't use mutexes? We can easily disable them again, 
> no problem.

It tells us this is a Red Hat problem and we shouldn't change our code :-).
Comment 7 Volker Lendecke 2018-06-12 18:54:52 UTC
(In reply to Jeremy Allison from comment #6)
> It tells us this is a Red Hat problem and we shouldn't change our code :-).

Are you sure other distros don't have the same problem? For the better or worse, regarding this aspect of kernel and glibc, probably most other distros are just RH downstream.
Comment 8 Jeremy Allison 2018-06-12 19:02:55 UTC
At the worst this needs a change inside tdb_mutex_locking_supported() disabling mutexes for affected kernel versions. IMHO of course. Either that or a config variable to allow people to turn it off.
Comment 9 Volker Lendecke 2018-06-12 19:25:03 UTC
(In reply to Jeremy Allison from comment #8)
> At the worst this needs a change inside tdb_mutex_locking_supported()
> disabling mutexes for affected kernel versions. IMHO of course. Either that
> or a config variable to allow people to turn it off.

I was told by RH people that it's not that simple. It's a combination of bugs between glibc and the kernel. You can't upgrade one without the other to fix these bugs. But the authorities are of course the kernel/glibc developers working on this.
Comment 10 Jeremy Allison 2018-06-12 19:49:57 UTC
Can't we just add an smb.conf variable to turn off mutexes which we will advise affected users to set ? I don't want to rip out our working code for temporary glibc/kernel bugs.
Comment 11 Sysadmin HTL-Leonding 2018-06-12 21:35:50 UTC
(In reply to Jeremy Allison from comment #10)
I don't know if anyone of you has read Mr. Metzmacher's comment #1 on Samba bug id 13443:

Can you try "dbwrap_tdb_mutexes:*=no" in the [global] section.

This might be the undocumented (at least it is missing from man smb.conf) solution ... (at least it seems to have fixed the winbind issue)
Comment 12 Sysadmin HTL-Leonding 2018-06-12 21:47:04 UTC
(In reply to Jeremy Allison from comment #6)
Sorry - if this should have been a joke I didn't understand it as other distros seems to be affected as well (maybe because they are using glibc?!).

If I executed the test cases correctly, robust mutexes are still broken in other distros (i.e. Ubuntu 18.04 ... - see my comment on bug id 13443), so I fear when you are using Linux with glibc it currently is always broken in one or another way.

See comment #3 on Samba bug id 13443

Maybe people running Linux not suffering from this issue are using ctdb instead of tdb (maybe it is disabling mutexes in its default configuration).
Comment 13 Sysadmin HTL-Leonding 2018-06-12 21:49:10 UTC
(In reply to Sysadmin HTL-Leonding from comment #11)
Link is wrong (redirecting to this bug instead of 13443). Maybe this link works 13443#1 - if not please open the other bug manually.
Comment 14 Jeremy Allison 2018-06-12 21:59:48 UTC
Ah, thought I remembered we had a way to turn it off, but couldn't find it in my quick search (in my defence I am still horribly jet-lagged :-).

Probably for us this is a documentation fix then.
Comment 15 Volker Lendecke 2018-06-13 11:43:27 UTC
(In reply to Jeremy Allison from comment #14)
> Ah, thought I remembered we had a way to turn it off, but couldn't find it
> in my quick search (in my defence I am still horribly jet-lagged :-).
> 
> Probably for us this is a documentation fix then.

The dbwrap option does not help for gencache_notrans.tdb, FWIW
Comment 16 Sysadmin HTL-Leonding 2018-06-13 17:26:31 UTC
(In reply to Jeremy Allison from comment #14)
Documentation for this feature seems to be a big problem as I couldn't find any (at least man smb.conf is silent about this option).

The release notes for Samba 4.2.0 states that you can enable this feature by setting

"dbwrap_tdb_mutexes:* = yes"

If you read the "fine print" of the Samba 4.7.0 release notes you can read

Samba 4.7.0 (changes since 4.7.0rc2)
BUG 12891: Enable TDB mutexes in dbwrap and ctdb.

So the default values were changed (nearly silently) (which wouldn't be an issue if this feature wouldn't be broken on Linux (or do you know a distro on which Mr. Amitay's test code doesn't lock up?)).

It seems that your web servers are also still delivering old information

https://ctdb.samba.org/manpages/ctdb-tunables.7.html

states

TDBMutexEnabled
Default: 0

while the patch in BUG 12891 changed exactly this default value to 1 (the dbwrap default value was also changed with this commit)...
Comment 17 Sysadmin HTL-Leonding 2018-06-13 17:55:45 UTC
(In reply to Volker Lendecke from comment #15)

Would the use of "--disable-tdb-mutex-locking" as mentioned in
samba/blob/master/lib/tdb/wscript fix this? If the tdb library has no mutex locking built-in Samba won't be able to use it anymore (don't know if that really makes a difference to a runtime option, but maybe the runtime option is not being queried in special cases)?

Or will this cause other stability issues?
Comment 18 Volker Lendecke 2018-06-13 19:07:10 UTC
(In reply to Sysadmin HTL-Leonding from comment #17)
> (In reply to Volker Lendecke from comment #15)
> 
> Would the use of "--disable-tdb-mutex-locking" as mentioned in
> samba/blob/master/lib/tdb/wscript fix this? If the tdb library has no mutex
> locking built-in Samba won't be able to use it anymore (don't know if that
> really makes a difference to a runtime option, but maybe the runtime option
> is not being queried in special cases)?
> 
> Or will this cause other stability issues?

This will probably help you. Sorry for relying on a documented feature of glibc and the kernel that turns out to be broken so badly.
Comment 19 Anoop C S 2018-07-26 08:53:15 UTC
(In reply to Sysadmin HTL-Leonding from comment #1)
Do you have a reproducer that results in the other back trace mentioned?
Comment 20 Anoop C S 2018-08-03 07:08:24 UTC
(In reply to Sysadmin HTL-Leonding from comment #1)
Opened bug 13558 to track this smbd panic.