Bug 15370 - smbd deadlocks in tdb_nest_lock -> fcntl_lock with TCP connection in CLOSED state
Summary: smbd deadlocks in tdb_nest_lock -> fcntl_lock with TCP connection in CLOSED s...
Status: NEW
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.18.0
Hardware: x64 FreeBSD
: P5 normal (vote)
Target Milestone: ---
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-05-08 18:17 UTC by Peter Eriksson
Modified: 2023-05-19 19:41 UTC (History)
1 user (show)

See Also:


Attachments
Backtrace of process 48592 (9.92 KB, text/plain)
2023-05-19 19:36 UTC, Peter Eriksson
no flags Details
Backtrace of process 51187 (13.01 KB, text/plain)
2023-05-19 19:37 UTC, Peter Eriksson
no flags Details
Backtrace of process 75528 (8.43 KB, text/plain)
2023-05-19 19:37 UTC, Peter Eriksson
no flags Details
tar archive of lsof and smbstatus output for the running processes (3.49 KB, application/x-compressed-tar)
2023-05-19 19:41 UTC, Peter Eriksson
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Eriksson 2023-05-08 18:17:57 UTC
After upgrading to Samba 4.18.0 a couple of weeks ago I've started seeing some smbd processes that seems to have deadlocked with a closed TCP session (according to "netstat" - state is "CLOSED") and waiting in:

  tdb_nest_lock -> tdb_brlock -> fcntl_lock -> fcntl

This is a bit of a problem if this happens to a client that have locks held on some files at the moment it occurs. 

Manually killing the problematic smbd processes "fixes" the issue and I have a monitoring script running that now watches for when this happens and alerts me.


The strange thing is that it seems to happen not all the time but after some 4-10 days, and it doesn't happen on all our servers (very busy ones) but just two (so far).

I can't currently upgrade to 4.18.2 on the servers where this occurs, but I can't anyway see anything in the release notes for 4.18.1 or 4.18.2 that looks like it be related to a fix. I've downgraded one of them to 4.17 now for the moment.


Here is a part of a GDB backtrace from a running deadlocked smbd process when
this has happened.

0x00000008049affc8 in _fcntl () from /lib/libc.so.7
#0  0x00000008049affc8 in _fcntl () from /lib/libc.so.7
#1  0x0000000807aab8d6 in ?? () from /lib/libthr.so.3
#2  0x00000008048845b7 in fcntl () from /lib/libc.so.7
#3  0x0000000804bf0e5c in fcntl_lock (waitflag=true, len=1, off=7608, rw=1, tdb=0x80f32ed20) at ../../lib/tdb/common/lock.c:58
#4  tdb_brlock (tdb=tdb@entry=0x80f32ed20, rw_type=rw_type@entry=1, offset=offset@entry=7608, len=len@entry=1, flags=flags@entry=TDB_LOCK_WAIT) at ../../lib/tdb/common/lock.c:200
#5  0x0000000804bf142d in tdb_nest_lock (tdb=tdb@entry=0x80f32ed20, offset=offset@entry=7608, ltype=ltype@entry=1, flags=flags@entry=TDB_LOCK_WAIT) at ../../lib/tdb/common/lock.c:390
#6  0x0000000804bf1659 in tdb_lock_list (tdb=tdb@entry=0x80f32ed20, list=list@entry=1860, ltype=ltype@entry=1, waitflag=waitflag@entry=TDB_LOCK_WAIT) at ../../lib/tdb/common/lock.c:482
#7  0x0000000804bf1720 in tdb_lock (tdb=tdb@entry=0x80f32ed20, list=1860, ltype=ltype@entry=1) at ../../lib/tdb/common/lock.c:500
#8  0x0000000804beeb82 in tdb_find_lock_hash (tdb=tdb@entry=0x80f32ed20, key=..., hash=2009077229, locktype=locktype@entry=1, rec=rec@entry=0x7fffffffd960) at ../../lib/tdb/common/tdb.c:165
#9  0x0000000804beed65 in tdb_parse_record (tdb=0x80f32ed20, key=..., parser=parser@entry=0x80589df7d <db_tdb_parser>, private_data=private_data@entry=0x7fffffffd9b0) at ../../lib/tdb/common/tdb.c:329
#10 0x000000080589e2e2 in db_tdb_parse (db=<optimized out>, key=..., parser=0x802108824 <dbwrap_watched_parse_record_parser>, private_data=0x7fffffffda20) at ../../lib/dbwrap/dbwrap_tdb.c:273
#11 0x000000080589ba5f in dbwrap_parse_record (db=<optimized out>, key=..., parser=parser@entry=0x802108824 <dbwrap_watched_parse_record_parser>, private_data=private_data@entry=0x7fffffffda20) at ../../lib/dbwrap/dbwrap.c:425
#12 0x000000080210893b in dbwrap_watched_parse_record (db=0x80f3ffb60, key=..., parser=0x80210daf8 <g_lock_dump_fn>, private_data=0x7fffffffdaa0) at ../../source3/lib/dbwrap/dbwrap_watch.c:783
#13 0x000000080589ba5f in dbwrap_parse_record (db=<optimized out>, key=..., parser=parser@entry=0x80210daf8 <g_lock_dump_fn>, private_data=private_data@entry=0x7fffffffdaa0) at ../../lib/dbwrap/dbwrap.c:425
#14 0x000000080210f129 in g_lock_dump (ctx=<optimized out>, key=..., fn=fn@entry=0x8016d0496 <fetch_share_mode_unlocked_parser>, private_data=private_data@entry=0x7fffffffdb00) at ../../source3/lib/g_lock.c:1653
#15 0x00000008016d1799 in fetch_share_mode_unlocked (mem_ctx=<optimized out>, id=...) at ../../source3/locking/share_mode_lock.c:1567
#16 0x00000008016c6b0e in get_file_infos (id=..., name_hash=name_hash@entry=0, delete_on_close=delete_on_close@entry=0x0, write_time=write_time@entry=0x7fffffffdc80) at ../../source3/locking/locking.c:652
#17 0x00000008016ea7d5 in smbd_dirptr_get_entry (ctx=ctx@entry=0x819095050, dirptr=dirptr@entry=0x80f3a12e0, mask=mask@entry=0x819094e00 "*", dirtype=dirtype@entry=22, dont_descend=dont_descend@entry=false, ask_sharemode=ask_sharemode@entry=true, get_dosmode_in=true, match_fn=0x8016f36d3 <smbd_dirptr_lanman2_match_fn>, mode_fn=0x8016f336b <smbd_dirptr_lanman2_mode_fn>, private_data=0x7fffffffdf80, _fname=0x7fffffffdf50, _smb_fname=0x7fffffffdf58, _mode=0x7fffffffdf44, _prev_offset=0x7fffffffdf48) at ../../source3/smbd/dir.c:1050
Comment 1 Andrew Bartlett 2023-05-09 01:41:34 UTC
The important trace will not be of the process 'stuck' in fcntl, but the process that holds the lock it wants, which won't be in fcntl (it will be somewhere else, busy doing something).
Comment 2 Peter Eriksson 2023-05-09 07:14:14 UTC
Ok, I'll try to catch backtraces of all smbd's (some 400 or so) next time this occurs (which should be in a couple of days if it happens with the same frequency) and we'll see if I can find something interresting :-)
Comment 3 Peter Eriksson 2023-05-19 17:24:21 UTC
Ok, I'm now got a new deadlocked pair of smbd's on another server.

I "gcore" all running smbd processes and got a "gdb bt" backtrace from them all.


root@filur02:/var/tmp/smbd # /liu/sbin/smbreaper
SUSPECT: pid=48592 user=mihha273 locks=1 from=2001:6b0:17:ac00:1000::ed11.63854 started=Fri May 19 17:22:41 2023 CEST
SUSPECT: pid=51187 user=axema002 locks=1 from=2001:6b0:17:ac00:2000::58fb.57873 started=Fri May 19 14:11:27 2023 CEST

Those two and a third one is the only ones sitting at fcntl_lock, 
root@filur02:/var/tmp/smbd # grep '^#[23] ' *.bt | egrep -v poll_event_loop
core.48592.bt:#2  0x00000008048845b7 in fcntl () from /lib/libc.so.7
core.48592.bt:#3  0x0000000804bf0e5c in fcntl_lock (waitflag=true, len=1, off=38356, rw=1, 
core.51187.bt:#2  0x00000008048845b7 in fcntl () from /lib/libc.so.7
core.51187.bt:#3  0x0000000804bf0e5c in fcntl_lock (waitflag=true, len=1, off=38356, rw=1, 
core.75528.bt:#2  0x00000008048845b7 in fcntl () from /lib/libc.so.7
core.75528.bt:#3  0x0000000804bf0e5c in fcntl_lock (waitflag=true, len=1, off=38356, rw=1, 

root@filur02:/var/tmp/smbd # smbstatus|egrep 75528
75528   nikha067     student-liu.se 2001:6b0:17:ac00:1000::5317 (ipv6:2001:6b0:17:ac00:1000::5317:50595) SMB3_11           -                    partial(AES-128-CMAC)
IPC$         75528   2001:6b0:17:ac00:1000::5317 Fri May 19 17:35:43 2023 CEST    -            AES-128-CMAC
students     75528   2001:6b0:17:ac00:1000::5317 Fri May 19 11:14:39 2023 CEST    -            AES-128-CMAC
75528        1000095530  DENY_NONE  0x100081    RDONLY     NONE             /export/students   nikha067/SettingsPackages/Microsoft-Windows-Powershell   Fri May 19 17:35:48 2023


The other two have a lot of files open according to "smbstatus". I can wait a little with restarting the smbd processes for now if you have some ideas on what to test. 

Btw, on the other server where I was experiencing this I backed down to 4.17.6 and there I haven't seen this issue again.


lsof output for this process:

oot@filur02:/var/tmp/smbd # lsof -p 75528
lsof: WARNING: compiled for FreeBSD release 12.3-RELEASE-p7; this is 12.3-RELEASE-p6.
lsof: WARNING: device cache mismatch: /dev/ttyu1
COMMAND   PID     USER   FD   TYPE                DEVICE  SIZE/OFF   NODE NAME
smbd    75528 nikha067  txt   VREG 1441510162,2280381956    212224 297721 /liu/pkg/samba/4.18.0-liu/sbin/smbd (zroot/ROOT/default)
smbd    75528 nikha067  cwd   VDIR  620489695,2065293036     34939      4 /export/students (DATA/students)
smbd    75528 nikha067  rtd   VDIR 1441510162,2280381956        32      4 / (zroot/ROOT/default)
smbd    75528 nikha067    0u  VCHR                  0,14       0t0     14 /dev/null (devfs)
smbd    75528 nikha067    1u  VCHR                  0,14       0t0     14 /dev/null (devfs)
smbd    75528 nikha067    2w  VREG 1441510162,2280381956   3576637 457153 /liu/var/samba/logs/log.smbd (zroot/ROOT/default)
smbd    75528 nikha067    3u  unix    0xfffff85aa6856368       0t0        ->0xfffff8427762a6d0
smbd    75528 nikha067    4u  VREG 1441510162,2280381956    430080 210017 /liu/etc/samba/private/secrets.tdb (zroot/ROOT/default)
smbd    75528 nikha067    5uW VREG 1441510162,2280381956        20 412038 /liu/var/samba/locks/msg.lock/75528 (zroot/ROOT/default)
smbd    75528 nikha067    6u  unix    0xfffff80562305a38       0t0        /liu/etc/samba/private/msg.sock/75528
smbd    75528 nikha067    7u  VREG 1441510162,2280381956     12288 114723 /liu/var/samba/locks/names.tdb (zroot/ROOT/default)
smbd    75528 nikha067    8r  VDIR 1355234787,1513152036        99     50 /export/students/nikha067/SettingsPackages/Microsoft-Windows-Powershell (DATA/students/nikha067)
smbd    75528 nikha067    9u  VCHR                  0,14       0t0     14 /dev/null (devfs)
smbd    75528 nikha067   10u  VCHR                  0,14       0t0     14 /dev/null (devfs)
smbd    75528 nikha067   11w  VREG 1441510162,2280381956         6 451482 /liu/var/samba/run/smbd.pid (zroot/ROOT/default)
smbd    75528 nikha067   12u  PIPE    0xfffff85228333000     16384        ->0xfffff85228333168
smbd    75528 nikha067   13u  PIPE    0xfffff83dc4835000     16384        ->0xfffff83dc4835168
smbd    75528 nikha067   14u  PIPE    0xfffff83dc4835168      4096        ->0xfffff83dc4835000
smbd    75528 nikha067   15w  VREG 1441510162,2280381956   3576637 457153 /liu/var/samba/logs/log.smbd (zroot/ROOT/default)
smbd    75528 nikha067   16u  VREG 1441510162,2280381956     16384 114728 /liu/var/samba/locks/smbXsrv_version_global.tdb (zroot/ROOT/default)
smbd    75528 nikha067   17u  VREG 1441510162,2280381956    253952 114769 /liu/var/samba/locks/smbXsrv_client_global.tdb (zroot/ROOT/default)
smbd    75528 nikha067   18u  VREG 1441510162,2280381956 115605504 321458 /liu/var/samba/locks/smbXsrv_session_global.tdb (zroot/ROOT/default)
smbd    75528 nikha067   19u  VREG 1441510162,2280381956   2306048 114730 /liu/var/samba/locks/smbXsrv_tcon_global.tdb (zroot/ROOT/default)
smbd    75528 nikha067   20u  VREG 1441510162,2280381956   1474560 114731 /liu/var/samba/locks/brlock.tdb (zroot/ROOT/default)
smbd    75528 nikha067   21u  VREG 1441510162,2280381956 149245952 451485 /liu/var/samba/locks/locking.tdb (zroot/ROOT/default)
smbd    75528 nikha067   22u  VREG 1441510162,2280381956     40200 114736 /liu/var/samba/locks/leases.tdb (zroot/ROOT/default)
smbd    75528 nikha067   23u  VREG 1441510162,2280381956    421888 114751 /liu/var/samba/state/share_info.tdb (zroot/ROOT/default)
smbd    75528 nikha067   24u  VREG 1441510162,2280381956  53071872 269463 /liu/var/samba/locks/gencache.tdb (zroot/ROOT/default)
smbd    75528 nikha067   25u  unix    0xfffff8054a0a6000       0t0        ->(none)
smbd    75528 nikha067   26u  VREG 1441510162,2280381956    425984 114755 /liu/var/samba/state/group_mapping.tdb (zroot/ROOT/default)
smbd    75528 nikha067   27u  VREG 1441510162,2280381956    421888 114758 /liu/var/samba/state/account_policy.tdb (zroot/ROOT/default)
smbd    75528 nikha067   28u  VREG 1441510162,2280381956    421888 210015 /liu/etc/samba/private/passdb.tdb (zroot/ROOT/default)
smbd    75528 nikha067   29u  VREG 1441510162,2280381956  11005952 114759 /liu/var/samba/locks/smbXsrv_open_global.tdb (zroot/ROOT/default)
smbd    75528 nikha067   30r  VREG 1355234787,1513152036     27208  69865 /export/students/nikha067/SettingsPackages/Microsoft-Windows-Powershell/12CA888A-96C3-4D71-B309-0311B54A61BE.pkgdat (DATA/students/nikha067)
smbd    75528 nikha067   31u  unix    0xfffff85aa68566d0       0t0        ->0xfffff8730445c368
smbd    75528 nikha067   34u  IPv6    0xfffff83f621ccb88       0t0    TCP filur02.it.liu.se:microsoft-ds->gimle-405.ad.liu.se:50595 (CLOSED)
smbd    75528 nikha067   36u  VREG 1441510162,2280381956     12288 114778 /liu/var/samba/locks/smbd_cleanupd.tdb (zroot/ROOT/default)


And a more detailed backtrace:

(gdb) bt
#0  0x00000008049affc8 in _fcntl () from /lib/libc.so.7
#1  0x0000000807aab8d6 in ?? () from /lib/libthr.so.3
#2  0x00000008048845b7 in fcntl () from /lib/libc.so.7
#3  0x0000000804bf0e5c in fcntl_lock (waitflag=true, len=1, off=38356, rw=1, tdb=0x80f32ed20) at ../../lib/tdb/common/lock.c:58
#4  tdb_brlock (tdb=tdb@entry=0x80f32ed20, rw_type=rw_type@entry=1, offset=offset@entry=38356, len=len@entry=1, 
    flags=flags@entry=TDB_LOCK_WAIT) at ../../lib/tdb/common/lock.c:200
#5  0x0000000804bf142d in tdb_nest_lock (tdb=tdb@entry=0x80f32ed20, offset=offset@entry=38356, ltype=ltype@entry=1, 
    flags=flags@entry=TDB_LOCK_WAIT) at ../../lib/tdb/common/lock.c:390
#6  0x0000000804bf1659 in tdb_lock_list (tdb=tdb@entry=0x80f32ed20, list=list@entry=9547, ltype=ltype@entry=1, 
    waitflag=waitflag@entry=TDB_LOCK_WAIT) at ../../lib/tdb/common/lock.c:482
#7  0x0000000804bf1720 in tdb_lock (tdb=tdb@entry=0x80f32ed20, list=9547, ltype=ltype@entry=1) at ../../lib/tdb/common/lock.c:500
#8  0x0000000804beeb82 in tdb_find_lock_hash (tdb=tdb@entry=0x80f32ed20, key=..., hash=3111656176, locktype=locktype@entry=1, 
    rec=rec@entry=0x7fffffffd960) at ../../lib/tdb/common/tdb.c:165
#9  0x0000000804beed65 in tdb_parse_record (tdb=0x80f32ed20, key=..., parser=parser@entry=0x80589df7d <db_tdb_parser>, 
    private_data=private_data@entry=0x7fffffffd9b0) at ../../lib/tdb/common/tdb.c:329
#10 0x000000080589e2e2 in db_tdb_parse (db=<optimized out>, key=..., parser=0x802108824 <dbwrap_watched_parse_record_parser>, 
    private_data=0x7fffffffda20) at ../../lib/dbwrap/dbwrap_tdb.c:273
#11 0x000000080589ba5f in dbwrap_parse_record (db=<optimized out>, key=..., 
    parser=parser@entry=0x802108824 <dbwrap_watched_parse_record_parser>, private_data=private_data@entry=0x7fffffffda20)
    at ../../lib/dbwrap/dbwrap.c:425
#12 0x000000080210893b in dbwrap_watched_parse_record (db=0x80f3ffb60, key=..., parser=0x80210daf8 <g_lock_dump_fn>, 
    private_data=0x7fffffffdaa0) at ../../source3/lib/dbwrap/dbwrap_watch.c:783
#13 0x000000080589ba5f in dbwrap_parse_record (db=<optimized out>, key=..., parser=parser@entry=0x80210daf8 <g_lock_dump_fn>, 
    private_data=private_data@entry=0x7fffffffdaa0) at ../../lib/dbwrap/dbwrap.c:425
#14 0x000000080210f129 in g_lock_dump (ctx=<optimized out>, key=..., fn=fn@entry=0x8016d0496 <fetch_share_mode_unlocked_parser>, 
    private_data=private_data@entry=0x7fffffffdb00) at ../../source3/lib/g_lock.c:1653
#15 0x00000008016d1799 in fetch_share_mode_unlocked (mem_ctx=<optimized out>, id=...)
    at ../../source3/locking/share_mode_lock.c:1567
#16 0x00000008016c6b0e in get_file_infos (id=..., name_hash=name_hash@entry=0, delete_on_close=delete_on_close@entry=0x0, 
    write_time=write_time@entry=0x7fffffffdc80) at ../../source3/locking/locking.c:652
#17 0x00000008016ea7d5 in smbd_dirptr_get_entry (ctx=ctx@entry=0x80f3b7640, dirptr=dirptr@entry=0x80f3a0da0, 
    mask=mask@entry=0x80f32d0a0 "*", dirtype=dirtype@entry=22, dont_descend=dont_descend@entry=false, 
    ask_sharemode=ask_sharemode@entry=true, get_dosmode_in=true, match_fn=0x8016f36d3 <smbd_dirptr_lanman2_match_fn>, 
    mode_fn=0x8016f336b <smbd_dirptr_lanman2_mode_fn>, private_data=0x7fffffffdf80, _fname=0x7fffffffdf50, 
    _smb_fname=0x7fffffffdf58, _mode=0x7fffffffdf44, _prev_offset=0x7fffffffdf48) at ../../source3/smbd/dir.c:1050
#18 0x00000008016f75ae in smbd_dirptr_lanman2_entry (ctx=ctx@entry=0x80f3b7640, conn=0x80f3a8c60, dirptr=0x80f3a0da0, 
    flags2=<optimized out>, path_mask=<optimized out>, dirtype=22, info_level=260, requires_resume_key=0, dont_descend=false, 
    ask_sharemode=true, get_dosmode=true, align=8 '\b', do_pad=false, ppdata=0x80f3b76a0, base_data=0x80f37c060 "\270", 
    end_data=0x80f37e057 "", space_remaining=2988, _smb_fname=0x7fffffffe0f0, got_exact_match=0x7fffffffe0ee, 
    _last_entry_off=0x80f3b76c4, name_list=0x0, file_id=0x7fffffffe100) at ../../source3/smbd/smb2_trans2.c:1958
#19 0x0000000801754d03 in smb2_query_directory_next_entry (req=req@entry=0x80f3b7480)
    at ../../source3/smbd/smb2_query_directory.c:561
#20 0x000000080175639b in smbd_smb2_query_directory_send (in_file_name=<optimized out>, in_output_buffer_length=<optimized out>, 
    in_file_index=<optimized out>, in_flags=0 '\000', in_file_info_class=3 '\003', fsp=0x80f376720, smb2req=0x813118060, 
    ev=0x80f3a8060, mem_ctx=0x813118060) at ../../source3/smbd/smb2_query_directory.c:529
Comment 4 Ralph Böhme 2023-05-19 19:33:55 UTC
Please attach the backtraces of all three processes.
Comment 5 Peter Eriksson 2023-05-19 19:36:34 UTC
Created attachment 17892 [details]
Backtrace of process 48592
Comment 6 Peter Eriksson 2023-05-19 19:37:00 UTC
Created attachment 17893 [details]
Backtrace of process 51187
Comment 7 Peter Eriksson 2023-05-19 19:37:29 UTC
Created attachment 17894 [details]
Backtrace of process 75528
Comment 8 Peter Eriksson 2023-05-19 19:41:06 UTC
Created attachment 17895 [details]
tar archive of lsof and smbstatus output for the running processes