Bug 13267 - smbd deadlock/endless wait or endless loop
smbd deadlock/endless wait or endless loop
Status: RESOLVED WORKSFORME
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services
4.1.15
x86 Linux
: P5 critical
: 4.7
Assigned To: Samba QA Contact
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2018-02-13 12:32 UTC by Dieter Ferdinand
Modified: 2018-04-12 11:06 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dieter Ferdinand 2018-02-13 12:32:32 UTC
hello,
i have a big problem with smbd. in some situations, smbd hangs at a waitpoint (futex wait) and the only thing i can do, is killing all processes, securly smbd and nmbd) with killall -9 smbd nmbd to restart the services.

my server is a amd-system with 8 cores.

yesterday, i must make this two times.

i think the trigger for this bug have something to do with the usage of the server.

yesterday i transfer files to a linux-system with a 2.6 kernel and a windows-system with xp, both with the maximal transferrate the systems can use, in sum ca. 30 MB/s.
in the moment, i open some files on xp too, the server hangs the first time and later a second time.

if i remeber right, i tried samba to version 4.7.3 or 4.7.4 with the same problems.

i have this problem with samba since my last update in september 2017 after more then one year without updates.

at the most time, only on system will get or send data and the server works fine. bat every time i transfer data to my archive-system or verify and delete some backups on the system over samba (normaly i use rsync to update the archive) and read files from xp at the same time, my samba-server hangs.

i don't know why this happen. it is possible, that some parallel used functions make the problem in combination with the multiprocessor-system or a missing network package or signal while there is no timeout programed.

i have no such problems with version less then 4.x! on my old systems, i use samba 3.6.25.

i don't remember the last used version on the newer systems, but i think it was a version less then 4.x. i think, it must be 3.6.23 or an other 3.6.2x.

if it is impossible to correct this bug, i must install the old version 3.6.25 or latest 3.6.x on my new server instead of the 4.x version.

goodby
Comment 1 Volker Lendecke 2018-02-13 13:21:06 UTC
Please upload a gstack of the hung process. Also, your kernel seems to be pretty vintage. Please try setting

dbwrap_tdb_mutexes:* = false

in the [global] section in your smb.conf
Comment 2 Dieter Ferdinand 2018-02-16 15:03:46 UTC
hello,
i try it. but today the server hangs again.

this config line don't solve the problem.

goodby
Comment 3 Volker Lendecke 2018-02-16 15:30:01 UTC
Well, what can I say here. I don't think with direct access to the system in that state we can solve this. You should get someone from https://samba.org/samba/support, them sign an NDA and give them root access to your system. This can have a *LOT* of reasons, from Hardware problems to kernel bugs to Samba itself.
Comment 4 Volker Lendecke 2018-02-16 15:58:37 UTC
typo in my last comment: I don't think we'll solve this *without* direct system access
Comment 5 Hemanth 2018-03-22 06:43:07 UTC
We have also come across similar issue at one of our customers. We did enable the robust mutex for TDB access.

We are currently running samba version 4.3.11 (+ security patches)

(gdb) bt
#0  0x00007fabe7996594 in __lll_robust_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fabe79915a2 in _L_robust_lock_261 () from /lib64/libpthread.so.0
#2  0x00007fabe79910ff in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#3  0x00007fabe1a9b094 in chain_mutex_lock (m=0x7fabd19e6a88, waitflag=true) at ../lib/tdb/common/mutex.c:182
#4  0x00007fabe1a9b1cd in tdb_mutex_lock (tdb=0x556c385f6d40, rw=0, off=412, len=1, waitflag=true, pret=0x7fff8b25cf18)
   at ../lib/tdb/common/mutex.c:234
#5  0x00007fabe1a8fc52 in fcntl_lock (tdb=0x556c385f6d40, rw=0, off=412, len=1, waitflag=true) at ../lib/tdb/common/lock.c:44
#6  0x00007fabe1a8fdec in tdb_brlock (tdb=0x556c385f6d40, rw_type=0, offset=412, len=1, flags=TDB_LOCK_WAIT) at ../lib/tdb/common/lock.c:174
#7  0x00007fabe1a90349 in tdb_nest_lock (tdb=0x556c385f6d40, offset=412, ltype=0, flags=TDB_LOCK_WAIT) at ../lib/tdb/common/lock.c:346
#8  0x00007fabe1a90593 in tdb_lock_list (tdb=0x556c385f6d40, list=61, ltype=0, waitflag=TDB_LOCK_WAIT) at ../lib/tdb/common/lock.c:438
#9  0x00007fabe1a9063b in tdb_lock (tdb=0x556c385f6d40, list=61, ltype=0) at ../lib/tdb/common/lock.c:456
#10 0x00007fabe1a8d285 in tdb_find_lock_hash (tdb=0x556c385f6d40, key=..., hash=1922915160, locktype=0, rec=0x7fff8b25d120)
   at ../lib/tdb/common/tdb.c:118
#11 0x00007fabe1a8d669 in tdb_parse_record (tdb=0x556c385f6d40, key=..., parser=0x7fabe13ddce1 <db_tdb_parser>, private_data=0x7fff8b25d1a0)
   at ../lib/tdb/common/tdb.c:245
#12 0x00007fabe13dddc6 in db_tdb_parse (db=0x556c385f6a80, key=..., parser=0x7fabe7346b58 <fetch_share_mode_unlocked_parser>,
   private_data=0x556c3867a950) at ../lib/dbwrap/dbwrap_tdb.c:231
#13 0x00007fabe13d9d03 in dbwrap_parse_record (db=0x556c385f6a80, key=..., parser=0x7fabe7346b58 <fetch_share_mode_unlocked_parser>,
   private_data=0x556c3867a950) at ../lib/dbwrap/dbwrap.c:387
#14 0x00007fabe7346ca1 in fetch_share_mode_unlocked (mem_ctx=0x556c38678980, id=...) at ../source3/locking/share_mode_lock.c:650
#15 0x00007fabe733ad8a in get_file_infos (id=..., name_hash=0, delete_on_close=0x0, write_time=0x7fff8b25d4e0)
   at ../source3/locking/locking.c:615
#16 0x00007fabe71dcf8e in smbd_dirptr_get_entry (ctx=0x556c386e8820, dirptr=0x556c3879ab90, mask=0x556c3875bf10 “*”, dirtype=22,
---Type <return> to continue, or q <return> to quit---
   dont_descend=false, ask_sharemode=true, match_fn=0x7fabe7233c58 <smbd_dirptr_lanman2_match_fn>,
   mode_fn=0x7fabe7233faa <smbd_dirptr_lanman2_mode_fn>, private_data=0x7fff8b25d600, _fname=0x7fff8b25d620, _smb_fname=0x7fff8b25d618,
   _mode=0x7fff8b25d668, _prev_offset=0x7fff8b25d628) at ../source3/smbd/dir.c:1194
#17 0x00007fabe7237a53 in smbd_dirptr_lanman2_entry (ctx=0x556c386e8820, conn=0x556c3864b8e0, dirptr=0x556c3879ab90, flags2=53313,

We have actually missed checking the current mutex owner and check its own process user layer stack(to identify which operation that it was blocked by holding a mutex lock). There were many smbd in the hung state(everything pointing to futex wait). 

Looking at db_tdb_fetch_locked() code, doesn't seem be having any system calls or other blocking operations after obtaining the mutex. Wondering what has caused the smbds going into that state. 
We have collected couple of offline cores with out the shared memory page dumps which has the actual mutex state. Not sure if those cores will be useful to debug this issue.
Comment 6 Dieter Ferdinand 2018-04-11 18:23:38 UTC
hello,
the same problem with 4.8.0.

goodby
Comment 7 Volker Lendecke 2018-04-12 11:06:37 UTC
(In reply to Dieter Ferdinand from comment #6)
> hello,
> the same problem with 4.8.0.

My guess is that something with the robust mutexes is broken on your platform. We're running robust mutexes under high load in a lot of situations without problems. Talking to RedHat employees I got the information that robust mutexes in glibc and the kernel have a lot of problems, and you might be hitting them.

The remedy here is to run without them. Set

dbwrap_tdb_mutexes:* = false

Closing this bug, I think we have to wait a few years until the fixes have trickled down into available distros.

If you can reproduce the issue without mutexes, please re-open.