Bug 13267 - smbd deadlock/endless wait or endless loop
Summary: smbd deadlock/endless wait or endless loop
Status: REOPENED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.1.15
Hardware: x86 Linux
: P5 critical (vote)
Target Milestone: 4.7
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-02-13 12:32 UTC by Dieter Ferdinand
Modified: 2019-02-23 14:24 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dieter Ferdinand 2018-02-13 12:32:32 UTC
hello,
i have a big problem with smbd. in some situations, smbd hangs at a waitpoint (futex wait) and the only thing i can do, is killing all processes, securly smbd and nmbd) with killall -9 smbd nmbd to restart the services.

my server is a amd-system with 8 cores.

yesterday, i must make this two times.

i think the trigger for this bug have something to do with the usage of the server.

yesterday i transfer files to a linux-system with a 2.6 kernel and a windows-system with xp, both with the maximal transferrate the systems can use, in sum ca. 30 MB/s.
in the moment, i open some files on xp too, the server hangs the first time and later a second time.

if i remeber right, i tried samba to version 4.7.3 or 4.7.4 with the same problems.

i have this problem with samba since my last update in september 2017 after more then one year without updates.

at the most time, only on system will get or send data and the server works fine. bat every time i transfer data to my archive-system or verify and delete some backups on the system over samba (normaly i use rsync to update the archive) and read files from xp at the same time, my samba-server hangs.

i don't know why this happen. it is possible, that some parallel used functions make the problem in combination with the multiprocessor-system or a missing network package or signal while there is no timeout programed.

i have no such problems with version less then 4.x! on my old systems, i use samba 3.6.25.

i don't remember the last used version on the newer systems, but i think it was a version less then 4.x. i think, it must be 3.6.23 or an other 3.6.2x.

if it is impossible to correct this bug, i must install the old version 3.6.25 or latest 3.6.x on my new server instead of the 4.x version.

goodby
Comment 1 Volker Lendecke 2018-02-13 13:21:06 UTC
Please upload a gstack of the hung process. Also, your kernel seems to be pretty vintage. Please try setting

dbwrap_tdb_mutexes:* = false

in the [global] section in your smb.conf
Comment 2 Dieter Ferdinand 2018-02-16 15:03:46 UTC
hello,
i try it. but today the server hangs again.

this config line don't solve the problem.

goodby
Comment 3 Volker Lendecke 2018-02-16 15:30:01 UTC
Well, what can I say here. I don't think with direct access to the system in that state we can solve this. You should get someone from https://samba.org/samba/support, them sign an NDA and give them root access to your system. This can have a *LOT* of reasons, from Hardware problems to kernel bugs to Samba itself.
Comment 4 Volker Lendecke 2018-02-16 15:58:37 UTC
typo in my last comment: I don't think we'll solve this *without* direct system access
Comment 5 Hemanth 2018-03-22 06:43:07 UTC
We have also come across similar issue at one of our customers. We did enable the robust mutex for TDB access.

We are currently running samba version 4.3.11 (+ security patches)

(gdb) bt
#0  0x00007fabe7996594 in __lll_robust_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fabe79915a2 in _L_robust_lock_261 () from /lib64/libpthread.so.0
#2  0x00007fabe79910ff in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#3  0x00007fabe1a9b094 in chain_mutex_lock (m=0x7fabd19e6a88, waitflag=true) at ../lib/tdb/common/mutex.c:182
#4  0x00007fabe1a9b1cd in tdb_mutex_lock (tdb=0x556c385f6d40, rw=0, off=412, len=1, waitflag=true, pret=0x7fff8b25cf18)
   at ../lib/tdb/common/mutex.c:234
#5  0x00007fabe1a8fc52 in fcntl_lock (tdb=0x556c385f6d40, rw=0, off=412, len=1, waitflag=true) at ../lib/tdb/common/lock.c:44
#6  0x00007fabe1a8fdec in tdb_brlock (tdb=0x556c385f6d40, rw_type=0, offset=412, len=1, flags=TDB_LOCK_WAIT) at ../lib/tdb/common/lock.c:174
#7  0x00007fabe1a90349 in tdb_nest_lock (tdb=0x556c385f6d40, offset=412, ltype=0, flags=TDB_LOCK_WAIT) at ../lib/tdb/common/lock.c:346
#8  0x00007fabe1a90593 in tdb_lock_list (tdb=0x556c385f6d40, list=61, ltype=0, waitflag=TDB_LOCK_WAIT) at ../lib/tdb/common/lock.c:438
#9  0x00007fabe1a9063b in tdb_lock (tdb=0x556c385f6d40, list=61, ltype=0) at ../lib/tdb/common/lock.c:456
#10 0x00007fabe1a8d285 in tdb_find_lock_hash (tdb=0x556c385f6d40, key=..., hash=1922915160, locktype=0, rec=0x7fff8b25d120)
   at ../lib/tdb/common/tdb.c:118
#11 0x00007fabe1a8d669 in tdb_parse_record (tdb=0x556c385f6d40, key=..., parser=0x7fabe13ddce1 <db_tdb_parser>, private_data=0x7fff8b25d1a0)
   at ../lib/tdb/common/tdb.c:245
#12 0x00007fabe13dddc6 in db_tdb_parse (db=0x556c385f6a80, key=..., parser=0x7fabe7346b58 <fetch_share_mode_unlocked_parser>,
   private_data=0x556c3867a950) at ../lib/dbwrap/dbwrap_tdb.c:231
#13 0x00007fabe13d9d03 in dbwrap_parse_record (db=0x556c385f6a80, key=..., parser=0x7fabe7346b58 <fetch_share_mode_unlocked_parser>,
   private_data=0x556c3867a950) at ../lib/dbwrap/dbwrap.c:387
#14 0x00007fabe7346ca1 in fetch_share_mode_unlocked (mem_ctx=0x556c38678980, id=...) at ../source3/locking/share_mode_lock.c:650
#15 0x00007fabe733ad8a in get_file_infos (id=..., name_hash=0, delete_on_close=0x0, write_time=0x7fff8b25d4e0)
   at ../source3/locking/locking.c:615
#16 0x00007fabe71dcf8e in smbd_dirptr_get_entry (ctx=0x556c386e8820, dirptr=0x556c3879ab90, mask=0x556c3875bf10 “*”, dirtype=22,
---Type <return> to continue, or q <return> to quit---
   dont_descend=false, ask_sharemode=true, match_fn=0x7fabe7233c58 <smbd_dirptr_lanman2_match_fn>,
   mode_fn=0x7fabe7233faa <smbd_dirptr_lanman2_mode_fn>, private_data=0x7fff8b25d600, _fname=0x7fff8b25d620, _smb_fname=0x7fff8b25d618,
   _mode=0x7fff8b25d668, _prev_offset=0x7fff8b25d628) at ../source3/smbd/dir.c:1194
#17 0x00007fabe7237a53 in smbd_dirptr_lanman2_entry (ctx=0x556c386e8820, conn=0x556c3864b8e0, dirptr=0x556c3879ab90, flags2=53313,

We have actually missed checking the current mutex owner and check its own process user layer stack(to identify which operation that it was blocked by holding a mutex lock). There were many smbd in the hung state(everything pointing to futex wait). 

Looking at db_tdb_fetch_locked() code, doesn't seem be having any system calls or other blocking operations after obtaining the mutex. Wondering what has caused the smbds going into that state. 
We have collected couple of offline cores with out the shared memory page dumps which has the actual mutex state. Not sure if those cores will be useful to debug this issue.
Comment 6 Dieter Ferdinand 2018-04-11 18:23:38 UTC
hello,
the same problem with 4.8.0.

goodby
Comment 7 Volker Lendecke 2018-04-12 11:06:37 UTC
(In reply to Dieter Ferdinand from comment #6)
> hello,
> the same problem with 4.8.0.

My guess is that something with the robust mutexes is broken on your platform. We're running robust mutexes under high load in a lot of situations without problems. Talking to RedHat employees I got the information that robust mutexes in glibc and the kernel have a lot of problems, and you might be hitting them.

The remedy here is to run without them. Set

dbwrap_tdb_mutexes:* = false

Closing this bug, I think we have to wait a few years until the fixes have trickled down into available distros.

If you can reproduce the issue without mutexes, please re-open.
Comment 8 Dieter Ferdinand 2019-01-13 16:54:28 UTC
hello,
it seems, that this problem only happens, if more then one-system access the same directory (or share ???).

if i acces the same directory from my linux-ans and my xp-pc to delete or move files, my system hangs after some minutes.

if i access this directory (share) to move or delete files and an other directory (share) to copy files to it from my win10-system, i have no problems.

i don't test the access to different directorys on the same share, but i think that the problems happens if the same directory (i use the same share) trigger this bug.

i always have deleted files in this directory from one or two clients.

this is my temp-directory for my video-files before i move them to the nas or my pc to convert or look the videos.

i hope, this will help to find the bug and remove it.

goodby
Comment 9 Dieter Ferdinand 2019-01-22 18:24:03 UTC
hello,
the two clients, which hnags use both the nt1-protocol.

clients with other protocols works fine at the moment.

goodby
Comment 10 Dieter Ferdinand 2019-01-22 22:32:25 UTC
hello,
i make some tests today and see, that only my linux- and xp-system hangs. two win 10 systems have no problems to access the server, but this two systems access an other directorys and shares on the server.

after samba hangs today, i get access to the server again, but the share is blocked. i can't access this share from one of the two systems again. a new share to the same directory is also inaccessible.
but only from xp-system, linux can't access an other share.


i don't try to access this share from one of my win 10 systems because i wan't risk that this systems will be interrupted their work.

after a complete restart of my samba all works again.

i don't know, why the share is blocked, but if i access this share again from my xp-system, i can't access any share of the server.

goodby
Comment 11 Volker Lendecke 2019-01-23 06:40:11 UTC
What file system are you running on?

Can you install the "gstack" program and do a gstack on the smbds that are affected? If gstack hangs, please post the output of "cat /proc/<pid>/stack" of the affected processes.
Comment 12 Dieter Ferdinand 2019-01-23 23:29:52 UTC
hello,
i use ext3 on the most volumes, but the vulume of the share is reiserfs.

i will change it to an other system if i have enouth free space and time to do that because the reiserfs-driver have a bug, which make the system hanging if a error is in the reiserfs-structure.

in which package is gstack ? i don't know something of this program and where can i find it ?

to check something, last time the system hangs, i insert a second share to find out, if this problem only happens, if the two system access the same share on the same volume or if the system hangs again, if i use two different shares on different volumes.

but the directory is the same. i make symlinks to it on some volumes.

if the server hang next time, i get the wanted information, but i hope, that my changes solve the problem for me.

goodby
Comment 13 Dieter Ferdinand 2019-02-02 12:22:47 UTC
hello,
i have no file stack in proc for this process.

goodby
Comment 14 Volker Lendecke 2019-02-02 17:13:30 UTC
Then something is severely wrong on your box? Is this Linux after all? If so, what version?
Comment 15 Dieter Ferdinand 2019-02-14 18:42:10 UTC
hello,
i have installed gentoo-linux with kernel 3.16.47.

it is possible, that i have not installed the needed kernel-future because i only install the options, which i need.

i can install this option (if i find the option) with the next kernel-update.

goodby
Comment 16 Dieter Ferdinand 2019-02-23 14:24:52 UTC
hello,
today i lost the access to my server and it was impossible, to get access again.

i kill all daemons and restart it --> no success

i update to 4.8.6-r2 --> no access

i try to set a new root pw --> no success fute wait endless

i delete all samba databases in /var/samba/lock and set all passwords again -->> server can be accessed again.

i hope, that this will solve the futex wait problem from the past. i will see it in the next month.

i have save the old database files. if you want to analyse it, please tell me and i will upload them.

goodby