i have a big problem with smbd. in some situations, smbd hangs at a waitpoint (futex wait) and the only thing i can do, is killing all processes, securly smbd and nmbd) with killall -9 smbd nmbd to restart the services.
my server is a amd-system with 8 cores.
yesterday, i must make this two times.
i think the trigger for this bug have something to do with the usage of the server.
yesterday i transfer files to a linux-system with a 2.6 kernel and a windows-system with xp, both with the maximal transferrate the systems can use, in sum ca. 30 MB/s.
in the moment, i open some files on xp too, the server hangs the first time and later a second time.
if i remeber right, i tried samba to version 4.7.3 or 4.7.4 with the same problems.
i have this problem with samba since my last update in september 2017 after more then one year without updates.
at the most time, only on system will get or send data and the server works fine. bat every time i transfer data to my archive-system or verify and delete some backups on the system over samba (normaly i use rsync to update the archive) and read files from xp at the same time, my samba-server hangs.
i don't know why this happen. it is possible, that some parallel used functions make the problem in combination with the multiprocessor-system or a missing network package or signal while there is no timeout programed.
i have no such problems with version less then 4.x! on my old systems, i use samba 3.6.25.
i don't remember the last used version on the newer systems, but i think it was a version less then 4.x. i think, it must be 3.6.23 or an other 3.6.2x.
if it is impossible to correct this bug, i must install the old version 3.6.25 or latest 3.6.x on my new server instead of the 4.x version.
Please upload a gstack of the hung process. Also, your kernel seems to be pretty vintage. Please try setting
dbwrap_tdb_mutexes:* = false
in the [global] section in your smb.conf
i try it. but today the server hangs again.
this config line don't solve the problem.
Well, what can I say here. I don't think with direct access to the system in that state we can solve this. You should get someone from https://samba.org/samba/support, them sign an NDA and give them root access to your system. This can have a *LOT* of reasons, from Hardware problems to kernel bugs to Samba itself.
typo in my last comment: I don't think we'll solve this *without* direct system access
We have also come across similar issue at one of our customers. We did enable the robust mutex for TDB access.
We are currently running samba version 4.3.11 (+ security patches)
#0 0x00007fabe7996594 in __lll_robust_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fabe79915a2 in _L_robust_lock_261 () from /lib64/libpthread.so.0
#2 0x00007fabe79910ff in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#3 0x00007fabe1a9b094 in chain_mutex_lock (m=0x7fabd19e6a88, waitflag=true) at ../lib/tdb/common/mutex.c:182
#4 0x00007fabe1a9b1cd in tdb_mutex_lock (tdb=0x556c385f6d40, rw=0, off=412, len=1, waitflag=true, pret=0x7fff8b25cf18)
#5 0x00007fabe1a8fc52 in fcntl_lock (tdb=0x556c385f6d40, rw=0, off=412, len=1, waitflag=true) at ../lib/tdb/common/lock.c:44
#6 0x00007fabe1a8fdec in tdb_brlock (tdb=0x556c385f6d40, rw_type=0, offset=412, len=1, flags=TDB_LOCK_WAIT) at ../lib/tdb/common/lock.c:174
#7 0x00007fabe1a90349 in tdb_nest_lock (tdb=0x556c385f6d40, offset=412, ltype=0, flags=TDB_LOCK_WAIT) at ../lib/tdb/common/lock.c:346
#8 0x00007fabe1a90593 in tdb_lock_list (tdb=0x556c385f6d40, list=61, ltype=0, waitflag=TDB_LOCK_WAIT) at ../lib/tdb/common/lock.c:438
#9 0x00007fabe1a9063b in tdb_lock (tdb=0x556c385f6d40, list=61, ltype=0) at ../lib/tdb/common/lock.c:456
#10 0x00007fabe1a8d285 in tdb_find_lock_hash (tdb=0x556c385f6d40, key=..., hash=1922915160, locktype=0, rec=0x7fff8b25d120)
#11 0x00007fabe1a8d669 in tdb_parse_record (tdb=0x556c385f6d40, key=..., parser=0x7fabe13ddce1 <db_tdb_parser>, private_data=0x7fff8b25d1a0)
#12 0x00007fabe13dddc6 in db_tdb_parse (db=0x556c385f6a80, key=..., parser=0x7fabe7346b58 <fetch_share_mode_unlocked_parser>,
private_data=0x556c3867a950) at ../lib/dbwrap/dbwrap_tdb.c:231
#13 0x00007fabe13d9d03 in dbwrap_parse_record (db=0x556c385f6a80, key=..., parser=0x7fabe7346b58 <fetch_share_mode_unlocked_parser>,
private_data=0x556c3867a950) at ../lib/dbwrap/dbwrap.c:387
#14 0x00007fabe7346ca1 in fetch_share_mode_unlocked (mem_ctx=0x556c38678980, id=...) at ../source3/locking/share_mode_lock.c:650
#15 0x00007fabe733ad8a in get_file_infos (id=..., name_hash=0, delete_on_close=0x0, write_time=0x7fff8b25d4e0)
#16 0x00007fabe71dcf8e in smbd_dirptr_get_entry (ctx=0x556c386e8820, dirptr=0x556c3879ab90, mask=0x556c3875bf10 “*”, dirtype=22,
---Type <return> to continue, or q <return> to quit---
dont_descend=false, ask_sharemode=true, match_fn=0x7fabe7233c58 <smbd_dirptr_lanman2_match_fn>,
mode_fn=0x7fabe7233faa <smbd_dirptr_lanman2_mode_fn>, private_data=0x7fff8b25d600, _fname=0x7fff8b25d620, _smb_fname=0x7fff8b25d618,
_mode=0x7fff8b25d668, _prev_offset=0x7fff8b25d628) at ../source3/smbd/dir.c:1194
#17 0x00007fabe7237a53 in smbd_dirptr_lanman2_entry (ctx=0x556c386e8820, conn=0x556c3864b8e0, dirptr=0x556c3879ab90, flags2=53313,
We have actually missed checking the current mutex owner and check its own process user layer stack(to identify which operation that it was blocked by holding a mutex lock). There were many smbd in the hung state(everything pointing to futex wait).
Looking at db_tdb_fetch_locked() code, doesn't seem be having any system calls or other blocking operations after obtaining the mutex. Wondering what has caused the smbds going into that state.
We have collected couple of offline cores with out the shared memory page dumps which has the actual mutex state. Not sure if those cores will be useful to debug this issue.
the same problem with 4.8.0.
(In reply to Dieter Ferdinand from comment #6)
> the same problem with 4.8.0.
My guess is that something with the robust mutexes is broken on your platform. We're running robust mutexes under high load in a lot of situations without problems. Talking to RedHat employees I got the information that robust mutexes in glibc and the kernel have a lot of problems, and you might be hitting them.
The remedy here is to run without them. Set
dbwrap_tdb_mutexes:* = false
Closing this bug, I think we have to wait a few years until the fixes have trickled down into available distros.
If you can reproduce the issue without mutexes, please re-open.
it seems, that this problem only happens, if more then one-system access the same directory (or share ???).
if i acces the same directory from my linux-ans and my xp-pc to delete or move files, my system hangs after some minutes.
if i access this directory (share) to move or delete files and an other directory (share) to copy files to it from my win10-system, i have no problems.
i don't test the access to different directorys on the same share, but i think that the problems happens if the same directory (i use the same share) trigger this bug.
i always have deleted files in this directory from one or two clients.
this is my temp-directory for my video-files before i move them to the nas or my pc to convert or look the videos.
i hope, this will help to find the bug and remove it.
the two clients, which hnags use both the nt1-protocol.
clients with other protocols works fine at the moment.
i make some tests today and see, that only my linux- and xp-system hangs. two win 10 systems have no problems to access the server, but this two systems access an other directorys and shares on the server.
after samba hangs today, i get access to the server again, but the share is blocked. i can't access this share from one of the two systems again. a new share to the same directory is also inaccessible.
but only from xp-system, linux can't access an other share.
i don't try to access this share from one of my win 10 systems because i wan't risk that this systems will be interrupted their work.
after a complete restart of my samba all works again.
i don't know, why the share is blocked, but if i access this share again from my xp-system, i can't access any share of the server.
What file system are you running on?
Can you install the "gstack" program and do a gstack on the smbds that are affected? If gstack hangs, please post the output of "cat /proc/<pid>/stack" of the affected processes.
i use ext3 on the most volumes, but the vulume of the share is reiserfs.
i will change it to an other system if i have enouth free space and time to do that because the reiserfs-driver have a bug, which make the system hanging if a error is in the reiserfs-structure.
in which package is gstack ? i don't know something of this program and where can i find it ?
to check something, last time the system hangs, i insert a second share to find out, if this problem only happens, if the two system access the same share on the same volume or if the system hangs again, if i use two different shares on different volumes.
but the directory is the same. i make symlinks to it on some volumes.
if the server hang next time, i get the wanted information, but i hope, that my changes solve the problem for me.
i have no file stack in proc for this process.
Then something is severely wrong on your box? Is this Linux after all? If so, what version?
i have installed gentoo-linux with kernel 3.16.47.
it is possible, that i have not installed the needed kernel-future because i only install the options, which i need.
i can install this option (if i find the option) with the next kernel-update.
today i lost the access to my server and it was impossible, to get access again.
i kill all daemons and restart it --> no success
i update to 4.8.6-r2 --> no access
i try to set a new root pw --> no success fute wait endless
i delete all samba databases in /var/samba/lock and set all passwords again -->> server can be accessed again.
i hope, that this will solve the futex wait problem from the past. i will see it in the next month.
i have save the old database files. if you want to analyse it, please tell me and i will upload them.