Created attachment 13799 [details]
Every now and then the Winbindd processes on our pretty busy six Samba fileservers (around 200-400 users per server) seems to stop responding - causing Samba to refuse new SMB connections. Today we saw it happen on two different servers at 12:09, and then one a third on at 14:47...
The timing seems to happen at the same time as we in the "log.smbd" file see the following errors:
> # egrep -A5 'signing' /var/samba/logs/log.smbd
> [2017/11/21 14:47:22.282388, 0] ../libcli/smb/smb2_signing.c:171(smb2_signing_check_pdu)
> Bad SMB2 signature for message
> [2017/11/21 14:47:22.282480, 0] ../lib/util/util.c:515(dump_data)
>  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........ ........
> [2017/11/21 14:47:22.282522, 0] ../lib/util/util.c:515(dump_data)
>  C8 AA BD 98 1C CD D4 F7 47 B8 79 B6 EF 90 6D AF ........ G.y...m.
Killing and restarting winbindd seems to allow the smbd processes to allow new connections again. Refuses both username+password & Kerberos-authenticated connections.
Dell PowerEdge 730xd with 256GB RAM, 10Gbps Ethernet and ~140TB of ZFS storage
Joined with a Windows 2012 AD domain (6 AD servers) with around 100k users and many groups. winbind users & groups enumeration is disabled.
Attaching our smb.conf file. Had a quick look into the libcli/smb/smb2_signing.c file but could really see anything obviously wrong...
is this still an issue?
Well, sort of. We are now running Samba 4.11-series and while we don't see that exact error message anymore we still experience regular winbindd freezes on our busy servers.
Every 10:th hour after we restart the winbindd processes the busy ("busy is in "many users connected, not necessarily doing a lot") ones tend to freeze up. I'm suspecting something going wrong at the same time as the AD Kerberos service ticket expires and is supposed to be renewed. It normally doesn't happen on less busy servers.
(We have a workaround though - we run a cron job that (around the time of when we know this is happening) runs a series of quick tests and if it fails it just restarts winbindd (kill, and if that doesn't help (sometimes) then "kill -9").
(We restart winbindd at 07:00 every morning so it won't affect our users untill 17:00 - so we can avoid disrupting users during prime office hours). (And then at 03:00 but that is not really a problem :-)