Created attachment 8782 [details] backtrace of core If samba service is run with server services = +smb and is a domain member, and the kerberos is not responding, then after a period of connection attempts by a windows client, the winbind service will segfault and not be restarted. All further session attempts fail and the server becomes largely unavailable. In re-producing this error, I deny kerberos access like this: iptables -t nat -A OUTPUT -m udp -p udp --dport 88 -j DNAT --to-destination 172.16.0.1 (it is only important that the destination is non-local and does not respond/exist).
Not crashing is an optimistic strategy As there is a pipe to join the task server children to the main server, I think the main server could refork the task service to recover.
Created attachment 8791 [details] Patch against samba 4.0.5 This patch runs on 4.0.5 and does remedy the stuck situation the crash brings about.
I'm also investigating related RPC process hangs that occur in networks with traffic loss
With the patch http://cpaste.org/1895/ to preserve gensec contexts, and also from #9832 crashes are less common. What I now find is that: - after periods of no or poor connectivity (such as reboot of sole DC) - when a client is constantly authenticating (e.g. map and unmap a share) - that some port 445 connections to the dc used to process logon connections will linger around - these 445 connections used to process logon connections be reset but winbind does not know this and will not create new connections but re-use old ones -- find_domain_from_sid() Also, if winbind is restarted (by means of the patches here) existing samba server sessions will lose an irpc pipe (or something like it) and not realise and will not create new connections. I also observe that some lingered connections to the dc will always return NT_STATUS_ACCESS_DENIED. Attaching gdb to close these (or reboot dc) results in a timeout for the client instead, no new pipe is made. Such situations result in timeout errors and loss of service ONLY for non-domain members using NTLM (in my case for S4U2Proxy) I think there is no means to handle or re-establish dropped transport connections.
I now have a very simple failure case (not using cifs proxy) which I will attempt to re-produce on current head: 1. perform NTLM auth of a client connecting to share (net use) 2. check using lsof that winbind service task has a connection to dc 3. unmap share from client 4. check using lsof that winbind service task still has a connection to dc 5. reboot dc 6. check using lsof that winbind service task does not have a connection to dc 7. reboot client, or disconnect IPC$ share (perhaps by kill the smb per-client process) 8. watch all NTLM auth attempts fail
This problem DOES occur for latest samba head 55add52f42f389f396ab9a08f08ada203fffee14 After an auth attempt, I reboot the DC and now samba -d2 gives: auth_check_password_recv: winbind authentication for user [WANOPTDOGFOOD\administrator] FAILED with error NT_STATUS_CONNECTION_DISCONNECTED SPNEGO login failed: NT_STATUS_CONNECTION_DISCONNECTED standard_terminate: reason[NT_STATUS_END_OF_FILE] the client reports: session setup failed: NT_STATUS_CONNECTION_DISCONNECTED ..repeatedly, no attempt is made to create a new connection to the dc. I'll look at fixing this in the samba head
I find this iptables applied briefly during an auth attempt is a substitute for rebooting the DC. iptables -A INPUT -m tcp -p tcp --sport 445 -j REJECT --reject-with tcp-reset ... iptables -D INPUT -m tcp -p tcp --sport 445 -j REJECT --reject-with tcp-reset it will be enough to reset the remote end, the rule is then removed to allow the remote RST back to reset the winbind socket. Resetting the samba side first like this: iptables -A OUTPUT -m tcp -p tcp --dport 445 -j REJECT --reject-with tcp-reset instead caused: auth_check_password_recv: winbind authentication for user SPNEGO login failed: NT_STATUS_IO_TIMEOUT session setup failed: NT_STATUS_IO_TIMEOUT standard_terminate: reason[NT_STATUS_END_OF_FILE] =============================================================== INTERNAL ERROR: Signal 11 in pid 31267 (4.1.0pre1-GIT-bbee6e4) Please read the Trouble-Shooting section of the Samba HOWTO =============================================================== PANIC: internal error Restart after 134 from child 31267 winbind waiting for child 31535 winbind (I have the restart winbind patch in)
Created attachment 8946 [details] Fix for transport-gone-away problem Patch that fixes the issue for me by reacting to a NT_STATUS_CONNECTION_RESET reset message. Maybe another fix would be needed for NT_STATUS_CONNECTION_DISCONNECTED and NT_STATUS_IO_TIMEOUT which might occur if the reset error was somehow missed. The action there would be to close the pipe and then destroy it at the point the error was found, rather than merely checking in the find_domain_from_sid() function
Ok, I have one (big) issue with the patch. That's the use of talloc_reference() inside the _wb_sid2domain_send() to keep state->domain around. Please use reference counting instead. talloc_reference leads to memory leaks and completely unmaintainable code, and any additional use of this API should be considered as introducing a bug into Samba. Sorry :-(. Jeremy.
And by reference counting, I mean *explicit* reference counting (I know talloc_reference is also reference counting, but it's nasty and hidden :-). Jeremy.
I'm just thinking how I would do that. It would be quite intrusive. it certainly is a pain to have to recover the pointer to release the counter. I'm not sure what the current style is for reference counting is, I would probably introduce a reference-counting wrapper struct that was just a pointer to something and also a pointer to it's reference count. In the wrappers talloc destructor it would reduce the reference count and then free it if it was zero. Which means that the domain struct would also need a talloc destructor to prevent destruction if the reference count was not zero. But did decrement of course. But then we get the risk of surrogate talloc parents (when the destructor refused destruction) re-calling the destructor and reducing the count AGAIN. I guess the original owner need not directly own the domain struct, but via another reference count wrapper. On balancem I would probably remove the talloc_reference from that patch and open another bug # to deal with better reference counting. Would that be OK?
Yes, thanks - that would be better. Jeremy.
Before you go too far with doing hand-written refcounting: I have really strong reservations against talloc_reference myself, but I have used it myself in core code. Take a look at tdb_wrap_open(). There the user of this API can not mis-use the object that was talloc_referenced, and there is a clear separation of the code that must be aware of the talloc_reference() and all the rest. Jeremy might disagree, but I would be willing to discuss this particular use over one or two beers :-)