Bug 9806 - unreachable krb server crashes winbind service
Summary: unreachable krb server crashes winbind service
Status: NEW
Alias: None
Product: Samba 4.0
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 4.0.5
Hardware: x86 Linux
: P5 major (vote)
Target Milestone: ---
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-17 15:10 UTC by Samjam
Modified: 2013-06-11 16:14 UTC (History)
1 user (show)

See Also:


Attachments
backtrace of core (9.24 KB, text/plain)
2013-04-17 15:10 UTC, Samjam
no flags Details
Patch against samba 4.0.5 (2.45 KB, patch)
2013-04-19 10:36 UTC, Samjam
no flags Details
Fix for transport-gone-away problem (3.14 KB, patch)
2013-06-05 15:59 UTC, Samjam
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Samjam 2013-04-17 15:10:38 UTC
Created attachment 8782 [details]
backtrace of core

If samba service is run with

server services = +smb

and is a domain member, and the kerberos is not responding, then after a period of connection attempts by a windows client, the winbind service will segfault and not be restarted.

All further session attempts fail and the server becomes largely unavailable.

In re-producing this error, I deny kerberos access like this:

iptables -t nat -A OUTPUT -m udp -p udp --dport 88 -j DNAT --to-destination 172.16.0.1

(it is only important that the destination is non-local and does not respond/exist).
Comment 1 Samjam 2013-04-17 15:12:04 UTC
Not crashing is an optimistic strategy

As there is a pipe to join the task server children to the main server, I think the main server could refork the task service to recover.
Comment 2 Samjam 2013-04-19 10:36:16 UTC
Created attachment 8791 [details]
Patch against samba 4.0.5

This patch runs on 4.0.5 and does remedy the stuck situation the crash brings about.
Comment 3 Samjam 2013-05-24 15:15:10 UTC
I'm also investigating related RPC process hangs that occur in networks with traffic loss
Comment 4 Samjam 2013-05-31 15:04:08 UTC
With the patch http://cpaste.org/1895/ to preserve gensec contexts, and also from #9832 crashes are less common.

What I now find is that:

- after periods of no or poor connectivity (such as reboot of sole DC)

- when a client is constantly authenticating (e.g. map and unmap a share)

- that some port 445 connections to the dc used to process logon connections will linger around

- these 445 connections used to process logon connections be reset but winbind does not know this and will not create new connections but re-use old ones -- find_domain_from_sid()

Also, if winbind is restarted (by means of the patches here) existing samba server sessions will lose an irpc pipe (or something like it) and not realise and will not create new connections.

I also observe that some lingered connections to the dc will always return NT_STATUS_ACCESS_DENIED. Attaching gdb to close these (or reboot dc) results in a timeout for the client instead, no new pipe is made.


Such situations result in timeout errors and loss of service ONLY for non-domain members using NTLM (in my case for S4U2Proxy)

I think there is no means to handle or re-establish dropped transport connections.
Comment 5 Samjam 2013-06-04 08:54:43 UTC
I now have a very simple failure case (not using cifs proxy) which I will attempt to re-produce on current head:

1. perform NTLM auth of a client connecting to share (net use)
2. check using lsof that winbind service task has a connection to dc
3. unmap share from client
4. check using lsof that winbind service task still has a connection to dc
5. reboot dc
6. check using lsof that winbind service task does not have a connection to dc
7. reboot client, or disconnect IPC$ share (perhaps by kill the smb per-client process)
8. watch all NTLM auth attempts fail
Comment 6 Samjam 2013-06-05 10:49:48 UTC
This problem DOES occur for latest samba head 55add52f42f389f396ab9a08f08ada203fffee14

After an auth attempt, I reboot the DC and now samba -d2 gives:

auth_check_password_recv: winbind authentication for user [WANOPTDOGFOOD\administrator] FAILED with error NT_STATUS_CONNECTION_DISCONNECTED
SPNEGO login failed: NT_STATUS_CONNECTION_DISCONNECTED
standard_terminate: reason[NT_STATUS_END_OF_FILE]

the client reports:
session setup failed: NT_STATUS_CONNECTION_DISCONNECTED

..repeatedly, no attempt is made to create a new connection to the dc.

I'll look at fixing this in the samba head
Comment 7 Samjam 2013-06-05 12:48:07 UTC
I find this iptables applied briefly during an auth attempt is a substitute for rebooting the DC.

iptables -A INPUT -m tcp -p tcp --sport 445 -j REJECT --reject-with tcp-reset
...
iptables -D INPUT -m tcp -p tcp --sport 445 -j REJECT --reject-with tcp-reset

it will be enough to reset the remote end, the rule is then removed to allow the remote RST back to reset the winbind socket.

Resetting the samba side first like this:
iptables -A OUTPUT -m tcp -p tcp --dport 445 -j REJECT --reject-with tcp-reset

instead caused:

auth_check_password_recv: winbind authentication for user SPNEGO login failed: NT_STATUS_IO_TIMEOUT
session setup failed: NT_STATUS_IO_TIMEOUT
standard_terminate: reason[NT_STATUS_END_OF_FILE]
===============================================================
INTERNAL ERROR: Signal 11 in pid 31267 (4.1.0pre1-GIT-bbee6e4)
Please read the Trouble-Shooting section of the Samba HOWTO
===============================================================
PANIC: internal error
Restart after 134 from child 31267 winbind
waiting for child 31535 winbind

(I have the restart winbind patch in)
Comment 8 Samjam 2013-06-05 15:59:26 UTC
Created attachment 8946 [details]
Fix for transport-gone-away problem

Patch that fixes the issue for me by reacting to a NT_STATUS_CONNECTION_RESET reset message.

Maybe another fix would be needed for NT_STATUS_CONNECTION_DISCONNECTED and NT_STATUS_IO_TIMEOUT which might occur if the reset error was somehow missed. The action there would be to close the pipe and then destroy it at the point the error was found, rather than merely checking in the find_domain_from_sid() function
Comment 9 Jeremy Allison 2013-06-06 18:07:08 UTC
Ok, I have one (big) issue with the patch. That's the use of talloc_reference() inside the _wb_sid2domain_send() to keep state->domain around.

Please use reference counting instead. talloc_reference leads to memory leaks and completely unmaintainable code, and any additional use of this API should be considered as introducing a bug into Samba.

Sorry :-(.

Jeremy.
Comment 10 Jeremy Allison 2013-06-06 18:07:57 UTC
And by reference counting, I mean *explicit* reference counting (I know talloc_reference is also reference counting, but it's nasty and hidden :-).

Jeremy.
Comment 11 Samjam 2013-06-06 21:04:11 UTC
I'm just thinking how I would do that.

It would be quite intrusive.  it certainly is a pain to have to recover the pointer to release the counter. 

I'm not sure what the current style is for reference counting is, I would probably introduce a reference-counting wrapper struct that was just a pointer to something and also a pointer to it's reference count. In the wrappers talloc destructor it would reduce the reference count and then free it if it was zero. Which means that the domain struct would also need a talloc destructor to prevent destruction if the reference count was not zero. But did decrement of course. But then we get the risk of surrogate talloc parents (when the destructor refused destruction) re-calling the destructor and reducing the count AGAIN. I guess the original owner need not directly own the domain struct, but via another reference count wrapper.

On balancem I would probably remove the talloc_reference from that patch and open another bug # to deal with better reference counting. Would that be OK?
Comment 12 Jeremy Allison 2013-06-06 23:44:07 UTC
Yes, thanks - that would be better.

Jeremy.
Comment 13 Volker Lendecke 2013-06-11 16:14:23 UTC
Before you go too far with doing hand-written refcounting: I have really strong reservations against talloc_reference myself, but I have used it myself in core code. Take a look at tdb_wrap_open(). There the user of this API can not mis-use the object that was talloc_referenced, and there is a clear separation of the code that must be aware of the talloc_reference() and all the rest. Jeremy might disagree, but I would be willing to discuss this particular use over one or two beers :-)