7259 – Winbind eventually locks "forever" if one of ActiveDirectory refuses all connections

Bug 7259 - Winbind eventually locks "forever" if one of ActiveDirectory refuses all connections

Summary: Winbind eventually locks "forever" if one of ActiveDirectory refuses all conn...

Status:	RESOLVED FIXED

Alias:	None

Product:	Samba 3.4
Classification:	Unclassified
Component:	Winbind (show other bugs)
Version:	3.4.7
Hardware:	x86 Linux

Importance:	P3 normal
Target Milestone:	---
Assignee:	Stefan Metzmacher
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-03-17 10:55 UTC by Eric Castan
Modified:	2010-05-11 10:08 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Eric Castan 2010-03-17 10:55:41 UTC

Originally created as http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=574144

Due to a misconfiguration, we resulted with the following situation:
- 3 active directory servers working perfectly
- 1 active directory server up, but all ports being blocked by the firewall
all active directory are serving the same domain.
It appears that this makes winbind unstable. The daemon eventually does not answer to incoming requests on /tmp/.winbindd/pipe

I report this bug as it may interest others and as it may also occur if the active directory service crashes badly.

The problem did not exist in winbind 3.4.2; I still have clients on this version that behave correctly in this erroneous AD configuration.


The steps to reproduce are easy:
- refuse incoming connections on active directory (connection closed) as below


eca@xxxxxxxxc ~> nmap 172.16.uuu.vv

Starting Nmap 4.62 ( http://nmap.org ) at 2010-03-16 17:19 CET
All 1715 scanned ports on hostname.fqdn.com (172.16.uuu.vv) are closed

Nmap done: 1 IP address (1 host up) scanned in 0.392 seconds


After a while, it will eventually block

The symptoms are:
- if you strace libnss programs, you will see that winbindd refuses to process requests on the /tmp/.winbindd/pipe (EAGAIN in connect() syscall)
- You will see such messages:

head /var/log/samba/log.winbindd-dc-connect
[2010/03/16 03:28:49,  1] libads/cldap.c:166(recv_cldap_netlogon)
  no reply received to cldap netlogon (ret = -1: Error = Connection refused)

/var/log/samba/log.winbindd (at the time it will not work anymore)
[2010/03/16 08:52:58,  1] winbindd/winbindd_ads.c:1137(lookup_groupmem)
  lsa_lookupsids call failed with NT_STATUS_IO_TIMEOUT - retrying...

- if you strace winbind processes you will see them sleeping in select() and nothing ever happens



-- System Information:
Debian Release: 5.0.4
  APT prefers stable
  APT policy: (800, 'stable')
Architecture: i386 (i686)

Kernel: Linux 2.6.26-2-xen-686 (SMP w/2 CPU cores)
Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/bash

Versions of packages winbind depends on:
ii  adduser         3.110                    add and remove users and groups
ii  libc6           2.7-18lenny2             GNU C Library: Shared libraries
ii  libcap2         2.11-2                   support for getting/setting POSIX.
ii  libcomerr2      1.41.3-1                 common error description library
ii  libkrb53        1.6.dfsg.4~beta1-5lenny2 MIT Kerberos runtime libraries
ii  libldap-2.4-2   2.4.11-1+lenny1          OpenLDAP libraries
ii  libpam0g        1.0.1-5+lenny1           Pluggable Authentication Modules l
ii  libpopt0        1.14-4                   lib for parsing cmdline parameters
ii  libtalloc2      2.0.1-1~bpo50+1          hierarchical pool based memory all
hi  libwbclient0    2:3.4.7~dfsg-1~bpo50+1   Samba winbind client library
ii  lsb-base        3.2-20                   Linux Standard Base 3.2 init scrip
ii  samba-common    2:3.4.7~dfsg-1~bpo50+1   common files used by both the Samb
ii  zlib1g          1:1.2.3.3.dfsg-12        compression library - runtime

winbind recommends no packages.

winbind suggests no packages.

-- no debconf information

Comment 1 Andrew Tranquada 2010-03-24 08:12:49 UTC

We are seeing the same problem with samba 3.4.5 on RHEL.
We have two AD domain controllers, if any of the two goes down, winbind hangs until it is restarted, at the exact place, it is unable to read /tmp/.winbindd/pipe.
with an lsof/strace, it looks like none of the winbind processes are looking at the right FD.

Comment 2 Stefan Metzmacher 2010-03-31 08:49:46 UTC

Can you try the patches attached to bug 7295?
https://bugzilla.samba.org/show_bug.cgi?id=7295

Or use the current v3-5-test branch

Comment 3 Eric Castan 2010-04-01 07:30:55 UTC

I just tested the build from branch v3-5-test (3.5.2-GIT-9620f5f-test) and reproduced on my lab.

It is a little bit better, though not perfect. That is:
- under the same test conditions, winbind remains stable and eventually produces an answer
- however, the answer is not correct because is does not try to connect to the other ActiveDirectory servers.

Here is the detail of lab
cat /var/run/samba/smb_krb5/krb5.conf.PREPROD
[libdefaults]
        default_realm = PREPROD.xxxx.xx
        default_tgs_enctypes = RC4-HMAC DES-CBC-CRC DES-CBC-MD5
        default_tkt_enctypes = RC4-HMAC DES-CBC-CRC DES-CBC-MD5
        preferred_enctypes = RC4-HMAC DES-CBC-CRC DES-CBC-MD5

[realms]
        PREPROD.xxxx.xx = {
                kdc = 172.29.129.22
        kdc = 172.29.129.21
        kdc = 172.29.129.21
        }

The test conditions are reproduced by

iptables -D OUTPUT -p tcp -d 172.29.129.22 -j REJECT

And in such conditions, winbind does not switch to the other available ActiveDirectory node and tries to use the already established connections

Comment 4 Stefan Metzmacher 2010-04-01 10:30:54 UTC

Can you rerun the test also blocking UDP?

The dc discovery uses UDP to detect if a dc is reachable.

We still should fallback if only TCP is blocked, but it should only cause delays
or temporary failures. If it doesn't recovery at all in this case we have an
additional bug to fix.

Comment 5 Andrew Tranquada 2010-04-02 12:37:12 UTC

I checked out 3.5.2-GIT-UNKNOWN-test yesterday and installed it on two servers in my dev environment. I have two AD servers in my lab. I made the linux servers leave the domain, and then re-join the domain. I did not login to the servers for an hour after doing so to make sure nothing got cached. I then rebooted the one AD server, waited 15 minutes after it came up, then rebooted the other AD server. I then waited another hour before logging in to the linux servers.
the one server with 3.5.2-GIT has been working like a champ. no issues at all.
The other gives this from the logs:
[2010/04/02 17:59:08.331357, 1] winbindd/winbindd_ads.c:126(ads_cached_connection)
ads_connect for domain AWESOME failed: No logon servers
[2010/04/02 17:59:14.411718, 1] winbindd/winbindd_ads.c:126(ads_cached_connection)
ads_connect for domain AWESOME failed: No logon servers
[2010/04/02 18:01:08.946911, 1] winbindd/winbindd_ads.c:126(ads_cached_connection)
ads_connect for domain AWESOME failed: No logon servers

On further investigation, it appeared the ip it was trying to contact the AD server on was unreachable (I accidentally had the secondary AD server register a "private" ip on its other NIC into DNS, which neither linux server can hit). However it seemed as if winbind would not try to look at any other ips for that domain, like it was stuck on checking that ip until it was reachable, or winbind got restarted (which fixed the problem). I let that server sit for about an hour before I tried logging in, so it was trying to reach that IP for about an hour. The other ips advertised for the domain were reachable the entire time.
Is this default behavior or a bug? Should winbind keep trying the ips listed from dns for the domain until it can talk to one instead of zoning in on one ip, or am I misunderstanding how its supposed to work?

Comment 6 Andrew Tranquada 2010-04-02 14:03:11 UTC

After fixing the dns issue I decided to try this again, with all ips returned for the domain being reachable. 
one server worked great (this time the one that had the problem below is working great and the other one that worked fine before is having weird problems).
Here is the problem:
ssh logins hang, as it is trying to do group lookups. getent groups/passwd hangs. However, if I were to strace -p <pid> of the main winbind process, group lookups begin to work so long as I am attached that pid (I thought I was crazy at first but after 15 minutes of nothing but that I have completely verified that is the behavior I am seeing). I would be more than happy to take a video capture to show this. I wanted to get an strace to see where it was hanging, and everytime I attach to the processes, it immediately returns correct values. 
If I do not attach to the process via strace, getent group/passwd only returns local users. 

strace here: http://pastebin.com/dg08uzdp

logs:

  winbindd_getgrent_done: done with 17 groups
[2010/04/02 19:54:17.120958, 10] winbindd/winbindd.c:662(wb_request_done)
  wb_request_done[20109:GETGRENT]: NT_STATUS_OK
[2010/04/02 19:54:17.122203, 10] winbindd/winbindd.c:723(winbind_client_response_written)
  winbind_client_response_written[20109:GETGRENT]: deliverd response to client
[2010/04/02 19:54:17.124385, 10] winbindd/winbindd.c:600(process_request)
  process_request: Handling async request 20109:GETGRENT
[2010/04/02 19:54:17.125035,  3] winbindd/winbindd_getgrent.c:51(winbindd_getgrent_send)
  [20109]: getgrent
[2010/04/02 19:54:17.125669,  5] winbindd/winbindd_getgrent.c:149(winbindd_getgrent_recv)
  getgrent failed: NT_STATUS_NO_MORE_ENTRIES
[2010/04/02 19:54:17.126308, 10] winbindd/winbindd.c:662(wb_request_done)
  wb_request_done[20109:GETGRENT]: NT_STATUS_NO_MORE_ENTRIES
[2010/04/02 19:54:17.127456, 10] winbindd/winbindd.c:723(winbind_client_response_written)
  winbind_client_response_written[20109:GETGRENT]: deliverd response to client
[2010/04/02 19:54:17.129106, 10] winbindd/winbindd.c:600(process_request)
  process_request: Handling async request 20109:ENDGRENT
[2010/04/02 19:54:17.129893, 10] winbindd/winbindd.c:662(wb_request_done)
  wb_request_done[20109:ENDGRENT]: NT_STATUS_OK
[2010/04/02 19:54:17.131047, 10] winbindd/winbindd.c:723(winbind_client_response_written)
  winbind_client_response_written[20109:ENDGRENT]: deliverd response to client
[2010/04/02 19:54:17.132178,  6] winbindd/winbindd.c:823(winbind_client_request_read)
  closing socket 32, client exited

Comment 7 Andrew Tranquada 2010-04-02 14:04:24 UTC

oh a restart of winbind fixes the issue

Comment 8 Andrew Tranquada 2010-04-03 22:34:00 UTC

sorry for the many comments. that last issue seems to be a fluke. things have been working great using 3.5.2-test.
However the issue before that can be reproduced (if one of those ips is unreachable winbind seems to hone in on just that ip and not try any other ips), I would love to know if that is a bug or not. Again sorry for the many comments, we have planning on using this for a large (1200+ server) deployment and this is the only bug holding us back. Thanks!

Comment 9 Eric Castan 2010-04-12 04:52:31 UTC

I confirm that on version 3.5.2-GIT-9620f5f-test, and blocking udp as well as tcp (as it were accidentally on my production site), it works as expected (switching to other active directory).

Comment 10 Stefan Metzmacher 2010-05-11 07:16:15 UTC

So is this one fixed in 3.5.2? I'd like to close this bug report.

Comment 11 Eric Castan 2010-05-11 10:08:52 UTC

I confirm it is ok in 3.5.2