Originally created as http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=574144 Due to a misconfiguration, we resulted with the following situation: - 3 active directory servers working perfectly - 1 active directory server up, but all ports being blocked by the firewall all active directory are serving the same domain. It appears that this makes winbind unstable. The daemon eventually does not answer to incoming requests on /tmp/.winbindd/pipe I report this bug as it may interest others and as it may also occur if the active directory service crashes badly. The problem did not exist in winbind 3.4.2; I still have clients on this version that behave correctly in this erroneous AD configuration. The steps to reproduce are easy: - refuse incoming connections on active directory (connection closed) as below eca@xxxxxxxxc ~> nmap 172.16.uuu.vv Starting Nmap 4.62 ( http://nmap.org ) at 2010-03-16 17:19 CET All 1715 scanned ports on hostname.fqdn.com (172.16.uuu.vv) are closed Nmap done: 1 IP address (1 host up) scanned in 0.392 seconds After a while, it will eventually block The symptoms are: - if you strace libnss programs, you will see that winbindd refuses to process requests on the /tmp/.winbindd/pipe (EAGAIN in connect() syscall) - You will see such messages: head /var/log/samba/log.winbindd-dc-connect [2010/03/16 03:28:49, 1] libads/cldap.c:166(recv_cldap_netlogon) no reply received to cldap netlogon (ret = -1: Error = Connection refused) /var/log/samba/log.winbindd (at the time it will not work anymore) [2010/03/16 08:52:58, 1] winbindd/winbindd_ads.c:1137(lookup_groupmem) lsa_lookupsids call failed with NT_STATUS_IO_TIMEOUT - retrying... - if you strace winbind processes you will see them sleeping in select() and nothing ever happens -- System Information: Debian Release: 5.0.4 APT prefers stable APT policy: (800, 'stable') Architecture: i386 (i686) Kernel: Linux 2.6.26-2-xen-686 (SMP w/2 CPU cores) Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1) Shell: /bin/sh linked to /bin/bash Versions of packages winbind depends on: ii adduser 3.110 add and remove users and groups ii libc6 2.7-18lenny2 GNU C Library: Shared libraries ii libcap2 2.11-2 support for getting/setting POSIX. ii libcomerr2 1.41.3-1 common error description library ii libkrb53 1.6.dfsg.4~beta1-5lenny2 MIT Kerberos runtime libraries ii libldap-2.4-2 2.4.11-1+lenny1 OpenLDAP libraries ii libpam0g 1.0.1-5+lenny1 Pluggable Authentication Modules l ii libpopt0 1.14-4 lib for parsing cmdline parameters ii libtalloc2 2.0.1-1~bpo50+1 hierarchical pool based memory all hi libwbclient0 2:3.4.7~dfsg-1~bpo50+1 Samba winbind client library ii lsb-base 3.2-20 Linux Standard Base 3.2 init scrip ii samba-common 2:3.4.7~dfsg-1~bpo50+1 common files used by both the Samb ii zlib1g 1:1.2.3.3.dfsg-12 compression library - runtime winbind recommends no packages. winbind suggests no packages. -- no debconf information
We are seeing the same problem with samba 3.4.5 on RHEL. We have two AD domain controllers, if any of the two goes down, winbind hangs until it is restarted, at the exact place, it is unable to read /tmp/.winbindd/pipe. with an lsof/strace, it looks like none of the winbind processes are looking at the right FD.
Can you try the patches attached to bug 7295? https://bugzilla.samba.org/show_bug.cgi?id=7295 Or use the current v3-5-test branch
I just tested the build from branch v3-5-test (3.5.2-GIT-9620f5f-test) and reproduced on my lab. It is a little bit better, though not perfect. That is: - under the same test conditions, winbind remains stable and eventually produces an answer - however, the answer is not correct because is does not try to connect to the other ActiveDirectory servers. Here is the detail of lab cat /var/run/samba/smb_krb5/krb5.conf.PREPROD [libdefaults] default_realm = PREPROD.xxxx.xx default_tgs_enctypes = RC4-HMAC DES-CBC-CRC DES-CBC-MD5 default_tkt_enctypes = RC4-HMAC DES-CBC-CRC DES-CBC-MD5 preferred_enctypes = RC4-HMAC DES-CBC-CRC DES-CBC-MD5 [realms] PREPROD.xxxx.xx = { kdc = 172.29.129.22 kdc = 172.29.129.21 kdc = 172.29.129.21 } The test conditions are reproduced by iptables -D OUTPUT -p tcp -d 172.29.129.22 -j REJECT And in such conditions, winbind does not switch to the other available ActiveDirectory node and tries to use the already established connections
Can you rerun the test also blocking UDP? The dc discovery uses UDP to detect if a dc is reachable. We still should fallback if only TCP is blocked, but it should only cause delays or temporary failures. If it doesn't recovery at all in this case we have an additional bug to fix.
I checked out 3.5.2-GIT-UNKNOWN-test yesterday and installed it on two servers in my dev environment. I have two AD servers in my lab. I made the linux servers leave the domain, and then re-join the domain. I did not login to the servers for an hour after doing so to make sure nothing got cached. I then rebooted the one AD server, waited 15 minutes after it came up, then rebooted the other AD server. I then waited another hour before logging in to the linux servers. the one server with 3.5.2-GIT has been working like a champ. no issues at all. The other gives this from the logs: [2010/04/02 17:59:08.331357, 1] winbindd/winbindd_ads.c:126(ads_cached_connection) ads_connect for domain AWESOME failed: No logon servers [2010/04/02 17:59:14.411718, 1] winbindd/winbindd_ads.c:126(ads_cached_connection) ads_connect for domain AWESOME failed: No logon servers [2010/04/02 18:01:08.946911, 1] winbindd/winbindd_ads.c:126(ads_cached_connection) ads_connect for domain AWESOME failed: No logon servers On further investigation, it appeared the ip it was trying to contact the AD server on was unreachable (I accidentally had the secondary AD server register a "private" ip on its other NIC into DNS, which neither linux server can hit). However it seemed as if winbind would not try to look at any other ips for that domain, like it was stuck on checking that ip until it was reachable, or winbind got restarted (which fixed the problem). I let that server sit for about an hour before I tried logging in, so it was trying to reach that IP for about an hour. The other ips advertised for the domain were reachable the entire time. Is this default behavior or a bug? Should winbind keep trying the ips listed from dns for the domain until it can talk to one instead of zoning in on one ip, or am I misunderstanding how its supposed to work?
After fixing the dns issue I decided to try this again, with all ips returned for the domain being reachable. one server worked great (this time the one that had the problem below is working great and the other one that worked fine before is having weird problems). Here is the problem: ssh logins hang, as it is trying to do group lookups. getent groups/passwd hangs. However, if I were to strace -p <pid> of the main winbind process, group lookups begin to work so long as I am attached that pid (I thought I was crazy at first but after 15 minutes of nothing but that I have completely verified that is the behavior I am seeing). I would be more than happy to take a video capture to show this. I wanted to get an strace to see where it was hanging, and everytime I attach to the processes, it immediately returns correct values. If I do not attach to the process via strace, getent group/passwd only returns local users. strace here: http://pastebin.com/dg08uzdp logs: winbindd_getgrent_done: done with 17 groups [2010/04/02 19:54:17.120958, 10] winbindd/winbindd.c:662(wb_request_done) wb_request_done[20109:GETGRENT]: NT_STATUS_OK [2010/04/02 19:54:17.122203, 10] winbindd/winbindd.c:723(winbind_client_response_written) winbind_client_response_written[20109:GETGRENT]: deliverd response to client [2010/04/02 19:54:17.124385, 10] winbindd/winbindd.c:600(process_request) process_request: Handling async request 20109:GETGRENT [2010/04/02 19:54:17.125035, 3] winbindd/winbindd_getgrent.c:51(winbindd_getgrent_send) [20109]: getgrent [2010/04/02 19:54:17.125669, 5] winbindd/winbindd_getgrent.c:149(winbindd_getgrent_recv) getgrent failed: NT_STATUS_NO_MORE_ENTRIES [2010/04/02 19:54:17.126308, 10] winbindd/winbindd.c:662(wb_request_done) wb_request_done[20109:GETGRENT]: NT_STATUS_NO_MORE_ENTRIES [2010/04/02 19:54:17.127456, 10] winbindd/winbindd.c:723(winbind_client_response_written) winbind_client_response_written[20109:GETGRENT]: deliverd response to client [2010/04/02 19:54:17.129106, 10] winbindd/winbindd.c:600(process_request) process_request: Handling async request 20109:ENDGRENT [2010/04/02 19:54:17.129893, 10] winbindd/winbindd.c:662(wb_request_done) wb_request_done[20109:ENDGRENT]: NT_STATUS_OK [2010/04/02 19:54:17.131047, 10] winbindd/winbindd.c:723(winbind_client_response_written) winbind_client_response_written[20109:ENDGRENT]: deliverd response to client [2010/04/02 19:54:17.132178, 6] winbindd/winbindd.c:823(winbind_client_request_read) closing socket 32, client exited
oh a restart of winbind fixes the issue
sorry for the many comments. that last issue seems to be a fluke. things have been working great using 3.5.2-test. However the issue before that can be reproduced (if one of those ips is unreachable winbind seems to hone in on just that ip and not try any other ips), I would love to know if that is a bug or not. Again sorry for the many comments, we have planning on using this for a large (1200+ server) deployment and this is the only bug holding us back. Thanks!
I confirm that on version 3.5.2-GIT-9620f5f-test, and blocking udp as well as tcp (as it were accidentally on my production site), it works as expected (switching to other active directory).
So is this one fixed in 3.5.2? I'd like to close this bug report.
I confirm it is ok in 3.5.2