From one of our Debian users: Samba 3.5.6~dfsg-1 on Debian Squeeze. Updates are current. Ever since upgrading from 3.4.x to 3.5.x, winbind_cache.tdb is repeatedly corrupted, rendering authentication impossible. I have a script that checks every 15 minutes for connectivity to the AD server and restarts samba and winbind if the connection is corrupted. This happens on every winbind system, so it is not system specific. ls of this file in /var/cache/samba is as follows: -rw------- 1 root root 139264 Nov 16 11:21 winbindd_cache.tdb -rw------- 1 root root 946176 Nov 16 11:20 winbindd_cache.tdb.bak -rw------- 1 root root 946176 Nov 16 09:16 winbindd_cache.tdb.bak.old The corrupted tdb's are almost 7 times as large as the working version. In 3.3.5, a winbind_cache.corrupt was generated at each restart after a corruption. In 3.5.6, 2 backups are maintained instead. This is the only difference I have noted between the two versions. I notice this is repeated in the logs at the time of the corruption: [2010/11/16 11:19:40.933366, 10] winbindd/winbindd_cache.c:4674(wcache_fetch_ndr) Entry has wrong sequence number: 1573542 [2010/11/16 11:19:40.933684, 1] winbindd/winbindd_util.c:289(trustdom_recv) Could not receive trustdoms A level 10 log during a corruption is attached.
Created attachment 6085 [details] Level 10 log at the moment the corruption happens
These "entry has wrong sequence number" messages are expected if something changes on the AD DC, they are normal, they are not a failure. We need debug level 10 logs of the actually failed request. log.winbindd* and log.wb* (there's more than one winbind log file since version 3.2). Volker
Created attachment 6093 [details] Winbind level 10 logs at the moment the corruption happens
What analysis has led you to the fact that the winbindd_cache.tdb is corrupt? Is it just the size of that file? If it is that, I would like to ask you to dig deeper with the tdbdump and tdbtool to really see where in what way the binary format of the tdb is corrupt. From the logs I do not see that the cache file is corrupt, please point out which lines show the winbindd_cache.tdb corruption. What I do see is messages with NT_STATUS_NO_LOGON_SERVERS. This to me is a much more likely reason that authentication does not work anymore. Please point out in a more detailed way how you came to the conclusion that winbindd_cache.tdb is corrupt and how this would lead to NT_STATUS_NO_LOGON_SERVERS. With best regards, Volker
Volker, Dale (our user) added the following: A corruption occured between 2:15 PM and 2:30 PM 11/28/2010. Also, a correction to original post - it should have stated: In 3.5.5, a winbind_cache.corrupt was generated at each restart after a corruption, instead of saying 3.3.5. -->as Dale is seeing a winbind_cache.corrupt, maybe this is why he concludes that the file was corrupted.
Created attachment 6102 [details] Another set of logs Some more info from Dale, our user: Attaching a 2nd set of logs. Previous was at log level 10 for winbind only. This set is full level 10 and from a different system. As stated previously, this problem affects all of my winbind systems and started after upgrading from 3.4.x to 3.5.x. Failure occurred between 14:30 and 14:45 on 12/02/2010.
To be honest, I'm pretty lost. You're saying that something stopped working between 14:30 and 14:45. [2010/12/02 14:44:00.671845, 5] libads/ldap.c:226(ads_try_connect) ads_try_connect: sending CLDAP request to 67.215.65.132 (realm: delsolw2k.com) [2010/12/02 14:44:02.674222, 10] lib/events.c:123(run_events) Running timed event "tevent_req_timedout" 0xb7b12608 [2010/12/02 14:44:04.676434, 10] lib/events.c:123(run_events) Running timed event "tevent_req_timedout" 0xb7b125a8 [2010/12/02 14:44:06.674126, 10] lib/events.c:123(run_events) Running timed event "tevent_req_timedout" 0xb7add410 [2010/12/02 14:44:06.674242, 2] libads/cldap.c:97(ads_cldap_netlogon) cldap_netlogon() failed: NT_STATUS_IO_TIMEOUT [2010/12/02 14:44:06.674312, 3] libads/ldap.c:231(ads_try_connect) ads_try_connect: CLDAP request 67.215.65.132 failed. This looks wrong. We should not time out within a few milliseconds... But it is definitely a different problem from winbind corrupting the cache. I don't have any indication at all why winbind decided to call its cache corrupt. Do you have any indication in the logs? Could you upload winbindd_cache.tdb right after you decided to restart winbind? Volker
Reassigning to Stefan Metzmacher for his opinion on the cldap timeout. Please re-assign when you've had the time to take a look. Volker
The timeout for the whole operation is 6 seconds: 14:44:00.671845 => 14:44:06.674242 We just resend the packet every 2 seconds (with 2 retries), that's why you see 3 timer events I guess the dc was just not reachable.
Crap, I had only looked at the milliseconds... Volker
(In reply to comment #10) > Crap, I had only looked at the milliseconds... > > Volker > What I see is this: [2010/12/02 14:44:00.671845, 5] libads/ldap.c:226(ads_try_connect) ads_try_connect: sending CLDAP request to 67.215.65.132 (realm: delsolw2k.com) whois says that 67.215.65.132 belongs to OpenDNS. It is not any ip of the domain. We do use OpenDNS as a forwarder, but this is not an ip of their dns servers. I have no idea why this address is being queried for local domain info.
Samba relies on the normal system DNS resolving routines to look up the IP address of the Active Directory Domain Controller. This is because we do not want to invent that as well, Samba is already a very large project. So if the system DNS resolving routines tell us the Domain Controller is at IP Address 67.215.65.132, we try to connect to that. If there is no DC at that address, the natural consequence is that your authentication ceases to work. Please make sure that the system DNS routines resolve the correct IP addresses for your Active Directory Domain Controllers. One way to do this is to put the IP address of a Active Directory Domain Controller that also carries a DNS server into the configuration line nameserver <ip-address> in the file /etc/resolv.conf. The exact way to configure this will depend on the exact version of Unix you are using. Please also make sure that no DHCP or BOOTP client program will change the settings in the file /etc/resolv.conf. One way to do this on GNU/Linux is to set this file immutable by issuing chattr +i /etc/resolv.conf if it is not possible to disable the dhcp client from attempting to change the /etc/resolv.conf file. I'm closing this bug as WORKSFORME. Please re-open if you still have that issue after making sure that your DNS configuration is stable. With best regards, Volker Lendecke
(In reply to comment #12) > Samba relies on the normal system DNS resolving routines to look up the IP > address of the Active Directory Domain Controller. This is because we do not > want to invent that as well, Samba is already a very large project. So if the > system DNS resolving routines tell us the Domain Controller is at IP Address > 67.215.65.132, we try to connect to that. If there is no DC at that address, > the natural consequence is that your authentication ceases to work. > > Please make sure that the system DNS routines resolve the correct IP addresses > for your Active Directory Domain Controllers. One way to do this is to put the > IP address of a Active Directory Domain Controller that also carries a DNS > server into the configuration line > > nameserver <ip-address> > > in the file /etc/resolv.conf. The exact way to configure this will depend on > the exact version of Unix you are using. Please also make sure that no DHCP or > BOOTP client program will change the settings in the file /etc/resolv.conf. One > way to do this on GNU/Linux is to set this file immutable by issuing > > chattr +i /etc/resolv.conf > > if it is not possible to disable the dhcp client from attempting to change the > /etc/resolv.conf file. > > I'm closing this bug as WORKSFORME. Please re-open if you still have that issue > after making sure that your DNS configuration is stable. > > With best regards, > > Volker Lendecke > The problem was not /etc/resolve.conf. Adding the DC to /etc/resolv.conf had no effect, and there was no DHCP or BOOTP process changing the settings in resolv.conf. The problem turned out to be a sync problem between the master and slave DNS servers. The serial number of the reverse DNS zone (PTR records) in the DC was not the same as that in the slave servers. After forcing the master/slaves to resync serial numbers, there have been no more winbind hangs. Note that this problem has not affected any Windows systems nor Samba systems prior to 3.5.x. Earlier versions of Samba had no problem with this DNS error. A Lenny server running 3.2.5 ran flawlessly the entire time. A look at archived logs show that this problem has existed for some time, long before the 3.5 series was released, but caused no visible problem until upgrading to 3.5.x. I have to assume something has changed in the way that winbind works relative to DNS queries. Thanks for leading me in the right direction. Dale