7818 – Repeated winbind_cache.tdb corruption

Bug 7818 - Repeated winbind_cache.tdb corruption

Summary: Repeated winbind_cache.tdb corruption

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	Samba 3.5
Classification:	Unclassified
Component:	Winbind (show other bugs)
Version:	3.5.6
Hardware:	Other Linux

Importance:	P3 normal
Target Milestone:	---
Assignee:	Volker Lendecke
QA Contact:	Samba QA Contact

URL:	http://bugs.debian.org/cgi-bin/bugrep...
Keywords:

Depends on:
Blocks:

Reported:	2010-11-24 12:57 UTC by Debian samba package maintainers (PUBLIC MAILING LIST)
Modified:	2010-12-16 13:54 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
Level 10 log at the moment the corruption happens (85.83 KB, text/plain) 2010-11-24 12:58 UTC, Debian samba package maintainers (PUBLIC MAILING LIST)	no flags	Details
Winbind level 10 logs at the moment the corruption happens (177.67 KB, application/x-compressed-tar) 2010-11-29 01:23 UTC, Debian samba package maintainers (PUBLIC MAILING LIST)	no flags	Details
Another set of logs (269.09 KB, application/x-gtar-compressed) 2010-12-03 10:59 UTC, Debian samba package maintainers (PUBLIC MAILING LIST)	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Debian samba package maintainers (PUBLIC MAILING LIST) 2010-11-24 12:57:50 UTC

From one of our Debian users:
Samba 3.5.6~dfsg-1 on Debian Squeeze.  Updates are current.

Ever since upgrading from 3.4.x to 3.5.x, winbind_cache.tdb is 
repeatedly corrupted, rendering authentication impossible.

I have a script that checks every 15 minutes for connectivity to the AD 
server and restarts samba and winbind if the connection is corrupted. 
This happens on every winbind system, so it is not system specific.

ls of this file in /var/cache/samba is as follows:
-rw-------  1 root root 139264 Nov 16 11:21 winbindd_cache.tdb
-rw-------  1 root root 946176 Nov 16 11:20 winbindd_cache.tdb.bak
-rw-------  1 root root 946176 Nov 16 09:16 winbindd_cache.tdb.bak.old

The corrupted tdb's are almost 7 times as large as the working version. 
 In 3.3.5, a winbind_cache.corrupt was generated at each restart after 
a corruption.  In 3.5.6, 2 backups are maintained instead.  This is the 
only difference I have noted between the two versions.

I notice this is repeated in the logs at the time of the corruption:

[2010/11/16 11:19:40.933366, 10] 
winbindd/winbindd_cache.c:4674(wcache_fetch_ndr)
   Entry has wrong sequence number: 1573542
[2010/11/16 11:19:40.933684,  1] winbindd/winbindd_util.c:289(trustdom_recv)
   Could not receive trustdoms

A level 10 log during a corruption is attached.

Comment 1 Debian samba package maintainers (PUBLIC MAILING LIST) 2010-11-24 12:58:54 UTC

Created attachment 6085 [details]
Level 10 log at the moment the corruption happens

Comment 2 Volker Lendecke 2010-11-24 14:44:50 UTC

These "entry has wrong sequence number" messages are expected if something changes on the AD DC, they are normal, they are not a failure. We need debug level 10 logs of the actually failed request. log.winbindd* and log.wb* (there's more than one winbind log file since version 3.2).

Volker

Comment 3 Debian samba package maintainers (PUBLIC MAILING LIST) 2010-11-29 01:23:51 UTC

Created attachment 6093 [details]
Winbind level 10 logs at the moment the corruption happens

Comment 4 Volker Lendecke 2010-11-30 02:20:11 UTC

What analysis has led you to the fact that the winbindd_cache.tdb is corrupt? Is it just the size of that file? If it is that, I would like to ask you to dig deeper with the tdbdump and tdbtool to really see where in what way the binary format of the tdb is corrupt. From the logs I do not see that the cache file is corrupt, please point out which lines show the winbindd_cache.tdb corruption.

What I do see is messages with NT_STATUS_NO_LOGON_SERVERS. This to me is a much more likely reason that authentication does not work anymore.

Please point out in a more detailed way how you came to the conclusion that winbindd_cache.tdb is corrupt and how this would lead to NT_STATUS_NO_LOGON_SERVERS.

With best regards,

Volker

Comment 5 Debian samba package maintainers (PUBLIC MAILING LIST) 2010-12-01 12:31:58 UTC

Volker, Dale (our user) added the following:

A corruption occured between 2:15 PM and 2:30 PM 11/28/2010.                                                                     

Also, a correction to original post - it should have stated:                                                                                               
In 3.5.5, a winbind_cache.corrupt was generated at each restart after a corruption, instead of saying 3.3.5.                                                                       


-->as Dale is seeing a winbind_cache.corrupt, maybe this is why he concludes that the file was corrupted.

Comment 6 Debian samba package maintainers (PUBLIC MAILING LIST) 2010-12-03 10:59:00 UTC

Created attachment 6102 [details]
Another set of logs

Some more info from Dale, our user:


Attaching a 2nd set of logs.  Previous was at log level 10 for winbind only.                                                                                                       
This set is full level 10 and from a different system.  As stated                                                                                                                  
previously, this problem affects all of my winbind systems and started                                                                                                             
after upgrading from 3.4.x to 3.5.x.                                                                                                                                               
Failure occurred between 14:30 and 14:45 on 12/02/2010.

Comment 7 Volker Lendecke 2010-12-03 14:15:41 UTC

To be honest, I'm pretty lost. You're saying that something stopped working between 14:30 and 14:45.

[2010/12/02 14:44:00.671845,  5] libads/ldap.c:226(ads_try_connect)
  ads_try_connect: sending CLDAP request to 67.215.65.132 (realm: delsolw2k.com)
[2010/12/02 14:44:02.674222, 10] lib/events.c:123(run_events)
  Running timed event "tevent_req_timedout" 0xb7b12608
[2010/12/02 14:44:04.676434, 10] lib/events.c:123(run_events)
  Running timed event "tevent_req_timedout" 0xb7b125a8
[2010/12/02 14:44:06.674126, 10] lib/events.c:123(run_events)
  Running timed event "tevent_req_timedout" 0xb7add410
[2010/12/02 14:44:06.674242,  2] libads/cldap.c:97(ads_cldap_netlogon)
  cldap_netlogon() failed: NT_STATUS_IO_TIMEOUT
[2010/12/02 14:44:06.674312,  3] libads/ldap.c:231(ads_try_connect)
  ads_try_connect: CLDAP request 67.215.65.132 failed.

This looks wrong. We should not time out within a few milliseconds...

But it is definitely a different problem from winbind corrupting the cache. I don't have any indication at all why winbind decided to call its cache corrupt. Do you have any indication in the logs? Could you upload winbindd_cache.tdb right after you decided to restart winbind?

Volker

Comment 8 Volker Lendecke 2010-12-06 02:40:18 UTC

Reassigning to Stefan Metzmacher for his opinion on the cldap timeout. Please re-assign when you've had the time to take a look.

Volker

Comment 9 Stefan Metzmacher 2010-12-06 03:13:55 UTC

The timeout for the whole operation is 6 seconds:

14:44:00.671845
=> 
14:44:06.674242

We just resend the packet every 2 seconds (with 2 retries),
that's why you see 3 timer events

I guess the dc was just not reachable.

Comment 10 Volker Lendecke 2010-12-06 03:17:50 UTC

Crap, I had only looked at the milliseconds...

Volker

Comment 11 Dale Schroeder 2010-12-07 13:58:22 UTC

(In reply to comment #10)
> Crap, I had only looked at the milliseconds...
> 
> Volker
> 
What I see is this:

[2010/12/02 14:44:00.671845,  5] libads/ldap.c:226(ads_try_connect)
  ads_try_connect: sending CLDAP request to 67.215.65.132 (realm:
delsolw2k.com)

whois says that 67.215.65.132 belongs to OpenDNS.  It is not any ip of the domain.  We do use OpenDNS as a forwarder, but this is not an ip of their dns servers.  I have no idea why this address is being queried for local domain info.

Comment 12 Volker Lendecke 2010-12-09 09:15:45 UTC

Samba relies on the normal system DNS resolving routines to look up the IP address of the Active Directory Domain Controller. This is because we do not want to invent that as well, Samba is already a very large project. So if the system DNS resolving routines tell us the Domain Controller is at IP Address 67.215.65.132, we try to connect to that. If there is no DC at that address, the natural consequence is that your authentication ceases to work.

Please make sure that the system DNS routines resolve the correct IP addresses for your Active Directory Domain Controllers. One way to do this is to put the IP address of a Active Directory Domain Controller that also carries a DNS server into the configuration line

nameserver <ip-address>

in the file /etc/resolv.conf. The exact way to configure this will depend on the exact version of Unix you are using. Please also make sure that no DHCP or BOOTP client program will change the settings in the file /etc/resolv.conf. One way to do this on GNU/Linux is to set this file immutable by issuing

chattr +i /etc/resolv.conf

if it is not possible to disable the dhcp client from attempting to change the /etc/resolv.conf file.

I'm closing this bug as WORKSFORME. Please re-open if you still have that issue after making sure that your DNS configuration is stable.

With best regards,

Volker Lendecke

Comment 13 Dale Schroeder 2010-12-16 13:54:45 UTC

(In reply to comment #12)
> Samba relies on the normal system DNS resolving routines to look up the IP
> address of the Active Directory Domain Controller. This is because we do not
> want to invent that as well, Samba is already a very large project. So if the
> system DNS resolving routines tell us the Domain Controller is at IP Address
> 67.215.65.132, we try to connect to that. If there is no DC at that address,
> the natural consequence is that your authentication ceases to work.
> 
> Please make sure that the system DNS routines resolve the correct IP addresses
> for your Active Directory Domain Controllers. One way to do this is to put the
> IP address of a Active Directory Domain Controller that also carries a DNS
> server into the configuration line
> 
> nameserver <ip-address>
> 
> in the file /etc/resolv.conf. The exact way to configure this will depend on
> the exact version of Unix you are using. Please also make sure that no DHCP or
> BOOTP client program will change the settings in the file /etc/resolv.conf. One
> way to do this on GNU/Linux is to set this file immutable by issuing
> 
> chattr +i /etc/resolv.conf
> 
> if it is not possible to disable the dhcp client from attempting to change the
> /etc/resolv.conf file.
> 
> I'm closing this bug as WORKSFORME. Please re-open if you still have that issue
> after making sure that your DNS configuration is stable.
> 
> With best regards,
> 
> Volker Lendecke
> 

The problem was not /etc/resolve.conf.  Adding the DC to /etc/resolv.conf had no effect, and there was no DHCP or BOOTP process changing the settings in resolv.conf.  The problem turned out to be a sync problem between the master and slave DNS servers.  The serial number of the reverse DNS zone (PTR records) in the DC was not the same as that in the slave servers.  After forcing the master/slaves to resync serial numbers, there have been no more winbind hangs.

Note that this problem has not affected any Windows systems nor Samba systems prior to 3.5.x.  Earlier versions of Samba had no problem with this DNS error.  A Lenny server running 3.2.5 ran flawlessly the entire time.  A look at archived logs show that this problem has existed for some time, long before the 3.5 series was released, but caused no visible problem until upgrading to 3.5.x.  I have to assume something has changed in the way that winbind works relative to DNS queries.

Thanks for leading me in the right direction.

Dale