6103 – network outage recovery for winbind

Bug 6103 - network outage recovery for winbind

Summary: network outage recovery for winbind

Status:	NEW

Alias:	None

Product:	Samba 3.2
Classification:	Unclassified
Component:	Winbind (show other bugs)
Version:	3.2.8
Hardware:	x86 Linux

Importance:	P3 normal
Target Milestone:	---
Assignee:	Jeremy Allison
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-02-11 19:15 UTC by Jason Haar
Modified:	2009-10-02 04:02 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jason Haar 2009-02-11 19:15:50 UTC

Hi there

It appears I'm being an adventurous soul by running a 100% winbind-based account system on my Fedora 10 laptop (samba-3.2.8). i.e. I log in using Active Directory credentials, and a /home/DOMAIN/username/ homedir is automagically created, etc. I have cached credentials enabled both via nscd and inside winbind so my creds work when I'm offline. End result: true WinXP laptop experience (that's a compliment BTW ;-)

Works really well - until I suspend and take it home. After unsuspending, sometimes I can unlock my screensaver - and sometimes I can't. Looks to me like winbind gets confused by hanging TCP sessions to domain controllers (ie they were working before the suspend) and can't recover. If I login on the TTY console as root (i.e. the only non-AD based account) and take a look, I can see winbind has these TIME_WAIT and even ESTABLISHED TCP connections to DCs - but they aren't real as the work network and DCs aren't actually available.

So far non-Samba really - this is standard side effects of suspending. Anyway, the issue is that winbind doesn't recover. If I actually restart winbind it very quickly realizes the remote DCs aren't there and I can login using my AD creds again (using cached creds).

So my request is that I shouldn't need to restart winbind for this to happen. Perhaps some timer needs to be set in winbind that makes it recheck its connectivity and go into offline mode more effectively?

Thanks

Jason

Comment 1 Jeremy Allison 2009-02-11 19:25:15 UTC

winbindd should really notice these connections are out, and then go into offline mode as soon as it tries to write any data to the DC down them.

Can you get me a debug level 10 log of this situation ?

Jeremy.

Comment 2 Jason Haar 2009-02-23 22:47:39 UTC

(In reply to comment #1)
> winbindd should really notice these connections are out, and then go into
> offline mode as soon as it tries to write any data to the DC down them.
> 
> Can you get me a debug level 10 log of this situation ?
> 
> Jeremy.
> 
Hi Jeremy

I reset the debug level the day you sent this email - and of course the problem hasn't occurred since - typical.

Just now I also realized I'd put in compensatory "restart winbind" scripts all over the place that have probably been covering up the issue, so I've disabled them and we should see a result - well - sometime :-)

Thanks

Jason

Comment 3 Eric Kerin 2009-09-28 10:35:21 UTC

I ran into a problem with similar circumstances with Fedora 11 and Samba 3.3.2 (as well as 3.4.1) -

I had problems with requests that returned no result (getent passwd <username> - returned no values) as well as what looks like infinite loops when trying to go online (getent passwd calls would just hang seemingly forever) these were transient problems, after 2 calls to getent passwd, it would start returning the correct values.

I tracked it down to two major issues:
1. Going Offline->Online, and Online->Offline - When you suspend and change networks, samba never instructs the resolver to reload it's config (/etc/resolv.conf) upon coming back to life, so it's still sending DNS requests to the servers from the old network, no matter how much it tries, it can't seem to recover when all it's DNS queries fail.

2. When going Online->Offline (and after #1 was fixed) - Requests are sent, it realizes it needs to go offline, and sets the appropriate status. Result of the call is still a failure, since the online status check is done before all this.

I was able to work around these two issues with a few res_init() calls before getaddrinfo(), and adding a post-failure check to one method to tell if the status have gone to offline, and re-checking for a cached entry (since the cache lookup will now return the cached value, even though it has expired)

I have a proof of concept patch that I can send if interested (it has only fixed the problems in code paths I've run into while testing). I'll be working on a cleaner patch though that fixes it in the rest of the cases.

Comment 4 Jason Haar 2009-10-02 04:02:59 UTC

(In reply to comment #2)

> Hi Jeremy
> 
> I reset the debug level the day you sent this email - and of course the problem
> hasn't occurred since - typical.
> 

Hi Jeremy

It just happened again. I was at work on Ethernet, suspended the laptop and brought it home. Upon opening the lid, it connected to my home wireless network (ie interface change occurred), but the screensaver password prompt didn't even appear! In the end I had to Ctr-Alt-F2 and log into a console as root. syslog showed the following

Oct  2 21:26:15 tnz-jhaar-dell gnome-screensaver-dialog: pam_unix(gnome-screensaver:auth): authentication failure; logname= uid=16777216 euid=16777216 tty=:0.0 ruser= rhost=  user=jhaar
Oct  2 21:26:15 tnz-jhaar-dell gnome-screensaver-dialog: pam_winbind(gnome-screensaver:auth): getting password (0x00000210)
Oct  2 21:26:15 tnz-jhaar-dell gnome-screensaver-dialog: pam_winbind(gnome-screensaver:auth): pam_get_item returned a password
Oct  2 21:26:15 tnz-jhaar-dell gnome-screensaver-dialog: pam_winbind(gnome-screensaver:auth): request wbcLogonUser failed: WBC_ERR_AUTH_ERROR, PAM error: PAM_AUTH_ERR (7), NTSTATUS: NT_STATUS_LOGON_FAILURE, Error message was: Logon failure
Oct  2 21:26:15 tnz-jhaar-dell gnome-screensaver-dialog: pam_winbind(gnome-screensaver:auth): user 'jhaar' denied access (incorrect password or invalid membership)
Oct  2 21:26:15 tnz-jhaar-dell gnome-screensaver-dialog: gkr-pam: unlocked 'login' keyring

Sounds to me like winbind totally lost the plot and lost my username - therefore gnome-screensaver couldn't show me a "enter password" prompt as it didn't know who I was? Even slogin wouldn't work for me and "id" showed 16777216 and not "jhaar"

Anyway, I shut down winbind and copied the /var/log/samba tree away for your perusal. Then I restarted winbind and hit another bug - which perhaps is related...

I had a VPN back to work in place so winbind should have connected to the DCs, but it appears one of our Dell DRAC cards on a DC is misconfigured in such a way that the DC is broadcasting an invalid 169.254.x.y address as being "it". Even though our domain has 8+ DCs, for some mad reason winbind kept insisting it would only connect to the one address that was unavailable - the "fake" 169.254 address. (I ran strace on "winbind -F -S -d 9" and could see in the strace logs it only tried connecting to 169.254.x.y). 

15021 write(1, "check_negative_conn_cache returni"..., 97) = 97
15021 write(1, "get_dc_list: returning 1 ip addre"..., 57) = 57
15021 write(1, "get_dc_list: 169.254.172.37:389 \n"..., 33) = 33

That "get_dc_list" is strange. A simple nslookup for the AD domain name returns 8+ IP addresses - I know "get_dc_list" isn't necessarily the same thing - but it is greater than 1...
 

"wbinfo -p" would work, but "wbinfo -t" always returned trust errors. In the end I added a "reject" route for that whole network, restarted winbind and that time it tried yet again to connect to the 169.254 address - but this time received a "network is unreachable" error and immediately tried a working DC instead. Problem fixed

Let me know which /var/log/samba logs you want - you can have the lot if you like (if I can mark them private? Don't want to give away internal details too widely)

Thanks

Jason