Bug 1208 - winbind in ads-security fails to reconnect (Invalid credentials)
Summary: winbind in ads-security fails to reconnect (Invalid credentials)
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: winbind (show other bugs)
Version: 3.0.2a
Hardware: All Linux
: P3 major
Target Milestone: none
Assignee: Guenther Deschner
QA Contact:
URL:
Keywords:
: 2558 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-03-22 16:06 UTC by Guenther Deschner
Modified: 2006-04-08 11:43 UTC (History)
8 users (show)

See Also:


Attachments
saves expire time at ticket acquisition time, and checks before using cached connection in winbind (4.10 KB, patch)
2004-03-23 13:16 UTC, Jim McDonough
no flags Details
log level 10 winbind-log (19.04 KB, application/octet-stream)
2004-03-26 02:56 UTC, Guenther Deschner
no flags Details
svn diff against 3_0 for checking and renewal of service tickets in cli_krb5_get_ticket (ads_krb5_mk_req) (1.28 KB, patch)
2004-04-08 07:47 UTC, Guenther Deschner
no flags Details
New version of re-obtaining service-tickets-patch (1.76 KB, patch)
2004-04-14 08:44 UTC, Guenther Deschner
no flags Details
Patch for heimdal-0.6.1 to implement krb5_cc_remove_cred for files-ccaches (fcc_remove_cred) - provided by Sumit Bose (1.18 KB, patch)
2004-04-14 08:47 UTC, Guenther Deschner
no flags Details
svn diff of libsmb/clikrb5.c (982 bytes, patch)
2004-05-03 08:13 UTC, Guenther Deschner
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Guenther Deschner 2004-03-22 16:06:43 UTC
After a certain period of idle-time (from what I have seen: after the 10hours of
default ticket lifetime), winbindd does not reconnect successfully to ADS and
winbind's kinit fails continously with "Invalid credentials". After winbindd is
restarted, operation continues as expected.

This might or might not be related to #1185, although please note, that winbindd
does not crash, it just cannot respond to queries until restarted. 

The recent fixes for #1185 (and #1195) have not yet been tested.

Environment:
SuSE Linux Enterprise Server (SP3),
samba3-rpms from SerNet (including static heimdal-0.6-20031214),
Windows 2003 in native mode

for now I can only provide log level 3 from winbind:
----------------------------------------------------

[2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)]
nsswitch/winbindd_ads.c:trusted_domains(852)
  ads: trusted_domains
[2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)]
libads/ldap.c:ads_connect(218)
  Connected to LDAP server 172.25.65.236
[2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)]
libads/ldap.c:ads_server_info(2030)
  got ldap server name dc1@MY.REALM, using bind path: dc=MY,dc=REALM
[2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)]
nsswitch/winbindd_cm.c:cm_get_ipc_userpass(107)
  IPC$ connections done anonymously
[2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)]
libsmb/cliconnect.c:cli_start_connection(1337)
  Connecting to host=DC1
[2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)]
lib/util_sock.c:open_socket_out(710)
  Connecting to 172.25.65.236 at port 445
[2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)]
libsmb/cliconnect.c:cli_session_setup_spnego(676)
  Doing spnego session setup (blob length=118)
[2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)]
libsmb/cliconnect.c:cli_session_setup_spnego(701)
  got OID=1 2 840 48018 1 2 2
[2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)]
libsmb/cliconnect.c:cli_session_setup_spnego(701)
  got OID=1 2 840 113554 1 2 2
[2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)]
libsmb/cliconnect.c:cli_session_setup_spnego(701)
  got OID=1 2 840 113554 1 2 2 3
[2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)]
libsmb/cliconnect.c:cli_session_setup_spnego(701)
  got OID=1 3 6 1 4 1 311 2 2 10
[2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)]
libsmb/cliconnect.c:cli_session_setup_spnego(708)
  got principal=dc1$@MY.REALM
[2004/03/19 06:05:31, 2, pid=4032, effective(0, 0), real(0, 0)]
libsmb/cliconnect.c:cli_session_setup_kerberos(510)
[2004/03/19 06:05:31, 2, pid=4032, effective(0, 0), real(0, 0)]
libsmb/cliconnect.c:cli_session_setup_kerberos(510)
  Doing kerberos session setup
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
nsswitch/winbindd_misc.c:winbindd_interface_version(261)
  [ 5841]: request interface version
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
nsswitch/winbindd_misc.c:winbindd_priv_pipe_dir(297)
  [ 5841]: request location of privileged pipe
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
nsswitch/winbindd_misc.c:winbindd_domain_info(210)
  [ 5841]: domain_info [MY.REALM]
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
nsswitch/winbindd_user.c:winbindd_getpwnam(122)
  [ 5841]: getpwnam my_realm+gf0047$
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
nsswitch/winbindd_ads.c:sequence_number(812)
  ads: fetch sequence_number for MY_REALM
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
libads/ldap.c:ads_do_paged_search(451)
  ldap_search_ext_s((objectclass=*)) -> Can't contact LDAP server
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
libads/ldap_utils.c:ads_do_search_retry(66)
  Reopening ads connection to realm 'MY.REALM' after error Can't contact LDAP server
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
libads/ldap.c:ads_connect(218)
  Connected to LDAP server 172.25.65.236
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
libads/ldap.c:ads_server_info(2030)
  got ldap server name dc1@MY.REALM, using bind path: dc=MY,dc=REALM
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
libads/sasl.c:ads_sasl_spnego_bind(187)
  got OID=1 2 840 48018 1 2 2
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
libads/sasl.c:ads_sasl_spnego_bind(187)
  got OID=1 2 840 113554 1 2 2
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
libads/sasl.c:ads_sasl_spnego_bind(187)
  got OID=1 2 840 113554 1 2 2 3
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
libads/sasl.c:ads_sasl_spnego_bind(187)
  got OID=1 3 6 1 4 1 311 2 2 10
[2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)]
libads/sasl.c:ads_sasl_spnego_bind(194)
  got principal=dc1$@MY.REALM
[2004/03/19 06:08:52, 1, pid=4032, effective(0, 0), real(0, 0)]
libads/ldap_utils.c:ads_do_search_retry(77)
  ads_search_retry: failed to reconnect (Invalid credentials)
[2004/03/19 06:08:52, 1, pid=4032, effective(0, 0), real(0, 0)]
nsswitch/winbindd_user.c:winbindd_getpwnam(157)
  user 'gf0047$' does not exist
Comment 1 Gerald (Jerry) Carter (dead mail address) 2004-03-23 05:12:56 UTC
Jim,  you're already working on this right ?
I thought a report about this might have already been 
filed, but I can't find it now.
Comment 2 Jim McDonough 2004-03-23 13:16:16 UTC
Created attachment 452 [details]
saves expire time at ticket acquisition time, and checks before using cached connection in winbind

Needs testing overnight.
Comment 3 Jim McDonough 2004-03-24 09:34:16 UTC
In addition to the above patch, I also added code to destroy the old tickets,
because it wouldn't acquire new ones without it.  Ticket renewal wasn't really
an option, because we would need to 1) be awake before they expired to make the
request 2) acquire new ones every week anyway, since that's the AD limit.
Comment 4 Guenther Deschner 2004-03-26 02:54:42 UTC
Even with all latest fixes from 3_0 winbindd in ads-mode dies for me after a
certain period. 

pretty much a basic smb.conf:
[global]
        workgroup = BERSMBLAB
        realm = BER.SMB-LAB.NET
        password server = ads
        security = ads
        unix charset = LOCALE
        idmap uid = 1000-999999
        idmap gid = 1000-999999
        log level = 10
        panic action = /root/backtrace.sh %d

At 00:35 suddently signature errors appear, signing worked fine at 00:30 (not in
the attachment).
Comment 5 Guenther Deschner 2004-03-26 02:56:05 UTC
Created attachment 457 [details]
log level 10 winbind-log
Comment 6 Jim McDonough 2004-03-27 05:57:35 UTC
Ok, I'm not conviced that this failure you're seeing is related.  I don't really
know about the signing errors, but the segfault is definitely not from this. 
However, you should make clean and re-make, just to verify that's not just from
some fn() that changed parms along the way, and if it persists, open a separate
bug on the segfault.

The signing error might have been fixed already, per Jerry, but I don't know for
sure.  Unfortunately I can't feasably build a system that has only specific
fixes, so I have to ask if you can recreate it on a straight cvs system.

For the others that have added themselves to the cc: list on this bug, please
test it out.  I'm convinced that I at least fixed some part of it, but not
100% sure I got all cases.  Report back your findings here, good or bad.

Thanks,
Jim
Comment 7 Jon Noack 2004-04-01 04:17:43 UTC
This didn't fix it for me (FreeBSD 5.2.1, Samba 3.0.2a + patches for bug 1195
and 1208).  I have a cron job running "wbinfo -t" every 5 minutes.  It works
fine but still fails after exactly 10 hours.  Note that I modified the patches
attached to the bugs to represent what was actually committed.  To view the patches:
http://www.noacks.org/winbind/

If there are problems with the patches or other fixes needed for these to work,
please let me know (I'm hesitant to use a snapshot as this is in production).
Comment 8 Jon Noack 2004-04-06 11:59:49 UTC
3.0.3pre2 was just committed to FreeBSD ports.  I'll give it a shot and report back.
Comment 9 Jon Noack 2004-04-07 07:14:15 UTC
Nope, I have the same problem with 3.0.3pre2.  Is my testing technique flawed
(checking 'wbinfo -t' every 5 minutes)?  I still dies after exactly 10 hours...

In the process of setting this up I added these to krb5.conf:
    ticket_lifetime = 36000
    renew_lifetime = 604800

Could the issue be that the ticket is renewable but we're not trying to renew? 
In other words, should I take out the "renew_lifetime" entry?  It's a simple
change so I'll give it a shot and report back after 9:15pm CDT.
Comment 10 Guenther Deschner 2004-04-08 07:45:51 UTC
The main problem while fixing this is that we do only renew
ticket-granting-tickets (tgts) and ignore service-tickets completely. 

While the reaquisition of a tgt might work now (although I never tested winbindd
getting a new tgt), service tickets most commonly expire before the end of
validity of a tgt (my test server uses a tgt-lifetime of 10 hours and service
tickets of 10 min.). So while ads_cached_connection still thinks it has a valid
ticket (which is only true for the tgt), an expired service-ticket is used that
cannot be renewed.

The attached patch basically adds expiry-checking (and renewal) for
service-tickets. It works for me after a few tests. It still needs to honour
clock-skew, I guess (this is missing for renewal of tgts as well, I suppose).

Levering severity to "major" because quite some people are hit by this (see main
samba-mailinglist).

This needs further testing.
Comment 11 Guenther Deschner 2004-04-08 07:47:35 UTC
Created attachment 465 [details]
svn diff against 3_0 for checking and renewal of service tickets in cli_krb5_get_ticket (ads_krb5_mk_req)
Comment 12 Jon Noack 2004-04-08 08:47:23 UTC
Removing "renew_lifetime" had no effect (as I expected).  I'll try Guenther's
patch and report back...
Comment 13 Jon Noack 2004-04-09 01:10:41 UTC
With Guenther's patch everything has been working successfully for over 16 hours
now.  My testing first involved 'wbinfo -t' every 5 minutes via cron and then I
just enabled Squid proxy auth (after 15 hours and no restart of winbindd). 
Everything looks good so far!  We have a long weekend for Easter, so I won't be
able to load test this until Tuesday.  However, I will leave Squid proxy auth
enabled and see how it fairs in the mean time.  I am highly excited by this....
Comment 14 Jon Noack 2004-04-13 13:34:27 UTC
winbindd has been up for over 5 days now and we're still authenticating.  It
handled my 400 users today without breaking a sweat -- I think Guenther's patch
is a winner...
Comment 15 Jim McDonough 2004-04-13 13:54:47 UTC
I've got 2 concerns here:
1) I couldn't reproduce the problem once my patch was on.  I can see logically
how it might occur, but on my system it doesn't
2) there is still definitely a window between when we check the validity of the
ticket and when we use it.

#1 isn't such a big deal.  I trust that since multiple people have found that it
fixed it, the patch is helping.  It just makes it harder for me to truly test it.

#2 has a couple of options:
- We can check for ticket expiration errors after using the ticket, which means
everyone that calls the stuff we've fixed has to be changed.
- We can introduce a fudge factor, say, a minute, or something similar, that
says if the ticket will expire in x time, go ahead and get new tickets.

The latter is easier to do, though not as elegant, and would just add a few
bytes to the existing 2 patches.
Comment 16 Jim McDonough 2004-04-13 13:56:00 UTC
Oh, one more thing.  Let me know in two more days....assuming you have the
default renewal time of 7 days set in your domain security policy.
Comment 17 Jon Noack 2004-04-13 15:24:45 UTC
The test I did was the every 5 minute 'wbinfo -t'.  Without Guenther's patch,
this always failed after 10 hours with an access denied message.  With
Guenther's patch, things appear to be OK.

I also thought about the 7 day issue on my way home from work -- I'll be sure to
report back on Thursday.  Note that I am still running without a
"renew_lifetime" entry in krb5.conf, so tickets shouldn't be renewable anyway
(confirmed with a 'klist -v' after a 'kinit').  As of now, none of the systems
involved have been rebooted.  Barring any critical security patches, I'll keep
it that way.
Comment 18 Jon Noack 2004-04-13 17:19:29 UTC
Sorry folks (and MS04-12, MS04-14, etc.):
http://www.microsoft.com/technet/security/bulletin/ms04-011.mspx

I can't wait on this one -- I'll let you know if it survives a week when (if?) I
can leave the servers up that long.
Comment 19 Jon Noack 2004-04-13 23:06:58 UTC
For what it's worth, everything still works after rebooting the Windows domain
controllers -- it was not necessary to restart Samba.  Note that I have 2 'kdc'
entries in krb5.conf and at least one of them stayed up the entire time
(staggered reboot).  I'll keep reporting back if you like...
Comment 20 Guenther Deschner 2004-04-14 08:43:19 UTC
Attached is a newer version of my patch for reobtaining service-tickets that 

a) supports another expiry-offset

=> Remove tickets from ccache before n seconds *before* expiry, n is set to 10
seconds now, this could of course be set to a little larger value. (As pointed
out by Luke Howard and discussed with Jim McDonough).

b) avoids credential-removal from file-based ccaches.

=> Heimdal Kerberos (up to 0.6.1) simply does nothing in krb5_cc_remove_creds
and just returns 0. The old patch was leading to an infinite ccache-refresh-loop
 while doing smbclient -k with an expired service ticket :( 
Heimdal should either remove fcc_remove_cred from krb5_fcc_ops or - better -
implement a real file-based krb5_cc_remove_creds-function (attached patch is
provided by Sumit Bose <sbose-at-suse-dot-de> and has been sent upstream to the
heimdal-folks.). 

I find it hard to detect a working krb5_cc_remove_creds for file-ccaches, so
reobtaining tickets in file-ccaches is currently disabled now - but can easily
be fixed with Sumit's patch. 
Comment 21 Guenther Deschner 2004-04-14 08:44:45 UTC
Created attachment 466 [details]
New version of re-obtaining service-tickets-patch
Comment 22 Guenther Deschner 2004-04-14 08:47:47 UTC
Created attachment 467 [details]
Patch for heimdal-0.6.1 to implement krb5_cc_remove_cred for files-ccaches (fcc_remove_cred) - provided by Sumit Bose
Comment 23 Jim McDonough 2004-04-14 12:10:23 UTC
Ok, I've checked fixes in based on the last work you did.  I tried to make the
format a bit more readable by breaking the credential removal code out into
another function, and removing the goto.  I've left the 10-second window in, but
we may need to adjust it.

I think we really don't want to touch file-based ccache's, because they will
usually be from a user's own kinit, and we shouldn't touch those credentials.
Comment 24 Jon Noack 2004-04-14 13:11:29 UTC
Guenther and Jim,
Thank you for work in resolving this issue!
Comment 25 Guenther Deschner 2004-05-03 08:12:56 UTC
There is another (new) problem that has to do with clock-skew-handling.

If your Samba-Server's time is advancing your ADS/KDC-Time, the
kerberos-routines may hang in a loop (aquiering new tickets and removing them
from the credential-cache immediatly afterwards).

AFAIK, the existing time-adjustment routine should just handle both cases (kdc
advance and delay).

I didn't had much time to test it. Can someone else please test this, too?
Comment 26 Guenther Deschner 2004-05-03 08:13:17 UTC
Created attachment 484 [details]
svn diff of libsmb/clikrb5.c

Not very well tested...
Comment 27 Guenther Deschner 2004-05-03 08:21:36 UTC
Sorry, forgot to describe test-case.

To reproduce the new problem try the following:

1) KDC has time(NULL);
2) client that uses clikrb5.c has time(NULL) + 1 hour 
3) start a command-line tool that uses clikrb5.c-routines to aquire a ticket
(e.g. net ads search cn=administrator -U administrator%password 

then krb5_get_credentials gets a ticket.
this ticket is immediatly removed by ads_cleanup_expired_creds because it's
expired. ads_cleanup_expired_creds returns True (thus leaving creds_ready
False), then refetches a ticket, reremoves ticket immediatly, and on and on...

clock_skew is only adapted in ads_krb5_mk_req if crdesp->times.starttime >
time(NULL).

I admit, that you'll see this problem in real setups rarely (you have to set
your clock) - but we better handle it instead of looping forever.
Comment 28 Guenther Deschner 2004-06-12 08:31:20 UTC
Any news on the clock-skew-fix? 

If I understand Volker Lendecke correctly, he recently was hit badly by this
remaining clock-skew-bug at a customer site, too.
Comment 29 Jim McDonough 2004-08-20 16:24:11 UTC
Reassigning to Guethner, as he has produced the patches, and can now even check
them in! :-)  I was trying to test them but had other problems prevent me from
even recreating the bug.
Comment 30 Guenther Deschner 2005-03-30 14:22:58 UTC
*** Bug 2558 has been marked as a duplicate of this bug. ***
Comment 31 Gerald (Jerry) Carter (dead mail address) 2006-04-08 11:43:15 UTC
closing old bugs.  Guenther if you still think we have an issue and want to work on it, please feel free to reopen.