After a certain period of idle-time (from what I have seen: after the 10hours of default ticket lifetime), winbindd does not reconnect successfully to ADS and winbind's kinit fails continously with "Invalid credentials". After winbindd is restarted, operation continues as expected. This might or might not be related to #1185, although please note, that winbindd does not crash, it just cannot respond to queries until restarted. The recent fixes for #1185 (and #1195) have not yet been tested. Environment: SuSE Linux Enterprise Server (SP3), samba3-rpms from SerNet (including static heimdal-0.6-20031214), Windows 2003 in native mode for now I can only provide log level 3 from winbind: ---------------------------------------------------- [2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)] nsswitch/winbindd_ads.c:trusted_domains(852) ads: trusted_domains [2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)] libads/ldap.c:ads_connect(218) Connected to LDAP server 172.25.65.236 [2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)] libads/ldap.c:ads_server_info(2030) got ldap server name dc1@MY.REALM, using bind path: dc=MY,dc=REALM [2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)] nsswitch/winbindd_cm.c:cm_get_ipc_userpass(107) IPC$ connections done anonymously [2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)] libsmb/cliconnect.c:cli_start_connection(1337) Connecting to host=DC1 [2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)] lib/util_sock.c:open_socket_out(710) Connecting to 172.25.65.236 at port 445 [2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)] libsmb/cliconnect.c:cli_session_setup_spnego(676) Doing spnego session setup (blob length=118) [2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)] libsmb/cliconnect.c:cli_session_setup_spnego(701) got OID=1 2 840 48018 1 2 2 [2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)] libsmb/cliconnect.c:cli_session_setup_spnego(701) got OID=1 2 840 113554 1 2 2 [2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)] libsmb/cliconnect.c:cli_session_setup_spnego(701) got OID=1 2 840 113554 1 2 2 3 [2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)] libsmb/cliconnect.c:cli_session_setup_spnego(701) got OID=1 3 6 1 4 1 311 2 2 10 [2004/03/19 06:05:31, 3, pid=4032, effective(0, 0), real(0, 0)] libsmb/cliconnect.c:cli_session_setup_spnego(708) got principal=dc1$@MY.REALM [2004/03/19 06:05:31, 2, pid=4032, effective(0, 0), real(0, 0)] libsmb/cliconnect.c:cli_session_setup_kerberos(510) [2004/03/19 06:05:31, 2, pid=4032, effective(0, 0), real(0, 0)] libsmb/cliconnect.c:cli_session_setup_kerberos(510) Doing kerberos session setup [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] nsswitch/winbindd_misc.c:winbindd_interface_version(261) [ 5841]: request interface version [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] nsswitch/winbindd_misc.c:winbindd_priv_pipe_dir(297) [ 5841]: request location of privileged pipe [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] nsswitch/winbindd_misc.c:winbindd_domain_info(210) [ 5841]: domain_info [MY.REALM] [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] nsswitch/winbindd_user.c:winbindd_getpwnam(122) [ 5841]: getpwnam my_realm+gf0047$ [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] nsswitch/winbindd_ads.c:sequence_number(812) ads: fetch sequence_number for MY_REALM [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] libads/ldap.c:ads_do_paged_search(451) ldap_search_ext_s((objectclass=*)) -> Can't contact LDAP server [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] libads/ldap_utils.c:ads_do_search_retry(66) Reopening ads connection to realm 'MY.REALM' after error Can't contact LDAP server [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] libads/ldap.c:ads_connect(218) Connected to LDAP server 172.25.65.236 [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] libads/ldap.c:ads_server_info(2030) got ldap server name dc1@MY.REALM, using bind path: dc=MY,dc=REALM [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] libads/sasl.c:ads_sasl_spnego_bind(187) got OID=1 2 840 48018 1 2 2 [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] libads/sasl.c:ads_sasl_spnego_bind(187) got OID=1 2 840 113554 1 2 2 [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] libads/sasl.c:ads_sasl_spnego_bind(187) got OID=1 2 840 113554 1 2 2 3 [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] libads/sasl.c:ads_sasl_spnego_bind(187) got OID=1 3 6 1 4 1 311 2 2 10 [2004/03/19 06:08:50, 3, pid=4032, effective(0, 0), real(0, 0)] libads/sasl.c:ads_sasl_spnego_bind(194) got principal=dc1$@MY.REALM [2004/03/19 06:08:52, 1, pid=4032, effective(0, 0), real(0, 0)] libads/ldap_utils.c:ads_do_search_retry(77) ads_search_retry: failed to reconnect (Invalid credentials) [2004/03/19 06:08:52, 1, pid=4032, effective(0, 0), real(0, 0)] nsswitch/winbindd_user.c:winbindd_getpwnam(157) user 'gf0047$' does not exist
Jim, you're already working on this right ? I thought a report about this might have already been filed, but I can't find it now.
Created attachment 452 [details] saves expire time at ticket acquisition time, and checks before using cached connection in winbind Needs testing overnight.
In addition to the above patch, I also added code to destroy the old tickets, because it wouldn't acquire new ones without it. Ticket renewal wasn't really an option, because we would need to 1) be awake before they expired to make the request 2) acquire new ones every week anyway, since that's the AD limit.
Even with all latest fixes from 3_0 winbindd in ads-mode dies for me after a certain period. pretty much a basic smb.conf: [global] workgroup = BERSMBLAB realm = BER.SMB-LAB.NET password server = ads security = ads unix charset = LOCALE idmap uid = 1000-999999 idmap gid = 1000-999999 log level = 10 panic action = /root/backtrace.sh %d At 00:35 suddently signature errors appear, signing worked fine at 00:30 (not in the attachment).
Created attachment 457 [details] log level 10 winbind-log
Ok, I'm not conviced that this failure you're seeing is related. I don't really know about the signing errors, but the segfault is definitely not from this. However, you should make clean and re-make, just to verify that's not just from some fn() that changed parms along the way, and if it persists, open a separate bug on the segfault. The signing error might have been fixed already, per Jerry, but I don't know for sure. Unfortunately I can't feasably build a system that has only specific fixes, so I have to ask if you can recreate it on a straight cvs system. For the others that have added themselves to the cc: list on this bug, please test it out. I'm convinced that I at least fixed some part of it, but not 100% sure I got all cases. Report back your findings here, good or bad. Thanks, Jim
This didn't fix it for me (FreeBSD 5.2.1, Samba 3.0.2a + patches for bug 1195 and 1208). I have a cron job running "wbinfo -t" every 5 minutes. It works fine but still fails after exactly 10 hours. Note that I modified the patches attached to the bugs to represent what was actually committed. To view the patches: http://www.noacks.org/winbind/ If there are problems with the patches or other fixes needed for these to work, please let me know (I'm hesitant to use a snapshot as this is in production).
3.0.3pre2 was just committed to FreeBSD ports. I'll give it a shot and report back.
Nope, I have the same problem with 3.0.3pre2. Is my testing technique flawed (checking 'wbinfo -t' every 5 minutes)? I still dies after exactly 10 hours... In the process of setting this up I added these to krb5.conf: ticket_lifetime = 36000 renew_lifetime = 604800 Could the issue be that the ticket is renewable but we're not trying to renew? In other words, should I take out the "renew_lifetime" entry? It's a simple change so I'll give it a shot and report back after 9:15pm CDT.
The main problem while fixing this is that we do only renew ticket-granting-tickets (tgts) and ignore service-tickets completely. While the reaquisition of a tgt might work now (although I never tested winbindd getting a new tgt), service tickets most commonly expire before the end of validity of a tgt (my test server uses a tgt-lifetime of 10 hours and service tickets of 10 min.). So while ads_cached_connection still thinks it has a valid ticket (which is only true for the tgt), an expired service-ticket is used that cannot be renewed. The attached patch basically adds expiry-checking (and renewal) for service-tickets. It works for me after a few tests. It still needs to honour clock-skew, I guess (this is missing for renewal of tgts as well, I suppose). Levering severity to "major" because quite some people are hit by this (see main samba-mailinglist). This needs further testing.
Created attachment 465 [details] svn diff against 3_0 for checking and renewal of service tickets in cli_krb5_get_ticket (ads_krb5_mk_req)
Removing "renew_lifetime" had no effect (as I expected). I'll try Guenther's patch and report back...
With Guenther's patch everything has been working successfully for over 16 hours now. My testing first involved 'wbinfo -t' every 5 minutes via cron and then I just enabled Squid proxy auth (after 15 hours and no restart of winbindd). Everything looks good so far! We have a long weekend for Easter, so I won't be able to load test this until Tuesday. However, I will leave Squid proxy auth enabled and see how it fairs in the mean time. I am highly excited by this....
winbindd has been up for over 5 days now and we're still authenticating. It handled my 400 users today without breaking a sweat -- I think Guenther's patch is a winner...
I've got 2 concerns here: 1) I couldn't reproduce the problem once my patch was on. I can see logically how it might occur, but on my system it doesn't 2) there is still definitely a window between when we check the validity of the ticket and when we use it. #1 isn't such a big deal. I trust that since multiple people have found that it fixed it, the patch is helping. It just makes it harder for me to truly test it. #2 has a couple of options: - We can check for ticket expiration errors after using the ticket, which means everyone that calls the stuff we've fixed has to be changed. - We can introduce a fudge factor, say, a minute, or something similar, that says if the ticket will expire in x time, go ahead and get new tickets. The latter is easier to do, though not as elegant, and would just add a few bytes to the existing 2 patches.
Oh, one more thing. Let me know in two more days....assuming you have the default renewal time of 7 days set in your domain security policy.
The test I did was the every 5 minute 'wbinfo -t'. Without Guenther's patch, this always failed after 10 hours with an access denied message. With Guenther's patch, things appear to be OK. I also thought about the 7 day issue on my way home from work -- I'll be sure to report back on Thursday. Note that I am still running without a "renew_lifetime" entry in krb5.conf, so tickets shouldn't be renewable anyway (confirmed with a 'klist -v' after a 'kinit'). As of now, none of the systems involved have been rebooted. Barring any critical security patches, I'll keep it that way.
Sorry folks (and MS04-12, MS04-14, etc.): http://www.microsoft.com/technet/security/bulletin/ms04-011.mspx I can't wait on this one -- I'll let you know if it survives a week when (if?) I can leave the servers up that long.
For what it's worth, everything still works after rebooting the Windows domain controllers -- it was not necessary to restart Samba. Note that I have 2 'kdc' entries in krb5.conf and at least one of them stayed up the entire time (staggered reboot). I'll keep reporting back if you like...
Attached is a newer version of my patch for reobtaining service-tickets that a) supports another expiry-offset => Remove tickets from ccache before n seconds *before* expiry, n is set to 10 seconds now, this could of course be set to a little larger value. (As pointed out by Luke Howard and discussed with Jim McDonough). b) avoids credential-removal from file-based ccaches. => Heimdal Kerberos (up to 0.6.1) simply does nothing in krb5_cc_remove_creds and just returns 0. The old patch was leading to an infinite ccache-refresh-loop while doing smbclient -k with an expired service ticket :( Heimdal should either remove fcc_remove_cred from krb5_fcc_ops or - better - implement a real file-based krb5_cc_remove_creds-function (attached patch is provided by Sumit Bose <sbose-at-suse-dot-de> and has been sent upstream to the heimdal-folks.). I find it hard to detect a working krb5_cc_remove_creds for file-ccaches, so reobtaining tickets in file-ccaches is currently disabled now - but can easily be fixed with Sumit's patch.
Created attachment 466 [details] New version of re-obtaining service-tickets-patch
Created attachment 467 [details] Patch for heimdal-0.6.1 to implement krb5_cc_remove_cred for files-ccaches (fcc_remove_cred) - provided by Sumit Bose
Ok, I've checked fixes in based on the last work you did. I tried to make the format a bit more readable by breaking the credential removal code out into another function, and removing the goto. I've left the 10-second window in, but we may need to adjust it. I think we really don't want to touch file-based ccache's, because they will usually be from a user's own kinit, and we shouldn't touch those credentials.
Guenther and Jim, Thank you for work in resolving this issue!
There is another (new) problem that has to do with clock-skew-handling. If your Samba-Server's time is advancing your ADS/KDC-Time, the kerberos-routines may hang in a loop (aquiering new tickets and removing them from the credential-cache immediatly afterwards). AFAIK, the existing time-adjustment routine should just handle both cases (kdc advance and delay). I didn't had much time to test it. Can someone else please test this, too?
Created attachment 484 [details] svn diff of libsmb/clikrb5.c Not very well tested...
Sorry, forgot to describe test-case. To reproduce the new problem try the following: 1) KDC has time(NULL); 2) client that uses clikrb5.c has time(NULL) + 1 hour 3) start a command-line tool that uses clikrb5.c-routines to aquire a ticket (e.g. net ads search cn=administrator -U administrator%password then krb5_get_credentials gets a ticket. this ticket is immediatly removed by ads_cleanup_expired_creds because it's expired. ads_cleanup_expired_creds returns True (thus leaving creds_ready False), then refetches a ticket, reremoves ticket immediatly, and on and on... clock_skew is only adapted in ads_krb5_mk_req if crdesp->times.starttime > time(NULL). I admit, that you'll see this problem in real setups rarely (you have to set your clock) - but we better handle it instead of looping forever.
Any news on the clock-skew-fix? If I understand Volker Lendecke correctly, he recently was hit badly by this remaining clock-skew-bug at a customer site, too.
Reassigning to Guethner, as he has produced the patches, and can now even check them in! :-) I was trying to test them but had other problems prevent me from even recreating the bug.
*** Bug 2558 has been marked as a duplicate of this bug. ***
closing old bugs. Guenther if you still think we have an issue and want to work on it, please feel free to reopen.