Bug 11198 - offline PAM auth deletes valid Kerberos credentials
Summary: offline PAM auth deletes valid Kerberos credentials
Status: ASSIGNED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 4.2.2
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Jeremy Allison
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-07 11:36 UTC by David Woodhouse
Modified: 2016-02-09 10:21 UTC (History)
3 users (show)

See Also:


Attachments
git-am possible patch for master. (1.51 KB, patch)
2015-09-03 00:15 UTC, Jeremy Allison
ab: review+
Details
winbind log (823.26 KB, text/plain)
2015-09-04 09:31 UTC, David Woodhouse
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Woodhouse 2015-04-07 11:36:22 UTC
I have no idea why winbind thinks it's offline. That will have to be the subject of a different bug. But why is it *deleting* my valid Kerberos credentials...?

[dwoodhou@i7 f22]$ kinit dwoodhou
Password for dwoodhou@GER.CORP.INTEL.COM: 
Warning: Your password will expire in 6 days on Mon 13 Apr 2015 17:49:56 BST
[dwoodhou@i7 f22]$ ls -l /tmp/krb5cc_500 
-rw-------. 1 dwoodhou dwoodhou 3804 Apr  7 12:35 /tmp/krb5cc_500
[dwoodhou@i7 f22]$ wbinfo -K dwoodhou
Enter dwoodhou's password: 
plaintext kerberos password authentication for [dwoodhou] succeeded (requesting cctype: FILE)
user_flgs: NETLOGON_CACHED_ACCOUNT
credentials were put in: FILE:/tmp/krb5cc_500
[dwoodhou@i7 f22]$ ls -l /tmp/krb5cc_500 
ls: cannot access /tmp/krb5cc_500: No such file or directory

ISTR it also does this if I attempt 'su' and get my password wrong, even when it *is* correctly online.
Comment 1 David Woodhouse 2015-09-01 13:00:09 UTC
I can no longer trigger this with su or sudo and an incorrect password, but it is still happening when I drop off the VPN briefly and have to authenticate to something while I'm offline. The Kerberos TGT which *should* have remained valid, is being deleted.

This is painful because applications like Evolution will get notified as soon as the VPN comes up again and will try to communicate... while they don't have a valid TGT because winbind hasn't *quite* managed to replace it in time.

child_process_request: request fn PAM_AUTH
[17087]: dual pam auth GER\dwoodhou
winbindd_dual_pam_auth: domain: GER last was online
winbindd_dual_pam_auth_kerberos
is_myname("GER") returns 0
using ccache: FILE:/tmp/krb5cc_500
winbindd_raw_kerberos_login: uid is 500
kerberos_kinit_password: as dwoodhou@GER.CORP.INTEL.COM using [FILE:/tmp/krb5cc_500] as ccache and config [(null)]
no krb5_error
kinit failed for 'dwoodhou@GER.CORP.INTEL.COM' with: Cannot contact any KDC for requested realm (-1765328228)
winbindd_dual_pam_auth_kerberos failed: NT_STATUS_NO_LOGON_SERVERS
winbindd_dual_pam_auth_kerberos setting domain to offline
set_domain_offline: called for domain GER
set_domain_offline: added event handler for domain GER
messaging_dgm_send: Sending message to 17087
winbindd_dual_pam_auth_cached
get_cache: Setting ADS methods for domain GER
centry_expired: Key NS/GER/DWOODHOU for domain GER valid as domain is offline.
wcache_fetch: returning entry NS/GER/DWOODHOU for domain GER
name_to_sid: [Cached] - cached name for domain GER status: NT_STATUS_OK
messaging_recv_cb: Received message 0x40c len 4 (num_fds:0) from 17089
centry_expired: Key CRED/S-1-5-21-2052111302-1275210071-1644491937-279532 for domain GER valid as domain is offline.
wcache_fetch: returning entry CRED/S-1-5-21-2052111302-1275210071-1644491937-279532 for domain GER
Domain GER is marked as offline now.
wcache_get_creds: [Cached] - cached creds for user S-1-5-21-2052111302-1275210071-1644491937-279532 status: NT_STATUS_OK
...
wcache_tdc_fetch_domain: Searching for domain GER
wcache_tdc_fetch_domain: Found domain GER
using ccache: FILE:/tmp/krb5cc_500
add_ccache_to_list: successfully destroyed krb5 ccache FILE:/tmp/krb5cc_500 for user GER\dwoodhou
add_ccache_to_list: ref count on entry GER\dwoodhou is now 2
winbindd_add_memory_creds_internal: ref count for user GER\dwoodhou is now 2
winbindd_add_memory_creds returned: NT_STATUS_OK
wcache_save_creds: S-1-5-21-2052111302-1275210071-1644491937-279532
Comment 2 Jeremy Allison 2015-09-01 16:58:54 UTC
This code is doing it:

source3/winbindd/winbindd_cred_cache.c:add_ccache_to_list()

 519         /* If it is cached login, destroy krb5 ticket
 520          * to avoid surprise. */
 521 #ifdef HAVE_KRB5
 522         if (postponed_request) {
 523                 /* ignore KRB5_FCC_NOFILE error here */
 524                 ret = ads_kdestroy(ccname);
 525                 if (ret == KRB5_FCC_NOFILE) {
 526                         ret = 0;
 527                 }
 528                 if (ret) {
 529                         DEBUG(0, ("add_ccache_to_list: failed to destroy "
 530                                    "user krb5 ccache %s with %s\n", ccname,
 531                                    error_message(ret)));
 532                         return krb5_to_nt_status(ret);
 533                 }
 534                 DEBUG(10, ("add_ccache_to_list: successfully destroyed "
 535                            "krb5 ccache %s for user %s\n", ccname,
 536                            username));
 537         }
 538 #endif

This commit shows the details.

git show f389b97c6
Comment 3 David Woodhouse 2015-09-01 21:23:03 UTC
(In reply to Jeremy Allison from comment #2)
> 519         /* If it is cached login, destroy krb5 ticket
> 520          * to avoid surprise. */

That's a... rather opaque comment. It's not entirely clear what form this "surprise" would take. One might normally expect such things to be expounded in the commit comment... but no, that's somewhat taciturn too.

What would the failure mode be if we *didn't* destroy the existing krb5 ticket? And why is there no better workaround, like actually inspecting it to see what its renew/refresh times are and setting our timers accordingly?
Comment 4 Jeremy Allison 2015-09-01 22:32:32 UTC
(In reply to David Woodhouse from comment #3)

Yeah, I'm not sure I understand precisely the logic here.

From the f389b97c6 commit there is a comment:

+               /* This is evil, if the ticket was already expired.
+                * renew ticket function returns KRB5KRB_AP_ERR_TKT_EXPIRED.
+                * But there is still a chance that we can rekinit it. 
+                *
+                * This happens when user login in online mode, and then network
+                * down or something cause winbind goes offline for a very long time,
+                * and then goes online again. ticket expired, renew failed.
+                * This happens when machine are put to sleep for a long time,
+                * but shorter than entry->renew_util.
+                * NB
+                * Looks like the KDC is reachable, we want to rekinit as soon as
+                * possible instead of waiting some time later. */

which I'm not sure I follow.

Can you explain exactly the logic you want here ?
Comment 5 David Woodhouse 2015-09-01 23:10:07 UTC
(In reply to Jeremy Allison from comment #4)
> Can you explain exactly the logic you want here ?

That's simple: Never delete a valid TGT and leave me with nothing.

If you can get a *new* one, fine. Please do it atomically.

But if you temporarily can't communicate with the server and you delete my valid TGT in a fit of pique, that's bad.

This was *really* painful a few weeks ago when we had some infrastructure problems. Anyone with an existing TGT was OK, but once it expired you couldn't get a new one and everything stopped working. And then I *really* hated this bug, and vowed to chase it up :)
Comment 6 Jeremy Allison 2015-09-01 23:57:19 UTC
OK, but this is the key "a valid TGT"..

How do we know if it's valid and not expired when we're offline ? What logic should we use here ?
Comment 7 David Woodhouse 2015-09-02 07:20:45 UTC
I'm perfectly happy to remove the word 'valid'. If there's already a TGT when you're authenticating in offline mode, don't delete it at all. Who cares if it's valid or not?

Later on when you come back online, you can get a shiny new TGT, which might replace the one that already existed.

If your renew/refresh logic is "surprised" by the existence of a TGT which you didn't expect, let's fix that. Although like you, I still don't quite see what the problem was there.
Comment 8 Jeremy Allison 2015-09-03 00:15:14 UTC
Created attachment 11395 [details]
git-am possible patch for master.

David can you check if this does what you want ? Alexander, can you take a look and see if this looks ok to you ?
Comment 9 Alexander Bokovoy 2015-09-03 06:20:59 UTC
Comment on attachment 11395 [details]
git-am possible patch for master.

We discussed with Jakub and keeping existing ccache is the behavior SSSD has as well -- in offline mode it injects a placeholder TGT (expired in Unix epoch start time) because the ccache path is exposed via KRB5CCNAME to the environment.

My only worry would be if we have another places which depend on the valid ticket in the user's ccache. If that code is not expecting an expired ticket, it might fail.
Comment 10 Alexander Bokovoy 2015-09-03 06:21:49 UTC
Comment on attachment 11395 [details]
git-am possible patch for master.

Forgot my RB+
Comment 11 Sumit Bose 2015-09-03 07:18:50 UTC
Please note that the ccache might not only contain TGTs but service tickets as we. Although the KDC might not be reachable which triggers a transition into the offline mode there might be still valid service tickets in the ccache for services which are still reachable, think e.g. of NFS.
Comment 12 David Woodhouse 2015-09-04 09:31:51 UTC
Created attachment 11410 [details]
winbind log

(In reply to Jeremy Allison from comment #8)
> David can you check if this does what you want ? 

It doesn't delete the valid TGT when I do an offline login, certainly.

It also didn't immediately get me a *new* one when I subsequently went online, though. That wasn't what I expected. Further testing shows that even if I have no existing TGT, or if I have an expired TGT, it never gets me a new one.

After I go online, I still see...

[dwoodhou@i7 1.1.fc22]$ wbinfo --online-status
BUILTIN : online
DWOODHOU-LINUX : online
GER : online
[dwoodhou@i7 1.1.fc22]$ wbinfo -K dwoodhou
Enter dwoodhou's password: 
plaintext kerberos password authentication for [dwoodhou] succeeded (requesting cctype: FILE)
user_flgs: NETLOGON_CACHED_ACCOUNT
credentials were put in: FILE:/tmp/krb5cc_500


But it lies. The credentials *weren't* put in FILE:/tmp/krb5cc_500.

Winbind log attached. I start it offline, run 'wbinfo -K dwoodhou' while offline, then join the VPN (which prods it to go online) and then do the above.
Comment 13 David Woodhouse 2015-09-04 10:23:51 UTC
Hm, I think you can disregard comment 12; I cannot repeat it.

I didn't change anything — instead of running the packaged build (Fedora's 4.2.2-1 with the patch applied), I tried running the *same* build from its build directory, as a prelude to reverting the patch and double-checking the 'gain TGT after going online' behaviour. It worked fine. At some point I disabled and re-enabled SELinux, and double-checked yet again that it was enabled. And the packaged build that I originally tested is no longer showing the same behaviour.

The only thing that's changed between then and now is that I had another cup of tea. Therefore I have to blame comment #12 on the fact that I was insufficiently caffeinated. Will continue to run with this build and report any issues that arise.
Comment 14 Jeremy Allison 2015-09-04 18:01:41 UTC
(In reply to David Woodhouse from comment #13)

Thanks. Fix has gone into master. Once you confirm it's good I'll back-port for 4.2.next, 4.3.0 and 4.1.next.