12532 – cache_name2sid as called in winbindd_dual_pam_auth can cause multiple refresh of sequence number simultaneously

Bug 12532 - cache_name2sid as called in winbindd_dual_pam_auth can cause multiple refresh of sequence number simultaneously

Summary: cache_name2sid as called in winbindd_dual_pam_auth can cause multiple refresh...

Status:	NEW

Alias:	None

Product:	Samba 4.1 and newer
Classification:	Unclassified
Component:	Winbind (show other bugs)
Version:	4.4.2
Hardware:	All All

Importance:	P5 normal (vote)
Target Milestone:	---
Assignee:	Samba QA Contact
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-01-23 08:02 UTC by Matthieu Patou
Modified:	2020-02-10 15:30 UTC (History)
CC List:	4 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Matthieu Patou 2017-01-23 08:02:36 UTC

Each call to cache_name2sid can end up calling refresh_sequence_number if another call is in process that's because last_seq_check is only update when the fetch is finished if the fetch takes time for any reason (massive number of DCs, slow DNS server) then while the refresh is in process subsequent calls to winbindd_dual_pam_auth will start fetching the same information again.

Also as this is in the authentication flow all calls will have a to wait for the refresh to be finish before the function winbindd_dual_pam_auth can return.

Comment 1 Matthieu Patou 2017-01-23 08:38:19 UTC

Wouldn't be a correct solution to avoid this snowbowl effect that we store some information in either the cache or the running state of the process to indicate that the refresh is in process and hence no need to start another refresh. In the mean time the cache function could use the current sequence number worse come to worse we will end up caching some information that will be obsolete right away as the USN change.

Comment 2 Andrew Bartlett 2017-02-01 10:14:02 UTC

I don't think we should be looking for the sequence number during the winbindd_dual_pam_auth call at all.

I think we should store the last sequence number we knew about, and then let the timeout do the work.  Then *if* we need the name -> sid later, then we do it the slow way, it will be just as slow as opening up connections to get the sequence number here, but won't be blocking the auth traffic, which can be very busy.

There are use cases for NTLM auth that will never do a name -> sid lookup.

Comment 3 Matthieu Patou 2017-02-02 03:05:48 UTC

(In reply to Andrew Bartlett from comment #2)
>Then *if* we need the name -> sid later, then we do it the slow way, it will be >just as slow as opening up connections to get the sequence number here, but >won't be blocking the auth traffic, which can be very busy.

I'm not clear what do you mean here, if we store in the cache name->sid (but without refreshing the USN) why would we have to do it the slow way ? It should be just as fast as we will be reading from the cache.

Also I'm not sure right now that we have a timer that is refreshing the USN so if we decide to remove the refresh (and I'm not saying it's a bad idea) then we need to have some sort of timer that refresh this.

Finally I disagree that fetching name->sid will as slow as doing the LDAP reconnection + fetch, in scenario where nsswitch is using winbindd for user and group I think we can see a ton of requests for translation. Not 100% sure, most probably Guenther can chime in.

Comment 4 Andrew Bartlett 2017-02-02 03:13:01 UTC

When we read the cache, after the timeout, if we find the USN (sequence number) has actually changed, then the cache won't be valid.

The cost to do the name -> sid call later (and save it with the current USN) would be, in my view, about the same as the cost to find the USN, so we are no worse off. 

The gain is moving this unrelated call our of the NTLM auth path, which can be quite hot.

Comment 5 Matthieu Patou 2017-02-02 06:23:33 UTC

I'm not quite sure I get this still. But my understanding from what you say seems to be: let's save the translation sid -> name in the cache with the current USN.
And whenever the USN is refreshed (and updated) then the entries in the cache will be tossed, sure we won't be able to keep the entries that just wrote but in the same time we won't have to wait for the request to finish.

Am I getting it right ?

Comment 6 Stefan Metzmacher 2017-02-02 06:51:25 UTC

It seems that cache_name2sid is only called in the winbindd_dual_pam_auth()
code path but not in winbindd_dual_pam_auth_crap(). I guess in the end
they should better do similar things.

In the end I think we should avoid any LDAP traffic at all and get rid
of the highestCommittedUSN usage as I guess it's totally useless
or do we flush the cache each time we connect to a different dc?

The goal is to only to only use netlogon and lsa via tcp and schannel
and only for direct trusts. And as a DC maybe drsuapi via tcp and krb5.

Comment 7 Matthieu Patou 2017-02-02 09:25:32 UTC

>In the end I think we should avoid any LDAP traffic at all and get rid
>of the highestCommittedUSN usage as I guess it's totally useless
>or do we flush the cache each time we connect to a different dc?

I don't think we really flush the cache but we might just start ignoring the entries that are with a different USN 

I'm not 100% that winbindd_dual_pam_auth_crap is not call refresh_usn, because my logs seems to indicate it is and also I can see that lots of functions in      
source3/winbindd/winbindd_cache.c are actually calling it, for instance                                                                                            wcache_fetch is calling it and winbindd_dual_pam_auth_crap is calling wcache_query_user_fullname which in turn is calling wcache_query_user which in turn is calling wcache_fetch

Comment 8 Stefan Metzmacher 2017-02-02 09:47:38 UTC

(In reply to Matthieu Patou from comment #7)

I mean the highestCommittedUSN is relative to the dc (the current invocationId of the dc).

If we add a value to the cache with sequence_number 100000 because that's the value on dc1, it won't be discarded when we connect to dc2 which has a value
of just 5000.

Using highestCommittedUSN as sequence number doesn't provide a strictly increasing number!