Each call to cache_name2sid can end up calling refresh_sequence_number if another call is in process that's because last_seq_check is only update when the fetch is finished if the fetch takes time for any reason (massive number of DCs, slow DNS server) then while the refresh is in process subsequent calls to winbindd_dual_pam_auth will start fetching the same information again. Also as this is in the authentication flow all calls will have a to wait for the refresh to be finish before the function winbindd_dual_pam_auth can return.
Wouldn't be a correct solution to avoid this snowbowl effect that we store some information in either the cache or the running state of the process to indicate that the refresh is in process and hence no need to start another refresh. In the mean time the cache function could use the current sequence number worse come to worse we will end up caching some information that will be obsolete right away as the USN change.
I don't think we should be looking for the sequence number during the winbindd_dual_pam_auth call at all. I think we should store the last sequence number we knew about, and then let the timeout do the work. Then *if* we need the name -> sid later, then we do it the slow way, it will be just as slow as opening up connections to get the sequence number here, but won't be blocking the auth traffic, which can be very busy. There are use cases for NTLM auth that will never do a name -> sid lookup.
(In reply to Andrew Bartlett from comment #2) >Then *if* we need the name -> sid later, then we do it the slow way, it will be >just as slow as opening up connections to get the sequence number here, but >won't be blocking the auth traffic, which can be very busy. I'm not clear what do you mean here, if we store in the cache name->sid (but without refreshing the USN) why would we have to do it the slow way ? It should be just as fast as we will be reading from the cache. Also I'm not sure right now that we have a timer that is refreshing the USN so if we decide to remove the refresh (and I'm not saying it's a bad idea) then we need to have some sort of timer that refresh this. Finally I disagree that fetching name->sid will as slow as doing the LDAP reconnection + fetch, in scenario where nsswitch is using winbindd for user and group I think we can see a ton of requests for translation. Not 100% sure, most probably Guenther can chime in.
When we read the cache, after the timeout, if we find the USN (sequence number) has actually changed, then the cache won't be valid. The cost to do the name -> sid call later (and save it with the current USN) would be, in my view, about the same as the cost to find the USN, so we are no worse off. The gain is moving this unrelated call our of the NTLM auth path, which can be quite hot.
I'm not quite sure I get this still. But my understanding from what you say seems to be: let's save the translation sid -> name in the cache with the current USN. And whenever the USN is refreshed (and updated) then the entries in the cache will be tossed, sure we won't be able to keep the entries that just wrote but in the same time we won't have to wait for the request to finish. Am I getting it right ?
It seems that cache_name2sid is only called in the winbindd_dual_pam_auth() code path but not in winbindd_dual_pam_auth_crap(). I guess in the end they should better do similar things. In the end I think we should avoid any LDAP traffic at all and get rid of the highestCommittedUSN usage as I guess it's totally useless or do we flush the cache each time we connect to a different dc? The goal is to only to only use netlogon and lsa via tcp and schannel and only for direct trusts. And as a DC maybe drsuapi via tcp and krb5.
>In the end I think we should avoid any LDAP traffic at all and get rid >of the highestCommittedUSN usage as I guess it's totally useless >or do we flush the cache each time we connect to a different dc? I don't think we really flush the cache but we might just start ignoring the entries that are with a different USN I'm not 100% that winbindd_dual_pam_auth_crap is not call refresh_usn, because my logs seems to indicate it is and also I can see that lots of functions in source3/winbindd/winbindd_cache.c are actually calling it, for instance wcache_fetch is calling it and winbindd_dual_pam_auth_crap is calling wcache_query_user_fullname which in turn is calling wcache_query_user which in turn is calling wcache_fetch
(In reply to Matthieu Patou from comment #7) I mean the highestCommittedUSN is relative to the dc (the current invocationId of the dc). If we add a value to the cache with sequence_number 100000 because that's the value on dc1, it won't be discarded when we connect to dc2 which has a value of just 5000. Using highestCommittedUSN as sequence number doesn't provide a strictly increasing number!