Hi Team, when DCs go down, I see that PingDc moves to a new DC comfortably, but all the getpwuid, getgrpid, type of calls which try calling to the failing DC does not shift to new DC. This will rectify sometimes with DCINFO entry cache expired. Details as below with a sample scenario. 1. Two nodes are in samba cluster (smbd, winbindd, ctdbd), joined to a domain having two Active Directory DCs (In my experiment, AD DCs are windows based, not samba source4 based). 2. winbindd on one node is using windc1 currently, which is first DC. Both DCs windc1 and windc2 are up currenty. While in internal cache or cache td files, this node will continue to use that. 3. We shutdown first DC windc1 now. 4. wbinfo -P might fail or hang for sometime. checking the NETLOGON for domain[ADPROTOCOLX] dc connection to "" failed failed to call wbcPingDc: WBC_ERR_WINBIND_NOT_AVAILABLE 5. Key in gencache.tdb would change to new DC windc2 as below. Key: CURRENT_DCNAME/ADPROTOCOLX Timeout: Mon Jan 18 22:14:07 2038 Value: 10.XX.1.YY windc2.adprotocolx.com 6. After this wbinfo -P starts working. But getpwuid based calls like a simple C program or getent passwd commands still use older DC and continues failing indefinitely. 7. wbint_UnixIDs2Sids: struct wbint_UnixIDs2Sids will have result as NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND or sometimes NULL in winbind logs. 8. We see getDcGetDcName issued. wbint_DsGetDcName: struct wbint_DsGetDcName would print dc_info as NULL or sometimes older DC name. Reasons I saw possible are as below: ==================================== 1. The DsGetDcName is sometimes issued early and once it gets NULL or older DC, it is never tried again. It is timed a little early. 2. In case there is key DCINFO/ADPROTOCOLX.COM entry, if it's cache is still not expired, it would not issue a fresh request. 3. PingDc request seems to use CURRENT_DCNAME and function correctly, but getpwuid and other calls are using dcname variable or DCINFO entry of gencache.tdb. Note: ====== In 4.17 code, when a DC is down, issuing a delete on DCINFO entry in gencache.tdb, after CURRENT_DCNAME write, rectified the problem by refreshing the info. But there are cases where DCINFO can be null and some internal cache is being used? Code changes that made it work in 4.17 are as below. source3/winbindd/winbindd_cm.c /* * Much as I hate global state, this seems to be the point * where we can be certain that we have a proper connection to * a DC. wbinfo --dc-info needs that information, store it in * gencache with a looong timeout. This will need revisiting * once we start to connect to multiple DCs, wbcDcInfo is * already prepared for that. */ store_current_dc_in_gencache(domain->name, domain->dcname, new_conn->cli); /* Construct the cache key for DCINFO and delete it */ cache_key = talloc_asprintf_strupper_m(mem_ctx, "DCINFO/%s", domain->name); if (cache_key == NULL) { talloc_destroy(mem_ctx); return NT_STATUS_NO_MEMORY; } D_DEBUG("Deleting DCINFO/%s key\n", domain->name); gencache_del(cache_key); talloc_destroy(cache_key); But do not know, other cases can exist. Reproduction steps ================== On a samba node which is joined as domain member, we can run as below. while true; do sleep 5; echo "Clearing ids in cache"; for i in `net cache list | grep -E "IDMAP|SID2NAME|NAME2SID" | awk '{print $2}'`; do net cache del $i; done; echo ""; date; echo "Displaying net cache list"; net cache list; tdbdump gencache.tdb; echo "Pinging using wbinfo -P"; time wbinfo -P; echo "Issuing getent passwd"; time getent passwd 10001; echo ""; done In middle of this, we can shutdown or make a DC problematic.
This is the sequence found based on further checking. 1. Two DCs up - initial setup 2. One DC down which is being used by a samba node. 3. On node that uses this DC, wbinfo -P would invoke wbint_PingDc on parent winbindd 4. Child winbindd invokes PingDc. This understands current DC not available, invokes cm_open_connection which will check for new DC, checks also for non-NULL, writes CURRENT_DCNAME into gencache.tdb and returns. Next invocations in child winbindd, seem to use the stored variables. 5. getent type command invokes GETPWUID on winbindd parent 6. It in turn issues UnixIDs2Sids and winbindd child tries on current down DC, returns NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND or NT_STATUS_HOST_UNREACHABLE or NT_STATUS_IO_TIMEOUT. 7. This invokes rediscovery routine in code as below. if ((NT_STATUS_EQUAL(result, NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND) && !state->tried_dclookup) { ... subreq = wb_dsgetdcname_send(state, state->ev, state->dom_map->name, NULL, NULL, DS_RETURN_DNS_NAME); ... } 8. But, Rediscovery will be attempted only if return status was NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND. And also I see flag as DS_RETURN_DNS_NAME. So, quite possible here that it will not do rediscovery because there is no DS_FORCE_REDISCOVERY? "flags" showed in return structure wbint_DsGetDcName as 0x40000000. Also, a wrong DC can be momentarily available while becoming down or free. In both cases - no rediscovery can happen and a wrong DC can be found. Once it writes DCINFO in gencache.tdb (more related to 4.17.12), next invocations would no more do rediscovery. 9. In my case, I received also NT_STATUS_IO_TIMEOUT and hence rediscovery was not attempted. Based on above, following changes make 4.17.12 as well as 4.20.8 code make it cross this phase. diff --git a/source3/winbindd/wb_sids2xids.c b/source3/winbindd/wb_sids2xids.c - if (NT_STATUS_EQUAL(result, NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND) && + if ((NT_STATUS_EQUAL(result, NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND) || + NT_STATUS_EQUAL(result, NT_STATUS_HOST_UNREACHABLE) || + NT_STATUS_EQUAL(status, NT_STATUS_IO_TIMEOUT)) && ... subreq = wb_dsgetdcname_send( state, state->ev, d->name.string, NULL, NULL, - DS_RETURN_DNS_NAME); + DS_RETURN_DNS_NAME | DS_FORCE_REDISCOVERY); diff --git a/source3/winbindd/wb_xids2sids.c b/source3/winbindd/wb_xids2sids.c - if (NT_STATUS_EQUAL(result, NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND) && + if ((NT_STATUS_EQUAL(result, NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND) || + NT_STATUS_EQUAL(result, NT_STATUS_HOST_UNREACHABLE) || + NT_STATUS_EQUAL(status, NT_STATUS_IO_TIMEOUT)) && ... subreq = wb_dsgetdcname_send( state, state->ev, state->dom_map->name, NULL, NULL, - DS_RETURN_DNS_NAME); + DS_RETURN_DNS_NAME | DS_FORCE_REDISCOVERY); With these changes, 4.17.12 works fine in rediscovering a new DC and using it. In 4.20.8, there is something else still causing DsGetDcName get NT_STATUS_NO_LOGON_SERVERS continuously. I am checking to see if it is related to any cache or other problems. Any suggestions on above changes or help on fixing this problem, would be helpful.
Created attachment 18631 [details] Terminal output and three scenario winbind logs Based on offline discussion, attaching complete logs for three scenarios. Note: This is not related to Windows 2025 DC related issues. This is with older DCs and any codebase. And this is about winbindd operations to be able to shift to new DC when current DC is down and another DC is available. Search for terms such as "Scenario 1" for scenario 1, then log.winbindd.scenario1 and log.wb-ADPROTOCOLX.scenario1 for the logs for that. Similarly, we can search for "Scenario 2" and "Scenario 3". Scenario 1: 4.19.9 based code - When DC down, wbinfo -P shifts to new DC, getpwuid operation keeps trying old DC and fails. Scenario 2: 4.19.9 based code + fixes - [Typo in previous update - read state as result]. The fixes applied are mentioned under this scenario in the file. With these fixes, getpwuid operation shifts to new DC invoking rediscovery and works fine. Scenario 3: 4.20.8 based code - Rediscovery invoked in getpwuid but GetDcName keeps failing with NT_STATUS_NO_LOGON_SERVERS, hence never recovers. Note: 4.17.12 based code also works fine with fixes. Need help on confirmation on the changes and with the Scenario 3 failure. I am expecting same behavior with 4.21 and 4.22 also.