Bug 15844 - getpwuid does not shift to new DC when current DC is down
Summary: getpwuid does not shift to new DC when current DC is down
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 4.20.8
Hardware: All All
: P5 major (vote)
Target Milestone: ---
Assignee: Jule Anger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-04-04 07:32 UTC by Subba Ramanna Bodda
Modified: 2025-09-09 15:40 UTC (History)
9 users (show)

See Also:


Attachments
Terminal output and three scenario winbind logs (3.66 MB, text/plain)
2025-04-09 12:12 UTC, Subba Ramanna Bodda
no flags Details
patch from master for v4-23-test (21.75 KB, patch)
2025-08-14 14:56 UTC, Guenther Deschner
slow: review+
gd: review? (metze)
Details
patch from master for v4-22-test (21.75 KB, patch)
2025-08-14 15:02 UTC, Guenther Deschner
slow: review+
gd: review? (metze)
Details
patch from master for v4-21-test (12.59 KB, patch)
2025-08-19 13:17 UTC, Guenther Deschner
slow: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Subba Ramanna Bodda 2025-04-04 07:32:46 UTC
Hi Team, when DCs go down, I see that PingDc moves to a new DC comfortably, but all the getpwuid, getgrpid, type of calls which try calling to the failing DC does not shift to new DC. This will rectify sometimes with DCINFO entry cache expired. Details as below with a sample scenario.

1. Two nodes are in samba cluster (smbd, winbindd, ctdbd), joined to a domain having two Active Directory DCs (In my experiment, AD DCs are windows based, not samba source4 based).

2. winbindd on one node is using windc1 currently, which is first DC. Both DCs windc1 and windc2 are up currenty. While in internal cache or cache td files, this node will continue to use that.

3. We shutdown first DC windc1 now.

4. wbinfo -P might fail or hang for sometime.
checking the NETLOGON for domain[ADPROTOCOLX] dc connection to "" failed
failed to call wbcPingDc: WBC_ERR_WINBIND_NOT_AVAILABLE

5. Key in gencache.tdb would change to new DC windc2 as below.
Key: CURRENT_DCNAME/ADPROTOCOLX  Timeout: Mon Jan 18 22:14:07 2038       Value: 10.XX.1.YY windc2.adprotocolx.com

6. After this wbinfo -P starts working. But getpwuid based calls like a simple C program or getent passwd commands still use older DC and continues failing indefinitely.

7. wbint_UnixIDs2Sids: struct wbint_UnixIDs2Sids will have result as NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND or sometimes NULL in winbind logs.

8. We see getDcGetDcName issued. wbint_DsGetDcName: struct wbint_DsGetDcName would print dc_info as NULL or sometimes older DC name.

Reasons I saw possible are as below:
====================================
1. The DsGetDcName is sometimes issued early and once it gets NULL or older DC, it is never tried again. It is timed a little early.
2. In case there is key DCINFO/ADPROTOCOLX.COM entry, if it's cache is still not expired, it would not issue a fresh request.
3. PingDc request seems to use CURRENT_DCNAME and function correctly, but getpwuid and other calls are using dcname variable or DCINFO entry of gencache.tdb.

Note: 
======
In 4.17 code, when a DC is down, issuing a delete on DCINFO entry in gencache.tdb, after CURRENT_DCNAME write, rectified the problem by refreshing the info. But there are cases where DCINFO can be null and some internal cache is being used?

Code changes that made it work in 4.17 are as below.

source3/winbindd/winbindd_cm.c

        /*
         * Much as I hate global state, this seems to be the point
         * where we can be certain that we have a proper connection to
         * a DC. wbinfo --dc-info needs that information, store it in
         * gencache with a looong timeout. This will need revisiting
         * once we start to connect to multiple DCs, wbcDcInfo is
         * already prepared for that.
         */
        store_current_dc_in_gencache(domain->name, domain->dcname,
                                     new_conn->cli);

        /* Construct the cache key for DCINFO and delete it */
        cache_key = talloc_asprintf_strupper_m(mem_ctx, "DCINFO/%s", domain->name);
        if (cache_key == NULL) {
                talloc_destroy(mem_ctx);
                return NT_STATUS_NO_MEMORY;
        }

        D_DEBUG("Deleting DCINFO/%s key\n", domain->name);
        gencache_del(cache_key); 
        talloc_destroy(cache_key);       

But do not know, other cases can exist.

Reproduction steps
==================

On a samba node which is joined as domain member, we can run as below.

while true; 
      do sleep 5; 
      echo "Clearing ids in cache"; 
      for i in `net cache list | grep -E "IDMAP|SID2NAME|NAME2SID" | awk '{print $2}'`; do net cache del $i; done; 

       echo ""; date; 

       echo "Displaying net cache list"; net cache list; tdbdump gencache.tdb; 

       echo "Pinging using wbinfo -P"; time wbinfo -P; 

       echo "Issuing getent passwd"; time getent passwd 10001; echo ""; 
done

In middle of this, we can shutdown or make a DC problematic.
Comment 1 Subba Ramanna Bodda 2025-04-06 23:38:03 UTC
This is the sequence found based on further checking.

1. Two DCs up - initial setup

2. One DC down which is being used by a samba node.

3. On node that uses this DC, wbinfo -P would invoke wbint_PingDc on parent winbindd

4. Child winbindd invokes PingDc. This understands current DC not available, invokes cm_open_connection which will check for new DC, checks also for non-NULL, writes CURRENT_DCNAME into gencache.tdb and returns. Next invocations in child winbindd, seem to use the stored variables.

5. getent type command invokes GETPWUID on winbindd parent

6. It in turn issues UnixIDs2Sids and winbindd child tries on current down DC, returns NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND or NT_STATUS_HOST_UNREACHABLE or NT_STATUS_IO_TIMEOUT.

7. This invokes rediscovery routine in code as below.
if ((NT_STATUS_EQUAL(result, NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND) && !state->tried_dclookup) {
...
subreq = wb_dsgetdcname_send(state, state->ev, state->dom_map->name, NULL, NULL, DS_RETURN_DNS_NAME);
...
}

8. But, Rediscovery will be attempted only if return status was NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND. And also I see flag as DS_RETURN_DNS_NAME. So, quite possible here that it will not do rediscovery because there is no DS_FORCE_REDISCOVERY? "flags" showed in return structure wbint_DsGetDcName as 0x40000000. Also, a wrong DC can be momentarily available while becoming down or free. In both cases - no rediscovery can happen and a wrong DC can be found. Once it writes DCINFO in gencache.tdb (more related to 4.17.12), next invocations would no more do rediscovery.  

9. In my case, I received also NT_STATUS_IO_TIMEOUT and hence rediscovery was not attempted.

Based on above, following changes make 4.17.12 as well as 4.20.8 code make it cross this phase.

diff --git a/source3/winbindd/wb_sids2xids.c b/source3/winbindd/wb_sids2xids.c
-       if (NT_STATUS_EQUAL(result, NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND) &&
+       if ((NT_STATUS_EQUAL(result, NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND) ||
+           NT_STATUS_EQUAL(result, NT_STATUS_HOST_UNREACHABLE) ||
+           NT_STATUS_EQUAL(status, NT_STATUS_IO_TIMEOUT)) &&
...
                subreq = wb_dsgetdcname_send(
                        state, state->ev, d->name.string, NULL, NULL,
-                       DS_RETURN_DNS_NAME);
+                       DS_RETURN_DNS_NAME | DS_FORCE_REDISCOVERY);


diff --git a/source3/winbindd/wb_xids2sids.c b/source3/winbindd/wb_xids2sids.c
-       if (NT_STATUS_EQUAL(result, NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND) &&
+       if ((NT_STATUS_EQUAL(result, NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND) ||
+           NT_STATUS_EQUAL(result, NT_STATUS_HOST_UNREACHABLE) ||
+           NT_STATUS_EQUAL(status, NT_STATUS_IO_TIMEOUT)) &&
...
                subreq = wb_dsgetdcname_send(
                        state, state->ev, state->dom_map->name, NULL, NULL,
-                       DS_RETURN_DNS_NAME);
+                       DS_RETURN_DNS_NAME | DS_FORCE_REDISCOVERY);


With these changes, 4.17.12 works fine in rediscovering a new DC and using it. In 4.20.8, there is something else still causing DsGetDcName get NT_STATUS_NO_LOGON_SERVERS continuously. I am checking to see if it is related to any cache or other problems.

Any suggestions on above changes or help on fixing this problem, would be helpful.
Comment 2 Subba Ramanna Bodda 2025-04-09 12:12:54 UTC
Created attachment 18631 [details]
Terminal output and three scenario winbind logs

Based on offline discussion, attaching complete logs for three scenarios. Note: This is not related to Windows 2025 DC related issues. This is with older DCs and any codebase. And this is about winbindd operations to be able to shift to new DC when current DC is down and another DC is available.

Search for terms such as "Scenario 1" for scenario 1, then log.winbindd.scenario1 and log.wb-ADPROTOCOLX.scenario1 for the logs for that. Similarly, we can search for "Scenario 2" and "Scenario 3".

Scenario 1: 4.19.9 based code - When DC down, wbinfo -P shifts to new DC, getpwuid operation keeps trying old DC and fails.

Scenario 2: 4.19.9 based code + fixes - [Typo in previous update - read state as result]. The fixes applied are mentioned under this scenario in the file. With these fixes, getpwuid operation shifts to new DC invoking rediscovery and works fine.

Scenario 3: 4.20.8 based code - Rediscovery invoked in getpwuid but GetDcName keeps failing with NT_STATUS_NO_LOGON_SERVERS, hence never recovers.

Note: 4.17.12 based code also works fine with fixes.

Need help on confirmation on the changes and with the Scenario 3 failure. I am expecting same behavior with 4.21 and 4.22 also.
Comment 3 Ralph Böhme 2025-07-20 10:00:34 UTC
Is this using idmap "ad" backend? Does is affect the primary domain, trusted domains or both? Please share your smb.conf.

Maybe the fixes for bug 15881 also fix this one? Can you check?
Comment 4 Ralph Böhme 2025-07-21 05:33:55 UTC
(In reply to Ralph Böhme from comment #3)
..plus the recent fix for bug 15876.
Comment 5 Subba Ramanna Bodda 2025-07-21 05:54:36 UTC
No, those two bugs might not fix. The problem here is that - in source3/winbindd/wb_queryuser.c source3/winbindd/wb_sids2xids.c source3/winbindd/wb_xids2sids.c, where we check only for NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND from DC and do the rediscovery. It is not always the case with DC. We should add NT_STATUS_NO_LOGON_SERVERS, NT_STATUS_HOST_UNREACHABLE, NT_STATUS_IO_TIMEOUT also to take care when a DC is down or completely slow, rediscovery can happen.
Comment 6 Ralph Böhme 2025-07-24 14:54:16 UTC
(In reply to Subba Ramanna Bodda from comment #5)
There are several ways to skin a cat, but we should likely at least also fill the negative connection cache and also handle the case of stale TCP connection where the kernel will only notify us of a dead peer if we have unacked data and after up to multiple minutes. For this we must set a timer on the tldap search request.
Comment 7 Ralph Böhme 2025-07-25 17:07:03 UTC
I have WIP patches here: https://gitlab.com/samba-team/devel/samba/-/commits/slow/idmap_ad_ldap_timeout
Comment 8 Samba QA Contact 2025-08-13 19:32:03 UTC
This bug was referenced in samba master:

4e79fe13325385ef4fe37baeec8656c9b332de19
4d69ec473b7be763399c9787eda8e659a1582184
6643d1fb3375903e2857e5bff33b39a4562c5a4d
85dd55a5fef0049660126bdcd48abfa1c48da259
Comment 9 Guenther Deschner 2025-08-14 14:56:06 UTC
Created attachment 18686 [details]
patch from master for v4-23-test
Comment 10 Guenther Deschner 2025-08-14 15:02:15 UTC
Created attachment 18687 [details]
patch from master for v4-22-test
Comment 11 Ralph Böhme 2025-08-18 17:46:32 UTC
Same here: we also need a patch for 4.21. Does the patch for 4.22 apply?
Comment 12 Guenther Deschner 2025-08-19 13:17:40 UTC
Created attachment 18690 [details]
patch from master for v4-21-test

Slightmy modified backport, please review carefully.
Comment 13 Guenther Deschner 2025-08-20 16:17:34 UTC
Jule, can you please pick the patches for v4-22-test and v4-23-test (until Ralph had a chance to review the backport to v4-21-test) ?
Comment 14 Ralph Böhme 2025-08-20 17:03:47 UTC
Comment on attachment 18690 [details]
patch from master for v4-21-test

Reassigning to Jule for inclusion in 4.21, 4.22 and 4.23
Comment 15 Jule Anger 2025-08-21 09:14:24 UTC
Pushed to autobuild-v4-{23,22,21}-test.
Comment 16 Samba QA Contact 2025-08-21 15:09:21 UTC
This bug was referenced in samba v4-22-test:

e4420f35c6732c3ab7d59fe10391715e00ee5170
5e685641fcc0b8aaaa2cb3acd4945dcaed9412d3
0a1f0d014175ba659af65018bf03d0eb16963e69
4725af8a4c3367e648708a8f4c50ddabfe6f4fa3
Comment 17 Samba QA Contact 2025-08-21 15:25:29 UTC
This bug was referenced in samba v4-22-stable (Release samba-4.22.4):

e4420f35c6732c3ab7d59fe10391715e00ee5170
5e685641fcc0b8aaaa2cb3acd4945dcaed9412d3
0a1f0d014175ba659af65018bf03d0eb16963e69
4725af8a4c3367e648708a8f4c50ddabfe6f4fa3
Comment 18 Samba QA Contact 2025-08-22 13:12:03 UTC
This bug was referenced in samba v4-23-test:

e03f233e92054a2f86017b2bdaed58a8466092cf
e412ceaa8e9087d7afa486aae8b8426d28debc62
0dc0860f3a615b45371889abeeee663d23f27ed0
8d50eb1938a26e7d8a81e56acc64365473b0e9fc
Comment 19 Samba QA Contact 2025-08-22 15:46:49 UTC
This bug was referenced in samba v4-23-stable (Release samba-4.23.0rc2):

e03f233e92054a2f86017b2bdaed58a8466092cf
e412ceaa8e9087d7afa486aae8b8426d28debc62
0dc0860f3a615b45371889abeeee663d23f27ed0
8d50eb1938a26e7d8a81e56acc64365473b0e9fc
Comment 20 Samba QA Contact 2025-08-22 17:04:11 UTC
This bug was referenced in samba v4-21-test:

236672028c1551395b26aa760db6830cbe320209
8910ba21bab66be6aa200b7b80fc888a34f65dbc
Comment 21 Jule Anger 2025-08-28 08:11:33 UTC
Closing out bug report.

Thanks!
Comment 22 Samba QA Contact 2025-09-09 15:40:15 UTC
This bug was referenced in samba v4-21-stable (Release samba-4.21.8):

236672028c1551395b26aa760db6830cbe320209
8910ba21bab66be6aa200b7b80fc888a34f65dbc