Bug 15884 - winbindd does not work if a DC in remote site is unreachable
Summary: winbindd does not work if a DC in remote site is unreachable
Status: NEW
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 4.22.3
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-07-09 03:06 UTC by Michael Tokarev
Modified: 2025-07-20 10:04 UTC (History)
3 users (show)

See Also:


Attachments
winbindd -d10 logs (80.29 KB, application/gzip)
2025-07-09 03:06 UTC, Michael Tokarev
no flags Details
log.wb-TLS (266.34 KB, text/plain)
2025-07-09 03:15 UTC, Michael Tokarev
no flags Details
log.wb-TSRV (6.72 KB, text/plain)
2025-07-09 03:16 UTC, Michael Tokarev
no flags Details
log.winbindd (383.33 KB, text/plain)
2025-07-09 03:16 UTC, Michael Tokarev
no flags Details
log.winbindd-idmap (420.10 KB, text/plain)
2025-07-09 03:17 UTC, Michael Tokarev
no flags Details
smb.conf (1.25 KB, text/plain)
2025-07-09 09:20 UTC, Michael Tokarev
no flags Details
winbindd-logs-with-88572cc8f6.tar.gz (49.76 KB, application/gzip)
2025-07-09 20:37 UTC, Michael Tokarev
no flags Details
winbindd-logs-with-88572cc8f6-2.tar.gz (52.09 KB, application/gzip)
2025-07-09 20:57 UTC, Michael Tokarev
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Tokarev 2025-07-09 03:06:14 UTC
Created attachment 18662 [details]
winbindd -d10 logs

There's a long(ish) description of the problem we're seeing, at https://lists.samba.org/archive/samba/2025-June/251711.html .

In short, having two sites and two samba DCs (one on each site), winbindd stops working if *remote* DC becomes unreachable, - it ignores replies from local DC, waiting for the first reply from remote one, - only after that it finally starts working.

I'm attaching -d10 debug logs.  Here, there are two sites: MoscowOffice (local), with DC svdcm[192.168.177.8], and PereslavlOffice (remote), with DC svdcp[192.168.19.6].  Samba AD Domain "TLS" and server TSRV where all this stuff is happening.

In the logs, I restarted winbindd with -d10, while the remote DC was unreachable, and made 2 queries for my user (`id mjt`).  There, each of the two queries took about a minute, and finally succeeded.

Next, I made the remote DC reachable again, and made another query, which succeeded almost immediately.

Note that winbindd does not cache a lot of information which it should cache (I think anyway) - for example, it does not cache DNS records for the DCs despite the TTL; it does not cache results of user lookups as shown by second `id mjt` query - it took as much time as the first one.  I think this is another bug (or several), but it hidden under the hood while everything is working.
Comment 1 Michael Tokarev 2025-07-09 03:13:53 UTC
There's another issue here: requests to the remote DC aren't timing out, - instead, I'm adding an unreachable route for that IP, so any packet destined for it errors out immediately.  Yet, winbindd ignores these errors, and waits for the reply (which, obviously, does not come).  I understand it is UDP where error handling is difficult.  But I'm mentioning this fact here for completeness.
Comment 2 Michael Tokarev 2025-07-09 03:15:48 UTC
Created attachment 18663 [details]
log.wb-TLS
Comment 3 Michael Tokarev 2025-07-09 03:16:10 UTC
Created attachment 18664 [details]
log.wb-TSRV
Comment 4 Michael Tokarev 2025-07-09 03:16:42 UTC
Created attachment 18665 [details]
log.winbindd
Comment 5 Michael Tokarev 2025-07-09 03:17:15 UTC
Created attachment 18666 [details]
log.winbindd-idmap
Comment 6 Ralph Böhme 2025-07-09 08:26:23 UTC
Can you please also attach your smb.conf? Thanks!
Comment 7 Ralph Böhme 2025-07-09 08:27:51 UTC
This might be <https://bugzilla.samba.org/show_bug.cgi?id=15881>. Can you try the fix from the PR?
Comment 8 Michael Tokarev 2025-07-09 09:20:39 UTC
Created attachment 18667 [details]
smb.conf
Comment 9 Michael Tokarev 2025-07-09 09:26:12 UTC
The smb.conf is rather simple, attached.

Meanwhile, I tried 4.22.3, and on top of that, the mentioned change (commit 88572cc8f629a737a1d5b33d5800f3692895233f from master).  Neither 4.22.3 nor 88572cc8f6 changed anything in the context of this bug report - winbindd is still non-operational without the DC from the remote site.

I did some more experiments, by creating another test linux server and joining it to the same domain - with a similar config.  This one works a bit better, - it does not show the same huge initial delay when resolving names.  But the lack of the other site is still very much visible.  I don't know what makes the difference here, why our main server basically stops working while a test server with similar config, while being slow, works somehow.
Comment 10 Guenther Deschner 2025-07-09 16:01:43 UTC
Can you please upload new logs from the issue with 4.22.3 and commit 88572cc8f629a737a1d5b33d5800f3692895233f included ?
Comment 11 Michael Tokarev 2025-07-09 20:37:15 UTC
Created attachment 18668 [details]
winbindd-logs-with-88572cc8f6.tar.gz
Comment 12 Michael Tokarev 2025-07-09 20:39:59 UTC
I was wrong saying 88572cc8f6 changed nothing.  It changed quite a lot, I was just a bit too impatient.  Now with 88572cc8f6 applied, the first lookup after (re)start takes a somewhat long time, but not as long as before.  And now, second and subsequent lookups works instantly.  So the change actually made huge difference, I just needed some more patience to see it.
Comment 13 Ralph Böhme 2025-07-09 20:50:38 UTC
(In reply to Michael Tokarev from comment #12)
Have you run `net cache flush` between any changes? Otherwise you will have all sorts of negative cache entries that will pollute the test environment for subsequent tests.
Comment 14 Michael Tokarev 2025-07-09 20:56:55 UTC
(In reply to Ralph Böhme from comment #13)
`net cache flush` does not change the visible behavior, it looks like, at least in my current testing/setup.  I'm uploading another log after `net cache flush`:

tsrv# echo WINBINDOPTIONS=-d10 > /etc/default/samba
(reverse-i-search)`rest': systemctl ^Cstart winbind
tsrv# net cache flush;rm /var/log/samba/log.w*; systemctl restart winbind
 tsrv:/var/log/samba# time id mjt
uid=1000(mjt) gid=1000(mjt) groups=1000(mjt),50(staff),100(users),...

real	0m9.397s
user	0m0.000s
sys	0m0.003s
tsrv:# time id mjt
uid=1000(mjt) gid=1000(mjt) groups=1000(mjt),50(staff),100(users),...

real	0m0.018s
user	0m0.000s
sys	0m0.003s
Comment 15 Michael Tokarev 2025-07-09 20:57:36 UTC
Created attachment 18669 [details]
winbindd-logs-with-88572cc8f6-2.tar.gz
Comment 16 Michael Tokarev 2025-07-15 05:03:25 UTC
So, is 88572cc8f629a737 the proper fix for this issue, or is some additional fixing needed?
Comment 17 Ralph Böhme 2025-07-20 10:04:03 UTC
(In reply to Michael Tokarev from comment #16)
There might be more problems, see bug 15844.