Created attachment 18662 [details] winbindd -d10 logs There's a long(ish) description of the problem we're seeing, at https://lists.samba.org/archive/samba/2025-June/251711.html . In short, having two sites and two samba DCs (one on each site), winbindd stops working if *remote* DC becomes unreachable, - it ignores replies from local DC, waiting for the first reply from remote one, - only after that it finally starts working. I'm attaching -d10 debug logs. Here, there are two sites: MoscowOffice (local), with DC svdcm[192.168.177.8], and PereslavlOffice (remote), with DC svdcp[192.168.19.6]. Samba AD Domain "TLS" and server TSRV where all this stuff is happening. In the logs, I restarted winbindd with -d10, while the remote DC was unreachable, and made 2 queries for my user (`id mjt`). There, each of the two queries took about a minute, and finally succeeded. Next, I made the remote DC reachable again, and made another query, which succeeded almost immediately. Note that winbindd does not cache a lot of information which it should cache (I think anyway) - for example, it does not cache DNS records for the DCs despite the TTL; it does not cache results of user lookups as shown by second `id mjt` query - it took as much time as the first one. I think this is another bug (or several), but it hidden under the hood while everything is working.
There's another issue here: requests to the remote DC aren't timing out, - instead, I'm adding an unreachable route for that IP, so any packet destined for it errors out immediately. Yet, winbindd ignores these errors, and waits for the reply (which, obviously, does not come). I understand it is UDP where error handling is difficult. But I'm mentioning this fact here for completeness.
Created attachment 18663 [details] log.wb-TLS
Created attachment 18664 [details] log.wb-TSRV
Created attachment 18665 [details] log.winbindd
Created attachment 18666 [details] log.winbindd-idmap
Can you please also attach your smb.conf? Thanks!
This might be <https://bugzilla.samba.org/show_bug.cgi?id=15881>. Can you try the fix from the PR?
Created attachment 18667 [details] smb.conf
The smb.conf is rather simple, attached. Meanwhile, I tried 4.22.3, and on top of that, the mentioned change (commit 88572cc8f629a737a1d5b33d5800f3692895233f from master). Neither 4.22.3 nor 88572cc8f6 changed anything in the context of this bug report - winbindd is still non-operational without the DC from the remote site. I did some more experiments, by creating another test linux server and joining it to the same domain - with a similar config. This one works a bit better, - it does not show the same huge initial delay when resolving names. But the lack of the other site is still very much visible. I don't know what makes the difference here, why our main server basically stops working while a test server with similar config, while being slow, works somehow.
Can you please upload new logs from the issue with 4.22.3 and commit 88572cc8f629a737a1d5b33d5800f3692895233f included ?
Created attachment 18668 [details] winbindd-logs-with-88572cc8f6.tar.gz
I was wrong saying 88572cc8f6 changed nothing. It changed quite a lot, I was just a bit too impatient. Now with 88572cc8f6 applied, the first lookup after (re)start takes a somewhat long time, but not as long as before. And now, second and subsequent lookups works instantly. So the change actually made huge difference, I just needed some more patience to see it.
(In reply to Michael Tokarev from comment #12) Have you run `net cache flush` between any changes? Otherwise you will have all sorts of negative cache entries that will pollute the test environment for subsequent tests.
(In reply to Ralph Böhme from comment #13) `net cache flush` does not change the visible behavior, it looks like, at least in my current testing/setup. I'm uploading another log after `net cache flush`: tsrv# echo WINBINDOPTIONS=-d10 > /etc/default/samba (reverse-i-search)`rest': systemctl ^Cstart winbind tsrv# net cache flush;rm /var/log/samba/log.w*; systemctl restart winbind tsrv:/var/log/samba# time id mjt uid=1000(mjt) gid=1000(mjt) groups=1000(mjt),50(staff),100(users),... real 0m9.397s user 0m0.000s sys 0m0.003s tsrv:# time id mjt uid=1000(mjt) gid=1000(mjt) groups=1000(mjt),50(staff),100(users),... real 0m0.018s user 0m0.000s sys 0m0.003s
Created attachment 18669 [details] winbindd-logs-with-88572cc8f6-2.tar.gz
So, is 88572cc8f629a737 the proper fix for this issue, or is some additional fixing needed?
(In reply to Michael Tokarev from comment #16) There might be more problems, see bug 15844.