Bug 10614 - a few times per day nobody can login: Windows cannot obtain the domain controller name
a few times per day nobody can login: Windows cannot obtain the domain contro...
Status: NEW
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services
All Linux
: P5 critical
: ---
Assigned To: Samba QA Contact
Samba QA Contact
Depends on:
  Show dependency treegraph
Reported: 2014-05-17 08:38 UTC by Bram Matthys
Modified: 2017-11-21 16:36 UTC (History)
3 users (show)

See Also:

smb.conf (8.40 KB, text/plain)
2014-05-17 09:01 UTC, Bram Matthys
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bram Matthys 2014-05-17 08:38:24 UTC
A few times per day - seemingly when a lot of users try to login - we are having problems where tens of people/computers suddenly can't login. The issue seems to resolve itself after 5-20 minutes without intervention. All existing connections are seemingly unaffected (other computers where users are already logged on to).

This problem started when we migrated from Samba 3.x to Samba 4.1 two weeks ago. It happens every day at roughly the same time when a lot of people turn on their computer or log in, and also at a few other moments during the day which may or may not vary.

These are all Windows XP clients.

Event log shows two variants of the error:
Windows cannot obtain the domain controller name for your computer network. (The specified domain either does not exist or could not be contacted. ). Group Policy processing aborted. 
Windows cannot copy file \\Green\profiles\gebruiker\Templates to location C:\Documents and Settings\llxxxxyyyy\Templates. Possible causes of this error include network problems or insufficient security rights. If this problem persists, contact your network administrator.

DETAIL - The specified network name is no longer available. 
^ Naturally the exact file/directory differs, it just seems the DC has suddenly 'gone away'.

I have ICINGA running which connects to this server every two minutes by IP (\\192.168.2.X\sharename), it has never raised an alarm. Same for the LDAP check. I also have a DNS check (although it resolves just google.com), never raised an alarm either. So the problem seems to be related to 'finding the DC'.

Is this a known issue or do you have any suggestions as to how to debug (or even fix) this?

How can I emulate this 'finding the DC' process? Is that done by WINS? DNS?
Is there a samba command line option to do this? So I can add a script to check for this e.g. every 30 seconds.

I have loglevel 3 logs available.

The load when all these users try to log in isn't very high. I currently have two CPU's and it seems neither one is using 100%. The only thing I noticed in Munin graphs is that there's a spike of UDP connections when this error happens... which may be related to this problem (a cause or a result).

Any help would be appreciated as this is naturally is a major problem for us.
Comment 1 Bram Matthys 2014-05-17 09:01:58 UTC
Created attachment 9947 [details]

Samba configuration file attached (smb.conf + included file).

I couldn't find a way to make attachments 'private' here so did not attach my loglevel 3 logs. I can e-mail the files on request (ask here, or mail syzop@vulnscan.org).
Comment 2 Bram Matthys 2014-05-20 15:28:41 UTC
My co-worker said that, on a computer he couldn't login due to this problem, if he logged in locally as administrator then he could browse the network (files) just fine.

I asked previously for a (Linux) command line tool to emulate finding the PDC so I could create an alert system / see when it happens. Have searched myself in 'net' but couldn't find any.

I just launched the following two commands in a batch script that loops every 5 seconds, and will check the results in 24 hours or so:
netdom query pdc /DOMAIN:XXXXXXX
netdom query pdc /DOMAIN:jnet.xxxxxxxxxxxx.nl

I also fired up a script on the Linux side to do "host -t srv _ldap._tcp.dc._msdcs.jnet.hermanjordan.nl" every X number of seconds.

I'll let you know the results.

Additionally, I started a thread on the samba mailing list called "Samba 4 + Windows XP very slow - especially noticeable with many files". http://marc.info/?l=samba&m=140059905627497&w=2
The issue describes slow performance with XP when I copy/read 1000 files of 10kb... which takes about half a minute on XP but only 3 seconds on Win7.
So far I've been dealing with it as a separate issue, but it may also very well be related.
Comment 3 Bram Matthys 2014-05-21 09:36:22 UTC
Still seeing many of these on clients: "No Domain Controller is available for domain JORDANET due to the following:  There are currently no logon servers available to service the logon request"

On my two machines the logs indicate no problem of finding the domain controller. These scripts run every 5 seconds so it should have caught something if it's a general problem.

Script 1: NETDOM QUERY PDC (every 5s). All is good:
Primary domain controller for the domain:
The command completed successfully.
Primary domain controller for the domain:
The command completed successfully.

Script 2: Similarly for DNS:
host -t srv _ldap._tcp.dc._msdcs.jnet.xxxxxxxxxxx.nl
host -t srv _kerberos._tcp.dc._msdcs.jnet.xxxxxxxxxxx.nl
host -t srv _gc._tcp.jnet.xxxxxxxxxxx.nl
host -t srv _kerberos._tcp.jnet.xxxxxxxxxxx.nl
host -t srv _kpasswd._tcp.jnet.xxxxxxxxxxx.nl
host -t srv _ldap._tcp.jnet.xxxxxxxxxxx.nl
All returned correct results (ran every 5 seconds for the past XX hours)

I'm out of ideas now..
Comment 4 Peter Eriksson 2017-11-21 16:02:49 UTC
This sounds suspiciously very much like a similar problem we are seeing. I call it the "10 hour problem" since it occurs at 10 hour intervals since the smbd/winbindd daemons were stared.

My _guess_ is that it is due the the (default) 10 hour lifetime of the Kerberos service tickets for either the CIFS/<HOSTNAME> principal and for some reason all service stops for a minute or so when it expires and has to be renewed.

We noticed that since we restart all our smbd processes at a fixed hour every day and exactly 10 hours lately we see this issue, and then again after 10 hours. So we had to adjust the reboot time so we atleast avoid this issue during prime daytime hours...

We set up a pretty aggressively testing system that measure the login time for connecting to our file servers (smbclient).

The problem got worse if we set the "winbind max domain connections" to a large number (20-100) so now we are trying with a smaller number (5) and the problem seems to be smaller (but we still see it from time to time)

Our Samba servers have around 200-400 clients, with a pretty bursty connection pattern (students all arrive at the same time and login).
Comment 5 Peter Eriksson 2017-11-21 16:06:23 UTC
Another related problem we noticed was that a large amount of clients had a tendency to all connect at the same time (timed wake-ups for antivirus scans, cause the clients to attempt to reconnect many shares), causing connections to the DNS servers - causing the firewall state tables to run out... Solved the DNS issue by increasing the firewall state table limits and also getting the Windows client people to _not_ initiate antivirus scans during prime daytime hours :-)
Comment 6 Bram Matthys 2017-11-21 16:14:48 UTC
I didn't realize this bug report was still open. We only solved this by migrating our clients from Windows XP to Windows 7 a month later. Since then we have no XP machines anymore.

Peter: is this on Windows XP?
Comment 7 Peter Eriksson 2017-11-21 16:36:07 UTC
Sorry, no Windows XP clients here. All Windows 7 or newer on the PC side.

Probably not the same issue then. Our issue is on the Samba side (but from the users perspective it looks their logins take forever since it takes minutes (if the client doesn't give ut first) for the client to map the shares. We see the problem not only from the PC clients but also from "smbclient" test connections from Linux systems (and Mac:s).