When joining a second DC with a large user database (100k users), we get a NT_STATUS_CONNECTION_DISCONNECTED message during the join process. Joining a 75k user domain is OK. This bug has been reproduced in latest RC5. I have dumped the error message below. RAM is at 4GiB. I will try to increase it to see if something changes, although I did had any oomkiller during the test. I might try to increase log level to have more info, but it would need to be increased only at the end, otherwise it make tons and tons of log during initial replication for those 100k users... How to reproduced : create an AD, add 100k users objects, join a second DC. Partition[DC=ad,DC=test,DC=fr] objects[97784/100214] linked_values[0/23] Partition[DC=ad,DC=test,DC=fr] objects[98186/100214] linked_values[0/23] Partition[DC=ad,DC=test,DC=fr] objects[98588/100214] linked_values[0/23] Partition[DC=ad,DC=test,DC=fr] objects[98990/100214] linked_values[0/23] Partition[DC=ad,DC=test,DC=fr] objects[99392/100214] linked_values[0/23] Partition[DC=ad,DC=test,DC=fr] objects[99794/100214] linked_values[0/23] Partition[DC=ad,DC=test,DC=fr] objects[100196/100214] linked_values[0/23] Partition[DC=ad,DC=test,DC=fr] objects[100311/100214] linked_values[23/23] Done with always replicated NC (base, config, schema) Replicating DC=DomainDnsZones,DC=ad,DC=test,DC=fr Partition[DC=DomainDnsZones,DC=ad,DC=test,DC=fr] objects[40/40] linked_values[0/0] Replicating DC=ForestDnsZones,DC=ad,DC=test,DC=fr Partition[DC=ForestDnsZones,DC=ad,DC=test,DC=fr] objects[18/18] linked_values[0/0] Exop on[CN=RID Manager$,CN=System,DC=ad,DC=test,DC=fr] objects[3] linked_values[0] Committing SAM database Adding 1 remote DNS records for SRVADS2.ad.test.fr Join failed - cleaning up ERROR(ldb): uncaught exception - LDAP client internal error: NT_STATUS_CONNECTION_DISCONNECTED File "/usr/local/samba/lib64/python2.7/site-packages/samba/netcmd/__init__.py", line 177, in _run return self.run(*args, **kwargs) File "/usr/local/samba/lib64/python2.7/site-packages/samba/netcmd/domain.py", line 716, in run backend_store=backend_store) File "/usr/local/samba/lib64/python2.7/site-packages/samba/join.py", line 1500, in join_DC ctx.do_join() File "/usr/local/samba/lib64/python2.7/site-packages/samba/join.py", line 1414, in do_join ctx.cleanup_old_join() File "/usr/local/samba/lib64/python2.7/site-packages/samba/join.py", line 272, in cleanup_old_join ctx.cleanup_old_accounts(force=force) File "/usr/local/samba/lib64/python2.7/site-packages/samba/join.py", line 218, in cleanup_old_accounts attrs=["msDS-krbTgtLink", "objectSID"])
The new DC is probably functional ironically. Anyway, what is happening is that the replication is failing to complete within the 15 min (900) second idle timeout on the DC. Then, after all that work Samba tries to reuse the same LDAP connection it had open at the start, but hasn't touched in the meantime. It has dropped, and needs just to be reconnected. This will not be hard to fix :-)
As a workaround, a chunk like: ctx.samdb = SamDB(url="ldap://%s" % ctx.server, session_info=system_session(), credentials=ctx.creds, lp=ctx.lp) Needs to be run before ctx.join_add_dns_records() in do_join() This is all in python/samba/join.py That will set up the LDAP connection again.
The other relevant option here is: --domain-critical-only The exact purpose of this is to replicate only the essential parts of the domain in the first go, and to the replicate the rest later, and so avoid similar timeouts.
Hi Andrew, With your patch, one can properly join the domain with 100k user, but it reached another timeout later with more users / groups. With --domain-critical-only, the join goes properly with 100k user and 50k small groups. Do you think documentation should be modified to recommend --domain-critical-only for joining (whatever the size), or samba-tool domain join should hint to use that parameter if times out if it fails. Thanks for your input about this issue Kévin and Denis
(In reply to Kevin Guerineau from comment #4) Do you have any additional logs of the second timeout? There might be another simple resolution. I think it would be preferable to get it working.
The timeout problem should now be fixed in latest master: https://git.samba.org/?p=samba.git;a=commit;h=d8ea16a3fb6d8ad9b738e8e71adc07079a292079 Do you want this fix backported to v4.9 (or any other releases)? Please could you add more details of the other timeout problem you mentioned.
Can we close this issue now? Joining a 100K user database works reliably for us now. Let me know if you are still seeing problems.
looks like it was fixed.