Bug 13612 - error NT_STATUS_CONNECTION_DISCONNECTED when joining DC on large database
Summary: error NT_STATUS_CONNECTION_DISCONNECTED when joining DC on large database
Status: ASSIGNED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: AD: LDB/DSDB/SAMDB (show other bugs)
Version: 4.9.0rc5
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Andrew Bartlett
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-12 15:42 UTC by Kevin Guerineau
Modified: 2019-06-11 23:05 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Guerineau 2018-09-12 15:42:44 UTC
When joining a second DC with a large user database (100k users), we get a NT_STATUS_CONNECTION_DISCONNECTED message during the join process. Joining a 75k user domain is OK. 

This bug has been reproduced in latest RC5.

I have dumped the error message below. RAM is at 4GiB. I will try to increase it to see if something changes, although I did had any oomkiller during the test.

I might try to increase log level to have more info, but it would need to be increased only at the end, otherwise it make tons and tons of log during initial replication for those 100k users...

How to reproduced : create an AD, add 100k users objects, join a second DC.

Partition[DC=ad,DC=test,DC=fr] objects[97784/100214] linked_values[0/23]
Partition[DC=ad,DC=test,DC=fr] objects[98186/100214] linked_values[0/23]
Partition[DC=ad,DC=test,DC=fr] objects[98588/100214] linked_values[0/23]
Partition[DC=ad,DC=test,DC=fr] objects[98990/100214] linked_values[0/23]
Partition[DC=ad,DC=test,DC=fr] objects[99392/100214] linked_values[0/23]
Partition[DC=ad,DC=test,DC=fr] objects[99794/100214] linked_values[0/23]
Partition[DC=ad,DC=test,DC=fr] objects[100196/100214] linked_values[0/23]
Partition[DC=ad,DC=test,DC=fr] objects[100311/100214] linked_values[23/23]
Done with always replicated NC (base, config, schema)
Replicating DC=DomainDnsZones,DC=ad,DC=test,DC=fr
Partition[DC=DomainDnsZones,DC=ad,DC=test,DC=fr] objects[40/40] linked_values[0/0]
Replicating DC=ForestDnsZones,DC=ad,DC=test,DC=fr
Partition[DC=ForestDnsZones,DC=ad,DC=test,DC=fr] objects[18/18] linked_values[0/0]
Exop on[CN=RID Manager$,CN=System,DC=ad,DC=test,DC=fr] objects[3] linked_values[0]
Committing SAM database
Adding 1 remote DNS records for SRVADS2.ad.test.fr
Join failed - cleaning up
ERROR(ldb): uncaught exception - LDAP client internal error: NT_STATUS_CONNECTION_DISCONNECTED
  File "/usr/local/samba/lib64/python2.7/site-packages/samba/netcmd/__init__.py", line 177, in _run
    return self.run(*args, **kwargs)
  File "/usr/local/samba/lib64/python2.7/site-packages/samba/netcmd/domain.py", line 716, in run
    backend_store=backend_store)
  File "/usr/local/samba/lib64/python2.7/site-packages/samba/join.py", line 1500, in join_DC
    ctx.do_join()
  File "/usr/local/samba/lib64/python2.7/site-packages/samba/join.py", line 1414, in do_join
    ctx.cleanup_old_join()
  File "/usr/local/samba/lib64/python2.7/site-packages/samba/join.py", line 272, in cleanup_old_join
    ctx.cleanup_old_accounts(force=force)
  File "/usr/local/samba/lib64/python2.7/site-packages/samba/join.py", line 218, in cleanup_old_accounts
    attrs=["msDS-krbTgtLink", "objectSID"])
Comment 1 Andrew Bartlett 2018-09-12 16:24:02 UTC
The new DC is probably functional ironically.

Anyway, what is happening is that the replication is failing to complete within the 15 min (900) second idle timeout on the DC.

Then, after all that work Samba tries to reuse the same LDAP connection it had open at the start, but hasn't touched in the meantime.  It has dropped, and needs just to be reconnected.

This will not be hard to fix :-)
Comment 2 Andrew Bartlett 2018-09-14 04:01:13 UTC
As a workaround, a chunk like:

        ctx.samdb = SamDB(url="ldap://%s" % ctx.server,
                          session_info=system_session(),
                          credentials=ctx.creds, lp=ctx.lp)


Needs to be run before  ctx.join_add_dns_records() in do_join()

This is all in python/samba/join.py

That will set up the LDAP connection again.
Comment 3 Andrew Bartlett 2018-09-15 12:58:29 UTC
The other relevant option here is:

--domain-critical-only

The exact purpose of this is to replicate only the essential parts of the domain in the first go, and to the replicate the rest later, and so avoid similar timeouts.
Comment 4 Kevin Guerineau 2018-09-18 07:45:58 UTC
Hi Andrew,

With your patch, one can properly join the domain with 100k user, but it reached another timeout later with more users / groups.

With --domain-critical-only, the join goes properly with 100k user and 50k small groups.

Do you think documentation should be modified to recommend --domain-critical-only for joining (whatever the size), or samba-tool domain join should hint to use that parameter if times out if it fails.

Thanks for your input about this issue

Kévin and Denis
Comment 5 Garming Sam 2018-10-17 02:43:03 UTC
(In reply to Kevin Guerineau from comment #4)

Do you have any additional logs of the second timeout? There might be another simple resolution. I think it would be preferable to get it working.
Comment 6 Tim Beale 2018-10-22 20:27:34 UTC
The timeout problem should now be fixed in latest master:
https://git.samba.org/?p=samba.git;a=commit;h=d8ea16a3fb6d8ad9b738e8e71adc07079a292079

Do you want this fix backported to v4.9 (or any other releases)? 

Please could you add more details of the other timeout problem you mentioned.
Comment 7 Tim Beale 2019-06-11 23:05:01 UTC
Can we close this issue now? Joining a 100K user database works reliably for us now. Let me know if you are still seeing problems.