Bug 11917 - Replication Issue
Summary: Replication Issue
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: AD: LDB/DSDB/SAMDB (show other bugs)
Version: 4.4.2
Hardware: x64 Linux
: P5 major (vote)
Target Milestone: ---
Assignee: Andrew Bartlett
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-05-12 19:29 UTC by Rolando
Modified: 2016-07-29 08:01 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Rolando 2016-05-12 19:29:02 UTC
Hi,

We have 3 DC with samba 4 (4.4.2) on Ubuntu 14.04.4 with aproximately 20000 users and 30000 computers and we're having some trouble replicating objects between DCs.
We use users and PCs policies.

Replication problems happen randomly, these are some errors from the promt:

  ../source4/rpc_server/drsuapi/getncchanges.c:1657: DsGetNCChanges 2nd replication on different DN CN=Schema,CN=Configuration,DC=dafip,DC=gov,DC=ar DC=dafip,DC=gov,DC=ar (last_dn CN=ws35001001,OU=OU_WS35001001,OU=3,OU=N350,OU=DAFIP_REDES,DC=dafip,DC=gov,DC=ar)

[2016/05/12 15:45:18.788586,  0] ../source4/rpc_server/drsuapi/getncchanges.c:1657(dcesrv_drsuapi_DsGetNCChanges)
  ../source4/rpc_server/drsuapi/getncchanges.c:1657: DsGetNCChanges 2nd replication on different DN CN=Schema,CN=Configuration,DC=dafip,DC=gov,DC=ar DC=dafip,DC=gov,DC=ar (last_dn CN=ws35001001,OU=OU_WS35001001,OU=3,OU=N350,OU=DAFIP_REDES,DC=dafip,DC=gov,DC=ar)

[2016/05/12 15:47:43.223426,  0] ../source4/rpc_server/drsuapi/getncchanges.c:1657(dcesrv_drsuapi_DsGetNCChanges)
  ../source4/rpc_server/drsuapi/getncchanges.c:1657: DsGetNCChanges 2nd replication on different DN CN=Schema,CN=Configuration,DC=dafip,DC=gov,DC=ar DC=dafip,DC=gov,DC=ar (last_dn CN=ws35001001,OU=OU_WS35001001,OU=3,OU=N350,OU=DAFIP_REDES,DC=dafip,DC=gov,DC=ar)

[2016/05/12 15:47:46.051517,  0] ../source4/rpc_server/drsuapi/getncchanges.c:1657(dcesrv_drsuapi_DsGetNCChanges)
  ../source4/rpc_server/drsuapi/getncchanges.c:1657: DsGetNCChanges 2nd replication on different DN CN=Schema,CN=Configuration,DC=dafip,DC=gov,DC=ar DC=dafip,DC=gov,DC=ar (last_dn CN=ws35001001,OU=OU_WS35001001,OU=3,OU=N350,OU=DAFIP_REDES,DC=dafip,DC=gov,DC=ar)

The things is, when some of  these errors occur, the rest of the replication is stopped.

If you need more information please ask.

Looking forward for your answer.
Comment 1 Andrew Bartlett 2016-05-14 08:33:15 UTC
Thanks for the bug report.

This probably belongs on the mailing list more than here, for reasons that I hope will become clear shortly.

So, I have good news and bad news, and a possible way forward.

The first thing to note is that the particular error you are seeing is a red herring - not an actual error, but a debugging message that really shouldn't be at level 0.  It isn't a failure case, just something strange (despite the fact that we do it Samba to Samba).

There may, if you turn the logs up enough, be a real error message printed at a higher log level.  Without data it is hard, but I would speculate that you are hitting timeouts at either the network or LDB layer.

The second thing is that, by numerous reports, you are at the outer edge of Samba's current practical scale.  We greatly admire the large installations that run Samba at the very edge of its capabilities, but we also know from reports that things like this can and do come up for them.  

The good news is that you are not alone: as an example, in my work on very similar customer Samba bugs for my employer Catalyst, we have isolated performance issues in both our DRS client and server, as well as some inappropriate timeouts that could simply be extended.   

The bad news is that while some small fixes are in git master, there is still much, much more to do.  I'm confident that it is practical to make Samba scale to the size you need, with the application of some appropriate development resource.

Finally, at your scale, particularly if this in in production, I would strongly suggest engaging a Samba commercial support provider to assist in isolating what exactly is happening on your network, and to propose the work required to fix it.

Thanks,

Andrew Bartlett
Comment 2 Andrew Bartlett 2016-07-29 02:33:24 UTC
Without further feedback, I'm going to say this is fixed in Samba 4.5rc1, as we addressed a large number of performance issues that will hurt domains of this size.  Those could certainly lock up a DC for long enough to cause this kind of behaviour.