Bug 9500 - Samba4 not checking ms-DS-replicationepoch before attempting replication
Summary: Samba4 not checking ms-DS-replicationepoch before attempting replication
Status: NEW
Alias: None
Product: Samba 4.0
Classification: Unclassified
Component: AD: LDB/DSDB/SAMDB (show other bugs)
Version: unspecified
Hardware: x86 Linux
: P5 normal (vote)
Target Milestone: ---
Assignee: Andrew Bartlett
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-12-13 21:40 UTC by dave@lavidamassage.com (mail address dead)
Modified: 2019-04-30 13:34 UTC (History)
12 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description dave@lavidamassage.com (mail address dead) 2012-12-13 21:40:28 UTC
Our previous administrators renamed our domain, and we are aware that it can cause issues, but it appears to me that samba-tool is not checking and setting the replication epoch prior to attempting to replicate with our Windows Server 2003 R2 Domain Controller. The output of running samba-tool is below. We are running this compiled from your source on CentOS 6.3. I imagine that this issue is likely something that is as rare as unicorn feathers.

/usr/local/samba/bin/samba-tool domain join corporate.lavidamassage.com DC -Ucorporate/dave
Finding a writeable DC for domain 'corporate.lavidamassage.com'
Found DC beelzebub.corporate.lavidamassage.com
Password for [CORPORATE\dave]:
workgroup is CORPORATE
realm is corporate.lavidamassage.com
checking sAMAccountName
Adding CN=CHARON,OU=Domain Controllers,DC=corporate,DC=lavidamassage,DC=com
Adding CN=CHARON,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=corporate,DC=lavidamassage,DC=com
Adding CN=NTDS Settings,CN=CHARON,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=corporate,DC=lavidamassage,DC=com
Adding SPNs to CN=CHARON,OU=Domain Controllers,DC=corporate,DC=lavidamassage,DC=com
Setting account password for CHARON$
Enabling account
Calling bare provision
No IPv6 address will be assigned
Provision OK for domain DN DC=corporate,DC=lavidamassage,DC=com
Starting replication
Join failed - cleaning up
checking sAMAccountName
Deleted CN=CHARON,OU=Domain Controllers,DC=corporate,DC=lavidamassage,DC=com
Deleted CN=NTDS Settings,CN=CHARON,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=corporate,DC=lavidamassage,DC=com
Deleted CN=CHARON,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=corporate,DC=lavidamassage,DC=com
ERROR(runtime): uncaught exception - (8593, 'WERR_DS_DIFFERENT_REPL_EPOCHS')
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/__init__.py", line 175, in _run
    return self.run(*args, **kwargs)
  File "/usr/local/samba/lib/python2.6/site-packages/samba/netcmd/domain.py", line 552, in run
    machinepass=machinepass, use_ntvfs=use_ntvfs, dns_backend=dns_backend)
  File "/usr/local/samba/lib/python2.6/site-packages/samba/join.py", line 1104, in join_DC
    ctx.do_join()
  File "/usr/local/samba/lib/python2.6/site-packages/samba/join.py", line 1009, in do_join
    ctx.join_replicate()
  File "/usr/local/samba/lib/python2.6/site-packages/samba/join.py", line 731, in join_replicate
    replica_flags=ctx.replica_flags)
  File "/usr/local/samba/lib/python2.6/site-packages/samba/drs_utils.py", line 248, in replicate
    (level, ctr) = self.drs.DsGetNCChanges(self.drs_handle, req_level, req)
Comment 1 Tim 2012-12-13 21:55:27 UTC
The following warning accompanies this issue in the windows event log:
Event Type:	Warning
Event Source:	NTDS Replication
Event Category:	DS RPC Server 
Event ID:	1876
Date:		12/13/2012
Time:		10:21:28 AM
User:		CORPORATE\dave
Computer:	SERVER3
Description:
The local domain controller cannot replicate with the following remote domain controller because of a mismatched replication epoch (msDS-ReplicationEpoch). This typically occurs as part of the domain rename process. 
 
Remote domain controller: 
CN=NTDS Settings,CN=CHARON,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=corporate,DC=lavidamassage,DC=com 
Remote domain controller replication epoch: 
0 
Local domain controller replication epoch: 
1 
 
Domain controllers undergoing a domain rename are not allowed to communicate with those domain controllers that have not yet undergone the domain rename. When all domain controllers have completed the domain rename, replication will once again be allowed.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
-----------------------------------------------
It seems to be an issue with Line 168 of /source4/rpc_server/drsuapi/dcesrv_drsuapi.c where the epoch is statically assigned a value of 0.
Comment 2 d tucny 2013-05-06 05:37:21 UTC
Have the same problem as the original report using 4.0.5 attempting to join a domain that was previously renamed.

There appear to be a number of places where repl_epoch is set to 0, though not sure at this point which one will be relevant here.
Comment 3 satadru 2013-05-28 18:09:37 UTC
I'm seeing the same issue with v4-0-stable, trying to add a DC to a w2k3 domain.

sudo /usr/local/samba/bin/samba-tool  domain join ny.clientdomain.com DC -Uadministrator --realm=ny.clientdomain.com
[sudo] password for localadmin: 
Finding a writeable DC for domain 'ny.clientdomain.com'
Found DC rsa-dc-one.ny.clientdomain.com
Password for [WORKGROUP\administrator]:
workgroup is NY
realm is ny.clientdomain.com
checking sAMAccountName
Adding CN=smbdc,OU=Domain Controllers,DC=ny,DC=clientdomain,DC=com
Adding CN=smbdc,CN=Servers,CN=RSA,CN=Sites,CN=Configuration,DC=ny,DC=clientdomain,DC=com
Adding CN=NTDS Settings,CN=smbdc,CN=Servers,CN=RSA,CN=Sites,CN=Configuration,DC=ny,DC=clientdomain,DC=com
Adding SPNs to CN=smbdc,OU=Domain Controllers,DC=ny,DC=clientdomain,DC=com
Setting account password for smbdc$
Enabling account
Calling bare provision
No IPv6 address will be assigned
Provision OK for domain DN DC=ny,DC=clientdomain,DC=com
Starting replication
Join failed - cleaning up
checking sAMAccountName
Deleted CN=smbdc,OU=Domain Controllers,DC=ny,DC=clientdomain,DC=com
Deleted CN=NTDS Settings,CN=smbdc,CN=Servers,CN=RSA,CN=Sites,CN=Configuration,DC=ny,DC=clientdomain,DC=com
Deleted CN=smbdc,CN=Servers,CN=RSA,CN=Sites,CN=Configuration,DC=ny,DC=clientdomain,DC=com
ERROR(runtime): uncaught exception - (8593, 'WERR_DS_DIFFERENT_REPL_EPOCHS')
  File "/usr/local/samba/lib/python2.7/site-packages/samba/netcmd/__init__.py", line 175, in _run
    return self.run(*args, **kwargs)
  File "/usr/local/samba/lib/python2.7/site-packages/samba/netcmd/domain.py", line 552, in run
    machinepass=machinepass, use_ntvfs=use_ntvfs, dns_backend=dns_backend)
  File "/usr/local/samba/lib/python2.7/site-packages/samba/join.py", line 1104, in join_DC
    ctx.do_join()
  File "/usr/local/samba/lib/python2.7/site-packages/samba/join.py", line 1009, in do_join
    ctx.join_replicate()
  File "/usr/local/samba/lib/python2.7/site-packages/samba/join.py", line 731, in join_replicate
    replica_flags=ctx.replica_flags)
  File "/usr/local/samba/lib/python2.7/site-packages/samba/drs_utils.py", line 248, in replicate
    (level, ctr) = self.drs.DsGetNCChanges(self.drs_handle, req_level, req)
Comment 4 baf 2013-06-24 17:18:46 UTC
Having the same problem as the original report. Samba 4.0.6 on CentOS 6.4, joining as a DC to an existing domain. Domain was renamed previously; was originally a Windows 2003 domain and is now Windows 2008 R2.
Comment 5 Matthieu Patou 2013-07-06 05:46:35 UTC
the issue is not super trivial to fix, because we need to modify a couple of place where it's fixed to 0.
Comment 6 user 2013-08-09 06:59:02 UTC
samba ver 4.0.8

Matthieu Patou

How to fix this bug?
Comment 7 user 2013-08-09 07:15:27 UTC
samba ver 4.0.8

Matthieu Patou

How to fix this bug?
Comment 8 user 2013-08-12 10:48:38 UTC
when this will be fixed?
Comment 9 Andrew Bartlett 2013-08-12 22:30:19 UTC
Posting on the bug report won't get this fixed any faster.  There are a limited number of developers working on this area at the moment. 

Accordingly, I'm resetting the blocker bug to 4.2, because it seems quite unlikely we are going to fix it at this late stage, but if we do (or someone provides a working and tested patch) then of course it is likely to be backported. 

Sorry, 

Andrew Bartlett
Comment 10 Matthieu Patou 2013-08-13 06:06:14 UTC
>Hello!

>Can you send me some instruction how to fix bug 9500? So I could do the testing.

>Best regards,
>Andrey.

So first things first you need to query the remote DC for its version of ms-DS-replicationepoch, then you need to put it in a variable so that it could be used in the join code.

As a separate effort you need to bump the attribute ms-DS-replicationepoch on a windows DC (a test one).
Then you have to modify the python function  drs_DsBind so that it takes a optional parameter which will be the replicationepoch you need to make that the calling function drsuapi_connect also accept a parameter and the caller (it seems in join_add_ntdsdsa()) needs to set this epoch.

In order to make things easier you might want at the beginning to hard code the value in join_add_ntdsdsa, or as when it's called it's just after a samdb call you might want to pass the content of the variable that you have done at the very beginning.

At that point it might replicate a bit *but* it's far from being finished, first you need to add the correct value in the ms-DS-replicationepoch in the NTDS object that we are creating for the samba AD DC.
Then you need to modify those files:
source4/dsdb/repl/drepl_out_helpers.c
source4/dsdb/repl/drepl_service.c
source4/libnet/libnet_become_dc.c
source4/libnet/libnet_unbecome_dc.c

So that we read from the local database or from the remote database the replicationepoch, I suggest that for libnet_become_dc you add a field in the struct becomeDC_drsuapi to store this value read that you will have to read from remote database in a function called from becomeDC_connect_ldap1, you can take inspiration on becomeDC_ldap1_crossref_behavior_version on how to read from the remote database.

For the other function it will be simplier as you have "just" to read from the local database.

I expect that when you will need more details but already having the initial replication working will require you a bit of coding and you will most probably have questions
Comment 11 user 2013-08-20 09:36:12 UTC
how long to wait for a bug fix? The whole plant with 5,000 workers can not go on Linux without fixing this bug.
Comment 12 Matthieu Patou 2013-08-21 04:27:03 UTC
(In reply to comment #11)
> how long to wait for a bug fix? The whole plant with 5,000 workers can not go
> on Linux without fixing this bug.

Where is your patch ? 
I mean I have a day job that is not related to Samba and so I'm doing this on my spare time and it's the case for quite a lot of person.
Don't expect something very soon, if you do something it will go faster. If you can't do something think of contracting a company to do the changes if you can't then learn how to be patient.
Comment 13 user 2013-08-21 05:55:37 UTC
(In reply to comment #12)
> (In reply to comment #11)
> > how long to wait for a bug fix? The whole plant with 5,000 workers can not go
> > on Linux without fixing this bug.
> 
> Where is your patch ? 
> I mean I have a day job that is not related to Samba and so I'm doing this on
> my spare time and it's the case for quite a lot of person.
> Don't expect something very soon, if you do something it will go faster. If you
> can't do something think of contracting a company to do the changes if you
> can't then learn how to be patient.

I need more details for patch. Can you give me more detailed path for patching?
Comment 14 JME 2013-09-26 09:01:07 UTC
Just wanted to add to the discussion that I am obviously experiencing the same problem with Samba4 cloned from git (Version Version 4.2.0pre1-GIT-58cb40d) on current Debian (Kernel 3.2.46-1+deb7u1) and Fedora 19 (Kernel 3.11x) installations.
No chance to join an existing domain that has been previously renamed due to the mismatch of the epoch attribute. I found a workaround though I suppose.

Changing the Epoch on the existing DC using the "dssite.msc" and resetting the value of msDS-ReplicationEpoch to "not-set" (can be Found un the Default-First-Site-Name/Servers/"DC-Name"/NTDS Settings -> Properties -> Attribute Editor) allowed me to join Samba4 to the existing domain. Currently it is replicating the database!
This is however only of use if you do not have a lot of DCs to be reset I am afraid.

Looking forward to the patch fixing the issue - any idea when this will arrive?
Comment 15 user 2013-09-26 09:08:09 UTC
(In reply to comment #14)
> Just wanted to add to the discussion that I am obviously experiencing the same
> problem with Samba4 cloned from git (Version Version 4.2.0pre1-GIT-58cb40d) on
> current Debian (Kernel 3.2.46-1+deb7u1) and Fedora 19 (Kernel 3.11x)
> installations.
> No chance to join an existing domain that has been previously renamed due to
> the mismatch of the epoch attribute. I found a workaround though I suppose.
> 
> Changing the Epoch on the existing DC using the "dssite.msc" and resetting the
> value of msDS-ReplicationEpoch to "not-set" (can be Found un the
> Default-First-Site-Name/Servers/"DC-Name"/NTDS Settings -> Properties ->
> Attribute Editor) allowed me to join Samba4 to the existing domain. Currently
> it is replicating the database!
> This is however only of use if you do not have a lot of DCs to be reset I am
> afraid.
> 
> Looking forward to the patch fixing the issue - any idea when this will arrive?

I say thank you!!!I'm testing "dssite.msc" now!!!
Comment 16 Andrew Bartlett 2013-09-27 01:04:40 UTC
The initial reaction of my contacts at Microsoft is that removing this value would be "catastrophic".  I strongly urge users of Samba and of Microsoft AD not to do this.  The reason is that this is a fundamental part of the replication state. 

I'm asking for further clarification as if there are any circumstances under which changing this might be valid, or safe, but I do want to make my grave concerns known regarding this workaround. 

On the original issue, I do not have a timeframe for a fix at this point.
Comment 17 Karolin Seeger 2013-12-10 15:43:37 UTC
Any news on this one?
Comment 18 user 2013-12-20 19:11:19 UTC
when?
Comment 19 Andrew Bartlett 2013-12-20 19:19:36 UTC
Sadly no progress on this one.

We need patches to:
 - check the replication epoch
 - add the replication epoch into all the usn calcuations
 - send the new replication epoch
 - test all the above

Preferably these patches would include a tool to rename the domain, so as to be both useful and set the replication epoch in the test environment. 

This is not a small task, somewhere between two and four weeks of developer time, so I would ask for patience from our users who understandably are keen to see this feature implemented, or patches to at least get us started (every bit helps).
Comment 20 Stefan Metzmacher 2014-07-08 10:22:14 UTC
(In reply to comment #18)
> when?

If it's so urgent for you might want to get some commercial support
to speed this up.

See https://www.samba.org/samba/support/globalsupport.html
Comment 21 Taner Tas 2018-11-29 11:26:36 UTC
We can't migrate to Samba because of this bug, so join solution is not an option at this moment.

But what about some other solution like migrating users with passwords to a new provisioned Samba DC from scratch?

Are we out of all options?
Comment 22 Dario B. 2019-04-30 13:33:41 UTC
(In reply to JME from comment #14)
@JME Hoping you are still reading this thread. Have you reach some sort of problem setting msDS-ReplicationEpoch to "not-set"? Even on these 6 years?
Thanks!
P.S.: someone else tried to set this value to "not-set"?
(obviously I am experiencing the same problem after renamed a domain and trying to join with a new samba AD DC)