Bug 12754 - DRS replication can miss entries (locking up replication)
Summary: DRS replication can miss entries (locking up replication)
Status: RESOLVED DUPLICATE of bug 12858
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: AD: LDB/DSDB/SAMDB (show other bugs)
Version: 4.6.2
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Andrew Bartlett
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-21 21:15 UTC by Andrew Bartlett
Modified: 2017-06-29 06:44 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Bartlett 2017-04-21 21:15:59 UTC
Due to the way the uptodateness vector is calculated, if more than a 'page' of objects is replicated during a cycle, objects can be missed if earlier pages contain objects updated since the start of the cycle.

This is one (possible) cause of WRITE_FAULT errors during 'make test' as well as real-world replication issues on larger data sets.

A WIP patch by Garming and myself is attached, but it needs tests.
Comment 1 Andrew Bartlett 2017-04-21 21:22:11 UTC
I wrote the patch up with Garming earlier this week based on his analysis of our flapping tests over Easter. 

If we use the USN of an object at the time we fetch the full object to calculate the up-to-dateness vector, we risk ignoring objects that should appear later in the replication cycle.

This can happen if objects A B and C have USN:

 A 100
 B 200
 C 300

but during replication of 3 pages of results, B is modified, getting USN 400

Then we send:
 A 100
 B 400

(and ignore)
 C 300

This is because the server sets an uptodateness vector of 400 at B, and client sends it back, causing the server to ignore C at 300, even when the USN check (alone) would have sent it.

Instead, only send an uptodatenss vector matching the USN seen at the time the cycle starts, and re-send the object later for the higher USN.
Comment 2 Andrew Bartlett 2017-06-14 10:37:27 UTC
The conclusion of this bug is correct, but the mechanism is not.

The entries are missed if the entries are modified (such as by a rename or delete) during replication. 

Patches are under development.
Comment 3 Andrew Bartlett 2017-06-23 02:46:01 UTC
Closing in favour of a new bug that is much clearer.

*** This bug has been marked as a duplicate of bug 12858 ***