Bug 13713 - Schema replication fails if link crosses chunk boundary backwards
Summary: Schema replication fails if link crosses chunk boundary backwards
Status: NEW
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: AD: LDB/DSDB/SAMDB (show other bugs)
Version: 4.8.0
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Andrew Bartlett
QA Contact: Samba QA Contact
URL:
Keywords:
: 13899 (view as bug list)
Depends on: 12204 12889
Blocks:
  Show dependency treegraph
 
Reported: 2018-12-14 01:58 UTC by Aaron Haslett (dead mail address)
Modified: 2019-04-17 21:26 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Aaron Haslett (dead mail address) 2018-12-14 01:58:09 UTC
Schema replicates in chunks of 133 objects. Instead of getting the whole schema before resolving, for some reason we resolve each chunk independently. If a schema object has any kind of link referring to an object that will be sent in a later chunk, rather than the current chunk or a previous chunk, we get this error:

Can't continue Schema load: didn't manage to convert any objects: all 42 remaining of 133 objects failed to convert
../../source4/dsdb/repl/replicated_objects.c:362: dsdb_repl_resolve_working_schema() failed: WERR_INTERNAL_ERROR
Failed to create working schema: WERR_INTERNAL_ERROR

So, if a line of schema parentage crosses a chunk backwards, we cannot replicate.

We caused this by adding 200 schema objects, each with subClassOf pointing to the previous object. We then modified the objects in the opposite order we added them in, because repl sends objects in USN order, and we want to send the newest class first and the oldest one (the one which is parent to all the others) last.

Our first thought was to work around the issue by increasing the max_objects value sent by the client in the getncchanges request to a higher value.  But, since base schema is over 1000 objects, this only shifts the chunk boundary without solving the problem. Here's the same error at 400 objects:

Can't continue Schema load: didn't manage to convert any objects: all 49 remaining of 400 objects failed to convert
../../source4/dsdb/repl/replicated_objects.c:362: dsdb_repl_resolve_working_schema() failed: WERR_INTERNAL_ERROR
Failed to create working schema: WERR_INTERNAL_ERROR

The same behaviour can be reproduced using any other link field such as possSuperiors.
Comment 1 Garming Sam 2018-12-19 03:40:01 UTC
While this testcase is synthetic, you don't actually need any more than one relation existing sent in the wrong order via DRS. This could happen if you modify a base schema element who has a dependent that hasn't been modified.

Normally this doesn't cause problems in joins because of a pre-loaded schema. Nor does it cause problems with ongoing replication because such modifications are rare and the chunk-to-full-partition ratio makes it even less likely. Full syncs could trigger it due to bad offsets (and GetNCChanges having unpredictable behaviour due to being timing based), but again, it requires a number of unlikely events including never getting the object originally normally.
Comment 2 Garming Sam 2019-01-28 04:17:29 UTC
Another example I just noticed is probably auxiliaryClass.
Comment 3 Stefan Metzmacher 2019-04-17 20:44:41 UTC
*** Bug 13899 has been marked as a duplicate of this bug. ***
Comment 4 Stefan Metzmacher 2019-04-17 20:46:29 UTC
https://bugzilla.samba.org/attachment.cgi?id=15077 is the minimal fix
on bug #12204
Comment 5 Stefan Metzmacher 2019-04-17 21:26:44 UTC
(In reply to Stefan Metzmacher from comment #4)

You should use https://bugzilla.samba.org/attachment.cgi?id=15081
(but only the commit message is a bit different)