Schema replicates in chunks of 133 objects. Instead of getting the whole schema before resolving, for some reason we resolve each chunk independently. If a schema object has any kind of link referring to an object that will be sent in a later chunk, rather than the current chunk or a previous chunk, we get this error: Can't continue Schema load: didn't manage to convert any objects: all 42 remaining of 133 objects failed to convert ../../source4/dsdb/repl/replicated_objects.c:362: dsdb_repl_resolve_working_schema() failed: WERR_INTERNAL_ERROR Failed to create working schema: WERR_INTERNAL_ERROR So, if a line of schema parentage crosses a chunk backwards, we cannot replicate. We caused this by adding 200 schema objects, each with subClassOf pointing to the previous object. We then modified the objects in the opposite order we added them in, because repl sends objects in USN order, and we want to send the newest class first and the oldest one (the one which is parent to all the others) last. Our first thought was to work around the issue by increasing the max_objects value sent by the client in the getncchanges request to a higher value. But, since base schema is over 1000 objects, this only shifts the chunk boundary without solving the problem. Here's the same error at 400 objects: Can't continue Schema load: didn't manage to convert any objects: all 49 remaining of 400 objects failed to convert ../../source4/dsdb/repl/replicated_objects.c:362: dsdb_repl_resolve_working_schema() failed: WERR_INTERNAL_ERROR Failed to create working schema: WERR_INTERNAL_ERROR The same behaviour can be reproduced using any other link field such as possSuperiors.
While this testcase is synthetic, you don't actually need any more than one relation existing sent in the wrong order via DRS. This could happen if you modify a base schema element who has a dependent that hasn't been modified. Normally this doesn't cause problems in joins because of a pre-loaded schema. Nor does it cause problems with ongoing replication because such modifications are rare and the chunk-to-full-partition ratio makes it even less likely. Full syncs could trigger it due to bad offsets (and GetNCChanges having unpredictable behaviour due to being timing based), but again, it requires a number of unlikely events including never getting the object originally normally.
Another example I just noticed is probably auxiliaryClass.
*** Bug 13899 has been marked as a duplicate of this bug. ***
https://bugzilla.samba.org/attachment.cgi?id=15077 is the minimal fix on bug #12204
(In reply to Stefan Metzmacher from comment #4) You should use https://bugzilla.samba.org/attachment.cgi?id=15081 (but only the commit message is a bit different)