There appears to be a race condition in GetNCChanges where the prefixMap is fetched before the list of object GUIDs. The server will send a new schema object with an OID which cannot be mapped on the client side. In replicated_objects_commit, the uptodateness vector and the highwatermark are both incremented and commited to disk before the schema is updated. This means that the
schema object is never properly initialized and any new calls to dsdb_get_schema cannot succeed.
This is non-trivial to fix if it occurs, and if the server is restarted, practically impossible without doing anything special (as it appears that you can no longer open the database).
Upon further inspection, the race doesn't seem to be with the prefix map being before the GUID list, but being afterwards. The client bumping the UTV is definitely wrong, but it's not clear that there is a proper bug on the server.
As I see it now, the client seems to insist that it must know the mapping for unknown schema before it has even arrived, which prevents schema being reloaded. So this might be a completely client-side issue.
Looking at the logs more closely, this gets more strange.
The invalid prefix map exists on the PDC, which should also be the schema FSMO master. There are some clear issues in the 3-DC environment, but only with large batches of schema changes. Unless the PDC has its prefix map clobbered by any secondary DC (with an external DC race), I can't see how this could happen.
I'm not sure if this is the race shown in the logs, but it appears that there is a race in dsdb_schema_refresh in dsdb_schema_from_db. It fetches the prefixMap in a separate search to the schema attributes.
Between these two searches, any adds can occur leading to an invalid schema with missing prefixMap entries for some of its attributes. This seems like it can propagate across the network as dsdb_get_schema will opt not to read the database again because the sequence number has not changed. I'm not sure how well the client checks these errors but it looks like this could cause the failure in replicated_objects_commit.
Found (what appears to be) another more obvious race condition:
dsdb_replicated_objects_commit writes the prefixMap, however the prefixMap appears to be acquired before the transaction begins.
This means the following sequence could occur:
1) Read of schema + prefix map by DREPL
2) Write of new schema introducing a new prefix
3) DREPL writes the old read of the prefix map
4) Reload of the database fails due to schema missing a prefix
What appears in the log seems to be a slight variation:
3.5) Write of another new schema to introduce yet another new prefix
(In reply to Garming Sam from comment #3)
8ac1646e92d707705c94bc286c17f1cd0d5505fb fixed any possible issues related to the separation of the fetch into two searches.
A more current log of schema issues:
Created attachment 14661 [details]
script to reproduce bug
This script will reproduce the bug in any testenv with a PDC separate from the target, like 'vampiredc'. As of master e650c40e11401f5ff8477292a46f14dfd06a2a29 you can get gdb on the bug like so:
1. SELFTEST_TESTENV=vampire_dc:local make testenv
2. ./bin/samba-tool -s ./st/ad_dc_ntvfs/etc/smb.conf
3. copy the pid of the drepl server
4. "gdb --pid" on the pid you just copied in a new terminal
5. break schema_init.c:667
6. (in the testenv terminal) ./break_repl.sh (run the script attached to this comment)
This was fixed in master with 1fd4cdfafaa6a41c824d1b3d76635bf3e446de0f (and is in 4.10). We should no longer see these errors in autobuild anymore.
We should probably use this extended operation more widely, but it's definitely needed here.
Note that this fix means that replication can sometimes fail with schema, and a retry will eventually fix it.