12858 – Read corruption in the AD DC: Objects renamed can vanish from LDAP and the replication state due to lack of read locks

Bug 12858 - Read corruption in the AD DC: Objects renamed can vanish from LDAP and the replication state due to lack of read locks

Summary: Read corruption in the AD DC: Objects renamed can vanish from LDAP and the re...

Status:	RESOLVED FIXED

Alias:	None

Product:	Samba 4.1 and newer
Classification:	Unclassified
Component:	AD: LDB/DSDB/SAMDB (show other bugs)
Version:	4.6.5
Hardware:	All All

Importance:	P5 regression (vote)
Target Milestone:	4.7
Assignee:	Andrew Bartlett
QA Contact:	Samba QA Contact

URL:
Keywords:

Duplicates (1):	12754 (view as bug list)
Depends on:	12904
Blocks:
	Show dependency tree / graph

Reported:	2017-06-23 02:39 UTC by Andrew Bartlett
Modified:	2018-03-13 17:49 UTC (History)
CC List:	3 users (show)

See Also:	12859

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andrew Bartlett 2017-06-23 02:39:51 UTC

The symptoms of this issue include:

Replication failures with this error showing in the client side logs:
 error during DRS repl ADD: No objectClass found in replPropertyMetaData for
 Failed to commit objects:
 WERR_GEN_FAILURE/NT_STATUS_INVALID_NETWORK_RESPONSE

A crash of the server, in particular the rpc_server process with
 INTERNAL ERROR: Signal 11

The most common situation for this bug to manifest is that an object needs to be created, then deleted or renamed at any time during the server-side search where is would be replicated out for the first time.

However, any delete or rename may trigger the issue, but the consequences would be less obvious, instead of a clear failure some change to the object would just not be replicated.

Finally, a client reading LDAP at the time a rename or delete is being processed may not be returned the object subject to rename or delete, but would be returned the object if asked again.

The root cause is a lack of read locking in ldb_tdb due to a missing decrement of a reference counter in ldb_tdb.  This caused an fcntl() lock not to be held and so the connection between the index and the main DB record not to be enforced. 

Additionally, it was noticed that a read lock is required over the entire ldb_search() operation, including the subsequent searches in the module stack.  This has required that new lock and unlock operations be added to ldb.

This issue will be fixed in ldb 1.2.0 and Samba 4.7.

Comment 1 Andrew Bartlett 2017-06-23 02:46:01 UTC

*** Bug 12754 has been marked as a duplicate of this bug. ***

Comment 2 Garming Sam 2017-07-13 10:42:46 UTC

There appears to be a regression in failure cases, see https://bugzilla.samba.org/show_bug.cgi?id=12904

Comment 3 Stefan Metzmacher 2017-08-11 08:35:46 UTC

Andrew, can this be closed?

Comment 4 Andrew Bartlett 2017-08-28 07:06:52 UTC

Fixed in master with 9063669a05a261657d5b9a60254bd1b9065e6423 for Samba 4.7

Comment 5 Justin Foreman 2018-03-13 16:55:21 UTC

I'm running 4.6.7 and I believe that I'm hitting this bug when trying to join a 4.7 or 4.8 DC.

The wiki says "See BUG #12858 for more details and updated advise on database recovery for affected installations." but there are no details on how to fix/workaround this issue here. Please advise.

https://wiki.samba.org/index.php/Samba_4.7_Features_added/changed#Whole_DB_read_locks:_Improved_LDAP_and_replication_consistency

Comment 6 Andrew Bartlett 2018-03-13 17:49:16 UTC

If you are hitting an issue consistently on a not-moving database then it won't be this issue, as you don't need read locks on a DB that isn't changing. 

However you certainly could be seeing the impact of one of many DRS bugs we have had over the years, possibly leaving your existing DB in a not-happy state.

It is probably best to describe your full issue on the samba mailing list and I'll pick it up from there.