Bug 15643 - Samba 4.20.0 DLZ module crashes BIND on startup
Summary: Samba 4.20.0 DLZ module crashes BIND on startup
Status: RESOLVED INVALID
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: AD: LDB/DSDB/SAMDB (show other bugs)
Version: 4.20.0
Hardware: x64 Linux
: P5 regression with 15 votes (vote)
Target Milestone: ---
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-08 14:58 UTC by Rob Foehl
Modified: 2024-07-28 11:25 UTC (History)
2 users (show)

See Also:


Attachments
GDB-captured backtrace of failed named startup (6.25 KB, text/plain)
2024-05-08 14:58 UTC, Rob Foehl
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rob Foehl 2024-05-08 14:58:28 UTC
Created attachment 18300 [details]
GDB-captured backtrace of failed named startup

Samba AD DC using DLZ DNS on Fedora 39 with current Samba 4.19.6 / BIND 9.18.24 packages upgraded to Fedora 40, Samba 4.20.0 / BIND 9.18.26.  named consistently crashes on startup when trying to load the Samba-provided DLZ module; stack trace attached.  No other indications of issues, including samba-tool dbcheck.  Samba itself seems to operate fine otherwise, including answering domain-related queries via LDAP.
Comment 1 Rob Foehl 2024-05-08 15:07:52 UTC
Cloned from previous report for Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=2278016

Component and severity are a bit of a guess -- it's crashing deep in the LDB code, the DLZ plugin doesn't appear to have been modified at all recently, and the particular sam.ldb it's failing to load hasn't been touched since it was written out during DC provisioning on a Fedora 37 install about 18 months ago, likely by 4.17.4.
Comment 2 Douglas Bagnall 2024-06-06 21:29:05 UTC
https://bugzilla.samba.org/show_bug.cgi?id=15652 looks similar but perhaps isn't.
Comment 3 Michael Saxl 2024-06-30 12:58:37 UTC
seems to happen on debian unstable too:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074378

here my findings:
the crash is triggered by a call to free in schema_metadata_get_uint64:

		talloc_free(tmp_ctx);
		return ldb_module_error(data->module, LDB_ERR_OPERATIONS_ERROR,
					"Failed to convert value");
	}

	*value = data->schema_seq_num_cache;

	SAFE_FREE(tdb_data.dptr); <<---- this crashes bind
	talloc_free(tmp_ctx);

	return LDB_SUCCESS;

this memory should be allocated by

tdb_data = tdb_fetch(tdb, tdb_key);

this seems to call tdb_alloc_read that in turn uses malloc to get this memory.

so far so good. Looking at the memory addresses it looks like they are "different" between bind versions (working bind they start with 0x55.., with broken they start with 0x7..). Ondřej Surý mentioned that bind9 uses jemalloc. could that be an issue? What I don't really get is why the malloc call in tdb_alloc_read would use jemalloc while SAFE_FREE (that simply uses free) would use glibc's free.

rewriting this function to use tdb_fetch_talloc would probably solve this issue
Comment 4 Michael Saxl 2024-06-30 15:48:37 UTC
sorry for the noise today.
did some testing and found that "working" bind point malloc and free to glibc whereas newer ones point to jemalloc

changelog in bind9:
6328.   [func]          Add workaround to enforce dynamic linker to pull
                        jemalloc earlier than libc to ensure all memory
                        allocations are done via jemalloc. [GL #4404]

this explains the issue: tdb_fetch seems to use jemalloc, but somehow SAFE_FREE that should be

#define SAFE_FREE(x) do { if ((x) != NULL) {free(x); (x)=NULL;} } while(0)

To me it makes no sense where in this case free does not resolve to jemalloc (gdb confirms that the symbol does NOT point to glibc but indeed to jemalloc). Is it possible that the compiler already replaces that with glibc? I don't think so since I don't see an explicit ___GL symbol in the binary file.
Comment 5 Michael Saxl 2024-06-30 16:36:18 UTC
(In reply to Michael Saxl from comment #4)
replying to myself:
I think I have traced the issue (and have a workaround that works without a recompile):
when bind loads samba's dlz, some samba dependencies are loaded (most importantly the one that holds tdb_fetch). Afterwards some ldb modules are loaded (with a dload flag RTLD_DEEPBIND). Now that means that ldb modules prefer symbols of .so files that are in the list of libraries this ldb modules is linked against -> libc. What we have now is that all except ldb modules will use jemalloc whereas ldb modules will use malloc from libc. Since these two implementations are not compatible (you cannot free memory allocated by the other) we have a issue.

Turns out you can set an environment variable named LDB_MODULES_DISABLE_DEEPBIND that prevents ldb modules to be loaded with that flag. Now all uses jemalloc and bind does not crash anymore.
Comment 6 Rowland Penny 2024-07-01 16:18:42 UTC
(In reply to Michael Saxl from comment #5)
I can confirm setting the environmental variable worked for me (once I worked out where to put it) ;-)
Comment 7 Rowland Penny 2024-07-28 11:25:04 UTC
(In reply to Rowland Penny from comment #6)
Closing this, the isc-bind code has been fixed (on Debian at least)

See here:

https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1982200.html