Created attachment 18300 [details] GDB-captured backtrace of failed named startup Samba AD DC using DLZ DNS on Fedora 39 with current Samba 4.19.6 / BIND 9.18.24 packages upgraded to Fedora 40, Samba 4.20.0 / BIND 9.18.26. named consistently crashes on startup when trying to load the Samba-provided DLZ module; stack trace attached. No other indications of issues, including samba-tool dbcheck. Samba itself seems to operate fine otherwise, including answering domain-related queries via LDAP.
Cloned from previous report for Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=2278016 Component and severity are a bit of a guess -- it's crashing deep in the LDB code, the DLZ plugin doesn't appear to have been modified at all recently, and the particular sam.ldb it's failing to load hasn't been touched since it was written out during DC provisioning on a Fedora 37 install about 18 months ago, likely by 4.17.4.
https://bugzilla.samba.org/show_bug.cgi?id=15652 looks similar but perhaps isn't.
seems to happen on debian unstable too: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074378 here my findings: the crash is triggered by a call to free in schema_metadata_get_uint64: talloc_free(tmp_ctx); return ldb_module_error(data->module, LDB_ERR_OPERATIONS_ERROR, "Failed to convert value"); } *value = data->schema_seq_num_cache; SAFE_FREE(tdb_data.dptr); <<---- this crashes bind talloc_free(tmp_ctx); return LDB_SUCCESS; this memory should be allocated by tdb_data = tdb_fetch(tdb, tdb_key); this seems to call tdb_alloc_read that in turn uses malloc to get this memory. so far so good. Looking at the memory addresses it looks like they are "different" between bind versions (working bind they start with 0x55.., with broken they start with 0x7..). Ondřej Surý mentioned that bind9 uses jemalloc. could that be an issue? What I don't really get is why the malloc call in tdb_alloc_read would use jemalloc while SAFE_FREE (that simply uses free) would use glibc's free. rewriting this function to use tdb_fetch_talloc would probably solve this issue
sorry for the noise today. did some testing and found that "working" bind point malloc and free to glibc whereas newer ones point to jemalloc changelog in bind9: 6328. [func] Add workaround to enforce dynamic linker to pull jemalloc earlier than libc to ensure all memory allocations are done via jemalloc. [GL #4404] this explains the issue: tdb_fetch seems to use jemalloc, but somehow SAFE_FREE that should be #define SAFE_FREE(x) do { if ((x) != NULL) {free(x); (x)=NULL;} } while(0) To me it makes no sense where in this case free does not resolve to jemalloc (gdb confirms that the symbol does NOT point to glibc but indeed to jemalloc). Is it possible that the compiler already replaces that with glibc? I don't think so since I don't see an explicit ___GL symbol in the binary file.
(In reply to Michael Saxl from comment #4) replying to myself: I think I have traced the issue (and have a workaround that works without a recompile): when bind loads samba's dlz, some samba dependencies are loaded (most importantly the one that holds tdb_fetch). Afterwards some ldb modules are loaded (with a dload flag RTLD_DEEPBIND). Now that means that ldb modules prefer symbols of .so files that are in the list of libraries this ldb modules is linked against -> libc. What we have now is that all except ldb modules will use jemalloc whereas ldb modules will use malloc from libc. Since these two implementations are not compatible (you cannot free memory allocated by the other) we have a issue. Turns out you can set an environment variable named LDB_MODULES_DISABLE_DEEPBIND that prevents ldb modules to be loaded with that flag. Now all uses jemalloc and bind does not crash anymore.
(In reply to Michael Saxl from comment #5) I can confirm setting the environmental variable worked for me (once I worked out where to put it) ;-)
(In reply to Rowland Penny from comment #6) Closing this, the isc-bind code has been fixed (on Debian at least) See here: https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1982200.html