Created attachment 18300 [details] GDB-captured backtrace of failed named startup Samba AD DC using DLZ DNS on Fedora 39 with current Samba 4.19.6 / BIND 9.18.24 packages upgraded to Fedora 40, Samba 4.20.0 / BIND 9.18.26. named consistently crashes on startup when trying to load the Samba-provided DLZ module; stack trace attached. No other indications of issues, including samba-tool dbcheck. Samba itself seems to operate fine otherwise, including answering domain-related queries via LDAP.
Cloned from previous report for Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=2278016 Component and severity are a bit of a guess -- it's crashing deep in the LDB code, the DLZ plugin doesn't appear to have been modified at all recently, and the particular sam.ldb it's failing to load hasn't been touched since it was written out during DC provisioning on a Fedora 37 install about 18 months ago, likely by 4.17.4.
https://bugzilla.samba.org/show_bug.cgi?id=15652 looks similar but perhaps isn't.
seems to happen on debian unstable too: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074378 here my findings: the crash is triggered by a call to free in schema_metadata_get_uint64: talloc_free(tmp_ctx); return ldb_module_error(data->module, LDB_ERR_OPERATIONS_ERROR, "Failed to convert value"); } *value = data->schema_seq_num_cache; SAFE_FREE(tdb_data.dptr); <<---- this crashes bind talloc_free(tmp_ctx); return LDB_SUCCESS; this memory should be allocated by tdb_data = tdb_fetch(tdb, tdb_key); this seems to call tdb_alloc_read that in turn uses malloc to get this memory. so far so good. Looking at the memory addresses it looks like they are "different" between bind versions (working bind they start with 0x55.., with broken they start with 0x7..). Ondřej Surý mentioned that bind9 uses jemalloc. could that be an issue? What I don't really get is why the malloc call in tdb_alloc_read would use jemalloc while SAFE_FREE (that simply uses free) would use glibc's free. rewriting this function to use tdb_fetch_talloc would probably solve this issue
sorry for the noise today. did some testing and found that "working" bind point malloc and free to glibc whereas newer ones point to jemalloc changelog in bind9: 6328. [func] Add workaround to enforce dynamic linker to pull jemalloc earlier than libc to ensure all memory allocations are done via jemalloc. [GL #4404] this explains the issue: tdb_fetch seems to use jemalloc, but somehow SAFE_FREE that should be #define SAFE_FREE(x) do { if ((x) != NULL) {free(x); (x)=NULL;} } while(0) To me it makes no sense where in this case free does not resolve to jemalloc (gdb confirms that the symbol does NOT point to glibc but indeed to jemalloc). Is it possible that the compiler already replaces that with glibc? I don't think so since I don't see an explicit ___GL symbol in the binary file.
(In reply to Michael Saxl from comment #4) replying to myself: I think I have traced the issue (and have a workaround that works without a recompile): when bind loads samba's dlz, some samba dependencies are loaded (most importantly the one that holds tdb_fetch). Afterwards some ldb modules are loaded (with a dload flag RTLD_DEEPBIND). Now that means that ldb modules prefer symbols of .so files that are in the list of libraries this ldb modules is linked against -> libc. What we have now is that all except ldb modules will use jemalloc whereas ldb modules will use malloc from libc. Since these two implementations are not compatible (you cannot free memory allocated by the other) we have a issue. Turns out you can set an environment variable named LDB_MODULES_DISABLE_DEEPBIND that prevents ldb modules to be loaded with that flag. Now all uses jemalloc and bind does not crash anymore.
(In reply to Michael Saxl from comment #5) I can confirm setting the environmental variable worked for me (once I worked out where to put it) ;-)
(In reply to Rowland Penny from comment #6) Closing this, the isc-bind code has been fixed (on Debian at least) See here: https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1982200.html
*** Bug 15722 has been marked as a duplicate of this bug. ***
The source code had the following comment: * So in future we may remove this completely * or at least invert the default behavior. We should invert the default behavior. MR opened
This bug was referenced in samba master: 8d6b5183770895fef002b6cce84902d1874fa502 dc6927fdca2ad77dbcf212ef4d3ba0d118ec7bdf d6ff05cb5708fb6746176821bee5f713195efa54 20a3a94e06a2294206ec233ccc7f873d6ef2aca0
Created attachment 18448 [details] patch for 4.20
Created attachment 18449 [details] patch for 4.21
Comment on attachment 18448 [details] patch for 4.20 I don't think we need to fix this in 4.20, as it would mean we need a new ldb release...
Pushed to autobuild-v4-21-test.
This bug was referenced in samba v4-21-test: a4cc81cc2f2173d78730c6abb06f7959547d7433 c9463d6dc98bee7c90439f00c9ab94f611f6eaf1 a56ce559eb181f3e050cba1e436df5517f4af68a aabaf6aaf55103d59d98e49d3c632bc5a65186b4
Closing out bug report. Thanks!
This bug was referenced in samba v4-21-stable (Release samba-4.21.1): a4cc81cc2f2173d78730c6abb06f7959547d7433 c9463d6dc98bee7c90439f00c9ab94f611f6eaf1 a56ce559eb181f3e050cba1e436df5517f4af68a aabaf6aaf55103d59d98e49d3c632bc5a65186b4