Bug 15643 - Samba 4.20.0 DLZ module crashes BIND on startup
Summary: Samba 4.20.0 DLZ module crashes BIND on startup
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: AD: LDB/DSDB/SAMDB (show other bugs)
Version: 4.20.0
Hardware: x64 Linux
: P5 regression with 15 votes (vote)
Target Milestone: ---
Assignee: Jule Anger
QA Contact: Samba QA Contact
URL:
Keywords:
: 15722 (view as bug list)
Depends on:
Blocks:
 
Reported: 2024-05-08 14:58 UTC by Rob Foehl
Modified: 2024-10-14 11:33 UTC (History)
4 users (show)

See Also:


Attachments
GDB-captured backtrace of failed named startup (6.25 KB, text/plain)
2024-05-08 14:58 UTC, Rob Foehl
no flags Details
patch for 4.20 (10.08 KB, patch)
2024-09-27 12:15 UTC, Andreas Schneider
no flags Details
patch for 4.21 (10.08 KB, patch)
2024-09-27 12:15 UTC, Andreas Schneider
metze: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rob Foehl 2024-05-08 14:58:28 UTC
Created attachment 18300 [details]
GDB-captured backtrace of failed named startup

Samba AD DC using DLZ DNS on Fedora 39 with current Samba 4.19.6 / BIND 9.18.24 packages upgraded to Fedora 40, Samba 4.20.0 / BIND 9.18.26.  named consistently crashes on startup when trying to load the Samba-provided DLZ module; stack trace attached.  No other indications of issues, including samba-tool dbcheck.  Samba itself seems to operate fine otherwise, including answering domain-related queries via LDAP.
Comment 1 Rob Foehl 2024-05-08 15:07:52 UTC
Cloned from previous report for Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=2278016

Component and severity are a bit of a guess -- it's crashing deep in the LDB code, the DLZ plugin doesn't appear to have been modified at all recently, and the particular sam.ldb it's failing to load hasn't been touched since it was written out during DC provisioning on a Fedora 37 install about 18 months ago, likely by 4.17.4.
Comment 2 Douglas Bagnall 2024-06-06 21:29:05 UTC
https://bugzilla.samba.org/show_bug.cgi?id=15652 looks similar but perhaps isn't.
Comment 3 Michael Saxl 2024-06-30 12:58:37 UTC
seems to happen on debian unstable too:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074378

here my findings:
the crash is triggered by a call to free in schema_metadata_get_uint64:

		talloc_free(tmp_ctx);
		return ldb_module_error(data->module, LDB_ERR_OPERATIONS_ERROR,
					"Failed to convert value");
	}

	*value = data->schema_seq_num_cache;

	SAFE_FREE(tdb_data.dptr); <<---- this crashes bind
	talloc_free(tmp_ctx);

	return LDB_SUCCESS;

this memory should be allocated by

tdb_data = tdb_fetch(tdb, tdb_key);

this seems to call tdb_alloc_read that in turn uses malloc to get this memory.

so far so good. Looking at the memory addresses it looks like they are "different" between bind versions (working bind they start with 0x55.., with broken they start with 0x7..). Ondřej Surý mentioned that bind9 uses jemalloc. could that be an issue? What I don't really get is why the malloc call in tdb_alloc_read would use jemalloc while SAFE_FREE (that simply uses free) would use glibc's free.

rewriting this function to use tdb_fetch_talloc would probably solve this issue
Comment 4 Michael Saxl 2024-06-30 15:48:37 UTC
sorry for the noise today.
did some testing and found that "working" bind point malloc and free to glibc whereas newer ones point to jemalloc

changelog in bind9:
6328.   [func]          Add workaround to enforce dynamic linker to pull
                        jemalloc earlier than libc to ensure all memory
                        allocations are done via jemalloc. [GL #4404]

this explains the issue: tdb_fetch seems to use jemalloc, but somehow SAFE_FREE that should be

#define SAFE_FREE(x) do { if ((x) != NULL) {free(x); (x)=NULL;} } while(0)

To me it makes no sense where in this case free does not resolve to jemalloc (gdb confirms that the symbol does NOT point to glibc but indeed to jemalloc). Is it possible that the compiler already replaces that with glibc? I don't think so since I don't see an explicit ___GL symbol in the binary file.
Comment 5 Michael Saxl 2024-06-30 16:36:18 UTC
(In reply to Michael Saxl from comment #4)
replying to myself:
I think I have traced the issue (and have a workaround that works without a recompile):
when bind loads samba's dlz, some samba dependencies are loaded (most importantly the one that holds tdb_fetch). Afterwards some ldb modules are loaded (with a dload flag RTLD_DEEPBIND). Now that means that ldb modules prefer symbols of .so files that are in the list of libraries this ldb modules is linked against -> libc. What we have now is that all except ldb modules will use jemalloc whereas ldb modules will use malloc from libc. Since these two implementations are not compatible (you cannot free memory allocated by the other) we have a issue.

Turns out you can set an environment variable named LDB_MODULES_DISABLE_DEEPBIND that prevents ldb modules to be loaded with that flag. Now all uses jemalloc and bind does not crash anymore.
Comment 6 Rowland Penny 2024-07-01 16:18:42 UTC
(In reply to Michael Saxl from comment #5)
I can confirm setting the environmental variable worked for me (once I worked out where to put it) ;-)
Comment 7 Rowland Penny 2024-07-28 11:25:04 UTC
(In reply to Rowland Penny from comment #6)
Closing this, the isc-bind code has been fixed (on Debian at least)

See here:

https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1982200.html
Comment 8 Andreas Schneider 2024-09-25 07:46:15 UTC
*** Bug 15722 has been marked as a duplicate of this bug. ***
Comment 9 Andreas Schneider 2024-09-25 07:53:39 UTC
The source code had the following comment:

	 * So in future we may remove this completely
	 * or at least invert the default behavior.

We should invert the default behavior. MR opened
Comment 10 Samba QA Contact 2024-09-27 09:07:03 UTC
This bug was referenced in samba master:

8d6b5183770895fef002b6cce84902d1874fa502
dc6927fdca2ad77dbcf212ef4d3ba0d118ec7bdf
d6ff05cb5708fb6746176821bee5f713195efa54
20a3a94e06a2294206ec233ccc7f873d6ef2aca0
Comment 11 Andreas Schneider 2024-09-27 12:15:29 UTC
Created attachment 18448 [details]
patch for 4.20
Comment 12 Andreas Schneider 2024-09-27 12:15:55 UTC
Created attachment 18449 [details]
patch for 4.21
Comment 13 Stefan Metzmacher 2024-10-01 12:12:07 UTC
Comment on attachment 18448 [details]
patch for 4.20

I don't think we need to fix this in 4.20, as it would mean we need a new ldb release...
Comment 14 Jule Anger 2024-10-02 08:12:02 UTC
Pushed to autobuild-v4-21-test.
Comment 15 Samba QA Contact 2024-10-02 09:29:19 UTC
This bug was referenced in samba v4-21-test:

a4cc81cc2f2173d78730c6abb06f7959547d7433
c9463d6dc98bee7c90439f00c9ab94f611f6eaf1
a56ce559eb181f3e050cba1e436df5517f4af68a
aabaf6aaf55103d59d98e49d3c632bc5a65186b4
Comment 16 Jule Anger 2024-10-02 14:29:59 UTC
Closing out bug report.

Thanks!
Comment 17 Samba QA Contact 2024-10-14 11:33:24 UTC
This bug was referenced in samba v4-21-stable (Release samba-4.21.1):

a4cc81cc2f2173d78730c6abb06f7959547d7433
c9463d6dc98bee7c90439f00c9ab94f611f6eaf1
a56ce559eb181f3e050cba1e436df5517f4af68a
aabaf6aaf55103d59d98e49d3c632bc5a65186b4