This happens intermittently when we fall off the corporate network and thus go offline. We have a NetworkManager dispatcher script which will invoke 'smbcontrol winbind offline' when that happens. https://bugzilla.redhat.com/show_bug.cgi?id=1033595 #0 0x00007fe0bd05ac39 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007fe0bd05c348 in __GI_abort () at abort.c:89 #2 0x00007fe0bf3e5a1b in dump_core () at ../source3/lib/dumpcore.c:336 #3 0x00007fe0bf3d00d7 in smb_panic_s3 (why=<optimized out>) at ../source3/lib/util.c:808 #4 0x00007fe0c36c044f in smb_panic (why=why@entry=0x7fe0c36cdcd4 "internal error") at ../lib/util/fault.c:159 #5 0x00007fe0c36c0666 in fault_report (sig=<optimized out>) at ../lib/util/fault.c:77 #6 sig_fault (sig=<optimized out>) at ../lib/util/fault.c:88 #7 <signal handler called> #8 0x00007fe0bd1392b4 in inet_pton4 (dst=0x7fffb021a580 "ȥ!\260\377\177", src=0x1 <error: Cannot access memory at address 0x1>) at inet_pton.c:93 #9 __GI_inet_pton (af=af@entry=2, src=src@entry=0x0, dst=dst@entry=0x7fffb021a580) at inet_pton.c:59 #10 0x00007fe0c36bcab4 in is_ipaddress_v4 (str=str@entry=0x0) at ../lib/util/util_net.c:316 #11 0x00007fe0c36bcd59 in is_ipaddress (str=str@entry=0x0) at ../lib/util/util_net.c:366 #12 0x00007fe0c0a6f9a3 in internal_resolve_name (name=name@entry=0x0, name_type=name_type@entry=28, sitename=sitename@entry=0x7fe0c5eaa4a0 "IR-Ireland", return_iplist=return_iplist@entry=0x7fffb021a768, return_count=return_count@entry=0x7fffb021a754, resolve_order=resolve_order@entry=0x7fe0c0c83020 <ads_order>) at ../source3/libsmb/namequery.c:2600 #13 0x00007fe0c0a70b09 in get_dc_list (domain=domain@entry=0x0, sitename=sitename@entry=0x7fe0c5eaa4a0 "IR-Ireland", ip_list=ip_list@entry=0x7fffb021a920, count=count@entry=0x7fffb021a91c, lookup_type=lookup_type@entry=DC_ADS_ONLY, ordered=ordered@entry=0x7fffb021a88f) at ../source3/libsmb/namequery.c:3114 #14 0x00007fe0c0a71a3f in get_sorted_dc_list (domain=0x0, sitename=sitename@entry=0x7fe0c5eaa4a0 "IR-Ireland", ip_list=ip_list@entry=0x7fffb021a920, count=count@entry=0x7fffb021a91c, ads_only=ads_only@entry=true) at ../source3/libsmb/namequery.c:3295 #15 0x00007fe0c3f657cd in get_dcs (mem_ctx=0x7fe0c5ec0150, domain=domain@entry=0x7fe0c5ebf850, dcs=dcs@entry=0x7fffb021ab70, num_dcs=num_dcs@entry=0x7fffb021ab6c) at ../source3/winbindd/winbindd_cm.c:1348 #16 0x00007fe0c3f65e50 in fork_child_dc_connect (domain=0x7fe0c5ebf850) at ../source3/winbindd/winbindd_cm.c:264 #17 check_domain_online_handler (ctx=<optimized out>, te=<optimized out>, now=..., private_data=0x7fe0c5ebf850) at ../source3/winbindd/winbindd_cm.c:325 #18 0x00007fe0bd63dfbf in tevent_common_loop_timer_delay () from /lib64/libtevent.so.0 #19 0x00007fe0bd63efca in epoll_event_loop_once () from /lib64/libtevent.so.0 #20 0x00007fe0bd63d6b7 in std_event_loop_once () from /lib64/libtevent.so.0 #21 0x00007fe0bd639f2d in _tevent_loop_once () from /lib64/libtevent.so.0 #22 0x00007fe0c3f4632a in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../source3/winbindd/winbindd.c:1588 (gdb) up 15 #15 0x00007fe0c3f657cd in get_dcs (mem_ctx=0x7fe0c5ec0150, domain=domain@entry=0x7fe0c5ebf850, dcs=dcs@entry=0x7fffb021ab70, num_dcs=num_dcs@entry=0x7fffb021ab6c) at ../source3/winbindd/winbindd_cm.c:1348 1348 get_sorted_dc_list(domain->alt_name, sitename, &ip_list, (gdb) p domain $1 = (struct winbindd_domain *) 0x7fe0c5ebf850 (gdb) p domain->alt_name $3 = 0x0 (gdb) p *domain $2 = {name = 0x7fe0c5ebf3b0 "IRRDM01", alt_name = 0x0, forest_name = 0x0, sid = {sid_rev_num = 1 '\001', num_auths = 4 '\004', id_auth = "\000\000\000\000\000\005", sub_auths = {21, 984154414, 1598771514, 316617838, 0 <repeats 11 times>}}, domain_flags = 0, domain_type = 0, domain_trust_attribs = 0, initialized = false, native_mode = false, active_directory = false, primary = false, internal = false, online = false, startup_time = 413, startup = true, can_do_samlogon_ex = false, can_do_ncacn_ip_tcp = false, can_do_validation6 = false, methods = 0x7fe0c42376c0 <cache_methods>, backend = 0x0, private_data = 0x0, have_idmap_config = false, id_range_low = 0, id_range_high = 0, dc_probe_pid = 0, dcname = 0x0, dcaddr = {ss_family = 0, __ss_align = 0, __ss_padding = '\000' <repeats 111 times>}, last_seq_check = 0, sequence_number = 4294967295, last_status = {v = 0}, conn = {cli = 0x0, samr_pipe = 0x0, sam_connect_handle = {handle_type = 0, uuid = { time_low = 0, time_mid = 0, time_hi_and_version = 0, clock_seq = "\000", node = "\000\000\000\000\000"}}, sam_domain_handle = {handle_type = 0, uuid = {time_low = 0, time_mid = 0, time_hi_and_version = 0, clock_seq = "\000", node = "\000\000\000\000\000"}}, lsa_pipe = 0x0, lsa_pipe_tcp = 0x0, lsa_policy = {handle_type = 0, uuid = {time_low = 0, time_mid = 0, time_hi_and_version = 0, clock_seq = "\000", node = "\000\000\000\000\000"}}, netlogon_pipe = 0x0}, children = 0x7fe0c5ebfa90, check_online_timeout = 0, check_online_event = 0x0, prev = 0x7fe0c5ebe4d0, next = 0x7fe0c5ebfbf0}
Hmmm. How can we get in a state where domain->alt_name == NULL... Investigating.
Bleeeegghhh. Looks like the ability for alt_name == NULL is built into much of the winbindd/AD contacting code. I think this is a mistake. However, this is a bigger patch than I can do for this bug right now. In the meantime, I've going through and fixing up all the places where alt_name is referenced and passed into functions, and fixing them up to be safe when passed a NULL. Ugly, ugly, ugly :-(.
Created attachment 10115 [details] git-am fix for master and 4.1.next David, can you test this for me ? I think this will fix the problem and stops us getting into the bad name lookup code paths. Once you've confirmed I'll get into master and all released branches. Thanks ! Jeremy.
Will not be instant; I was unable to reliably reproduce this on demand so I'll leave it running for a while.
Thanks David, much appreciated !
I've added this to my test build so at least I know when it *would* have happened, and if this triggers without any other adverse effects then we'll have a reasonable amount of confidence in the efficacy of your patch. --- a/source3/winbindd/winbindd_cm.c +++ b/source3/winbindd/winbindd_cm.c @@ -1331,7 +1331,8 @@ static bool get_dcs(TALLOC_CTX *mem_ctx, struct winbindd_domain *domain, return True; } - if ((sec == SEC_ADS) && (domain->alt_name != NULL)) { + if ((sec == SEC_ADS)) { + if (domain->alt_name != NULL) { char *sitename = NULL; /* We need to make sure we know the local site before @@ -1391,6 +1392,9 @@ static bool get_dcs(TALLOC_CTX *mem_ctx, struct winbindd_domain *domain, SAFE_FREE(ip_list); iplist_size = 0; + } else { + DEBUG(1, ("get_dcs: alt_name is NULL for domain %s", domain->name)); + } } /* Try standard netbios queries if no ADS and fall back to DNS queries
Good call, thanks !
Ping. Any updates on this one ? I'd love to get this fixed in a real release..
I haven't seen the printf I added. Starting to wonder if this only happened the first time, after joining the domain on a new machine. Will wipe /var/lib/samba and run my assimilation scripts again, then briefly join and leave the VPN, and see if I can get it to happen.
Gr. I finally managed to reproduce this again on my laptop (yay for crappy hotel networks) but Fedora had shipped a Samba package update since I'd made my test build, so I was no longer running with the fix and the canary; I just got the crash again. Installing the patched version again, and maybe by the end of the week I'll see it again...
Finally! Apologies for the delay. Let me know if you want other logs which might help shed light on how it happened. [2014/08/27 04:58:25.380326, 3, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1867(connection_ok) connection_ok: Connection to IRSGER203.ger.corp.intel.com for domain GER is not connected [2014/08/27 04:58:25.456960, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1642(cm_open_connection) cm_open_connection: saf_servername is 'IRSGER201.ger.corp.intel.com' for domain GER [2014/08/27 04:58:25.457083, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1684(cm_open_connection) cm_open_connection: dcname is 'IRSGER201.ger.corp.intel.com' for domain GER [2014/08/27 04:58:25.514715, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:882(cm_prepare_connection) cm_prepare_connection: connecting to DC IRSGER201.ger.corp.intel.com for domain GER [2014/08/27 04:58:25.572076, 5, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:941(cm_prepare_connection) connecting to IRSGER201.ger.corp.intel.com from DWMW2-SHINYBOOK with kerberos principal [DWMW2-SHINYBOOK$@GER.CORP.INTEL.COM] and realm [ger.corp.intel.com] [2014/08/27 04:58:25.809053, 4, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:955(cm_prepare_connection) failed kerberos session setup with NT_STATUS_UNSUCCESSFUL [2014/08/27 04:58:25.809232, 5, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:973(cm_prepare_connection) connecting to IRSGER201.ger.corp.intel.com from DWMW2-SHINYBOOK with username [GER]\[DWMW2-SHINYBOOK$] [2014/08/27 04:58:26.001816, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:475(set_domain_online) set_domain_online: called for domain GER [2014/08/27 04:58:28.988898, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:769(get_dc_name_via_netlogon) dcerpc_netr_GetAnyDCName failed: WERR_NO_SUCH_DOMAIN [2014/08/27 04:58:28.989128, 1, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1397(get_dcs) get_dcs: alt_name is NULL for domain IRRDM01 [2014/08/27 04:58:58.994931, 3, pid=29074, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1867(connection_ok) connection_ok: Connection to IRSGER203.ger.corp.intel.com for domain GER is not connected
Ok, so this is confirmation that this patch actually fixes the crash bug - yeah ? If so I'd like to get it into master. Still not sure exactly why it happened, but the patch certainly seems to stop the crash. Jeremy.
(In reply to comment #12) > Ok, so this is confirmation that this patch actually fixes the crash bug - yeah > ? Yes. It hit that canary I added in comment 6 and it *didn't* crash.
OK, requested this go into master, once it's in I'll get it back-ported and into 4.1.next, 4.0.next. Jeremy.
Created attachment 10288 [details] git-am fix that went into master. Applies cleanly to 4.1.next. Fox for 4.1.x.
Created attachment 10296 [details] Patch for v4-1-test, with cherry-pick-info Updated patch with cherry-pick-info. Jeremy, please re-ack and then assign to Karolin.
Re-assigning to Karolin for inclusion in 4.1.next.
Pushed to autobuild-v4-1-test.
Pushed to v4-1-test. Closing out bug report. Thanks!