Bug 10717 - Winbind crash on losing VPN connection
Summary: Winbind crash on losing VPN connection
Status: RESOLVED FIXED
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 4.1.9
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-07-16 08:46 UTC by David Woodhouse
Modified: 2014-09-29 18:02 UTC (History)
1 user (show)

See Also:


Attachments
git-am fix for master and 4.1.next (2.57 KB, patch)
2014-07-16 19:45 UTC, Jeremy Allison
no flags Details
git-am fix that went into master. Applies cleanly to 4.1.next. (2.72 KB, patch)
2014-09-16 00:46 UTC, Jeremy Allison
no flags Details
Patch for v4-1-test, with cherry-pick-info (2.79 KB, patch)
2014-09-19 19:50 UTC, Michael Adam
obnox: review+
jra: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Woodhouse 2014-07-16 08:46:16 UTC
This happens intermittently when we fall off the corporate network and thus go offline. We have a NetworkManager dispatcher script which will invoke 'smbcontrol winbind offline' when that happens.

https://bugzilla.redhat.com/show_bug.cgi?id=1033595

#0  0x00007fe0bd05ac39 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fe0bd05c348 in __GI_abort () at abort.c:89
#2  0x00007fe0bf3e5a1b in dump_core () at ../source3/lib/dumpcore.c:336
#3  0x00007fe0bf3d00d7 in smb_panic_s3 (why=<optimized out>) at ../source3/lib/util.c:808
#4  0x00007fe0c36c044f in smb_panic (why=why@entry=0x7fe0c36cdcd4 "internal error") at ../lib/util/fault.c:159
#5  0x00007fe0c36c0666 in fault_report (sig=<optimized out>) at ../lib/util/fault.c:77
#6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:88
#7  <signal handler called>
#8  0x00007fe0bd1392b4 in inet_pton4 (dst=0x7fffb021a580 "ȥ!\260\377\177", src=0x1 <error: Cannot access memory at address 0x1>) at inet_pton.c:93
#9  __GI_inet_pton (af=af@entry=2, src=src@entry=0x0, dst=dst@entry=0x7fffb021a580) at inet_pton.c:59
#10 0x00007fe0c36bcab4 in is_ipaddress_v4 (str=str@entry=0x0) at ../lib/util/util_net.c:316
#11 0x00007fe0c36bcd59 in is_ipaddress (str=str@entry=0x0) at ../lib/util/util_net.c:366
#12 0x00007fe0c0a6f9a3 in internal_resolve_name (name=name@entry=0x0, name_type=name_type@entry=28, sitename=sitename@entry=0x7fe0c5eaa4a0 "IR-Ireland", return_iplist=return_iplist@entry=0x7fffb021a768, return_count=return_count@entry=0x7fffb021a754, resolve_order=resolve_order@entry=0x7fe0c0c83020 <ads_order>) at ../source3/libsmb/namequery.c:2600
#13 0x00007fe0c0a70b09 in get_dc_list (domain=domain@entry=0x0, sitename=sitename@entry=0x7fe0c5eaa4a0 "IR-Ireland", ip_list=ip_list@entry=0x7fffb021a920, count=count@entry=0x7fffb021a91c, lookup_type=lookup_type@entry=DC_ADS_ONLY, ordered=ordered@entry=0x7fffb021a88f) at ../source3/libsmb/namequery.c:3114
#14 0x00007fe0c0a71a3f in get_sorted_dc_list (domain=0x0, sitename=sitename@entry=0x7fe0c5eaa4a0 "IR-Ireland", ip_list=ip_list@entry=0x7fffb021a920, count=count@entry=0x7fffb021a91c, ads_only=ads_only@entry=true) at ../source3/libsmb/namequery.c:3295
#15 0x00007fe0c3f657cd in get_dcs (mem_ctx=0x7fe0c5ec0150, domain=domain@entry=0x7fe0c5ebf850, dcs=dcs@entry=0x7fffb021ab70, num_dcs=num_dcs@entry=0x7fffb021ab6c) at ../source3/winbindd/winbindd_cm.c:1348
#16 0x00007fe0c3f65e50 in fork_child_dc_connect (domain=0x7fe0c5ebf850) at ../source3/winbindd/winbindd_cm.c:264
#17 check_domain_online_handler (ctx=<optimized out>, te=<optimized out>, now=..., private_data=0x7fe0c5ebf850) at ../source3/winbindd/winbindd_cm.c:325
#18 0x00007fe0bd63dfbf in tevent_common_loop_timer_delay () from /lib64/libtevent.so.0
#19 0x00007fe0bd63efca in epoll_event_loop_once () from /lib64/libtevent.so.0
#20 0x00007fe0bd63d6b7 in std_event_loop_once () from /lib64/libtevent.so.0
#21 0x00007fe0bd639f2d in _tevent_loop_once () from /lib64/libtevent.so.0
#22 0x00007fe0c3f4632a in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../source3/winbindd/winbindd.c:1588
(gdb) up 15
#15 0x00007fe0c3f657cd in get_dcs (mem_ctx=0x7fe0c5ec0150, domain=domain@entry=0x7fe0c5ebf850, 
    dcs=dcs@entry=0x7fffb021ab70, num_dcs=num_dcs@entry=0x7fffb021ab6c)
    at ../source3/winbindd/winbindd_cm.c:1348
1348				get_sorted_dc_list(domain->alt_name, sitename, &ip_list,
(gdb) p domain
$1 = (struct winbindd_domain *) 0x7fe0c5ebf850
(gdb) p domain->alt_name
$3 = 0x0
(gdb) p *domain
$2 = {name = 0x7fe0c5ebf3b0 "IRRDM01", alt_name = 0x0, forest_name = 0x0, sid = {sid_rev_num = 1 '\001', 
    num_auths = 4 '\004', id_auth = "\000\000\000\000\000\005", sub_auths = {21, 984154414, 1598771514, 
      316617838, 0 <repeats 11 times>}}, domain_flags = 0, domain_type = 0, domain_trust_attribs = 0, 
  initialized = false, native_mode = false, active_directory = false, primary = false, internal = false, 
  online = false, startup_time = 413, startup = true, can_do_samlogon_ex = false, 
  can_do_ncacn_ip_tcp = false, can_do_validation6 = false, methods = 0x7fe0c42376c0 <cache_methods>, 
  backend = 0x0, private_data = 0x0, have_idmap_config = false, id_range_low = 0, id_range_high = 0, 
  dc_probe_pid = 0, dcname = 0x0, dcaddr = {ss_family = 0, __ss_align = 0, 
    __ss_padding = '\000' <repeats 111 times>}, last_seq_check = 0, sequence_number = 4294967295, 
  last_status = {v = 0}, conn = {cli = 0x0, samr_pipe = 0x0, sam_connect_handle = {handle_type = 0, uuid = {
        time_low = 0, time_mid = 0, time_hi_and_version = 0, clock_seq = "\000", 
        node = "\000\000\000\000\000"}}, sam_domain_handle = {handle_type = 0, uuid = {time_low = 0, 
        time_mid = 0, time_hi_and_version = 0, clock_seq = "\000", node = "\000\000\000\000\000"}}, 
    lsa_pipe = 0x0, lsa_pipe_tcp = 0x0, lsa_policy = {handle_type = 0, uuid = {time_low = 0, time_mid = 0, 
        time_hi_and_version = 0, clock_seq = "\000", node = "\000\000\000\000\000"}}, netlogon_pipe = 0x0}, 
  children = 0x7fe0c5ebfa90, check_online_timeout = 0, check_online_event = 0x0, prev = 0x7fe0c5ebe4d0, 
  next = 0x7fe0c5ebfbf0}
Comment 1 Jeremy Allison 2014-07-16 17:57:07 UTC
Hmmm. How can we get in a state where domain->alt_name == NULL...

Investigating.
Comment 2 Jeremy Allison 2014-07-16 18:58:03 UTC
Bleeeegghhh. Looks like the ability for alt_name == NULL is built into much of the winbindd/AD contacting code. I think this is a mistake. However, this is a bigger patch than I can do for this bug right now.

In the meantime, I've going through and fixing up all the places where alt_name is referenced and passed into functions, and fixing them up to be safe when passed a NULL. Ugly, ugly, ugly :-(.
Comment 3 Jeremy Allison 2014-07-16 19:45:26 UTC
Created attachment 10115 [details]
git-am fix for master and 4.1.next

David, can you test this for me ? I think this will fix the problem and stops us getting into the bad name lookup code paths.

Once you've confirmed I'll get into master and all released branches.

Thanks !

Jeremy.
Comment 4 David Woodhouse 2014-07-16 19:54:39 UTC
Will not be instant; I was unable to reliably reproduce this on demand so I'll leave it running for a while.
Comment 5 Jeremy Allison 2014-07-16 20:16:59 UTC
Thanks David, much appreciated !
Comment 6 David Woodhouse 2014-07-17 09:21:32 UTC
I've added this to my test build so at least I know when it *would* have happened, and if this triggers without any other adverse effects then we'll have a reasonable amount of confidence in the efficacy of your patch.

--- a/source3/winbindd/winbindd_cm.c
+++ b/source3/winbindd/winbindd_cm.c
@@ -1331,7 +1331,8 @@ static bool get_dcs(TALLOC_CTX *mem_ctx, struct winbindd_domain *domain,
 		return True;
 	}
 
-	if ((sec == SEC_ADS) && (domain->alt_name != NULL)) {
+	if ((sec == SEC_ADS)) {
+	    if (domain->alt_name != NULL) {
 		char *sitename = NULL;
 
 		/* We need to make sure we know the local site before
@@ -1391,6 +1392,9 @@ static bool get_dcs(TALLOC_CTX *mem_ctx, struct winbindd_domain *domain,
 
 		SAFE_FREE(ip_list);
 		iplist_size = 0;
+	    } else {
+		DEBUG(1, ("get_dcs: alt_name is NULL for domain %s", domain->name));
+	    }
         }
 
 	/* Try standard netbios queries if no ADS and fall back to DNS queries
Comment 7 Jeremy Allison 2014-07-17 17:12:54 UTC
Good call, thanks !
Comment 8 Jeremy Allison 2014-07-25 22:40:25 UTC
Ping. Any updates on this one ? I'd love to get this fixed in a real release..
Comment 9 David Woodhouse 2014-07-25 22:43:09 UTC
I haven't seen the printf I added. Starting to wonder if this only happened the first time, after joining the domain on a new machine. Will wipe /var/lib/samba and run my assimilation scripts again, then briefly join and leave the VPN, and see if I can get it to happen.
Comment 10 David Woodhouse 2014-08-19 22:52:34 UTC
Gr. I finally managed to reproduce this again on my laptop (yay for crappy hotel networks) but Fedora had shipped a Samba package update since I'd made my test build, so I was no longer running with the fix and the canary; I just got the crash again.

Installing the patched version again, and maybe by the end of the week I'll see it again...
Comment 11 David Woodhouse 2014-09-01 12:54:16 UTC
Finally! Apologies for the delay. Let me know if you want other logs which might help shed light on how it happened.

[2014/08/27 04:58:25.380326,  3, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1867(connection_ok)
  connection_ok: Connection to IRSGER203.ger.corp.intel.com for domain GER is not connected
[2014/08/27 04:58:25.456960, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1642(cm_open_connection)
  cm_open_connection: saf_servername is 'IRSGER201.ger.corp.intel.com' for domain GER
[2014/08/27 04:58:25.457083, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1684(cm_open_connection)
  cm_open_connection: dcname is 'IRSGER201.ger.corp.intel.com' for domain GER
[2014/08/27 04:58:25.514715, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:882(cm_prepare_connection)
  cm_prepare_connection: connecting to DC IRSGER201.ger.corp.intel.com for domain GER
[2014/08/27 04:58:25.572076,  5, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:941(cm_prepare_connection)
  connecting to IRSGER201.ger.corp.intel.com from DWMW2-SHINYBOOK with kerberos principal [DWMW2-SHINYBOOK$@GER.CORP.INTEL.COM] and realm [ger.corp.intel.com]
[2014/08/27 04:58:25.809053,  4, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:955(cm_prepare_connection)
  failed kerberos session setup with NT_STATUS_UNSUCCESSFUL
[2014/08/27 04:58:25.809232,  5, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:973(cm_prepare_connection)
  connecting to IRSGER201.ger.corp.intel.com from DWMW2-SHINYBOOK with username [GER]\[DWMW2-SHINYBOOK$]
[2014/08/27 04:58:26.001816, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:475(set_domain_online)
  set_domain_online: called for domain GER
[2014/08/27 04:58:28.988898, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:769(get_dc_name_via_netlogon)
  dcerpc_netr_GetAnyDCName failed: WERR_NO_SUCH_DOMAIN
[2014/08/27 04:58:28.989128,  1, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1397(get_dcs)
  get_dcs: alt_name is NULL for domain IRRDM01
[2014/08/27 04:58:58.994931,  3, pid=29074, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1867(connection_ok)
  connection_ok: Connection to IRSGER203.ger.corp.intel.com for domain GER is not connected
Comment 12 Jeremy Allison 2014-09-03 15:09:16 UTC
Ok, so this is confirmation that this patch actually fixes the crash bug - yeah ?

If so I'd like to get it into master.

Still not sure exactly why it happened, but the patch certainly seems to stop the crash.

Jeremy.
Comment 13 David Woodhouse 2014-09-03 15:54:45 UTC
(In reply to comment #12)
> Ok, so this is confirmation that this patch actually fixes the crash bug - yeah
> ?

Yes. It hit that canary I added in comment 6 and it *didn't* crash.
Comment 14 Jeremy Allison 2014-09-12 23:36:24 UTC
OK, requested this go into master, once it's in I'll get it back-ported and into 4.1.next, 4.0.next.

Jeremy.
Comment 15 Jeremy Allison 2014-09-16 00:46:09 UTC
Created attachment 10288 [details]
git-am fix that went into master. Applies cleanly to 4.1.next.

Fox for 4.1.x.
Comment 16 Michael Adam 2014-09-19 19:50:30 UTC
Created attachment 10296 [details]
Patch for v4-1-test, with cherry-pick-info

Updated patch with cherry-pick-info.
Jeremy, please re-ack and then assign to Karolin.
Comment 17 Jeremy Allison 2014-09-19 20:40:50 UTC
Re-assigning to Karolin for inclusion in 4.1.next.
Comment 18 Karolin Seeger 2014-09-27 18:01:41 UTC
Pushed to autobuild-v4-1-test.
Comment 19 Karolin Seeger 2014-09-29 18:02:16 UTC
Pushed to v4-1-test.
Closing out bug report.

Thanks!