10717 – Winbind crash on losing VPN connection

Bug 10717 - Winbind crash on losing VPN connection

Summary: Winbind crash on losing VPN connection

Status:	RESOLVED FIXED

Alias:	None

Product:	Samba 4.1 and newer
Classification:	Unclassified
Component:	Winbind (show other bugs)
Version:	4.1.9
Hardware:	All All

Importance:	P5 normal (vote)
Target Milestone:	---
Assignee:	Karolin Seeger
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-07-16 08:46 UTC by David Woodhouse
Modified:	2014-09-29 18:02 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
git-am fix for master and 4.1.next (2.57 KB, patch) 2014-07-16 19:45 UTC, Jeremy Allison	no flags	Details
git-am fix that went into master. Applies cleanly to 4.1.next. (2.72 KB, patch) 2014-09-16 00:46 UTC, Jeremy Allison	no flags	Details
Patch for v4-1-test, with cherry-pick-info (2.79 KB, patch) 2014-09-19 19:50 UTC, Michael Adam	obnox: review+ jra: review+	Details
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description David Woodhouse 2014-07-16 08:46:16 UTC

This happens intermittently when we fall off the corporate network and thus go offline. We have a NetworkManager dispatcher script which will invoke 'smbcontrol winbind offline' when that happens.

https://bugzilla.redhat.com/show_bug.cgi?id=1033595

#0  0x00007fe0bd05ac39 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fe0bd05c348 in __GI_abort () at abort.c:89
#2  0x00007fe0bf3e5a1b in dump_core () at ../source3/lib/dumpcore.c:336
#3  0x00007fe0bf3d00d7 in smb_panic_s3 (why=<optimized out>) at ../source3/lib/util.c:808
#4  0x00007fe0c36c044f in smb_panic (why=why@entry=0x7fe0c36cdcd4 "internal error") at ../lib/util/fault.c:159
#5  0x00007fe0c36c0666 in fault_report (sig=<optimized out>) at ../lib/util/fault.c:77
#6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:88
#7  <signal handler called>
#8  0x00007fe0bd1392b4 in inet_pton4 (dst=0x7fffb021a580 "ȥ!\260\377\177", src=0x1 <error: Cannot access memory at address 0x1>) at inet_pton.c:93
#9  __GI_inet_pton (af=af@entry=2, src=src@entry=0x0, dst=dst@entry=0x7fffb021a580) at inet_pton.c:59
#10 0x00007fe0c36bcab4 in is_ipaddress_v4 (str=str@entry=0x0) at ../lib/util/util_net.c:316
#11 0x00007fe0c36bcd59 in is_ipaddress (str=str@entry=0x0) at ../lib/util/util_net.c:366
#12 0x00007fe0c0a6f9a3 in internal_resolve_name (name=name@entry=0x0, name_type=name_type@entry=28, sitename=sitename@entry=0x7fe0c5eaa4a0 "IR-Ireland", return_iplist=return_iplist@entry=0x7fffb021a768, return_count=return_count@entry=0x7fffb021a754, resolve_order=resolve_order@entry=0x7fe0c0c83020 <ads_order>) at ../source3/libsmb/namequery.c:2600
#13 0x00007fe0c0a70b09 in get_dc_list (domain=domain@entry=0x0, sitename=sitename@entry=0x7fe0c5eaa4a0 "IR-Ireland", ip_list=ip_list@entry=0x7fffb021a920, count=count@entry=0x7fffb021a91c, lookup_type=lookup_type@entry=DC_ADS_ONLY, ordered=ordered@entry=0x7fffb021a88f) at ../source3/libsmb/namequery.c:3114
#14 0x00007fe0c0a71a3f in get_sorted_dc_list (domain=0x0, sitename=sitename@entry=0x7fe0c5eaa4a0 "IR-Ireland", ip_list=ip_list@entry=0x7fffb021a920, count=count@entry=0x7fffb021a91c, ads_only=ads_only@entry=true) at ../source3/libsmb/namequery.c:3295
#15 0x00007fe0c3f657cd in get_dcs (mem_ctx=0x7fe0c5ec0150, domain=domain@entry=0x7fe0c5ebf850, dcs=dcs@entry=0x7fffb021ab70, num_dcs=num_dcs@entry=0x7fffb021ab6c) at ../source3/winbindd/winbindd_cm.c:1348
#16 0x00007fe0c3f65e50 in fork_child_dc_connect (domain=0x7fe0c5ebf850) at ../source3/winbindd/winbindd_cm.c:264
#17 check_domain_online_handler (ctx=<optimized out>, te=<optimized out>, now=..., private_data=0x7fe0c5ebf850) at ../source3/winbindd/winbindd_cm.c:325
#18 0x00007fe0bd63dfbf in tevent_common_loop_timer_delay () from /lib64/libtevent.so.0
#19 0x00007fe0bd63efca in epoll_event_loop_once () from /lib64/libtevent.so.0
#20 0x00007fe0bd63d6b7 in std_event_loop_once () from /lib64/libtevent.so.0
#21 0x00007fe0bd639f2d in _tevent_loop_once () from /lib64/libtevent.so.0
#22 0x00007fe0c3f4632a in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../source3/winbindd/winbindd.c:1588
(gdb) up 15
#15 0x00007fe0c3f657cd in get_dcs (mem_ctx=0x7fe0c5ec0150, domain=domain@entry=0x7fe0c5ebf850, 
    dcs=dcs@entry=0x7fffb021ab70, num_dcs=num_dcs@entry=0x7fffb021ab6c)
    at ../source3/winbindd/winbindd_cm.c:1348
1348				get_sorted_dc_list(domain->alt_name, sitename, &ip_list,
(gdb) p domain
$1 = (struct winbindd_domain *) 0x7fe0c5ebf850
(gdb) p domain->alt_name
$3 = 0x0
(gdb) p *domain
$2 = {name = 0x7fe0c5ebf3b0 "IRRDM01", alt_name = 0x0, forest_name = 0x0, sid = {sid_rev_num = 1 '\001', 
    num_auths = 4 '\004', id_auth = "\000\000\000\000\000\005", sub_auths = {21, 984154414, 1598771514, 
      316617838, 0 <repeats 11 times>}}, domain_flags = 0, domain_type = 0, domain_trust_attribs = 0, 
  initialized = false, native_mode = false, active_directory = false, primary = false, internal = false, 
  online = false, startup_time = 413, startup = true, can_do_samlogon_ex = false, 
  can_do_ncacn_ip_tcp = false, can_do_validation6 = false, methods = 0x7fe0c42376c0 <cache_methods>, 
  backend = 0x0, private_data = 0x0, have_idmap_config = false, id_range_low = 0, id_range_high = 0, 
  dc_probe_pid = 0, dcname = 0x0, dcaddr = {ss_family = 0, __ss_align = 0, 
    __ss_padding = '\000' <repeats 111 times>}, last_seq_check = 0, sequence_number = 4294967295, 
  last_status = {v = 0}, conn = {cli = 0x0, samr_pipe = 0x0, sam_connect_handle = {handle_type = 0, uuid = {
        time_low = 0, time_mid = 0, time_hi_and_version = 0, clock_seq = "\000", 
        node = "\000\000\000\000\000"}}, sam_domain_handle = {handle_type = 0, uuid = {time_low = 0, 
        time_mid = 0, time_hi_and_version = 0, clock_seq = "\000", node = "\000\000\000\000\000"}}, 
    lsa_pipe = 0x0, lsa_pipe_tcp = 0x0, lsa_policy = {handle_type = 0, uuid = {time_low = 0, time_mid = 0, 
        time_hi_and_version = 0, clock_seq = "\000", node = "\000\000\000\000\000"}}, netlogon_pipe = 0x0}, 
  children = 0x7fe0c5ebfa90, check_online_timeout = 0, check_online_event = 0x0, prev = 0x7fe0c5ebe4d0, 
  next = 0x7fe0c5ebfbf0}

Comment 1 Jeremy Allison 2014-07-16 17:57:07 UTC

Hmmm. How can we get in a state where domain->alt_name == NULL...

Investigating.

Comment 2 Jeremy Allison 2014-07-16 18:58:03 UTC

Bleeeegghhh. Looks like the ability for alt_name == NULL is built into much of the winbindd/AD contacting code. I think this is a mistake. However, this is a bigger patch than I can do for this bug right now.

In the meantime, I've going through and fixing up all the places where alt_name is referenced and passed into functions, and fixing them up to be safe when passed a NULL. Ugly, ugly, ugly :-(.

Comment 3 Jeremy Allison 2014-07-16 19:45:26 UTC

Created attachment 10115 [details]
git-am fix for master and 4.1.next

David, can you test this for me ? I think this will fix the problem and stops us getting into the bad name lookup code paths.

Once you've confirmed I'll get into master and all released branches.

Thanks !

Jeremy.

Comment 4 David Woodhouse 2014-07-16 19:54:39 UTC

Will not be instant; I was unable to reliably reproduce this on demand so I'll leave it running for a while.

Comment 5 Jeremy Allison 2014-07-16 20:16:59 UTC

Thanks David, much appreciated !

Comment 6 David Woodhouse 2014-07-17 09:21:32 UTC

I've added this to my test build so at least I know when it *would* have happened, and if this triggers without any other adverse effects then we'll have a reasonable amount of confidence in the efficacy of your patch.

--- a/source3/winbindd/winbindd_cm.c
+++ b/source3/winbindd/winbindd_cm.c
@@ -1331,7 +1331,8 @@ static bool get_dcs(TALLOC_CTX *mem_ctx, struct winbindd_domain *domain,
 		return True;
 	}
 
-	if ((sec == SEC_ADS) && (domain->alt_name != NULL)) {
+	if ((sec == SEC_ADS)) {
+	    if (domain->alt_name != NULL) {
 		char *sitename = NULL;
 
 		/* We need to make sure we know the local site before
@@ -1391,6 +1392,9 @@ static bool get_dcs(TALLOC_CTX *mem_ctx, struct winbindd_domain *domain,
 
 		SAFE_FREE(ip_list);
 		iplist_size = 0;
+	    } else {
+		DEBUG(1, ("get_dcs: alt_name is NULL for domain %s", domain->name));
+	    }
         }
 
 	/* Try standard netbios queries if no ADS and fall back to DNS queries

Comment 7 Jeremy Allison 2014-07-17 17:12:54 UTC

Good call, thanks !

Comment 8 Jeremy Allison 2014-07-25 22:40:25 UTC

Ping. Any updates on this one ? I'd love to get this fixed in a real release..

Comment 9 David Woodhouse 2014-07-25 22:43:09 UTC

I haven't seen the printf I added. Starting to wonder if this only happened the first time, after joining the domain on a new machine. Will wipe /var/lib/samba and run my assimilation scripts again, then briefly join and leave the VPN, and see if I can get it to happen.

Comment 10 David Woodhouse 2014-08-19 22:52:34 UTC

Gr. I finally managed to reproduce this again on my laptop (yay for crappy hotel networks) but Fedora had shipped a Samba package update since I'd made my test build, so I was no longer running with the fix and the canary; I just got the crash again.

Installing the patched version again, and maybe by the end of the week I'll see it again...

Comment 11 David Woodhouse 2014-09-01 12:54:16 UTC

Finally! Apologies for the delay. Let me know if you want other logs which might help shed light on how it happened.

[2014/08/27 04:58:25.380326,  3, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1867(connection_ok)
  connection_ok: Connection to IRSGER203.ger.corp.intel.com for domain GER is not connected
[2014/08/27 04:58:25.456960, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1642(cm_open_connection)
  cm_open_connection: saf_servername is 'IRSGER201.ger.corp.intel.com' for domain GER
[2014/08/27 04:58:25.457083, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1684(cm_open_connection)
  cm_open_connection: dcname is 'IRSGER201.ger.corp.intel.com' for domain GER
[2014/08/27 04:58:25.514715, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:882(cm_prepare_connection)
  cm_prepare_connection: connecting to DC IRSGER201.ger.corp.intel.com for domain GER
[2014/08/27 04:58:25.572076,  5, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:941(cm_prepare_connection)
  connecting to IRSGER201.ger.corp.intel.com from DWMW2-SHINYBOOK with kerberos principal [DWMW2-SHINYBOOK$@GER.CORP.INTEL.COM] and realm [ger.corp.intel.com]
[2014/08/27 04:58:25.809053,  4, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:955(cm_prepare_connection)
  failed kerberos session setup with NT_STATUS_UNSUCCESSFUL
[2014/08/27 04:58:25.809232,  5, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:973(cm_prepare_connection)
  connecting to IRSGER201.ger.corp.intel.com from DWMW2-SHINYBOOK with username [GER]\[DWMW2-SHINYBOOK$]
[2014/08/27 04:58:26.001816, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:475(set_domain_online)
  set_domain_online: called for domain GER
[2014/08/27 04:58:28.988898, 10, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:769(get_dc_name_via_netlogon)
  dcerpc_netr_GetAnyDCName failed: WERR_NO_SUCH_DOMAIN
[2014/08/27 04:58:28.989128,  1, pid=29037, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1397(get_dcs)
  get_dcs: alt_name is NULL for domain IRRDM01
[2014/08/27 04:58:58.994931,  3, pid=29074, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:1867(connection_ok)
  connection_ok: Connection to IRSGER203.ger.corp.intel.com for domain GER is not connected

Comment 12 Jeremy Allison 2014-09-03 15:09:16 UTC

Ok, so this is confirmation that this patch actually fixes the crash bug - yeah ?

If so I'd like to get it into master.

Still not sure exactly why it happened, but the patch certainly seems to stop the crash.

Jeremy.

Comment 13 David Woodhouse 2014-09-03 15:54:45 UTC

(In reply to comment #12)
> Ok, so this is confirmation that this patch actually fixes the crash bug - yeah
> ?

Yes. It hit that canary I added in comment 6 and it *didn't* crash.

Comment 14 Jeremy Allison 2014-09-12 23:36:24 UTC

OK, requested this go into master, once it's in I'll get it back-ported and into 4.1.next, 4.0.next.

Jeremy.

Comment 15 Jeremy Allison 2014-09-16 00:46:09 UTC

Created attachment 10288 [details]
git-am fix that went into master. Applies cleanly to 4.1.next.

Fox for 4.1.x.

Comment 16 Michael Adam 2014-09-19 19:50:30 UTC

Created attachment 10296 [details]
Patch for v4-1-test, with cherry-pick-info

Updated patch with cherry-pick-info.
Jeremy, please re-ack and then assign to Karolin.

Comment 17 Jeremy Allison 2014-09-19 20:40:50 UTC

Re-assigning to Karolin for inclusion in 4.1.next.

Comment 18 Karolin Seeger 2014-09-27 18:01:41 UTC

Pushed to autobuild-v4-1-test.

Comment 19 Karolin Seeger 2014-09-29 18:02:16 UTC

Pushed to v4-1-test.
Closing out bug report.

Thanks!