Bug 7636 - winbind internal error, backtrace
Summary: winbind internal error, backtrace
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.5
Classification: Unclassified
Component: Winbind (show other bugs)
Version: 3.5.4
Hardware: Other Linux
: P3 normal
Target Milestone: ---
Assignee: Karolin Seeger
QA Contact: Samba QA Contact
URL:
Keywords:
: 8062 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-08-20 05:10 UTC by David Woodhouse
Modified: 2011-04-05 20:03 UTC (History)
3 users (show)

See Also:


Attachments
Plain patch fro 3.5.x. (2.84 KB, patch)
2010-08-23 15:02 UTC, Jeremy Allison
no flags Details
Guenther's patch for 3.5.x to fix the allocation problem. (3.96 KB, patch)
2010-08-23 15:09 UTC, Jeremy Allison
no flags Details
gt-am format patch for 3.5.x. (3.26 KB, patch)
2010-08-23 15:10 UTC, Jeremy Allison
no flags Details
git am format patch for 3.4.x. (746 bytes, patch)
2010-08-23 18:32 UTC, Jeremy Allison
gd: review+
Details
git am fix for 3.5.x - part1 (3.33 KB, patch)
2010-08-23 19:12 UTC, Jeremy Allison
gd: review+
Details
git am fix for 3.5.x - part2 (5.09 KB, patch)
2010-08-23 19:13 UTC, Jeremy Allison
gd: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Woodhouse 2010-08-20 05:10:48 UTC
From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
===============================================================
PANIC (pid 2643): internal error
BACKTRACE: 22 stack frames:
 #0 /usr/sbin/winbindd(log_stack_trace+0x1a) [0x7ffff7a4d3ca]
 #1 /usr/sbin/winbindd(smb_panic+0x1f) [0x7ffff7a4d48f]
 #2 /usr/sbin/winbindd(+0x193dcd) [0x7ffff7a3cdcd]
 #3 /lib64/libc.so.6(+0x359a832a20) [0x7ffff5342a20]
 #4 /usr/sbin/winbindd(rpccli_set_timeout+0x4) [0x7ffff7b45874]
 #5 /usr/sbin/winbindd(winbindd_lookup_names+0xca) [0x7ffff7998c1a]
 #6 /usr/sbin/winbindd(+0xefdbe) [0x7ffff7998dbe]
 #7 /usr/sbin/winbindd(+0xf1698) [0x7ffff799a698]
 #8 /usr/sbin/winbindd(+0xf18a7) [0x7ffff799a8a7]
 #9 /usr/sbin/winbindd(+0xde20c) [0x7ffff798720c]
 #10 /usr/sbin/winbindd(_wbint_LookupName+0x47) [0x7ffff79a4bd7]
 #11 /usr/sbin/winbindd(+0x105491) [0x7ffff79ae491]
 #12 /usr/sbin/winbindd(winbindd_dual_ndrcmd+0xdc) [0x7ffff79a47bc]
 #13 /usr/sbin/winbindd(+0xfa85b) [0x7ffff79a385b]
 #14 /usr/sbin/winbindd(+0xfafeb) [0x7ffff79a3feb]
 #15 /usr/sbin/winbindd(tevent_common_loop_immediate+0xd7) [0x7ffff7a5e5b7]
 #16 /usr/sbin/winbindd(run_events+0x4b) [0x7ffff7a5cd8b]
 #17 /usr/sbin/winbindd(+0x1b40e7) [0x7ffff7a5d0e7]
 #18 /usr/sbin/winbindd(_tevent_loop_once+0x90) [0x7ffff7a5d580]
 #19 /usr/sbin/winbindd(main+0x92b) [0x7ffff797a5eb]
 #20 /lib64/libc.so.6(__libc_start_main+0xfd) [0x7ffff532ec5d]
 #21 /usr/sbin/winbindd(+0xcf079) [0x7ffff7978079]
dumping core in /var/log/samba/cores/winbindd

Not entirely sure how I triggered this; I installed the latest Fedora test package (with bug #7589 and bug #7590 fixed) and was playing with re-joining the domain with different host names and watching whether the DNS got updated.

I was running in gdb to try to catch it, but it still escaped. I'll have to play with the fork-follows-child setting, presumably. Will debug more later if it's not immediately obvious to someone who knows the code...
Comment 1 Volker Lendecke 2010-08-20 05:18:42 UTC
To catch these things with gdb, set

panic action = /bin/sleep 999999

You can then remotely attach to the parent of the sleep process and do the "bt full".

Volker
Comment 2 David Woodhouse 2010-08-20 07:59:53 UTC
process_request: Handling async request 18811:GETPWNAM
getpwnam root
Entry has wrong sequence number: 29970033
Need to read 40 extra bytes
child daemon request 63
child_process_request: request fn NDRCMD
winbindd_dual_ndrcmd: Running command WBINT_LOOKUPNAME (CORP)
     wbint_LookupName: struct wbint_LookupName
        in: struct wbint_LookupName
            domain                   : *
                domain                   : 'GER'
            name                     : *
                name                     : 'ROOT'
            flags                    : 0x00000008 (8)
refresh_sequence_number: CORP time ok
refresh_sequence_number: CORP seq number is now 30032694
centry_expired: Key NS/GER/ROOT for domain CORP expired
wcache_fetch: entry NS/GER/ROOT expired for domain CORP
name_to_sid: [Cached] - doing backend query for name for domain CORP
rpc: name_to_sid name=GER\ROOT
name_to_sid [rpc] GER\ROOT for domain GER
cm_connect_lsa_tcp
wcache_tdc_fetch_domain: Searching for domain CORP
wcache_tdc_fetch_domain: Found domain CORP
wcache_tdc_fetch_domain: Searching for domain CORP
wcache_tdc_fetch_domain: Found domain CORP
===============================================================
INTERNAL ERROR: Signal 11 in pid 2643 (3.5.4-65.fc13)
Comment 3 David Woodhouse 2010-08-20 09:18:35 UTC
Can't reproduce at will. But stupid me, it gave me a core file to look at so maybe I don't need to.



(gdb) bt
#0  0x00007ffff53429a5 in raise (sig=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007ffff5344185 in abort () at abort.c:92
#2  0x00007ffff7a3c9d0 in dump_core () at lib/fault.c:337
#3  0x00007ffff7a4d4cd in smb_panic (why=<value optimized out>)
    at lib/util.c:1481
#4  0x00007ffff7a3cdcd in fault_report (sig=11) at lib/fault.c:52
#5  sig_fault (sig=11) at lib/fault.c:75
#6  <signal handler called>
#7  rpccli_set_timeout (rpc_cli=0x0, timeout=35000)
    at rpc_client/cli_pipe.c:3041
#8  0x00007ffff7998c1a in winbindd_lookup_names (mem_ctx=0x7ffff82328d0, 
    domain=<value optimized out>, num_names=1, names=0x7fffffffbfe8, 
    domains=0x0, sids=0x7fffffffbff8, types=0x7fffffffbff0)
    at winbindd/winbindd_rpc.c:1285
#9  0x00007ffff7998dbe in msrpc_name_to_sid (domain=0x7ffff82a7740, 
    mem_ctx=0x7ffff82328d0, domain_name=0x7ffff82c8f70 "GER", 
    name=<value optimized out>, flags=<value optimized out>, 
    sid=0x7ffff82d8030, type=0x7ffff82c6eb0) at winbindd/winbindd_rpc.c:314
#10 0x00007ffff799a698 in name_to_sid (domain=0x7ffff82a7740, 
    mem_ctx=0x7ffff82328d0, domain_name=0x7ffff82c8f70 "GER", 
    name=0x7ffff829ecb0 "ROOT", flags=8, sid=0x7ffff82d8030, 
    type=0x7ffff82c6eb0) at winbindd/winbindd_reconnect.c:139
---Type <return> to continue, or q <return> to quit---
#11 0x00007ffff799a8a7 in name_to_sid (domain=<value optimized out>, 
    mem_ctx=<value optimized out>, domain_name=<value optimized out>, 
    name=<value optimized out>, flags=<value optimized out>, 
    sid=<value optimized out>, type=0x7ffff82c6eb0)
    at winbindd/winbindd_ads.c:413
#12 0x00007ffff798720c in name_to_sid (domain=0x7ffff82a7740, 
    mem_ctx=0x7ffff82328d0, domain_name=0x7ffff82c8f70 "GER", 
    name=0x7ffff829ecb0 "ROOT", flags=8, sid=0x7ffff82d8030, 
    type=0x7ffff82c6eb0) at winbindd/winbindd_cache.c:1798
#13 0x00007ffff79a4bd7 in _wbint_LookupName (p=0x7fffffffc1a0, 
    r=0x7ffff82d82f0) at winbindd/winbindd_dual_srv.c:66
#14 0x00007ffff79ae491 in api_wbint_LookupName (p=0x7fffffffc1a0)
    at librpc/gen_ndr/srv_wbint.c:232
#15 0x00007ffff79a47bc in winbindd_dual_ndrcmd (domain=<value optimized out>, 
    state=0x7fffffffc480) at winbindd/winbindd_dual_ndr.c:271
#16 0x00007ffff79a385b in child_process_request (child=0x7ffff82a7cc8)
    at winbindd/winbindd_dual.c:480
#17 fork_domain_child (child=0x7ffff82a7cc8) at winbindd/winbindd_dual.c:1522
#18 0x00007ffff79a3feb in wb_child_request_trigger (req=0x7ffff8232160, 
    private_data=<value optimized out>) at winbindd/winbindd_dual.c:137
#19 0x00007ffff7a5e5b7 in tevent_common_loop_immediate (ev=0x7ffff8295850)
    at ../lib/tevent/tevent_immediate.c:135
#20 0x00007ffff7a5cd8b in run_events (ev=0x7ffff8295850, selrtn=0, 
---Type <return> to continue, or q <return> to quit---
    read_fds=0x0, write_fds=0x0) at lib/events.c:99
#21 0x00007ffff7a5d0e7 in s3_event_loop_once (ev=0x7ffff8295850, 
    location=<value optimized out>) at lib/events.c:192
#22 0x00007ffff7a5d580 in _tevent_loop_once (ev=0x7ffff8295850, 
    location=0x7ffff7dcff97 "winbindd/winbindd.c:1275")
    at ../lib/tevent/tevent.c:497
#23 0x00007ffff797a5eb in main (argc=<value optimized out>, 
    argv=<value optimized out>, envp=<value optimized out>)
    at winbindd/winbindd.c:1275
(gdb) up 7
#7  rpccli_set_timeout (rpc_cli=0x0, timeout=35000)
    at rpc_client/cli_pipe.c:3041
3041		if (rpc_cli->transport == NULL) {
(gdb) p rpc_cli
$1 = (struct rpc_pipe_client *) 0x0
(gdb) up
#8  0x00007ffff7998c1a in winbindd_lookup_names (mem_ctx=0x7ffff82328d0, 
    domain=<value optimized out>, num_names=1, names=0x7fffffffbfe8, 
    domains=0x0, sids=0x7fffffffbff8, types=0x7fffffffbff0)
    at winbindd/winbindd_rpc.c:1285
1285		orig_timeout = rpccli_set_timeout(cli, 35000);

Can either cm_connect_lsa() or cm_connect_lsa_tcp() be returning an OK status without setting cli?
Comment 4 Jeremy Allison 2010-08-23 14:20:11 UTC
cm_connect_lsa_tcp() can potentially return NT_STATUS_OK without filling in *cli if cm_get_schannel_creds() fails. I'll attach a patch for this for you to test.

Jeremy.
Comment 5 Guenther Deschner 2010-08-23 14:40:18 UTC
Jeremy, please look at the two commits in 
http://git.samba.org/?p=gd/samba/.git;a=shortlog;h=master-schannel
before digging deeper.
Comment 6 Jeremy Allison 2010-08-23 14:44:53 UTC
Not the right fix (although I appreciate the error message you used :-).
My patch changes cm_get_schannel_creds() to return NTSTATUS, which is what we should be doing here.

Sorry, you'll have to rebase :-).

Jeremy.
Comment 7 Guenther Deschner 2010-08-23 14:47:06 UTC
Ok, ok :) But please look at the 2nd commit, this is really a huge problem for multiple schanneld connections.
Comment 8 Jeremy Allison 2010-08-23 15:02:16 UTC
Created attachment 5916 [details]
Plain patch fro 3.5.x.
Comment 9 Jeremy Allison 2010-08-23 15:09:10 UTC
Created attachment 5917 [details]
Guenther's patch for 3.5.x to fix the allocation problem.
Comment 10 Jeremy Allison 2010-08-23 15:10:19 UTC
Created attachment 5918 [details]
gt-am format patch for 3.5.x.
Comment 11 Jeremy Allison 2010-08-23 15:31:23 UTC
Unfortunately, gd's patch breaks the s4 WAF build due to :

../../libcli/auth/credentials.c: In function ‘netlogon_creds_copy’:
../../libcli/auth/credentials.c:485: error: implicit declaration of function ‘sid_dup_talloc’
../../libcli/auth/credentials.c:485: warning: assignment makes pointer from integer without a cast

Will post a fix shortly.

Jeremy.
Comment 12 Jeremy Allison 2010-08-23 18:32:14 UTC
Created attachment 5920 [details]
git am format patch for 3.4.x.
Comment 13 Jeremy Allison 2010-08-23 19:12:29 UTC
Created attachment 5921 [details]
git am fix for 3.5.x - part1
Comment 14 Jeremy Allison 2010-08-23 19:13:06 UTC
Created attachment 5922 [details]
git am fix for 3.5.x - part2
Comment 15 Jeremy Allison 2010-08-23 19:15:45 UTC
David Woodhouse <dwmw2@infradead.org> - please apply the two attachments marked as "git am fix for 3.5.x - part1", and "part2". These should fix the issue for you. Please report back that they have fixed the problem.

Guenther, if you're satisfied with the patches for 3.5.x and 3.4.x please re-assign to Karolin for inclusion.

Jeremy
Comment 16 David Woodhouse 2010-08-23 20:12:41 UTC
Thanks. Given that I didn't really know how to reliably reproduce it, is there something I should be looking for in the logs which tells me it *would* have happened?
Comment 17 Jeremy Allison 2010-08-23 21:23:49 UTC
Errr - no, sorry :-). It's the absence of a crash. Rather like one hand clapping :-).

Jeremy.
Comment 18 David Woodhouse 2010-08-24 04:28:04 UTC
Hm, then I'll tell you in a week or so if it hasn't crashed :)
Comment 19 Guenther Deschner 2010-08-24 10:05:41 UTC
Karolin, please add to 3.4 (that one patch) and to 3.5 (the two patches)
Comment 20 Karolin Seeger 2010-09-06 13:43:52 UTC
Pushed to both branches.
Closing out bug report.
Please feel free to reopen if it's still an issue.

Thanks!
Comment 21 Guenther Deschner 2011-04-05 20:03:12 UTC
*** Bug 8062 has been marked as a duplicate of this bug. ***