Bug 5814 - Winbindd dumping core in a strange manner while doing "rescan_trusted_domain"
Summary: Winbindd dumping core in a strange manner while doing "rescan_trusted_domain"
Status: CLOSED FIXED
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: winbind (show other bugs)
Version: 3.0.28
Hardware: Other Linux
: P3 critical
Target Milestone: none
Assignee: Jeremy Allison
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-10-07 04:07 UTC by hargagan
Modified: 2008-10-09 23:51 UTC (History)
2 users (show)

See Also:


Attachments
log file for winbindd (98.65 KB, application/octet-stream)
2008-10-07 04:08 UTC, hargagan
no flags Details
Patch for 3.0.x. (1.17 KB, patch)
2008-10-08 13:36 UTC, Jeremy Allison
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description hargagan 2008-10-07 04:07:16 UTC
We are using the samba version 3.0.28. Recently we have started seeing a core dump in winbindd. This is happening in a stressed setup. The stack trace is as follows :

#0  0xffffe410 in __kernel_vsyscall ()
#1  0xb7c848d0 in raise () from /lib/libc.so.6
#2  0xb7c85ff3 in abort () from /lib/libc.so.6
#3  0x800c05f9 in dump_core () at lib/fault.c:189
#4  0x800d5b55 in smb_panic (why=0x802281d8 "error (core file administratively disabled)\n") at lib/util.c:1645
#5  0x800c0afa in sig_fault (sig=6) at lib/fault.c:45
#6  <signal handler called>
#7  0xffffe410 in __kernel_vsyscall ()
#8  0xb7c848d0 in raise () from /lib/libc.so.6
#9  0xb7c85ff3 in abort () from /lib/libc.so.6
#10 0x800b959e in talloc_get_name (ptr=0x8029e998) at lib/talloc/talloc.c:620
#11 0x800b980c in talloc_check_name (ptr=0x8029e998, name=0x80217cd9 "ync_request_timeout") at lib/talloc/talloc.c:635
#12 0x800d5c2c in talloc_check_name_abort (ptr=0x8029e998, name=0x80217cd9 "ync_request_timeout") at lib/util.c:2972
#13 0x8006f46f in async_request_timeout_handler (ctx=0x8030b638, te=0x80280408, now=0xbf9132f4, private_data=0x8029e998)
    at nsswitch/winbindd_dual.c:177
#14 0x800f44f2 in run_events (event_ctx=0x8030b638, selrtn=0, read_fds=0x0, write_fds=0x0) at lib/events.c:229
#15 0x80040ae8 in main (argc=1627389952, argv=0x0, envp=0x0) at nsswitch/winbindd.c:796
(gdb) f 13
#13 0x8006f46f in async_request_timeout_handler (ctx=0x8030b638, te=0x80280408, now=0xbf9132f4, private_data=0x8029e998)
    at nsswitch/winbindd_dual.c:177
177                     talloc_get_type_abort(private_data, struct winbindd_async_request);
(gdb) p *(struct winbindd_async_request *)private_data
$2 = {next = 0x0, prev = 0x0, mem_ctx = 0x8030b430, child = 0x8030bbe8, request = 0x8030c9e8, response = 0x8030d248,
  continuation = 0x8004ac50 <rescan_trusted_domains+512>, reply_timeout_event = 0x0, child_pid = 28347, private_data = 0x8030df28}
(gdb) p *(struct talloc_chunk *)(private_data - 48)
$3 = {next = 0xb7d851f0, prev = 0xb7d851f0, parent = 0x8030b400, child = 0x0, refs = 0x0, destructor = 0,
  name = 0x80217cd9 "ync_request_timeout", size = 40, flags = 3893685363}
(gdb) p /x *(struct talloc_chunk *)(private_data - 48)
$4 = {next = 0xb7d851f0, prev = 0xb7d851f0, parent = 0x8030b400, child = 0x0, refs = 0x0, destructor = 0x0, name = 0x80217cd9,
  size = 0x28, flags = 0xe814ec73}

**************************************************************************


My analysis of the problem is as follows:

1. The dump is happening while the timeout handler is called for the rescan_trusted_domain() request.
2. There it is doing "smb_panic()" because it finds that "private_data" is already freed.
3. This free could happen from the "trustdom_recv()" which is the continuation function of this request.
4. I am continuously getting the error from the "trustdom_recv()" function, 
"[2008/10/04 19:06:23, 1] nsswitch/winbindd_util.c:trustdom_recv(237)
  Could not receive trustdoms". Please see the log file for more logs.

Now my analysis halts at the puzzle which is as follows :

1. The free of the "state->mem_ctx" (or private_data) would only happen in the continuation function which is "trustdom_recv()" by the "talloc_destroy()", which is called in the error path of the "trustdom_recv()" function.

2. This continuation function would be called only when the response would have come for this request. But when the response comes, the timeout_handler would be destroyed by the destructor function. This would happen by doing "TALLOC_FREE(state->reply_timeout_event)" in the async_reply_recv().

3. So going by this above analysis, the timeout handler (async_request_timeout_handler()) shouldn't been called for this request as the continuation function has already been called for it from the "async_reply_recv()"

Please correct me if I am wrong my analysis. I would be attaching the log.winbindd shortly. But it is only with debug level 0, not sure how much it could help.

I am really stuck because of this core dump. Somebody please help me in this.
Comment 1 hargagan 2008-10-07 04:08:59 UTC
Created attachment 3663 [details]
log file for winbindd

Log file containing the error messages at level 1.
Comment 2 Gerald (Jerry) Carter (dead mail address) 2008-10-07 08:00:46 UTC
The current theory is that this crash is fixed by Volker's change in 3.0.32:

  o   Volker Lendecke <vl@samba.org>
      * Fix a race condition in winbind leading to a crash.

Please test that version and reopen if the bug crash still occurs.
Comment 3 hargagan 2008-10-07 08:31:10 UTC
Hi Gerald,
(In reply to comment #2)
> The current theory is that this crash is fixed by Volker's change in 3.0.32:
> 
>   o   Volker Lendecke <vl@samba.org>
>       * Fix a race condition in winbind leading to a crash.

Are these the git-hashes you are talking about :

http://gitweb.samba.org/?p=samba.git;a=commit;h=c70e2b6476d2d99c79624e15a4a3cfcdc850fc7c
and 
http://gitweb.samba.org/?p=samba.git;a=commit;h=c93d42969451949566327e7fdbf29bfcee2c8319

If so, we have already back-ported this fix to 3.0.28. With this fix also we are seeing this problem. This seems to be a different issue.

I am reopening the bug as with the above fix also the dump is happening.
Comment 4 hargagan 2008-10-07 09:43:20 UTC
These are my findings on the issue:

1. sys_waitpid() is detecting the child has died. It calls the winbindd_child_died().
2. In winbindd_child_died() we does some cleanup for the child (like close(), making zeros etc) and calls the schedule_async_request().
3. This call to schedule_async_request() add one more request by forking the child.

Now we have two request with two timeout handler for the same request. Now if the first one times out, then the timeout handler cleans the "private_data" for the request. Now the timeout handler for the second request is also fired and it finds that the "private_data" is already freed (using talloc_get_type_abort()) and does smb_panic().

So the problem is happening coz of timeout_handler not cleaned in the winbindd_child_died().

I will test based on this findings and put the patch.
Comment 5 Jeremy Allison 2008-10-07 12:19:21 UTC
This looks like it's already fixed in the current 3_0_X git tree.
winbind_child_died() is only called from async_request_fail(), and the current code in async_request_fail() looks like :

static void async_request_fail(struct winbindd_async_request *state)
{
        DLIST_REMOVE(state->child->requests, state);

        TALLOC_FREE(state->reply_timeout_event);

        /* If child exists and is not already reaped,
           send kill signal to child. */

        if ((state->child->pid != (pid_t)0) &&
                        (state->child->pid != (pid_t)-1) &&
                        (state->child->pid == state->child_pid)) {
                kill(state->child_pid, SIGTERM);

                /* 
                 * Close the socket to the child.
                 */
                winbind_child_died(state->child_pid);
        }

        state->response->length = sizeof(struct winbindd_response);
        state->response->result = WINBINDD_ERROR;
        state->continuation(state->private_data, False);
}

Note that it explicitly frees the state->reply_timeout_event event.

Jeremy.
Comment 6 hargagan 2008-10-08 01:51:08 UTC
(In reply to comment #5)
> This looks like it's already fixed in the current 3_0_X git tree.
> winbind_child_died() is only called from async_request_fail(), and the current
> code in async_request_fail() looks like :

Hi Jeremy,

Thanks for your reply. 

The freeing of "reply_timeout_event" as pointed by you is already present in our code base. But unfortunately we are still getting the core dump.

The analysis from my side which can cause the issue is as follows :

The winbindd_child_died() is also getting called from process_loop() in case of SIGCHLD signal. In this case it doesn't make the timeout_handler to NULL for the first request. It then initiate a new request using schedule_async_request() which installs a new timeout handler for the same request. In such a case, for a badly unresponsive system both the timeout handler can be called. For the first call the "private_data" will be cleared and for another call the timeout handler will be detecting the double free. So, for such a case as well, the winbindd_child_died() should make the timeout_handler to NULL.

I would verify the changes based on the above assumption and update accordingly.

Comment 7 hargagan 2008-10-08 02:38:18 UTC
This piece of log message with debug level 10 clearly shows that the timeout handler is getting invoked twice. In the second call it is detecting a double free and doing smb_panic().

<snip>

[2008/10/08 09:09:28, 10] lib/events.c:run_events(240)
  Running event "async_request_timeout" 555555938d60
[2008/10/08 09:09:28, 0] nsswitch/winbindd_dual.c:async_request_timeout_handler(181)
  async_request_timeout_handler: child pid 6501 is not responding. Closing connection to it.
[2008/10/08 09:09:28, 10] lib/events.c:timed_event_destructor(66)
  Destroying timed event 555555983030 "async_request_timeout"
[2008/10/08 09:09:28, 5] nsswitch/winbindd_dual.c:async_reply_recv(265)
  Could not receive async reply from child pid 6501
[2008/10/08 09:09:28, 1] nsswitch/winbindd_util.c:trustdom_recv(237)
  Could not receive trustdoms
[2008/10/08 09:09:28, 10] lib/events.c:run_events(240)
  Running event "async_request_timeout" 555555938d60
[2008/10/08 09:09:28, 0] lib/fault.c:fault_report(41)
  ===============================================================
[2008/10/08 09:09:28, 0] lib/fault.c:fault_report(42)
  INTERNAL ERROR: Signal 6 in pid 4506 (3.0.28-0.8-1918-SUSE-CODE10)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2008/10/08 09:09:28, 0] lib/fault.c:fault_report(44)

  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2008/10/08 09:09:28, 0] lib/fault.c:fault_report(45)
  ===============================================================
[2008/10/08 09:09:28, 0] lib/util.c:smb_panic(1633)
  PANIC (pid 4506): internal error
[2008/10/08 09:09:28, 0] lib/util.c:log_stack_trace(1737)
  BACKTRACE: 14 stack frames:
   #0 /usr/sbin/winbindd(log_stack_trace+0x1c) [0x55555563080c]
   #1 /usr/sbin/winbindd(smb_panic+0x43) [0x5555556308f3]
   #2 /usr/sbin/winbindd [0x55555561e7b2]
   #3 /lib64/libc.so.6 [0x2add8dca7c30]
   #4 /lib64/libc.so.6(gsignal+0x35) [0x2add8dca7bb5]
   #5 /lib64/libc.so.6(abort+0x110) [0x2add8dca8fb0]
   #6 /usr/sbin/winbindd(talloc_get_name+0x38) [0x555555617578]
   #7 /usr/sbin/winbindd(talloc_check_name+0x1e) [0x5555556177be]
   #8 /usr/sbin/winbindd(talloc_check_name_abort+0x19) [0x555555630a29]
   #9 /usr/sbin/winbindd [0x5555555d5368]
   #10 /usr/sbin/winbindd(run_events+0x4f) [0x55555564b61f]
   #11 /usr/sbin/winbindd(main+0x67b) [0x5555555ab8fb]
   #12 /lib64/libc.so.6(__libc_start_main+0xf4) [0x2add8dc95184]
   #13 /usr/sbin/winbindd [0x5555555a9f69]
[2008/10/08 09:09:28, 0] lib/fault.c:dump_core(181)
  dumping core in /var/log/samba/cores/winbindd

<snap>
Comment 8 Jeremy Allison 2008-10-08 13:12:26 UTC
Ah, I missed the call from process_loop(). Yes, that could cause it. Give me a short time to create a new patch.
Jeremy.
Comment 9 Jeremy Allison 2008-10-08 13:36:59 UTC
Created attachment 3671 [details]
Patch for 3.0.x.

This should fix it. It clears the timeout even both in winbind_child_died() and also in schedule_async_request() (which should never have an existing timeout event when scheduling a new request).
I'll push to all branches - let me know if it fixes your test case.
Thanks a *lot* for the analysis.
Jeremy.
Comment 10 Tukaram 2008-10-08 14:10:53 UTC
Hi Jeremy,
Thanks a lot for the patch.
Gagan made the patch by calling TALLOC_FREE(state->reply_timeout_event) in winbind_child_died() but did did not make changes to schedule_async_request().
We are testing with above code, but we will take your patch and test the changes in out lab. Also it was consistently reproducible in when we reboot the server.
Comment 11 hargagan 2008-10-09 23:42:30 UTC
Hi Jeremy,

Thanks for looking into the issue. In my patch I had only cleared the timeout_handler in the winbindd_child_died(), but your patch is much safer. The good news is the test has completed 24 hr duration without a core, which used to make winbindd dump randomly.

Thanks again. 

I am closing this bug as the patch fixes the issue.