Bug 437 - winbind dies if DC is not available
Summary: winbind dies if DC is not available
Status: CLOSED FIXED
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: winbind (show other bugs)
Version: 3.0.0preX
Hardware: Other Linux
: P3 major
Target Milestone: 3.0.1
Assignee: Gerald (Jerry) Carter (dead mail address)
QA Contact:
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2003-09-11 07:58 UTC by Guenther Deschner
Modified: 2005-08-24 10:19 UTC (History)
2 users (show)

See Also:


Attachments
Backtrace handler for samba (233 bytes, text/plain)
2003-09-11 17:39 UTC, Tim Potter
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Guenther Deschner 2003-09-11 07:58:15 UTC
While doing some stress tests on todays winbind (Thu Sep 11 16:42:49 CEST 2003)
i noticed two things that happen to winbind when the DC suddenly disappears (
e.g. reboot):

my setup: suse linux 8.2, security = ads, one win2k-DC


- winbindd segfaults on getpwnam-operations while the DC is currently not available:

libads/ldap_utils.c:ads_do_search_retry(68)
  ads_search_retry: failed to reconnect (Transport endpoint is not connected)
libads/ads_ldap.c:ads_name_to_sid(58)
  name_to_sid ads_search: Transport endpoint is not connected
nsswitch/winbindd_cache.c:wcache_save_name_to_sid(602)
  wcache_save_name_to_sid: ADMINISTRATOR -> S-0-0
nsswitch/winbindd_user.c:winbindd_getpwnam(147)
  user 'administrator' does not exist
...
nsswitch/winbindd.c:process_request(305)
  process_request: request fn GETPWNAM
nsswitch/winbindd_user.c:winbindd_getpwnam(112)
  [ 2370]: getpwnam my.domain\administrator
nsswitch/winbindd_cache.c:refresh_sequence_number(342)
  refresh_sequence_number: MYDOMAIN time ok
nsswitch/winbindd_cache.c:refresh_sequence_number(367)
  refresh_sequence_number: MYDOMAIN seq number is now 18510
nsswitch/winbindd_cache.c:name_to_sid(958)
  name_to_sid: [Cached] - doing backend query for name for domain MYDOMAIN
nsswitch/winbindd_ads.c:name_to_sid(312)
  ads: name_to_sid
lib/fault.c:fault_report(36)
  ===============================================================
lib/fault.c:fault_report(37)
  INTERNAL ERROR: Signal 11 in pid 6762 (CVS 3.0.1pre1-SuSE)
  Please read the appendix Bugs of the Samba HOWTO collection
lib/fault.c:fault_report(39)
  ===============================================================
lib/util.c:smb_panic(1400)
  PANIC: internal error
lib/util.c:smb_panic(1407)
  BACKTRACE: 15 stack frames:
   #0 /usr/sbin/winbindd(smb_panic+0x1ab) [0x80b9bbb]
   #1 /usr/sbin/winbindd [0x80a89cd]
   #2 /usr/sbin/winbindd [0x80a8a2e]
   #3 /lib/libc.so.6 [0x4023f5c8]
   #4 /usr/sbin/winbindd(ads_name_to_sid+0x45) [0x8157ff9]
   #5 /usr/sbin/winbindd [0x807f44b]
   #6 /usr/sbin/winbindd [0x8076a28]
   #7 /usr/sbin/winbindd(winbindd_lookup_sid_by_name+0x73) [0x8073122]
   #8 /usr/sbin/winbindd(winbindd_getpwnam+0x298) [0x806e536]
   #9 /usr/sbin/winbindd(strftime+0x1340) [0x806cdf4]
   #10 /usr/sbin/winbindd(winbind_process_packet+0x1f) [0x806d0e0]
   #11 /usr/sbin/winbindd(strftime+0x1ec4) [0x806d978]
   #12 /usr/sbin/winbindd(main+0x51b) [0x806df76]
   #13 /lib/libc.so.6(__libc_start_main+0xce) [0x4022b8ae]
   #14 /usr/sbin/winbindd(ldap_msgfree+0x7d) [0x806c711]
nsswitch/winbindd.c:winbind_client_read(455)
  client_read: read 0 bytes. Need 1568 more for a full request.
nsswitch/winbindd.c:winbind_client_read(462)
  read failed on sock 9, pid 6762: EOF




- and sometimes winbindd dies silently while the DC is doing a reboot:


nsswitch/winbindd_ads.c:sequence_number(778)
  ads: fetch sequence_number for MYDOMAIN
libads/ldap.c:ads_do_paged_search(451)
  ldap_search_ext_s((objectclass=*)) -> Can't contact LDAP server
libads/ldap_utils.c:ads_do_search_retry(60)
  Reopening ads connection to realm 'MY.DOMAIN' after error Can't contact LDAP
server
libads/ldap.c:ads_find_dc(147)
  ads_find_dc: looking for realm 'MY.DOMAIN'
libsmb/namequery.c:get_sorted_dc_list(1215)
  get_sorted_dc_list: attempting lookup using [hosts]
libsmb/namequery.c:internal_resolve_name(989)
  internal_resolve_name: looking up ads#20
lib/gencache.c:gencache_get(264)
  Returning valid cache entry: key = NBT/ADS#20, value = 10.60.4.2:0, timeout =
Thu Sep 11 16:47:27 2003
  
libsmb/namecache.c:namecache_fetch(201)
  name ads#20 found.
libsmb/namequery.c:remove_duplicate_addrs2(312)
  remove_duplicate_addrs2: looking for duplicate address/port pairs
libsmb/namequery.c:get_dc_list(1350)
  get_dc_list: returning 1 ip addresses in an ordered list
libsmb/namequery.c:get_dc_list(1351)
  get_dc_list: 10.60.4.2:389 
libsmb/conncache.c:check_negative_conn_cache(83)
  check_negative_conn_cache: returning negative entry for MY.DOMAIN, 10.60.4.2
libads/ldap_utils.c:ads_do_search_retry(68)
  ads_search_retry: failed to reconnect (Transport endpoint is not connected)
winbindd: unbind.c:40: ldap_unbind_ext: Assertion `( (ld)->ld_options.ldo_valid
== 0x2 )' failed.


nothing more. winbind is just dead then.

although disappearing DCs are not very common, i think winbind should possibly
deal better with that and survive the period until the DC is available again.
Comment 1 Tim Potter 2003-09-11 17:39:54 UTC
Created attachment 137 [details]
Backtrace handler for samba

Guenther have you considered installing a panic action handler for your system?
 You are coming up with some really good bugs and it would certainly make
things faster if we could get a stack backtrace from you early on.

Tridge has written a neat little script that should produce a stack trace in
the log file when a Samba program crashes.  If you copy it to /usr/local/bin
and add the following to your smb.conf file you will be able to provide a stack
backtrace in your initial reports:

[global]
    panic action = /usr/local/bin/backtrace %d
Comment 2 Tim Potter 2003-09-15 20:56:11 UTC
cc me
Comment 3 Gerald (Jerry) Carter (dead mail address) 2003-10-03 12:22:40 UTC
Guenther, I can't reproduce this using the latest SAMBA_3_0 cvs.  
Is it still a propblem for you?
Comment 4 Gerald (Jerry) Carter (dead mail address) 2003-10-03 12:31:03 UTC
Nevermind.  I found it.  I have a crash in winbindd_ads.c: 
sequence_number() where we are trying to destroy an LDAP 
connection.  Probably a double free or something.
Comment 5 Gerald (Jerry) Carter (dead mail address) 2003-10-03 14:43:49 UTC
Found it.  We we're calling ads_destroy twice 
on the same structure.  Fixed checked into 
SAMBA_3_0.
Comment 6 Lars Müller 2004-03-24 07:19:58 UTC
Sorry for off topic. But who is the author of attachment 137 [details], a simple script to
call gdb. We want to include it to the example/scripts/ of SuSE's package. And
as usual I want to add an author and copyright.
Comment 7 Gerald (Jerry) Carter (dead mail address) 2004-03-24 08:38:56 UTC
It's already in ./testsuite/build_farm/backtrace
Tridge wrote it IIRC.
Comment 8 Gerald (Jerry) Carter (dead mail address) 2005-02-07 09:05:59 UTC
originally reported against one of the 3.0.0rc[1-4] releases.
Cleaning up non-production versions.
Comment 9 Gerald (Jerry) Carter (dead mail address) 2005-08-24 10:19:37 UTC
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.