Bug 2179 - W2k3 Domain controller failover doesn't work properly
W2k3 Domain controller failover doesn't work properly
Status: CLOSED FIXED
Product: Samba 3.0
Classification: Unclassified
Component: winbind
3.0.9
x86 Linux
: P3 major
: none
Assigned To: Samba Bugzilla Account
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-12-21 07:58 UTC by Alex de Vaal
Modified: 2005-08-24 10:25 UTC (History)
0 users

See Also:


Attachments
level 10 log winbind (270.92 KB, application/octet-stream)
2004-12-21 08:09 UTC, Alex de Vaal
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alex de Vaal 2004-12-21 07:58:40 UTC
3.0.9 is configured as W2k3 domain member. Winbind and Kerberos are used for
user validation against the AD.

In the W2k3 realm 2 domain controllers exist; 192.168.100.100 and 192.168.100.101

SMB.CONF (Global Section) looks like this:
# Global parameters
[global]
	workgroup = NH-HOTELES
	realm = NH-HOTELES.COM
	server string = %h server (Samba %v)
	security = ADS
	password server = 192.168.100.100 192.168.100.101
	ldap timeout = 3
	log file = /var/log/samba/%m.log
	socket options = TCP_NODELAY SO_RCVBUF=8192 SO_SNDBUF=8192
	printcap cache time = 600
	domain master = No
	idmap uid = 10000-20000
	idmap gid = 10000-20000
	template homedir = /data/hom/%U
	template shell = /bin/bash
	printer admin = root, "@NH-HOTELES.COM\Domain Admins",
@NH-HOTELES.COM\DEP_ADMIN_GERMANY
	oplocks = No
	level2 oplocks = No

When contact with DC1 is lost (192.168.100.100); within 5 minutes "wbinfo -u"
report "Error looking up domain users" and XP clients loose their shares with
the samba server.

In a level 10 log of winbind I was able to determine the problem:
[2004/12/21 12:33:13, 3] nsswitch/winbindd_user.c:winbindd_list_users(592)
  [  974]: list users
[2004/12/21 12:33:13, 10] nsswitch/winbindd_cache.c:fetch_cache_seqnum(287)
  fetch_cache_seqnum: timeout [NH-HOTELES][18495042 @ 1103628090]
[2004/12/21 12:33:13, 3] nsswitch/winbindd_ads.c:sequence_number(792)
  ads: fetch sequence_number for NH-HOTELES
[2004/12/21 12:33:13, 7] nsswitch/winbindd_ads.c:ads_cached_connection(48)
  Current tickets expire at 1103663094, time is now 1103628793
[2004/12/21 12:49:21, 3] libads/ldap.c:ads_do_paged_search(477)
  ldap_search_ext_s((objectclass=*)) -> Can't contact LDAP server
[2004/12/21 12:49:21, 3] libads/ldap_utils.c:ads_do_search_retry(66)
  Reopening ads connection to realm 'NH-HOTELES.COM' after error Can't contact
LDAP server
[2004/12/21 12:49:21, 6] libads/ldap.c:ads_find_dc(176)
  ads_find_dc: looking for realm 'NH-HOTELES.COM'
[2004/12/21 12:49:21, 8] libsmb/namequery.c:get_sorted_dc_list(1434)
  get_sorted_dc_list: attempting lookup using [ads]
[2004/12/21 12:49:21, 10] libsmb/conncache.c:check_negative_conn_cache(72)
  check_negative_conn_cache: cache entry expired for NH-HOTELES.COM, 192.168.100.100
[2004/12/21 12:49:21, 10] libsmb/namequery.c:remove_duplicate_addrs2(320)
  remove_duplicate_addrs2: looking for duplicate address/port pairs
[2004/12/21 12:49:21, 4] libsmb/namequery.c:get_dc_list(1407)
  get_dc_list: returning 2 ip addresses in an ordered list
[2004/12/21 12:49:21, 4] libsmb/namequery.c:get_dc_list(1408)
  get_dc_list: 192.168.100.100:389 192.168.100.101:389 
[2004/12/21 12:49:21, 5] libads/ldap.c:ads_try_connect(85)
  ads_try_connect: trying ldap server '192.168.100.100' port 389
[2004/12/21 12:49:24, 10] libsmb/conncache.c:add_failed_connection_entry(132)
  add_failed_connection_entry: added domain NH-HOTELES.COM (192.168.100.100) to
failed conn cache
[2004/12/21 12:49:24, 5] libads/ldap.c:ads_try_connect(85)
  ads_try_connect: trying ldap server '192.168.100.101' port 389
[2004/12/21 12:49:24, 3] libads/ldap.c:ads_connect(247)
  Connected to LDAP server 192.168.100.101

This part is causing the problem according to me:
[2004/12/21 12:33:13, 7] nsswitch/winbindd_ads.c:ads_cached_connection(48)
  Current tickets expire at 1103663094, time is now 1103628793
[2004/12/21 12:49:21, 3] libads/ldap.c:ads_do_paged_search(477)
  ldap_search_ext_s((objectclass=*)) -> Can't contact LDAP server
[2004/12/21 12:49:21, 3] libads/ldap_utils.c:ads_do_search_retry(66)
  Reopening ads connection to realm 'NH-HOTELES.COM' after error Can't contact
LDAP server

It seems that a timeout of 16 minutes must be expired before winbind can
determine that the primary LDAP server is down! 
Once this timeout has expired "wbinfo -u" gives the appropriate output.
In a next message I'll upload a zipped level 10 log of the complete test I made.
Comment 1 Alex de Vaal 2004-12-21 08:09:14 UTC
Created attachment 853 [details]
level 10 log winbind

In attached ZIP you can find a log level 10 output of winbind of the subjoined
test I performed:

12:05 startup winbindd
12:12 wbinfo -u; output Ok.
12:21 wbinfo -u; output Ok.
12:23 connection with DC1 (192.168.100.100) removed (removed UTP plug from NIC)

12:33 wbinfo -u; error looking up domain users
13:06 wbinfo -u; output Ok.
12:11 wbinfo -u; output Ok.
13:13 re-established connection with DC1 (192.168.100.100)

Problem resides here:

[2004/12/21 12:33:13, 7] nsswitch/winbindd_ads.c:ads_cached_connection(48)
  Current tickets expire at 1103663094, time is now 1103628793
[2004/12/21 12:49:21, 3] libads/ldap.c:ads_do_paged_search(477)
  ldap_search_ext_s((objectclass=*)) -> Can't contact LDAP server
[2004/12/21 12:49:21, 3] libads/ldap_utils.c:ads_do_search_retry(66)
  Reopening ads connection to realm 'NH-HOTELES.COM' after error Can't contact
LDAP server 

Can reproduce:

[2004/12/20 17:22:44, 3] nsswitch/winbindd_user.c:winbindd_list_users(592)
  [14142]: list users
[2004/12/20 17:22:44, 10] nsswitch/winbindd_cache.c:fetch_cache_seqnum(287)
  fetch_cache_seqnum: timeout [NH-HOTELES][18492523 @ 1103559273]
[2004/12/20 17:22:44, 3] nsswitch/winbindd_ads.c:sequence_number(792)
  ads: fetch sequence_number for NH-HOTELES
[2004/12/20 17:22:44, 7] nsswitch/winbindd_ads.c:ads_cached_connection(48)
  Current tickets expire at 1103595267, time is now 1103559764
[2004/12/20 17:38:29, 3] libads/ldap.c:ads_do_paged_search(477)
  ldap_search_ext_s((objectclass=*)) -> Can't contact LDAP server
[2004/12/20 17:38:29, 3] libads/ldap_utils.c:ads_do_search_retry(66)
  Reopening ads connection to realm 'NH-HOTELES.COM' after error Can't contact
LDAP server
[2004/12/20 17:38:29, 6] libads/ldap.c:ads_find_dc(176)
  ads_find_dc: looking for realm 'NH-HOTELES.COM'
[2004/12/20 17:38:29, 8] libsmb/namequery.c:get_sorted_dc_list(1434)
  get_sorted_dc_list: attempting lookup using [ads]
[2004/12/20 17:38:29, 10] libsmb/conncache.c:check_negative_conn_cache(72)
  check_negative_conn_cache: cache entry expired for NH-HOTELES.COM,
192.168.100.100
[2004/12/20 17:38:29, 10] libsmb/namequery.c:remove_duplicate_addrs2(320)
  remove_duplicate_addrs2: looking for duplicate address/port pairs
[2004/12/20 17:38:29, 4] libsmb/namequery.c:get_dc_list(1407)
  get_dc_list: returning 2 ip addresses in an ordered list
[2004/12/20 17:38:29, 4] libsmb/namequery.c:get_dc_list(1408)
  get_dc_list: 192.168.100.100:389 192.168.100.101:389 
[2004/12/20 17:38:29, 5] libads/ldap.c:ads_try_connect(85)
  ads_try_connect: trying ldap server '192.168.100.100' port 389
[2004/12/20 17:38:31, 10] libsmb/conncache.c:add_failed_connection_entry(132)
  add_failed_connection_entry: added domain NH-HOTELES.COM (192.168.100.100) to
failed conn cache
[2004/12/20 17:38:31, 5] libads/ldap.c:ads_try_connect(85)
  ads_try_connect: trying ldap server '192.168.100.101' port 389
[2004/12/20 17:38:31, 3] libads/ldap.c:ads_connect(247)
  Connected to LDAP server 192.168.100.101
Comment 2 Alex de Vaal 2005-01-18 07:40:20 UTC
DC failover didn't work because of LDAP timeouts in winbind.
This is fixed with rev. 4655 of ads.h and ldap.c
Comment 3 Gerald (Jerry) Carter 2005-08-24 10:25:16 UTC
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.