Bug 3115 - Calling Wbinfo too soon after winbindd start causes terminally confused winbind
Summary: Calling Wbinfo too soon after winbindd start causes terminally confused winbind
Status: RESOLVED LATER
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: winbind (show other bugs)
Version: 3.0.20
Hardware: All Windows XP
: P3 enhancement
Target Milestone: none
Assignee: Volker Lendecke
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-09-23 09:26 UTC by Brian Moran
Modified: 2007-11-25 06:31 UTC (History)
0 users

See Also:


Attachments
Tar of tarfiles of winbind logs, some fail, some pass (600.00 KB, application/octet-stream)
2005-10-12 11:17 UTC, Brian Moran
no flags Details
Script to join domain, verify join, start winbind, perform operation (2.70 KB, text/plain)
2005-10-12 18:23 UTC, Brian Moran
no flags Details
D10 winbind logs of failure of wbinfo to look up information. (16.91 KB, application/octet-stream)
2005-10-12 18:29 UTC, Brian Moran
no flags Details
Proposed patch... (501 bytes, patch)
2005-10-14 02:29 UTC, Volker Lendecke
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Brian Moran 2005-09-23 09:26:52 UTC
Environment: Security=ads, joined to win2k3 (non sp1) with two DCs. 

If one joins a domain, then starts or restarts winbind, and then doesn't wait 
before issuing a wbinfo request, winbind is unable to service ANY subsequent 
requests correctly. The correct amount of time to wait is variable. 8 seconds 
(as shown in the script) is sometimes enough, but often NOT ENOUGH time to 
wait).

this script (foo.sh) is able to repro this here:

while `/bin/true` ; do
killall winbindd

/etc/init.d/smb stop
net ads leave
rm /var/log/samba/*
echo "note that rndnum.pl generates a hostname like host45379"
hostname `perl rndnum.pl`
echo Hostname is `hostname`
net ads join -U join%join
/etc/init.d/smb start
echo "Starting winbindd" && winbindd -d 10
sleep 8
for i in `wbinfo -u` ; do wbinfo -n $i ; done
for i in `wbinfo -g` ; do wbinfo -n $i ; done
/etc/init.d/winbind restart

done ;
Comment 1 Volker Lendecke 2005-09-23 09:42:54 UTC
A bug like that has been fixed in current svn. Under
http://samba.org/samba/patches/winbindd_v1.patch you find a patch that might fix
your problem. Could you verify it does it for you?

Thanks,

Volker
Comment 2 Brian Moran 2005-09-23 14:28:16 UTC
Hi, the patch you describe was already applied to winbindd_util.c at the time 
the bug was found.

Comment 3 Gerald (Jerry) Carter (dead mail address) 2005-10-11 07:52:39 UTC
Brian,  could you retest trunk again ?  We fixed another crash bug
that will be in 3.0.20b (later today).
Comment 4 Gerald (Jerry) Carter (dead mail address) 2005-10-12 06:11:54 UTC
still reprodicible says Brian.
Comment 5 Gerald (Jerry) Carter (dead mail address) 2005-10-12 06:25:22 UTC
Brian, i tried to reproduce this against the SAMBA_3_0 tree
and could not.  So I think it probably has something to do with 
Volker's additional group code in trunk.

Can you check to see if you can reproduce it against SAMBA_3_0
to verify ?
Comment 6 Brian Moran 2005-10-12 09:54:50 UTC
I can repro with the following frequency:

Sleep 1 : Consistently

Sleep 4 : Fails 9  out of 23 times

Will be posting a more automated script for testing this in a bit...
Comment 7 Volker Lendecke 2005-10-12 11:00:15 UTC
Could you provide a debug level 10 log of all winbind daemons?

Thanks,

Volker
Comment 8 Brian Moran 2005-10-12 11:03:12 UTC
Here's a nice script (all Bash) that may be able to demonstrate this issue:
---------------------------------------------------------------------------
#! /bin/bash
for (( SECS=1 ; SECS<10; SECS+=2 )) ; do
NTRIES=10
rm -rf /tmp/result${SECS}
for (( tries=${NTRIES} ; tries>0; tries-- )) ; do
# while `/bin/true` ; do
killall winbindd
#
killall smbd
# /etc/init.d/smb stop
bin/net ads leave
rm /var/log/samba/*
HOSTNAME=tsthost${RANDOM}
hostname ${HOSTNAME}
echo Hostname is ${HOSTNAME}
bin/net ads join -U join%join
bin/smbd -d 10
# /etc/init.d/smb start
echo "Starting winbindd" && bin/winbindd -d 10
 sleep ${SECS}
FOO=`bin/wbinfo -u | wc -l`
echo $FOO
if [[ ${FOO} == '1' ]] ; then
    echo "fail" ${HOSTNAME} >>/tmp/results${SECS} ;
 else
    echo "pass" ${FOO} ${HOSTNAME} >>/tmp/results${SECS} ;
fi
# for i in `bin/wbinfo -u` ; do  bin/wbinfo -n $i ; done
#for i in `bin/wbinfo -g` ; do  bin/wbinfo -n $i ; done
killall winbindd
bin/winbindd -d 10

done ;
PASSRAW=`grep pass /tmp/results${SECS} | wc -l`
PASSRATE=`dc -e "6 k ${PASSRAW} ${NTRIES} / 100 * p" `
echo "Sleep of ${SECS} seconds: Succeeded " ${PASSRATE} %  ${PASSRAW} / 
${NTRIES}
echo "Sleep of ${SECS} seconds: Succeeded " ${PASSRATE} %  ${PASSRAW} / 
${NTRIES} >>/tmp/allresults
done ;

On A 3.2 GHz OpenSuse 10 RC box, samba_3_0 build 10925, I get the following 
results:
---------------------------------------------------------------
Sleep of 1 seconds: Succeeded  0 % 0 / 10
Sleep of 3 seconds: Succeeded  30.000000 % 3 / 10
Sleep of 5 seconds: Succeeded  50.000000 % 5 / 10
Sleep of 7 seconds: Succeeded  40.000000 % 4 / 10
Sleep of 9 seconds: Succeeded  70.000000 % 7 / 10


Comment 9 Brian Moran 2005-10-12 11:17:36 UTC
Created attachment 1502 [details]
Tar of tarfiles of winbind logs, some fail, some pass
Comment 10 Brian Moran 2005-10-12 11:21:34 UTC
Here's the modified script that generated the Tar of Tarfiles attachment...
--------------------------------------
#! /bin/bash
for (( SECS=1 ; SECS<10; SECS+=2 )) ; do
NTRIES=10
rm -rf /tmp/result${SECS}
for (( tries=${NTRIES} ; tries>0; tries-- )) ; do
# while `/bin/true` ; do
killall winbindd
#
killall smbd
# /etc/init.d/smb stop
bin/net ads leave
rm /var/log/samba/*
HOSTNAME=tsthost${RANDOM}
hostname ${HOSTNAME}
echo Hostname is ${HOSTNAME}
bin/net ads join -U join%join
bin/smbd -d 10
# /etc/init.d/smb start
echo "Starting winbindd" && bin/winbindd -d 10
 sleep ${SECS}
FOO=`bin/wbinfo -u | wc -l`
echo $FOO
if [[ ${FOO} == '1' ]] ; then
    echo "fail" ${HOSTNAME} >>/tmp/results${SECS} ;
    STT="fail"
 else
    echo "pass" ${FOO} ${HOSTNAME} >>/tmp/results${SECS} ;
    STT="pass"
fi
# for i in `bin/wbinfo -u` ; do  bin/wbinfo -n $i ; done
#for i in `bin/wbinfo -g` ; do  bin/wbinfo -n $i ; done
killall winbindd
# save all of the logs...
tar -j -cvf /tmp/winbind-log[${HOSTNAME}]-${STT}.tar.bz2 /var/log/samba/log.w*
bin/winbindd -d 10

done ;
PASSRAW=`grep pass /tmp/results${SECS} | wc -l`
PASSRATE=`dc -e "6 k ${PASSRAW} ${NTRIES} / 100 * p" `
echo "Sleep of ${SECS} seconds: Succeeded " ${PASSRATE} %  ${PASSRAW} / 
${NTRIES}
echo "Sleep of ${SECS} seconds: Succeeded " ${PASSRATE} %  ${PASSRAW} / 
${NTRIES} >>/tmp/allresults
done ;

Comment 11 Volker Lendecke 2005-10-12 11:44:50 UTC
When looking at your logs I don't see any particular bug in winbind, I much more
assume that the problem is that you join the domain and *directly* after joining
you try to connect to the domain controller again. In the logfiles you see

[2005/10/12 11:15:27, 5] nsswitch/winbindd_ads.c:trusted_domains(838)
  trusted_domains: Could not open a connection to CORPQA for PIPE_NETLOGON
(NT_STATUS_NO_TRUST_SAM_ACCOUNT)

To me this gives the impression that something on the DC is not yet syncronized
between the join and the attempt to connect.

I did not really test it, but I would really assume that this is expected
behaviour. Could you put the sleep directly after the join and retry?

Volker
Comment 12 Brian Moran 2005-10-12 13:36:40 UTC
Adding the sleep after the join yields the following results, which are not 
much different from previously. If this is "*by design*" what would be the 
best strategy to mitigate this?

-------------------------------------------------------------
Sleep of 1 seconds: Succeeded  40.000000% 8 / 20 /tmp/results1
Sleep of 3 seconds: Succeeded  50.000000% 10 / 20 /tmp/results3
Sleep of 5 seconds: Succeeded  50.000000% 10 / 20 /tmp/results5
Sleep of 7 seconds: Succeeded  55.000000% 11 / 20 /tmp/results7
Sleep of 9 seconds: Succeeded  100.000000% 20 / 20 /tmp/results9

---------------------------------------------------------------------------
Here's the script that generated this -- 
Have you run this on your systems to compare results?
#! /bin/bash
for (( SECS=1 ; SECS<10; SECS+=2 )) ; do
NTRIES=20
rm -rf /tmp/results${SECS}
for (( tries=${NTRIES} ; tries>0; tries-- )) ; do
killall winbindd
killall smbd
# /etc/init.d/smb stop
bin/net ads leave
rm /var/log/samba/*
HOSTNAME=tsthost${RANDOM}
hostname ${HOSTNAME}
echo Hostname is ${HOSTNAME}
bin/net ads join -U join%join
sleep ${SECS}
bin/smbd -d 10
# /etc/init.d/smb start
echo "Starting winbindd" && bin/winbindd -d 10
FOO=`bin/wbinfo -u | wc -l`
echo $FOO
if [[ ${FOO} == '1' ]] ; then
    echo "fail" ${HOSTNAME} >>/tmp/results${SECS} ;
    STT="fail"
 else
    echo "pass" ${FOO} ${HOSTNAME} >>/tmp/results${SECS} ;
    STT="pass"
fi
# for i in `bin/wbinfo -u` ; do  bin/wbinfo -n $i ; done
# for i in `bin/wbinfo -g` ; do  bin/wbinfo -n $i ; done
killall winbindd
# save all of the logs...
tar -j -cvf /tmp/winbind-log[${HOSTNAME}]-${STT}.tar.bz2 /var/log/samba/log.w*
bin/winbindd -d 10
done ;
PASSRAW=`grep pass /tmp/results${SECS} | wc -l`
DENOM=`wc -l /tmp/results${SECS}`
PASSRATE=`dc -e "6 k ${PASSRAW} ${NTRIES} / 100 * p" `
echo "Sleep of ${SECS} seconds: Succeeded " ${PASSRATE}%  ${PASSRAW} / ${DENOM}
echo "Sleep of ${SECS} seconds: Succeeded " ${PASSRATE}%  ${PASSRAW} / 
${DENOM} >>/tmp/allresults
done ;

Comment 13 Volker Lendecke 2005-10-12 13:50:52 UTC
So we successfully delegated the problem to the DC not finding the wks quickly
enough after joining the domain. What we have to do is to program a time out and
retry into winbind if we get this specific error message, there's not much we
can do to speed up the DC.

You might even face a much longer time out if have more domain controllers. The
join might be done against one DC, the later winbind connect might use another
one. If the replication has not yet been fully completed, then there's not much
winbind can do. We would have to save the name of the DC the net join used, and
I very much doubt that is worth the effort.

Is this a real requirement for you, or is this just an artificial test? Can't
you live with a minute between the join and later winbind startup? Normal
windows workstations have the big advantage here that they have to reboot, so
domain controllers have some time to catch up, time that you don't give the DC.

I'm changing this bug to "enhancement", I don't see a real bug here.
 
Lets get the policy discussion about wait time and fallbacks to
samba-technical@samba.org please.

Thanks for testing this BTW.

Volker
Comment 14 Brian Moran 2005-10-12 18:23:35 UTC
Created attachment 1504 [details]
Script to join domain, verify join, start winbind, perform operation
Comment 15 Brian Moran 2005-10-12 18:26:42 UTC
Attached is an adjusted script which demonstrates that even though the domain 
join has succeeded (net ads join, followed by net ads testjoin until we 
achieve success), there can be an occasional error. 

Next to be attached are the logs from a failure that we had after we 
successfully joined and verified. One error in 80 tries, but still an error.
Comment 16 Brian Moran 2005-10-12 18:29:58 UTC
Created attachment 1505 [details]
D10 winbind logs of failure of wbinfo to look up information.
Comment 17 Volker Lendecke 2005-10-12 21:30:01 UTC
The log you just appended has a NT_STATUS_ACCESS_DENIED on the serverauth2
request. This is an indication that the workstation and the domain controllers
disagree about the password, and as long as you re-join it will stay this way.

Is it possible that the replication among your DCs is just a bit too slow and
that your continouus leave and join leaves the DC's confused? If you have a
failure in 80 attempts I'd very much say to just retry the join itself, not the
lookup.

Could you please comment on the question why you do this? I've never seen the
requirement that someone needs to leave and join a domain at such a high rate,
I'd very much assume that Windows is just not made for it. If you don't give me
a good  need from your production environment I'll change this bug to "wontfix".

Volker
Comment 18 Gerald (Jerry) Carter (dead mail address) 2005-10-13 04:30:24 UTC
Brian, is this just a stress test for QA ?  If so, it 
might be pushing outside of the realm of realistic use.
And if so, then this is probably a no fix.

btw...I used you original script and cannot reproduce you
error in SAMBA_3_0 (even removing the sleep call).
Comment 19 Brian Moran 2005-10-13 10:28:31 UTC
We have some automated scripts which configure samba and join to AD; The 
operation needs to join the domain, get the SID of Domain Admins (for some 
additional security checking later), put this away, and proceed. We were 
getting failures after the net ads join, winbind start, and then calling 
wbinfo.  We were seeing these very frequently at customer sites, especially in 
a corporate setting where DCs were separated by a slow link (e.g. Seattle, New 
Jersey). The scripts were written to repro and characterize the behavior as 
well as examine ways to defensively work around the issues.

We think for the most part, VL's recent fixes are excellent and have worked -- 
with one exception --if winbind is started before the DCs are synchronized, or 
the "join" has taken, winbind is unable to provide any information (via 
wbinfo) during the subsequent duration of it's process (we actually think it's 
the query from wbinfo that causes winbind to fail, and then forever fail).  

One could argue that it's a case of removing the spoon before one drinks the 
hot cocoa, or "don't do that". So, our procedure is that we're net ads 
join'ing then repeatedly calling net ads testjoin to see if the join "took" 
*before* we start winbind.  Our scripts changed to reflect that.

We did find that in three instances (out of 900, we let this run overnight) 
that even with this precaution, winbind got into a state where it was 
unrecoverable without killing the process and restarting. To reiterate -- 
starting winbind, then running wbinfo, while the DCs have not "settled out" 
causes the winbind process to *never* (not during a reasonable testing 
interval) be able to do a query and return valid results, despite the 
DCs "settling" during the time winbind is running. The only way to recover 
from this situation is to kill the winbind process, and start it again.

The most desirable behavior would be for winbind to re-establish itself with 
the DC if it detects that there was some problem when it started, and try to 
do so at some periodic interval. 

Comment 20 Volker Lendecke 2005-10-14 02:29:19 UTC
Created attachment 1510 [details]
Proposed patch...

So if getting the domain sid is your real, deeper problem, does the attached
(trivial, but untested) patch help you in any way?

Volker
Comment 21 Volker Lendecke 2007-11-25 06:31:16 UTC
Closing bug, no feedback from reporter.

Volker