Environment: Security=ads, joined to win2k3 (non sp1) with two DCs. If one joins a domain, then starts or restarts winbind, and then doesn't wait before issuing a wbinfo request, winbind is unable to service ANY subsequent requests correctly. The correct amount of time to wait is variable. 8 seconds (as shown in the script) is sometimes enough, but often NOT ENOUGH time to wait). this script (foo.sh) is able to repro this here: while `/bin/true` ; do killall winbindd /etc/init.d/smb stop net ads leave rm /var/log/samba/* echo "note that rndnum.pl generates a hostname like host45379" hostname `perl rndnum.pl` echo Hostname is `hostname` net ads join -U join%join /etc/init.d/smb start echo "Starting winbindd" && winbindd -d 10 sleep 8 for i in `wbinfo -u` ; do wbinfo -n $i ; done for i in `wbinfo -g` ; do wbinfo -n $i ; done /etc/init.d/winbind restart done ;
A bug like that has been fixed in current svn. Under http://samba.org/samba/patches/winbindd_v1.patch you find a patch that might fix your problem. Could you verify it does it for you? Thanks, Volker
Hi, the patch you describe was already applied to winbindd_util.c at the time the bug was found.
Brian, could you retest trunk again ? We fixed another crash bug that will be in 3.0.20b (later today).
still reprodicible says Brian.
Brian, i tried to reproduce this against the SAMBA_3_0 tree and could not. So I think it probably has something to do with Volker's additional group code in trunk. Can you check to see if you can reproduce it against SAMBA_3_0 to verify ?
I can repro with the following frequency: Sleep 1 : Consistently Sleep 4 : Fails 9 out of 23 times Will be posting a more automated script for testing this in a bit...
Could you provide a debug level 10 log of all winbind daemons? Thanks, Volker
Here's a nice script (all Bash) that may be able to demonstrate this issue: --------------------------------------------------------------------------- #! /bin/bash for (( SECS=1 ; SECS<10; SECS+=2 )) ; do NTRIES=10 rm -rf /tmp/result${SECS} for (( tries=${NTRIES} ; tries>0; tries-- )) ; do # while `/bin/true` ; do killall winbindd # killall smbd # /etc/init.d/smb stop bin/net ads leave rm /var/log/samba/* HOSTNAME=tsthost${RANDOM} hostname ${HOSTNAME} echo Hostname is ${HOSTNAME} bin/net ads join -U join%join bin/smbd -d 10 # /etc/init.d/smb start echo "Starting winbindd" && bin/winbindd -d 10 sleep ${SECS} FOO=`bin/wbinfo -u | wc -l` echo $FOO if [[ ${FOO} == '1' ]] ; then echo "fail" ${HOSTNAME} >>/tmp/results${SECS} ; else echo "pass" ${FOO} ${HOSTNAME} >>/tmp/results${SECS} ; fi # for i in `bin/wbinfo -u` ; do bin/wbinfo -n $i ; done #for i in `bin/wbinfo -g` ; do bin/wbinfo -n $i ; done killall winbindd bin/winbindd -d 10 done ; PASSRAW=`grep pass /tmp/results${SECS} | wc -l` PASSRATE=`dc -e "6 k ${PASSRAW} ${NTRIES} / 100 * p" ` echo "Sleep of ${SECS} seconds: Succeeded " ${PASSRATE} % ${PASSRAW} / ${NTRIES} echo "Sleep of ${SECS} seconds: Succeeded " ${PASSRATE} % ${PASSRAW} / ${NTRIES} >>/tmp/allresults done ; On A 3.2 GHz OpenSuse 10 RC box, samba_3_0 build 10925, I get the following results: --------------------------------------------------------------- Sleep of 1 seconds: Succeeded 0 % 0 / 10 Sleep of 3 seconds: Succeeded 30.000000 % 3 / 10 Sleep of 5 seconds: Succeeded 50.000000 % 5 / 10 Sleep of 7 seconds: Succeeded 40.000000 % 4 / 10 Sleep of 9 seconds: Succeeded 70.000000 % 7 / 10
Created attachment 1502 [details] Tar of tarfiles of winbind logs, some fail, some pass
Here's the modified script that generated the Tar of Tarfiles attachment... -------------------------------------- #! /bin/bash for (( SECS=1 ; SECS<10; SECS+=2 )) ; do NTRIES=10 rm -rf /tmp/result${SECS} for (( tries=${NTRIES} ; tries>0; tries-- )) ; do # while `/bin/true` ; do killall winbindd # killall smbd # /etc/init.d/smb stop bin/net ads leave rm /var/log/samba/* HOSTNAME=tsthost${RANDOM} hostname ${HOSTNAME} echo Hostname is ${HOSTNAME} bin/net ads join -U join%join bin/smbd -d 10 # /etc/init.d/smb start echo "Starting winbindd" && bin/winbindd -d 10 sleep ${SECS} FOO=`bin/wbinfo -u | wc -l` echo $FOO if [[ ${FOO} == '1' ]] ; then echo "fail" ${HOSTNAME} >>/tmp/results${SECS} ; STT="fail" else echo "pass" ${FOO} ${HOSTNAME} >>/tmp/results${SECS} ; STT="pass" fi # for i in `bin/wbinfo -u` ; do bin/wbinfo -n $i ; done #for i in `bin/wbinfo -g` ; do bin/wbinfo -n $i ; done killall winbindd # save all of the logs... tar -j -cvf /tmp/winbind-log[${HOSTNAME}]-${STT}.tar.bz2 /var/log/samba/log.w* bin/winbindd -d 10 done ; PASSRAW=`grep pass /tmp/results${SECS} | wc -l` PASSRATE=`dc -e "6 k ${PASSRAW} ${NTRIES} / 100 * p" ` echo "Sleep of ${SECS} seconds: Succeeded " ${PASSRATE} % ${PASSRAW} / ${NTRIES} echo "Sleep of ${SECS} seconds: Succeeded " ${PASSRATE} % ${PASSRAW} / ${NTRIES} >>/tmp/allresults done ;
When looking at your logs I don't see any particular bug in winbind, I much more assume that the problem is that you join the domain and *directly* after joining you try to connect to the domain controller again. In the logfiles you see [2005/10/12 11:15:27, 5] nsswitch/winbindd_ads.c:trusted_domains(838) trusted_domains: Could not open a connection to CORPQA for PIPE_NETLOGON (NT_STATUS_NO_TRUST_SAM_ACCOUNT) To me this gives the impression that something on the DC is not yet syncronized between the join and the attempt to connect. I did not really test it, but I would really assume that this is expected behaviour. Could you put the sleep directly after the join and retry? Volker
Adding the sleep after the join yields the following results, which are not much different from previously. If this is "*by design*" what would be the best strategy to mitigate this? ------------------------------------------------------------- Sleep of 1 seconds: Succeeded 40.000000% 8 / 20 /tmp/results1 Sleep of 3 seconds: Succeeded 50.000000% 10 / 20 /tmp/results3 Sleep of 5 seconds: Succeeded 50.000000% 10 / 20 /tmp/results5 Sleep of 7 seconds: Succeeded 55.000000% 11 / 20 /tmp/results7 Sleep of 9 seconds: Succeeded 100.000000% 20 / 20 /tmp/results9 --------------------------------------------------------------------------- Here's the script that generated this -- Have you run this on your systems to compare results? #! /bin/bash for (( SECS=1 ; SECS<10; SECS+=2 )) ; do NTRIES=20 rm -rf /tmp/results${SECS} for (( tries=${NTRIES} ; tries>0; tries-- )) ; do killall winbindd killall smbd # /etc/init.d/smb stop bin/net ads leave rm /var/log/samba/* HOSTNAME=tsthost${RANDOM} hostname ${HOSTNAME} echo Hostname is ${HOSTNAME} bin/net ads join -U join%join sleep ${SECS} bin/smbd -d 10 # /etc/init.d/smb start echo "Starting winbindd" && bin/winbindd -d 10 FOO=`bin/wbinfo -u | wc -l` echo $FOO if [[ ${FOO} == '1' ]] ; then echo "fail" ${HOSTNAME} >>/tmp/results${SECS} ; STT="fail" else echo "pass" ${FOO} ${HOSTNAME} >>/tmp/results${SECS} ; STT="pass" fi # for i in `bin/wbinfo -u` ; do bin/wbinfo -n $i ; done # for i in `bin/wbinfo -g` ; do bin/wbinfo -n $i ; done killall winbindd # save all of the logs... tar -j -cvf /tmp/winbind-log[${HOSTNAME}]-${STT}.tar.bz2 /var/log/samba/log.w* bin/winbindd -d 10 done ; PASSRAW=`grep pass /tmp/results${SECS} | wc -l` DENOM=`wc -l /tmp/results${SECS}` PASSRATE=`dc -e "6 k ${PASSRAW} ${NTRIES} / 100 * p" ` echo "Sleep of ${SECS} seconds: Succeeded " ${PASSRATE}% ${PASSRAW} / ${DENOM} echo "Sleep of ${SECS} seconds: Succeeded " ${PASSRATE}% ${PASSRAW} / ${DENOM} >>/tmp/allresults done ;
So we successfully delegated the problem to the DC not finding the wks quickly enough after joining the domain. What we have to do is to program a time out and retry into winbind if we get this specific error message, there's not much we can do to speed up the DC. You might even face a much longer time out if have more domain controllers. The join might be done against one DC, the later winbind connect might use another one. If the replication has not yet been fully completed, then there's not much winbind can do. We would have to save the name of the DC the net join used, and I very much doubt that is worth the effort. Is this a real requirement for you, or is this just an artificial test? Can't you live with a minute between the join and later winbind startup? Normal windows workstations have the big advantage here that they have to reboot, so domain controllers have some time to catch up, time that you don't give the DC. I'm changing this bug to "enhancement", I don't see a real bug here. Lets get the policy discussion about wait time and fallbacks to samba-technical@samba.org please. Thanks for testing this BTW. Volker
Created attachment 1504 [details] Script to join domain, verify join, start winbind, perform operation
Attached is an adjusted script which demonstrates that even though the domain join has succeeded (net ads join, followed by net ads testjoin until we achieve success), there can be an occasional error. Next to be attached are the logs from a failure that we had after we successfully joined and verified. One error in 80 tries, but still an error.
Created attachment 1505 [details] D10 winbind logs of failure of wbinfo to look up information.
The log you just appended has a NT_STATUS_ACCESS_DENIED on the serverauth2 request. This is an indication that the workstation and the domain controllers disagree about the password, and as long as you re-join it will stay this way. Is it possible that the replication among your DCs is just a bit too slow and that your continouus leave and join leaves the DC's confused? If you have a failure in 80 attempts I'd very much say to just retry the join itself, not the lookup. Could you please comment on the question why you do this? I've never seen the requirement that someone needs to leave and join a domain at such a high rate, I'd very much assume that Windows is just not made for it. If you don't give me a good need from your production environment I'll change this bug to "wontfix". Volker
Brian, is this just a stress test for QA ? If so, it might be pushing outside of the realm of realistic use. And if so, then this is probably a no fix. btw...I used you original script and cannot reproduce you error in SAMBA_3_0 (even removing the sleep call).
We have some automated scripts which configure samba and join to AD; The operation needs to join the domain, get the SID of Domain Admins (for some additional security checking later), put this away, and proceed. We were getting failures after the net ads join, winbind start, and then calling wbinfo. We were seeing these very frequently at customer sites, especially in a corporate setting where DCs were separated by a slow link (e.g. Seattle, New Jersey). The scripts were written to repro and characterize the behavior as well as examine ways to defensively work around the issues. We think for the most part, VL's recent fixes are excellent and have worked -- with one exception --if winbind is started before the DCs are synchronized, or the "join" has taken, winbind is unable to provide any information (via wbinfo) during the subsequent duration of it's process (we actually think it's the query from wbinfo that causes winbind to fail, and then forever fail). One could argue that it's a case of removing the spoon before one drinks the hot cocoa, or "don't do that". So, our procedure is that we're net ads join'ing then repeatedly calling net ads testjoin to see if the join "took" *before* we start winbind. Our scripts changed to reflect that. We did find that in three instances (out of 900, we let this run overnight) that even with this precaution, winbind got into a state where it was unrecoverable without killing the process and restarting. To reiterate -- starting winbind, then running wbinfo, while the DCs have not "settled out" causes the winbind process to *never* (not during a reasonable testing interval) be able to do a query and return valid results, despite the DCs "settling" during the time winbind is running. The only way to recover from this situation is to kill the winbind process, and start it again. The most desirable behavior would be for winbind to re-establish itself with the DC if it detects that there was some problem when it started, and try to do so at some periodic interval.
Created attachment 1510 [details] Proposed patch... So if getting the domain sid is your real, deeper problem, does the attached (trivial, but untested) patch help you in any way? Volker
Closing bug, no feedback from reporter. Volker