Created attachment 6775 [details] Samba debug level 10 logs from the member server After upgrading from 3.5.11 to 3.6.0 I discovered, that when I restart the services (smbd, nmbd, winbindd) all come up, but winbind doesn't get any information from the Samba PDC (3.5.11). Also "wbinfo" fails and user can't access files,... because the uid/gid can't be resolved: # wbinfo -u Error looking up domain users If you restart several times, once it will work. If you got a moment when winbind successful everything is fine and the users can work. Restarting the services in such a successfull moment again doesnt break it. But if you restart it after some hours, it's the same issue and you need a lot of tries and luck until it comes up again. Every time, when this happens, I get on the PDC in nmbd.log: [2011/08/11 17:01:24.194765, 0] nmbd/nmbd_processlogon.c:615(process_logon_packet) process_logon_packet: bad packet I have this problem on all members I have migrated to 3.6.0.
Please upload full debug level 10 logs (log.wb*, log.winbind*), your smb.conf and a network trace. Thanks, Volker
Created attachment 6780 [details] Tcpdump Network Trace
Created attachment 6781 [details] smb.conf
Created attachment 6782 [details] full debug level 10 logs (log.wb*, log.winbind*)
Hi Marc, Thanks for the report. I looked a bit into your logs. I have no explanation yet, but there are a few strange things: The only attempted idmap lookup I see in log.winbindd-idmap is for a sid S-1-5-21-1205634850-3549583380-2535093466-546 which is not of the domain (MUC) but local to your member server (STORAGE-01). This is strange in itself because I would not assume these sids popping up in logs from idmap_tdb but rather from the passdb idmap backend. And then this attempted mapping fails with NT_STATUS_INVAILD_PARAMETER because the sid is handed in with id mapping type ID_TYPE_NOT_SPECIFIED, while only ID_TYPE_UID and ID_TYPE_GID are allowed. But I do rather suspect that this is not the main problem. You should do atomic tests with id mapping first. I am not so sure that the problem is id mapping itself, it could well be located in the nsswitch layer (libnss_winbindd). Firstly, for debugging, could you switch off "winbind use default domain" in smb.conf? Then please perform the following tests: 1. check the trust: $ wbinfo -t 2. check name->sid resolution for some domain user "username" $ wbinfo -n MUC\\username call the resulting sid "SID" for now. 3. check sid->name resolution for the sid from step #2: $ wbinfo -s SID this should give you back the name you entered in step #2 plus the type 4. check sid->uid resoultion: $ wbinfo -S SID this should give you a Unix User ID from the configured range. 5. Repeat the steps 2-4 for other users, and also for groups (in this case use wbinfo -Y instead of wbinfo -S in step 4). If these succeed, then id mapping is working in principle. If any of these fail, please provide full level 10 logs of the failing action like this: * stop winbindd (and the other daemons) * remove the old log files * make sure the smb.conf contains "log level = 10" * start winbindd * run the failing wbinfo command * attach all the winbind related log files to this bug If these succeed, we need to move towards the nss layer. First you can test what winbindd would report back to nss by calling: 6. wbinfo --user-info=MUC\\username 7. wbinfo --user-sidinfo=USER_SID 8. wbinfo --group-info=MUC\\groupname these should print output like you would expect from "getent passwd MUC\\username" and "getent group MUC\\groupname". (But without going through the nss layer). Again, if any of this fails, provide complete lvl10 logs of the failing operation. Then, the complete layer including the nss library, is to be tested. "getent passwd MUC\\username" and "getent group MUC\\groupname". Provide full level 10 logs including daemon start of the failing call. You have to make sure you have winbind in /etc/nsswitch.conf You should also switch off nscd if it is running. Furthermore, it might generally be usefull to prune sambas caches before performing the tests: the command is "net cache flush". These tests should give us some indication as to where to debug further. Cheers - Michael
I am also experiencing the same/similar problems. I am running 3.6 on AIX 6.1. I do have a 3.5.8 installation running without problem (I understand some major changes have happened.) I took the smb.conf from my 3.5.8 install and changed appropriately for 3.6 (At least as far as I catell). I followed the steps Michael posted on the bug report. /usr/local/samba/bin/wbinfo -t checking the trust secret for domain MYDOMAIN via RPC calls succeeded /usr/local/samba/bin/wbinfo -n MYDOMAIN+peacocjo S-1-5-21-4260745004-1716061493-1944009462-4325 SID_USER (1) /usr/local/samba/bin/wbinfo -s S-1-5-21-4260745004-1716061493-1944009462-4325 MYDOMAIN+peacocjo 1 /usr/local/samba/bin/wbinfo -S S-1-5-21-4260745004-1716061493-1944009462-4325 failed to call wbcSidToUid: WBC_ERR_WINBIND_NOT_AVAILABLE Could not convert sid S-1-5-21-4260745004-1716061493-1944009462-4325 to uid At times I have seen the failed to call wbcSidToUid: WBC_ERR_WINBIND_NOT_AVAILABLE error also. I am using rid as my backend for MYDOMAIN. Should I open a new bug?
I wanted to do the debuggings for you and upgraded one machine to 3.6.0 again. Everything worked now. Even restarting doesn't break it again. But then I run "net cache flush" and the problem is back. > Firstly, for debugging, could you switch off "winbind use default domain" > in smb.conf? Done. # wbinfo -t could not obtain winbind interface details: WBC_ERR_WINBIND_NOT_AVAILABLE could not obtain winbind domain name! checking the trust secret for domain (null) via RPC calls failed failed to call wbcCheckTrustCredentials: WBC_ERR_WINBIND_NOT_AVAILABLE Could not check secret # wbinfo -n MUC\\muehlfeld could not obtain winbind interface details: WBC_ERR_WINBIND_NOT_AVAILABLE could not obtain winbind separator! failed to call wbcLookupName: WBC_ERR_WINBIND_NOT_AVAILABLE Could not lookup name MUC\muehlfeld # wbinfo -s S-1-5-21-1362721961-1801182073-732966438-3122 failed to call wbcLookupSid: WBC_ERR_WINBIND_NOT_AVAILABLE Could not lookup sid S-1-5-21-1362721961-1801182073-732966438-3122 # wbinfo -S S-1-5-21-1362721961-1801182073-732966438-3122 failed to call wbcSidToUid: WBC_ERR_WINBIND_NOT_AVAILABLE Could not convert sid S-1-5-21-1362721961-1801182073-732966438-3122 to uid Because of the "could not obtain winbind separator" message, I set winbind separator = ´\´ (the default for this parameter as written in the manpage) and restarted the services. Now I get: # wbinfo -t checking the trust secret for domain MUC via RPC calls succeeded # wbinfo -n MUC\\muehlfeld failed to call wbcLookupName: WBC_ERR_DOMAIN_NOT_FOUND Could not lookup name MUC\muehlfeld # wbinfo -s S-1-5-21-1362721961-1801182073-732966438-3122 MUC�muehlfeld 1 # wbinfo -S S-1-5-21-1362721961-1801182073-732966438-3122 30000 Also "wbinfo -u/-g" work now. Because of this new information, with which settings I should provide you a new level 10 debug log?
This is what I get: ======================= wbinfo -t could not obtain winbind interface details: WBC_ERR_WINBIND_NOT_AVAILABLE could not obtain winbind domain name! checking the trust secret for domain (null) via RPC calls failed failed to call wbcCheckTrustCredentials: WBC_ERR_WINBIND_NOT_AVAILABLE Could not check secret ======================= Chris
Created attachment 6828 [details] level 10 logs
(In reply to comment #8) > This is what I get: > ======================= > wbinfo -t > could not obtain winbind interface details: WBC_ERR_WINBIND_NOT_AVAILABLE > could not obtain winbind domain name! > checking the trust secret for domain (null) via RPC calls failed > failed to call wbcCheckTrustCredentials: WBC_ERR_WINBIND_NOT_AVAILABLE > Could not check secret > ======================= > > Chris This very much looks like you did not update your libwbclient with the one you compiled with 3.6.0
At least for me I can say that I replaced the libnss_winbind.so with the one from 3.6.0 and run ldconfig. Can I provide any logfiles based on my last postings findings?
(In reply to comment #11) > At least for me I can say that I replaced the libnss_winbind.so with the one > from 3.6.0 and run ldconfig. > > Can I provide any logfiles based on my last postings findings? strace -ttT of the wbinfo process when it fails together with the winbind log files with 'debug hires timestamp = yes' might help
(In reply to comment #10) > This very much looks like you did not update your libwbclient with the one you > compiled with 3.6.0 Not the case. Even totally uninstalled, checked to make sure that libwbclient* and libnss_win* were gone. Re-installed, the files are there and the timestamps are correct. However, sometimes winbind does work. The times it doesn't I'll see this: ========================== # ps ax |grep winb 5612 ? Ss 0:00 /usr/sbin/winbindd 5615 ? S 0:00 /usr/sbin/winbindd 5622 ? Z 0:00 [winbindd] <defunct> ========================== This would be just after starting samba (no need to run wbinfo) and there will be that extra defunct winbindd process. If samba-3.6.0 (smbd, nmbd, winbindd) manages to start without creating the defunct process the wbinfo tests work. Chris
This smells like a duplicate of Bug 8409. Can you try https://bugzilla.samba.org/attachment.cgi?id=6813 please?
(In reply to comment #14) > This smells like a duplicate of Bug 8409. Can you try > > https://bugzilla.samba.org/attachment.cgi?id=6813 No difference. Got the defunct winbindd on first startup: ================================ # /etc/init.d/samba start * samba -> start: smbd ... [ ok ] * samba -> start: nmbd ... [ ok ] * samba -> start: winbind ... [ ok ] sartre samba # ps ax |grep winb 1799 ? Ss 0:00 /usr/sbin/winbindd 1802 ? S 0:00 /usr/sbin/winbindd 1806 pts/4 S+ 0:00 grep --colour=auto winb sartre samba # ps ax |grep winb 1799 ? Ss 0:00 /usr/sbin/winbindd 1802 ? S 0:00 /usr/sbin/winbindd 1807 ? Z 0:00 [winbindd] <defunct> 1809 pts/4 S+ 0:00 grep --colour=auto winb ================================ Above commands about 1-2 seconds apart.
Created attachment 6835 [details] strace of wbinfo -u
Created attachment 6836 [details] winbind logs (level 10) of wbinfo -u
The defunct winbind process directly after startup, I can confirm, too. # ps xf |grep winb 32438 ? Ss 0:00 /usr/sbin/winbindd 32439 ? S 0:00 \_ /usr/sbin/winbindd 32440 ? S 0:00 \_ /usr/sbin/winbindd 32442 ? Z 0:00 \_ [winbindd] <defunct>
(In reply to comment #18) > The defunct winbind process directly after startup, I can confirm, too. > > # ps xf |grep winb > 32438 ? Ss 0:00 /usr/sbin/winbindd > 32439 ? S 0:00 \_ /usr/sbin/winbindd > 32440 ? S 0:00 \_ /usr/sbin/winbindd > 32442 ? Z 0:00 \_ [winbindd] <defunct> I don't get it.... I don't see which winbind exited and where the parent hangs. If you can directly reproduce this, can you do another test: Run winbind under strace: strace -f -o /tmp/winbind.strace -ttT /usr/sbin/winbindd and watch it until you see the zombie. Please also upload the simultaneous log files again, so that I can match up the strace with log entries. Thanks, Volker
Created attachment 6837 [details] 2nd winbind strace Find attached the requested logs and strace.
Created attachment 6839 [details] strace and logs as requested Oddly enough when run under strace the defunct process disappears and wbinfo works. Also note that the patch @ https://bugzilla.samba.org/attachment.cgi?id=6813 has been applied in this case. ================================================= sartre samba # ps xf |grep winb 21069 pts/4 S+ 0:00 | \_ strace -f -o /tmp/winbind.strace -ttT /usr/sbin/winbindd 21082 pts/2 S+ 0:00 \_ grep --colour=auto winb 21074 ? Ss 0:00 /usr/sbin/winbindd 21075 ? S 0:00 \_ /usr/sbin/winbindd 21080 ? Z 0:00 \_ [winbindd] <defunct> sartre samba # ps xf |grep winb 21069 pts/4 S+ 0:00 | \_ strace -f -o /tmp/winbind.strace -ttT /usr/sbin/winbindd 21084 pts/2 S+ 0:00 \_ grep --colour=auto winb 21074 ? Ss 0:00 /usr/sbin/winbindd 21075 ? S 0:00 \_ /usr/sbin/winbindd sartre samba # wbinfo -t checking the trust secret for domain COMPUTERGUY via RPC calls succeeded sartre samba # ps xf |grep winb 21069 pts/4 S+ 0:00 | \_ strace -f -o /tmp/winbind.strace -ttT /usr/sbin/winbindd 21087 pts/2 S+ 0:00 \_ grep --colour=auto winb 21074 ? Ss 0:00 /usr/sbin/winbindd 21075 ? S 0:00 \_ /usr/sbin/winbindd =================================================
Created attachment 6840 [details] strace and logs as requested - proper failure this time This time the defunct process did not disappear and "wbinfo -t" failed. A couple of notes: It appears that if I don't start smbd and nmbd, winbind does not fail - the defunct process never occurs. I can more regularly recreate the failure if I first remove the gencache* and winbind* tdb files before restart samba.
Anything else we can do to help getting this issue fixed in the next version?
Is this one a showstopper for 3.6.1?
(In reply to comment #24) > Is this one a showstopper for 3.6.1? Prevents its use as a member server for several of my accounts. Would be nice to be able to use it for the supposed printing improvements.
For us it's definitely a showstopper. This problem is on all of our member servers and prevents us from upgrading to 3.6. And it seems it is almost half done (see what I found in in Comment 7). Would be really nice, if it could finally be fixed in 3.6.1
(In reply to comment #26) > For us it's definitely a showstopper. This problem is on all of our member > servers and prevents us from upgrading to 3.6. > > And it seems it is almost half done (see what I found in in Comment 7). Would > be really nice, if it could finally be fixed in 3.6.1 If we only had a reproducer or a way to see this live..
Beside logfiles and traces I can create for you, I can offer you to have a remote session (e. g. teamviewer is also available for Linux), where you can have a look at it and see the problem live. Just send me a mail. I'm back in office on tuesday.
Okay, we do need to delay the release anyway. So I would like to wait another 2 days to see if there is a chance to get this one fixed also. Marking as blocker to be aware of it.
(In reply to comment #7) > I wanted to do the debuggings for you and upgraded one machine to 3.6.0 again. > Everything worked now. Even restarting doesn't break it again. But then I run > "net cache flush" and the problem is back. Now that is really strange. Is that reliably reproducible? > # wbinfo -t > could not obtain winbind interface details: WBC_ERR_WINBIND_NOT_AVAILABLE > could not obtain winbind domain name! > checking the trust secret for domain (null) via RPC calls failed > failed to call wbcCheckTrustCredentials: WBC_ERR_WINBIND_NOT_AVAILABLE > Could not check secret > > > # wbinfo -n MUC\\muehlfeld > could not obtain winbind interface details: WBC_ERR_WINBIND_NOT_AVAILABLE > could not obtain winbind separator! > failed to call wbcLookupName: WBC_ERR_WINBIND_NOT_AVAILABLE > Could not lookup name MUC\muehlfeld > > > # wbinfo -s S-1-5-21-1362721961-1801182073-732966438-3122 > failed to call wbcLookupSid: WBC_ERR_WINBIND_NOT_AVAILABLE > Could not lookup sid S-1-5-21-1362721961-1801182073-732966438-3122 > > > # wbinfo -S S-1-5-21-1362721961-1801182073-732966438-3122 > failed to call wbcSidToUid: WBC_ERR_WINBIND_NOT_AVAILABLE > Could not convert sid S-1-5-21-1362721961-1801182073-732966438-3122 to uid > > > Because of the "could not obtain winbind separator" message, I set > winbind separator = ´\´ > (the default for this parameter as written in the manpage) and restarted the > services. Well, this is actually not the reason. And it must be coincidence that the next set of commands somewhat succeeded: wbinfo does not get the separator from the config but from winbindd itself. Now WBC_ERR_WINBIND_NOT_AVAILABLE says that winbindd could not be contacted. Hence in particular the separator could not be obtained. The ´/´ is even invalid, because the ´ chars are taken into the value, not used as quoting. > Now I get: > > # wbinfo -t > checking the trust secret for domain MUC via RPC calls succeeded > > > # wbinfo -n MUC\\muehlfeld > failed to call wbcLookupName: WBC_ERR_DOMAIN_NOT_FOUND > Could not lookup name MUC\muehlfeld Here you already see that the separator did not work: It could not split the domain from the name. > # wbinfo -s S-1-5-21-1362721961-1801182073-732966438-3122 > MUC�muehlfeld 1 And here you see that the separator that winbindd delivers is a non-ascii character. > # wbinfo -S S-1-5-21-1362721961-1801182073-732966438-3122 > 30000 > > > > Also "wbinfo -u/-g" work now. > > > > Because of this new information, with which settings I should provide you a new > level 10 debug log? So the winbind separator = ´/´ is not the solution, not even a partial one. This is more a question of winbindd not starting up properly. First thing is now to clarify whether the net cache flush thing is reproducible reliably. Cheers - Michael
I did 10 restarts of the services (smbd, nmbd, winbindd) and everytime it worked, I tried "net cache flush". This are my results: /etc/init.d/rc.smb restart wbinfo -u # works net cache flush wbinfo -u # works /etc/init.d/rc.smb restart wbinfo -u # failed /etc/init.d/rc.smb restart wbinfo -u # works net cache flush wbinfo -u # works /etc/init.d/rc.smb restart wbinfo -u # failed /etc/init.d/rc.smb restart wbinfo -u # failed /etc/init.d/rc.smb restart wbinfo -u # failed /etc/init.d/rc.smb restart wbinfo -u # works net cache flush wbinfo -u # works /etc/init.d/rc.smb restart wbinfo -u # failed /etc/init.d/rc.smb restart wbinfo -u # works net cache flush wbinfo -u # works /etc/init.d/rc.smb restart wbinfo -u # failed I just tried because I was curious: I deleted the content of the lock directory and started the services again. But I failed again directly.
Hi Michael, please comment if there is a chance to get a fix for this one soon and if it's really a blocker for 3.6.1. Thanks! Cheers, Karo
There must be some difference in winbind (or somewhere) between 3.5.11 (which always works) and 3.6.0 (which fails sporadically) that's causing this.
Michael, please comment if there is a chance to get a fix for this one soon and if it's really a blocker for 3.6.1. Thanks!
Hi Karo, I don't know whether we have ah chance to fix it soon, since I don't know what circumstances trigger this behaviour. I.e. I did not reproduce it yet. For those who suffer from it, it seems to be very bad indeed. Not sure whether this makes it a blocker for 3.6.1. (I was arguing once that blocker should only exist for .0 releases and not for .X releases with X > 1. I don't know how this discussion was resolved, if at all...) :-) Cheers - Michael (In reply to comment #34) > Michael, > > please comment if there is a chance to get a fix for this one soon and if it's > really a blocker for 3.6.1. > > Thanks!
I compiled the new 3.6.1. Problem is still there.
Disappointing. Samba-3.6.1 does not fix this issue and is unusable as a member server. Back to 3.5.11.
(In reply to comment #37) > Disappointing. Samba-3.6.1 does not fix this issue and is unusable as a member > server. Back to 3.5.11. Want to add that it even seems worse - so far have not seen one working start of winbind at all with 3.6.1 and need to use "kill -9" to stop the processes. However I have upgraded the system I'm testing with to a a faster one (i7 2600k) so not everything is the same.
Decided to fiddle with this again this morning and made some possibly helpful discoveries. I noticed that winbind had no issues if the PDC was not running. After stopping Samba on the PDC winbind started normally with no <defunct> process. So it wasn't some sort of standalone race condition with winbind itself. After stating the PDC trouble was again apparent. Suspecting now that the issue was communication dependent between the PDC and the Member server I decided to test upgrading the PDC from its working version of 3.5.8 (worked with all member servers through 3.5.11) to 3.6.1. After the PDC upgrade to 3.6.1 I have not (as yet) been able to reproduce the winbind problem with the 3.6.1 member server (which was recently failing 100% of the time before the PDC upgrade). Which still leaves me really scared about an account that still has an NT4 PDC... is it really safe to even consider a 3.6.1 upgrade on its member servers ??
This is a showstopper for 3.6.2. Could someone investigate, please? Thanks!
(In reply to comment #39) > Which still leaves me really scared about an account that still has an NT4 > PDC... is it really safe to even consider a 3.6.1 upgrade on its member servers > ?? Finally tested this - samba-3.6.1 fails as member server to an NT4 PDC. The same winbind defunct process occurs.
I can confirm that, if both (PDC and meber) are running 3.6.1, then the problem doesn't appear. In my test environment I created a script that contains all steps from comment #5 and let it run every even minute (it also contains a restart of the services and a "net cache flush"). On the PDC I additionally let the services restart every uneven minute. This script I let run for over 1h now. And everytime all steps from comment #5 succeed. So this bug really only seems to appear when the PDC is on 3.5.x and the member on 3.6.x.
(In reply to comment #42) > So this bug really only seems to appear when the PDC is on 3.5.x and the member > on 3.6.x. Possibly accurate to state that winbind from 3.6.[0,1] fails against a PDC running <3.6.0 including a real NT4 PDC. Although I have only tested against PDC's of 3.5.[8,11-12] and NT4. Yesterday I installed 3.6.1 on a new system that had never run Samba previously and it failed immediately against an NT4 PDC and although many subsequent restarts would run OK the failure would eventually occur again. The only change in the smb.conf between 3.5.x and 3.6.x versions were the removal of the deprecated parameters: =============================== idmap backend = tdb idmap uid = 300000-400000 idmap gid = 300000-400000 idmap config WARGAMES : backend = rid idmap config WARGAMES : range = 2000-299999 =============================== was shortened to: =============================== idmap config WARGAMES : backend = rid idmap config WARGAMES : range = 2000-299999 ===============================
Created attachment 7095 [details] set of level 10 logs from failure with actual NT4 PDC Can also add a set from a non-failed start up for comparison if desired. Also note that the failure rate is (so far) 100% of Samba starts if I do a "net cache flush" after stopping Samba.
Looked at all log files again. None show any failure. winbind logs are more than just log.winbindd. They are also log.wb-* and other stuff. This is terminally unfixable with proper logon to such a box. Closing as INVALID, it seems completely impossible to get the required information.
(In reply to comment #45) > Looked at all log files again. None show any failure. winbind logs are more > than just log.winbindd. They are also log.wb-* and other stuff. This is > terminally unfixable with proper logon to such a box. Closing as INVALID, it s/with/without/, but you guessed that already. Volker
(In reply to comment #45) > Looked at all log files again. None show any failure. winbind logs are more > than just log.winbindd. They are also log.wb-* and other stuff. This is > terminally unfixable with proper logon to such a box. Closing as INVALID, it > seems completely impossible to get the required information. Don't jump to conclusions! Those were _all_ of the logs created - a full set! Maybe that's a clue of some sort. When it doesn't fail there are log.wb-* files created, for the domain, the host and builtin. I just reran the scenario a few times and one time did get a log.wb-<DOMAINNAME> but none of the others (<HOST> and BUILTIN). Marc and I both can recreate this identical problem at will and it is not distro specific as we use different one. The problem is 100% repeatable when using Samba-3.6.1 as a member server against an NT4 PDC when using "net flush cache" before starting Samba for me. I will be happy to furnish anything you ask for to the best of my ability. I will upload the newest set of fail logs that adds the log.wb-<DOMAINNAME> file, and I let this one run through an attempt to get user and group info. Here is the console output: darwin samba # /etc/init.d/samba start * samba -> start: smbd ... [ ok ] * samba -> start: nmbd ... [ ok ] * samba -> start: winbind ... [ ok ] darwin samba # ps ax |grep win 10148 ? Ss 0:00 /usr/sbin/winbindd 10151 ? S 0:00 /usr/sbin/winbindd 10153 ? Z 0:00 [winbindd] <defunct> darwin samba # wbinfo -u Error looking up domain users darwin samba # wbinfo -g failed to call wbcListGroups: WBC_ERR_WINBIND_NOT_AVAILABLE Error looking up domain groups darwin samba # ls /var/log/samba/ cores log.nmbd log.smbd log.wb-WARGAMES log.winbindd log.winbindd-dc-connect smbd.log winbindd.log
I already offered you to have a teamviewer session, so you can have a look at the machines. Just send me an mail during work hours. I can also call you back. I'm also from germany.
Created attachment 7096 [details] _complete_ set of logs from failure with NT4 PDC the set has an additional log file - the log-wb-<DOMAINNAME>
(In reply to comment #49) > Created attachment 7096 [details] > _complete_ set of logs from failure with NT4 PDC > > the set has an additional log file - the log-wb-<DOMAINNAME> vlendec@delphin:/tmp/wb$ grep process_request * log.wb-WARGAMES:[2011/11/14 16:17:04.840630, 10] winbindd/winbindd_dual.c:439(child_process_request) log.wb-WARGAMES: child_process_request: request fn INIT_CONNECTION log.wb-WARGAMES:[2011/11/14 16:18:59.416551, 10] winbindd/winbindd_dual.c:439(child_process_request) log.wb-WARGAMES: child_process_request: request fn LIST_TRUSTDOM winbindd.log:[2011/11/14 16:18:59.416567, 10] winbindd/winbindd.c:642(process_request) winbindd.log: process_request: request fn INTERFACE_VERSION winbindd.log:[2011/11/14 16:18:59.416656, 10] winbindd/winbindd.c:642(process_request) winbindd.log: process_request: request fn INTERFACE_VERSION There is no attempt at all to list users, so I am at a complete loss which winbind your wbinfo command asked. One thing is 100% certain: It was NOT the winbind that you sent logfiles from.
Created attachment 7097 [details] logfiles and smb.conf from a 3.6.1 member server and 3.5.12 PDC I cleared the log directories on my PDC and member, set on both log level to 10 and started the services. Then I run "wbinfo -u" on the member and wait until it timed out. The logs are created in my test environment, so no other machines should appear in the logs. GENOME = PDC TEST_Member = Member server MUC = Domain name The logs of both servers (log level 10) and both smb.conf are included. PDC is 3.5.12 and member server is 3.6.1
Created attachment 7098 [details] Patch for 3.6 Can you try the attached (completely untested) patch? Thanks, Volker
Could you see any unusual in my logs? I compiled 3.6.1 with your patch. I run "net cache flush" and started the services. But "wbinfo -u" still time out.
Well, I saw a NT_STATUS_NO_MEMORY which is the wrong error message to an empty local domain. The patch should have fixed that, but apparently it did not. As I said, to me this seems an unfixable defect. Way too many roundtrips for a pro-bono volunteer project.
Probably impossible, but the same set of log files as in together with two sets of network traces on both the client and the server might give some additional clues. For detailed information how to create useful network traces, please see http://wiki.samba.org/index.php/Capture_Packets.
Created attachment 7099 [details] Logs and network traces of 3.5.12 PDC and 3.6.1 member Find attached a .tar.bz2 that contains (each from PDC and member): - level 10 debug log - smb.con - network trace Included are: Start of the services on both machines, "wbinfo -u" on the client until timeout.
Trace_PDC contains -- NOTHING. I would have liked to see whether frame 65 from Trace_Member arrives on the PDC. But to me it seems that you have traced the wrong network interface.
Created attachment 7102 [details] Logs and network traces of 3.5.12 PDC and 3.6.1 member You are right. I had the wrong interface on the PDC. Sorry. Find attached the same as before. Just with new created traces and logs.
(In reply to comment #50) > There is no attempt at all to list users, so I am at a complete loss which > winbind your wbinfo command asked. One thing is 100% certain: It was NOT the > winbind that you sent logfiles from. All of the logs were deleted before the run, and I uploaded every single one created. My guess is that wbinfo is somehow attempting to gather data via the <defunct> winbind process and this is possibly preventing the log information you're expecting to see. Maybe we're looking in the wrong spot. The thing to keep in mind is that after starting Samba, if there is a <defunct> winbind process (100% against an NT4 PDC if net cache flush is run before starting) the error has already occurred - there is no need to run wbinfo as it is a 100% foregone conclusion that it will fail.
Marc, got your fault reproduced.
Created attachment 7109 [details] samlogon packet from 3.5
Created attachment 7110 [details] samlogon packet from 3.6 This packet has a problem: The alignment after the SID is not correct. That's why the 3.5 DC discards this packet from 3.6, which leads to the initial timeouts. When winbind was able to figure out the DC name in a different way then everything is fine. Need to dig deeper why ndr_push_NETLOGON_SAM_LOGON_REQUEST in librpc/ndr/ndr_nbt.c (hand-marshalled) gets the alignment wrong.
BTW, I have not been able to reproduce this so far because I have run winbind without nmbd all the time. Without nmbd that code path is never executed.
(In reply to comment #63) > BTW, I have not been able to reproduce this so far because I have run winbind > without nmbd all the time. Without nmbd that code path is never executed. Verified that without running nmbd the winbind bug does not occur against an NT4 PDC.
Created attachment 7114 [details] Patch for 3.6 This is what I've pushed to master
Created attachment 7115 [details] draft patch I think ndr_pull_NETLOGON_SAM_LOGON_REQUEST() has the same problem. Maybe can fix this more genericly in ndr_set_flags(), but I'd need check all callers. If this takes to long we can add manual fixes to ndr_push_NETLOGON_SAM_LOGON_REQUEST and ndr_pull_NETLOGON_SAM_LOGON_REQUEST.
Volker, I think for now we should add your fix also to ndr_pull_NETLOGON_SAM_LOGON_REQUEST()
(In reply to comment #65) > Created attachment 7114 [details] > Patch for 3.6 This does resolve the winbind bug with 3.6.1 as a member server to an NT4 PDC.
Comment on attachment 7114 [details] Patch for 3.6 Ok, I stand to my comment: This is not fixable. My patch does not survive autobuild. I really don't know anymore what to do.
If I just apply the patch from Stefan, it don't work. But if I also apply Volkers patch, wbinfo seems to work like it should. Is our setup so special, that we are the only two who are having this problem? Do others with a 3.5 PDC and 3.6 Member don't have this problem? At the moment I can't switch the PDC also to 3.6, because I discovered a bug with printing drivers which is a showstopper for me.
Created attachment 7117 [details] Patch for 3.6 This is the equivalent of what ended up in master.
Comment on attachment 7117 [details] Patch for 3.6 Looks good for 3.6
(In reply to comment #72) > Comment on attachment 7117 [details] > Patch for 3.6 > > Looks good for 3.6 Karolin, please pick this for the next release
Created attachment 7118 [details] new draft patch Marc, can you also test if this patch would fix it too? A working capture against a real NT4 DC would also be very useful.
Looks good, Stefan. Thanks. Maybe Chris can create a capture against a NT4 PDC (we fortunately just have a samba PDC).
(In reply to comment #75) > Looks good, Stefan. Thanks. Ok, thanks. > Maybe Chris can create a capture against a NT4 PDC (we fortunately just have a > samba PDC). Chris, could do create a capture please. See https://wiki.samba.org/index.php/Capture_Packets
(In reply to comment #76) > Chris, could do create a capture please. > See https://wiki.samba.org/index.php/Capture_Packets Which patch(es)? Should both 7117 and 7118 be applied?
Created attachment 7124 [details] tcpdump capture against NT4 PDC (In reply to comment #76) > Chris, could do create a capture please. > See https://wiki.samba.org/index.php/Capture_Packets Hopefully this is useful - if not I'll head back to the drawing board. This is the capture against an NT4 PDC after starting Samba-3.6.1 with all the (non-obsolete) patches available in the bugs listed in #8595 except for bugs #8600 & #8592 as they do not properly apply (maybe against master but not against 3.6.1) included, of course patches 7117 and 7118 from this bug. Started the capture on the member server, started Samba (smbd, nmbd, winbind), ran wbinfo -u and -g. I can easily adjust which patches get applied if necessary.
Re-assigning to metze for further investigation. Please note that I did not push the patch yet. I think it makes sense to wait until the bug can be closed. Please let me know if I should push the patch right now. Thanks!
Comment on attachment 7117 [details] Patch for 3.6 Karolin, please pick this patch for the next 3.6 release, it's a correct fix for this bug report.
Thanks for the final fix. Good work. I'm happy to hear, that it will be included in the next 3.6 version. I hope the other printer related bugs are also fixed soon (like #8612) - then I can finally switch all my samba servers to 3.6 :-)
(In reply to comment #80) > Comment on attachment 7117 [details] > Patch for 3.6 > > Karolin, please pick this patch for the next 3.6 release, it's a correct fix > for this bug report. Hi Metze, done. Thanks for clarification! Closing out bug report. Please feel free to re-open if it's still an issue. Thanks!