Bug 8371 - Winbind can't receive any user/group information
Winbind can't receive any user/group information
Status: RESOLVED FIXED
Product: Samba 3.6
Classification: Unclassified
Component: Winbind
3.6.1
x64 Linux
: P5 regression
: ---
Assigned To: Karolin Seeger
Samba QA Contact
:
Depends on:
Blocks: 8373 8595
  Show dependency treegraph
 
Reported: 2011-08-11 15:11 UTC by Marc Muehlfeld
Modified: 2011-12-11 18:51 UTC (History)
6 users (show)

See Also:


Attachments
Samba debug level 10 logs from the member server (29.52 KB, application/octet-stream)
2011-08-11 15:11 UTC, Marc Muehlfeld
no flags Details
Tcpdump Network Trace (34.90 KB, application/octet-stream)
2011-08-12 16:09 UTC, Marc Muehlfeld
no flags Details
smb.conf (2.73 KB, text/plain)
2011-08-12 16:10 UTC, Marc Muehlfeld
no flags Details
full debug level 10 logs (log.wb*, log.winbind*) (26.37 KB, application/octet-stream)
2011-08-12 16:11 UTC, Marc Muehlfeld
no flags Details
level 10 logs (37.03 KB, application/x-bzip)
2011-08-29 19:39 UTC, Chris Smith
no flags Details
strace of wbinfo -u (12.17 KB, text/plain)
2011-08-30 06:28 UTC, Marc Muehlfeld
no flags Details
winbind logs (level 10) of wbinfo -u (30.50 KB, application/octet-stream)
2011-08-30 06:28 UTC, Marc Muehlfeld
no flags Details
2nd winbind strace (77.02 KB, application/octet-stream)
2011-08-30 07:43 UTC, Marc Muehlfeld
no flags Details
strace and logs as requested (234.85 KB, application/x-bzip)
2011-08-30 13:57 UTC, Chris Smith
no flags Details
strace and logs as requested - proper failure this time (93.94 KB, application/x-bzip)
2011-08-30 14:31 UTC, Chris Smith
no flags Details
set of level 10 logs from failure with actual NT4 PDC (29.50 KB, application/x-bzip)
2011-11-14 16:49 UTC, Chris Smith
no flags Details
_complete_ set of logs from failure with NT4 PDC (35.66 KB, application/x-bzip)
2011-11-14 21:48 UTC, Chris Smith
no flags Details
logfiles and smb.conf from a 3.6.1 member server and 3.5.12 PDC (59.20 KB, application/octet-stream)
2011-11-15 13:56 UTC, Marc Muehlfeld
no flags Details
Patch for 3.6 (732 bytes, patch)
2011-11-15 14:52 UTC, Volker Lendecke
no flags Details
Logs and network traces of 3.5.12 PDC and 3.6.1 member (106.49 KB, application/octet-stream)
2011-11-15 15:41 UTC, Marc Muehlfeld
no flags Details
Logs and network traces of 3.5.12 PDC and 3.6.1 member (78.49 KB, application/octet-stream)
2011-11-15 16:14 UTC, Marc Muehlfeld
no flags Details
samlogon packet from 3.5 (355 bytes, application/octet-stream)
2011-11-16 20:30 UTC, Volker Lendecke
no flags Details
samlogon packet from 3.6 (376 bytes, application/cap)
2011-11-16 20:33 UTC, Volker Lendecke
no flags Details
Patch for 3.6 (1.07 KB, patch)
2011-11-17 21:42 UTC, Volker Lendecke
no flags Details
draft patch (716 bytes, patch)
2011-11-17 22:19 UTC, Stefan Metzmacher
no flags Details
Patch for 3.6 (1.46 KB, patch)
2011-11-18 09:46 UTC, Volker Lendecke
metze: review+
Details
new draft patch (846 bytes, patch)
2011-11-18 10:05 UTC, Stefan Metzmacher
no flags Details
tcpdump capture against NT4 PDC (28.20 KB, application/octet-stream)
2011-11-18 17:08 UTC, Chris Smith
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Muehlfeld 2011-08-11 15:11:02 UTC
Created attachment 6775 [details]
Samba debug level 10 logs from the member server

After upgrading from 3.5.11 to 3.6.0 I discovered, that when I restart the services (smbd, nmbd, winbindd) all come up, but winbind doesn't get any information from the Samba PDC (3.5.11). Also "wbinfo" fails and user can't access files,... because the uid/gid can't be resolved:

# wbinfo -u
Error looking up domain users

If you restart several times, once it will work. If you got a moment when winbind successful everything is fine and the users can work. Restarting the services in such a successfull moment again doesnt break it. But if you restart it after some hours, it's the same issue and you need a lot of tries and luck until it comes up again.

Every time, when this happens, I get on the PDC in nmbd.log:
[2011/08/11 17:01:24.194765,  0] nmbd/nmbd_processlogon.c:615(process_logon_packet)
  process_logon_packet: bad packet

I have this problem on all members I have migrated to 3.6.0.
Comment 1 Volker Lendecke 2011-08-12 14:31:24 UTC
Please upload full debug level 10 logs (log.wb*, log.winbind*), your smb.conf and a network trace.

Thanks,

Volker
Comment 2 Marc Muehlfeld 2011-08-12 16:09:53 UTC
Created attachment 6780 [details]
Tcpdump Network Trace
Comment 3 Marc Muehlfeld 2011-08-12 16:10:05 UTC
Created attachment 6781 [details]
smb.conf
Comment 4 Marc Muehlfeld 2011-08-12 16:11:04 UTC
Created attachment 6782 [details]
full debug level 10 logs (log.wb*, log.winbind*)
Comment 5 Michael Adam 2011-08-15 10:14:23 UTC
Hi Marc,

Thanks for the report.
I looked a bit into your logs.

I have no explanation yet, but there are a few strange things:
The only attempted idmap lookup I see in log.winbindd-idmap is for a sid
S-1-5-21-1205634850-3549583380-2535093466-546 which is not of the domain (MUC) but local to your member server (STORAGE-01). This is strange in itself because I would not assume these sids popping up in logs from idmap_tdb but rather from the passdb idmap backend. And then this attempted mapping fails with NT_STATUS_INVAILD_PARAMETER because the sid is handed in with id mapping type ID_TYPE_NOT_SPECIFIED, while only ID_TYPE_UID and ID_TYPE_GID are allowed.

But I do rather suspect that this is not the main problem.
You should do atomic tests with id mapping first. I am not so sure
that the problem is id mapping itself, it could well be located in the nsswitch layer (libnss_winbindd).

Firstly, for debugging, could you switch off "winbind use default domain"
in smb.conf?

Then please perform the following tests:

1. check the trust:
   $ wbinfo -t 

2. check name->sid resolution for some domain user "username"
   $ wbinfo -n MUC\\username
   call the resulting sid "SID" for now.

3. check sid->name resolution for the sid from step #2:
   $ wbinfo -s SID
   this should give you back the name you entered in step #2 plus
   the type

4. check sid->uid resoultion:
   $ wbinfo -S SID
   this should give you a Unix User ID from the configured range.

5. Repeat the steps 2-4 for other users, and also for groups
   (in this case use wbinfo -Y instead of wbinfo -S in step 4).


If these succeed, then id mapping is working in principle.
If any of these fail, please provide full level 10 logs of the
failing action like this:

* stop winbindd (and the other daemons)
* remove the old log files
* make sure the smb.conf contains "log level = 10"
* start winbindd
* run the failing wbinfo command
* attach all the winbind related log files to this bug

If these succeed, we need to move towards the nss layer.
First you can test what winbindd would report back to nss
by calling:

6. wbinfo --user-info=MUC\\username
7. wbinfo --user-sidinfo=USER_SID
8. wbinfo --group-info=MUC\\groupname

these should print output like you would expect from "getent passwd MUC\\username" and "getent group MUC\\groupname". (But without going through the nss layer).

Again, if any of this fails, provide complete lvl10 logs of the failing operation.

Then, the complete layer including the nss library, is to be tested.
"getent passwd MUC\\username" and "getent group MUC\\groupname".
Provide full level 10 logs including daemon start of the failing call.

You have to make sure you have winbind in /etc/nsswitch.conf
You should also switch off nscd if it is running.

Furthermore, it might generally be usefull to prune sambas caches
before performing the tests: the command is "net cache flush".

These tests should give us some indication as to where to debug further.

Cheers - Michael
Comment 6 Josh Peacock 2011-08-15 15:26:49 UTC

I am also experiencing the same/similar problems.  I am running 3.6 on AIX 6.1.  I do have a 3.5.8 installation running without problem (I understand some major changes have happened.)  I took the smb.conf from my 3.5.8 install and changed appropriately for 3.6 (At least as far as I catell).

I followed the steps Michael posted on the bug report.

/usr/local/samba/bin/wbinfo -t
checking the trust secret for domain MYDOMAIN via RPC calls succeeded

/usr/local/samba/bin/wbinfo -n MYDOMAIN+peacocjo
S-1-5-21-4260745004-1716061493-1944009462-4325 SID_USER (1)

/usr/local/samba/bin/wbinfo -s S-1-5-21-4260745004-1716061493-1944009462-4325
MYDOMAIN+peacocjo 1

/usr/local/samba/bin/wbinfo -S S-1-5-21-4260745004-1716061493-1944009462-4325
failed to call wbcSidToUid: WBC_ERR_WINBIND_NOT_AVAILABLE Could not convert sid S-1-5-21-4260745004-1716061493-1944009462-4325 to uid

At times I have seen the failed to call wbcSidToUid: WBC_ERR_WINBIND_NOT_AVAILABLE
 error also.  

I am using rid as my backend for MYDOMAIN.

Should I open a new bug?
Comment 7 Marc Muehlfeld 2011-08-15 17:09:27 UTC
I wanted to do the debuggings for you and upgraded one machine to 3.6.0 again. Everything worked now. Even restarting doesn't break it again. But then I run "net cache flush" and the problem is back.


> Firstly, for debugging, could you switch off "winbind use default domain"
> in smb.conf?

Done.



# wbinfo -t
could not obtain winbind interface details: WBC_ERR_WINBIND_NOT_AVAILABLE
could not obtain winbind domain name!
checking the trust secret for domain (null) via RPC calls failed
failed to call wbcCheckTrustCredentials: WBC_ERR_WINBIND_NOT_AVAILABLE
Could not check secret


# wbinfo -n MUC\\muehlfeld
could not obtain winbind interface details: WBC_ERR_WINBIND_NOT_AVAILABLE
could not obtain winbind separator!
failed to call wbcLookupName: WBC_ERR_WINBIND_NOT_AVAILABLE
Could not lookup name MUC\muehlfeld


# wbinfo -s S-1-5-21-1362721961-1801182073-732966438-3122
failed to call wbcLookupSid: WBC_ERR_WINBIND_NOT_AVAILABLE
Could not lookup sid S-1-5-21-1362721961-1801182073-732966438-3122


# wbinfo -S S-1-5-21-1362721961-1801182073-732966438-3122
failed to call wbcSidToUid: WBC_ERR_WINBIND_NOT_AVAILABLE
Could not convert sid S-1-5-21-1362721961-1801182073-732966438-3122 to uid


Because of the "could not obtain winbind separator" message, I set
winbind separator = ´\´
(the default for this parameter as written in the manpage) and restarted the services. Now I get:


# wbinfo -t
checking the trust secret for domain MUC via RPC calls succeeded


# wbinfo -n MUC\\muehlfeld
failed to call wbcLookupName: WBC_ERR_DOMAIN_NOT_FOUND
Could not lookup name MUC\muehlfeld


# wbinfo -s S-1-5-21-1362721961-1801182073-732966438-3122
MUC�muehlfeld 1


# wbinfo -S S-1-5-21-1362721961-1801182073-732966438-3122
30000



Also "wbinfo -u/-g" work now.



Because of this new information, with which settings I should provide you a new level 10 debug log?
Comment 8 Chris Smith 2011-08-29 19:37:43 UTC
This is what I get:
=======================
wbinfo -t
could not obtain winbind interface details: WBC_ERR_WINBIND_NOT_AVAILABLE
could not obtain winbind domain name!
checking the trust secret for domain (null) via RPC calls failed
failed to call wbcCheckTrustCredentials: WBC_ERR_WINBIND_NOT_AVAILABLE
Could not check secret
=======================

Chris
Comment 9 Chris Smith 2011-08-29 19:39:04 UTC
Created attachment 6828 [details]
level 10 logs
Comment 10 Volker Lendecke 2011-08-29 19:43:26 UTC
(In reply to comment #8)
> This is what I get:
> =======================
> wbinfo -t
> could not obtain winbind interface details: WBC_ERR_WINBIND_NOT_AVAILABLE
> could not obtain winbind domain name!
> checking the trust secret for domain (null) via RPC calls failed
> failed to call wbcCheckTrustCredentials: WBC_ERR_WINBIND_NOT_AVAILABLE
> Could not check secret
> =======================
> 
> Chris

This very much looks like you did not update your libwbclient with the one you compiled with 3.6.0
Comment 11 Marc Muehlfeld 2011-08-29 19:54:48 UTC
At least for me I can say that I replaced the libnss_winbind.so with the one from 3.6.0 and run ldconfig.

Can I provide any logfiles based on my last postings findings?
Comment 12 Volker Lendecke 2011-08-29 20:06:29 UTC
(In reply to comment #11)
> At least for me I can say that I replaced the libnss_winbind.so with the one
> from 3.6.0 and run ldconfig.
> 
> Can I provide any logfiles based on my last postings findings?

strace -ttT of the wbinfo process when it fails together with the winbind log files with 'debug hires timestamp = yes' might help
Comment 13 Chris Smith 2011-08-29 21:58:23 UTC
(In reply to comment #10)
> This very much looks like you did not update your libwbclient with the one you
> compiled with 3.6.0

Not the case. Even totally uninstalled, checked to make sure that libwbclient* and libnss_win* were gone. Re-installed, the files are there and the timestamps are correct.

However, sometimes winbind does work. The times it doesn't I'll see this:
==========================
 # ps ax |grep winb
 5612 ?        Ss     0:00 /usr/sbin/winbindd
 5615 ?        S      0:00 /usr/sbin/winbindd
 5622 ?        Z      0:00 [winbindd] <defunct>
==========================
This would be just after starting samba (no need to run wbinfo) and there will be that extra defunct winbindd process. If samba-3.6.0 (smbd, nmbd, winbindd) manages to start without creating the defunct process the wbinfo tests work.

Chris
Comment 14 Volker Lendecke 2011-08-29 22:03:01 UTC
This smells like a duplicate of Bug 8409. Can you try

https://bugzilla.samba.org/attachment.cgi?id=6813

please?
Comment 15 Chris Smith 2011-08-29 22:50:10 UTC
(In reply to comment #14)
> This smells like a duplicate of Bug 8409. Can you try
> 
> https://bugzilla.samba.org/attachment.cgi?id=6813

No difference. Got the defunct winbindd on first startup:
================================
# /etc/init.d/samba start
 * samba -> start: smbd ...                                                                             [ ok ]
 * samba -> start: nmbd ...                                                                             [ ok ]
 * samba -> start: winbind ...                                                                          [ ok ]
sartre samba # ps ax |grep winb
 1799 ?        Ss     0:00 /usr/sbin/winbindd
 1802 ?        S      0:00 /usr/sbin/winbindd
 1806 pts/4    S+     0:00 grep --colour=auto winb
sartre samba # ps ax |grep winb
 1799 ?        Ss     0:00 /usr/sbin/winbindd
 1802 ?        S      0:00 /usr/sbin/winbindd
 1807 ?        Z      0:00 [winbindd] <defunct>
 1809 pts/4    S+     0:00 grep --colour=auto winb
================================
Above commands about 1-2 seconds apart.
Comment 16 Marc Muehlfeld 2011-08-30 06:28:15 UTC
Created attachment 6835 [details]
strace of wbinfo -u
Comment 17 Marc Muehlfeld 2011-08-30 06:28:39 UTC
Created attachment 6836 [details]
winbind logs (level 10) of wbinfo -u
Comment 18 Marc Muehlfeld 2011-08-30 06:30:10 UTC
The defunct winbind process directly after startup, I can confirm, too.

# ps xf |grep winb
32438 ?        Ss     0:00 /usr/sbin/winbindd
32439 ?        S      0:00  \_ /usr/sbin/winbindd
32440 ?        S      0:00  \_ /usr/sbin/winbindd
32442 ?        Z      0:00  \_ [winbindd] <defunct>
Comment 19 Volker Lendecke 2011-08-30 06:43:37 UTC
(In reply to comment #18)
> The defunct winbind process directly after startup, I can confirm, too.
> 
> # ps xf |grep winb
> 32438 ?        Ss     0:00 /usr/sbin/winbindd
> 32439 ?        S      0:00  \_ /usr/sbin/winbindd
> 32440 ?        S      0:00  \_ /usr/sbin/winbindd
> 32442 ?        Z      0:00  \_ [winbindd] <defunct>

I don't get it.... I don't see which winbind exited and where the parent hangs. If you can directly reproduce this, can you do another test: Run winbind under strace:

strace -f -o /tmp/winbind.strace -ttT /usr/sbin/winbindd

and watch it until you see the zombie. Please also upload the simultaneous log files again, so that I can match up the strace with log entries.

Thanks,

Volker
Comment 20 Marc Muehlfeld 2011-08-30 07:43:44 UTC
Created attachment 6837 [details]
2nd winbind strace

Find attached the requested logs and strace.
Comment 21 Chris Smith 2011-08-30 13:57:06 UTC
Created attachment 6839 [details]
strace and logs as requested

Oddly enough when run under strace the defunct process disappears and wbinfo works.

Also note that the patch @ https://bugzilla.samba.org/attachment.cgi?id=6813 has been applied in this case.
=================================================
sartre samba # ps xf |grep winb
21069 pts/4    S+     0:00  |   \_ strace -f -o /tmp/winbind.strace -ttT /usr/sbin/winbindd
21082 pts/2    S+     0:00      \_ grep --colour=auto winb
21074 ?        Ss     0:00 /usr/sbin/winbindd
21075 ?        S      0:00  \_ /usr/sbin/winbindd
21080 ?        Z      0:00  \_ [winbindd] <defunct>
sartre samba # ps xf |grep winb
21069 pts/4    S+     0:00  |   \_ strace -f -o /tmp/winbind.strace -ttT /usr/sbin/winbindd
21084 pts/2    S+     0:00      \_ grep --colour=auto winb
21074 ?        Ss     0:00 /usr/sbin/winbindd
21075 ?        S      0:00  \_ /usr/sbin/winbindd
sartre samba # wbinfo -t                                                                                      
checking the trust secret for domain COMPUTERGUY via RPC calls succeeded
sartre samba # ps xf |grep winb
21069 pts/4    S+     0:00  |   \_ strace -f -o /tmp/winbind.strace -ttT /usr/sbin/winbindd
21087 pts/2    S+     0:00      \_ grep --colour=auto winb
21074 ?        Ss     0:00 /usr/sbin/winbindd
21075 ?        S      0:00  \_ /usr/sbin/winbindd
=================================================
Comment 22 Chris Smith 2011-08-30 14:31:02 UTC
Created attachment 6840 [details]
strace and logs as requested - proper failure this time

This time the defunct process did not disappear and "wbinfo -t" failed.

A couple of notes:

It appears that if I don't start smbd and nmbd, winbind does not fail - the defunct process never occurs.

I can more regularly recreate the failure if I first remove the gencache* and winbind* tdb files before restart samba.
Comment 23 Marc Muehlfeld 2011-09-14 07:05:16 UTC
Anything else we can do to help getting this issue fixed in the next version?
Comment 24 Karolin Seeger 2011-09-22 20:05:20 UTC
Is this one a showstopper for 3.6.1?
Comment 25 Chris Smith 2011-09-22 20:31:06 UTC
(In reply to comment #24)
> Is this one a showstopper for 3.6.1?

Prevents its use as a member server for several of my accounts. Would be nice to be able to use it for the supposed printing improvements.
Comment 26 Marc Muehlfeld 2011-09-22 20:57:15 UTC
For us it's definitely a showstopper. This problem is on all of our member servers and prevents us from upgrading to 3.6.

And it seems it is almost half done (see what I found in in Comment 7). Would be really nice, if it could finally be fixed in 3.6.1
Comment 27 Volker Lendecke 2011-09-23 16:10:21 UTC
(In reply to comment #26)
> For us it's definitely a showstopper. This problem is on all of our member
> servers and prevents us from upgrading to 3.6.
> 
> And it seems it is almost half done (see what I found in in Comment 7). Would
> be really nice, if it could finally be fixed in 3.6.1

If we only had a reproducer or a way to see this live..
Comment 28 Marc Muehlfeld 2011-09-25 17:24:12 UTC
Beside logfiles and traces I can create for you, I can offer you to have a remote session (e. g. teamviewer is also available for Linux), where you can have a look at it and see the problem live. Just send me a mail. I'm back in office on tuesday.
Comment 29 Karolin Seeger 2011-09-26 18:42:25 UTC
Okay, we do need to delay the release anyway. So I would like to wait another 2 days to see if there is a chance to get this one fixed also. Marking as blocker to be aware of it.
Comment 30 Michael Adam 2011-09-29 20:14:12 UTC
(In reply to comment #7)
> I wanted to do the debuggings for you and upgraded one machine to 3.6.0 again.
> Everything worked now. Even restarting doesn't break it again. But then I run
> "net cache flush" and the problem is back.

Now that is really strange.
Is that reliably reproducible?

> # wbinfo -t
> could not obtain winbind interface details: WBC_ERR_WINBIND_NOT_AVAILABLE
> could not obtain winbind domain name!
> checking the trust secret for domain (null) via RPC calls failed
> failed to call wbcCheckTrustCredentials: WBC_ERR_WINBIND_NOT_AVAILABLE
> Could not check secret
> 
> 
> # wbinfo -n MUC\\muehlfeld
> could not obtain winbind interface details: WBC_ERR_WINBIND_NOT_AVAILABLE
> could not obtain winbind separator!
> failed to call wbcLookupName: WBC_ERR_WINBIND_NOT_AVAILABLE
> Could not lookup name MUC\muehlfeld
> 
> 
> # wbinfo -s S-1-5-21-1362721961-1801182073-732966438-3122
> failed to call wbcLookupSid: WBC_ERR_WINBIND_NOT_AVAILABLE
> Could not lookup sid S-1-5-21-1362721961-1801182073-732966438-3122
> 
> 
> # wbinfo -S S-1-5-21-1362721961-1801182073-732966438-3122
> failed to call wbcSidToUid: WBC_ERR_WINBIND_NOT_AVAILABLE
> Could not convert sid S-1-5-21-1362721961-1801182073-732966438-3122 to uid
> 
> 
> Because of the "could not obtain winbind separator" message, I set
> winbind separator = ´\´
> (the default for this parameter as written in the manpage) and restarted the
> services.

Well, this is actually not the reason.
And it must be coincidence that the next set of commands
somewhat succeeded:
wbinfo does not get the separator from the config but from
winbindd itself. Now WBC_ERR_WINBIND_NOT_AVAILABLE says that
winbindd could not be contacted. Hence in particular the
separator could not be obtained.

The ´/´ is even invalid, because the ´ chars are
taken into the value, not used as quoting.

> Now I get:
> 
> # wbinfo -t
> checking the trust secret for domain MUC via RPC calls succeeded
> 
> 
> # wbinfo -n MUC\\muehlfeld
> failed to call wbcLookupName: WBC_ERR_DOMAIN_NOT_FOUND
> Could not lookup name MUC\muehlfeld

Here you already see that the separator did not work:
It could not split the domain from the name.

> # wbinfo -s S-1-5-21-1362721961-1801182073-732966438-3122
> MUC�muehlfeld 1

And here you see that the separator that winbindd delivers
is a non-ascii character.

> # wbinfo -S S-1-5-21-1362721961-1801182073-732966438-3122
> 30000
> 
> 
> 
> Also "wbinfo -u/-g" work now.
> 
> 
> 
> Because of this new information, with which settings I should provide you a new
> level 10 debug log?

So the winbind separator = ´/´ is not the solution, not even
a partial one. This is more a question of winbindd not starting
up properly.

First thing is now to clarify whether the net cache flush
thing is reproducible reliably.

Cheers - Michael
Comment 31 Marc Muehlfeld 2011-09-30 06:21:19 UTC
I did 10 restarts of the services (smbd, nmbd, winbindd) and everytime it worked, I tried "net cache flush". This are my results:


/etc/init.d/rc.smb restart
wbinfo -u         # works
net cache flush
wbinfo -u         # works


/etc/init.d/rc.smb restart
wbinfo -u         # failed


/etc/init.d/rc.smb restart
wbinfo -u         # works
net cache flush
wbinfo -u         # works


/etc/init.d/rc.smb restart
wbinfo -u         # failed


/etc/init.d/rc.smb restart
wbinfo -u         # failed


/etc/init.d/rc.smb restart
wbinfo -u         # failed


/etc/init.d/rc.smb restart
wbinfo -u         # works
net cache flush
wbinfo -u         # works


/etc/init.d/rc.smb restart
wbinfo -u         # failed


/etc/init.d/rc.smb restart
wbinfo -u         # works
net cache flush
wbinfo -u         # works


/etc/init.d/rc.smb restart
wbinfo -u         # failed




I just tried because I was curious: I deleted the content of the lock directory and started the services again. But I failed again directly.
Comment 32 Karolin Seeger 2011-10-03 17:46:58 UTC
Hi Michael,

please comment if there is a chance to get a fix for this one soon and if it's really a blocker for 3.6.1.

Thanks!

Cheers,
Karo
Comment 33 Chris Smith 2011-10-03 20:49:49 UTC
There must be some difference in winbind (or somewhere) between 3.5.11 (which always works) and 3.6.0 (which fails sporadically) that's causing this.
Comment 34 Karolin Seeger 2011-10-08 18:44:45 UTC
Michael,

please comment if there is a chance to get a fix for this one soon and if it's
really a blocker for 3.6.1.

Thanks!
Comment 35 Michael Adam 2011-10-09 20:09:06 UTC
Hi Karo,

I don't know whether we have ah chance to fix it soon,
since I don't know what circumstances trigger this behaviour.
I.e. I did not reproduce it yet.

For those who suffer from it, it seems to be very bad indeed.
Not sure whether this makes it a blocker for 3.6.1.
(I was arguing once that blocker should only exist
for .0 releases and not for .X releases with X > 1.
I don't know how this discussion was resolved, if at all...)

:-)

Cheers - Michael


(In reply to comment #34)
> Michael,
> 
> please comment if there is a chance to get a fix for this one soon and if it's
> really a blocker for 3.6.1.
> 
> Thanks!
Comment 36 Marc Muehlfeld 2011-10-21 07:33:02 UTC
I compiled the new 3.6.1. Problem is still there.
Comment 37 Chris Smith 2011-10-23 22:55:41 UTC
Disappointing. Samba-3.6.1 does not fix this issue and is unusable as a member server. Back to 3.5.11.
Comment 38 Chris Smith 2011-10-24 01:51:01 UTC
(In reply to comment #37)
> Disappointing. Samba-3.6.1 does not fix this issue and is unusable as a member
> server. Back to 3.5.11.

Want to add that it even seems worse - so far have not seen one working start of winbind at all with 3.6.1 and need to use "kill -9" to stop the processes. However I have upgraded the system I'm testing with to a a faster one (i7 2600k) so not everything is the same.
Comment 39 Chris Smith 2011-10-24 15:58:10 UTC
Decided to fiddle with this again this morning and made some possibly helpful discoveries.

I noticed that winbind had no issues if the PDC was not running. After stopping Samba on the PDC winbind started normally with no <defunct> process. So it wasn't some sort of standalone race condition with winbind itself. After stating the PDC trouble was again apparent.

Suspecting now that the issue was communication dependent between the PDC and the Member server I decided to test upgrading the PDC from its working version of 3.5.8 (worked with all member servers through 3.5.11) to 3.6.1.

After the PDC upgrade to 3.6.1 I have not (as yet) been able to reproduce the winbind problem with the 3.6.1 member server (which was recently failing 100% of the time before the PDC upgrade).

Which still leaves me really scared about an account that still has an NT4 PDC... is it really safe to even consider a 3.6.1 upgrade on its member servers ??
Comment 40 Karolin Seeger 2011-11-03 19:47:25 UTC
This is a showstopper for 3.6.2.
Could someone investigate, please?

Thanks!
Comment 41 Chris Smith 2011-11-04 16:54:48 UTC
(In reply to comment #39)
> Which still leaves me really scared about an account that still has an NT4
> PDC... is it really safe to even consider a 3.6.1 upgrade on its member servers
> ??

Finally tested this - samba-3.6.1 fails as member server to an NT4 PDC. The same winbind defunct process occurs.
Comment 42 Marc Muehlfeld 2011-11-14 08:46:17 UTC
I can confirm that, if both (PDC and meber) are running 3.6.1, then the problem doesn't appear.

In my test environment I created a script that contains all steps from comment #5 and let it run every even minute (it also contains a restart of the services and a "net cache flush"). On the PDC I additionally let the services restart every uneven minute. This script I let run for over 1h now. And everytime all steps from comment #5 succeed.

So this bug really only seems to appear when the PDC is on 3.5.x and the member on 3.6.x.
Comment 43 Chris Smith 2011-11-14 15:16:52 UTC
(In reply to comment #42)
> So this bug really only seems to appear when the PDC is on 3.5.x and the member
> on 3.6.x.

Possibly accurate to state that winbind from 3.6.[0,1] fails against a PDC running <3.6.0 including a real NT4 PDC. Although I have only tested against PDC's of 3.5.[8,11-12] and NT4.

Yesterday I installed 3.6.1 on a new system that had never run Samba previously and it failed immediately against an NT4 PDC and although many subsequent restarts would run OK the failure would eventually occur again.

The only change in the smb.conf between 3.5.x and 3.6.x versions were the removal of the deprecated parameters:
===============================
idmap backend = tdb
idmap uid = 300000-400000
idmap gid = 300000-400000
idmap config WARGAMES : backend = rid
idmap config WARGAMES : range = 2000-299999
===============================
was shortened to:
===============================
idmap config WARGAMES : backend = rid
idmap config WARGAMES : range = 2000-299999
===============================
Comment 44 Chris Smith 2011-11-14 16:49:10 UTC
Created attachment 7095 [details]
set of level 10 logs from failure with actual NT4 PDC

Can also add a set from a non-failed start up for comparison if desired.

Also note that the failure rate is (so far) 100% of Samba starts if I do a "net cache flush" after stopping Samba.
Comment 45 Volker Lendecke 2011-11-14 20:00:36 UTC
Looked at all log files again. None show any failure. winbind logs are more than just log.winbindd. They are also log.wb-* and other stuff. This is terminally unfixable with proper logon to such a box. Closing as INVALID, it seems completely impossible to get the required information.
Comment 46 Volker Lendecke 2011-11-14 20:02:13 UTC
(In reply to comment #45)
> Looked at all log files again. None show any failure. winbind logs are more
> than just log.winbindd. They are also log.wb-* and other stuff. This is
> terminally unfixable with proper logon to such a box. Closing as INVALID, it

s/with/without/, but you guessed that already.

Volker
Comment 47 Chris Smith 2011-11-14 21:46:11 UTC
(In reply to comment #45)
> Looked at all log files again. None show any failure. winbind logs are more
> than just log.winbindd. They are also log.wb-* and other stuff. This is
> terminally unfixable with proper logon to such a box. Closing as INVALID, it
> seems completely impossible to get the required information.

Don't jump to conclusions! Those were _all_ of the logs created - a full set! Maybe that's a clue of some sort. When it doesn't fail there are log.wb-* files created, for the domain, the host and builtin.

I just reran the scenario a few times and one time did get a log.wb-<DOMAINNAME> but none of  the others (<HOST> and BUILTIN).

Marc and I both can recreate this identical problem at will and it is not distro specific as we use different one. The problem is 100% repeatable when using Samba-3.6.1 as a member server against an NT4 PDC when using "net flush cache" before starting Samba for me.

I will be happy to furnish anything you ask for to the best of my ability. I will upload the newest set of fail logs that adds the log.wb-<DOMAINNAME> file, and I let this one run through an attempt to get user and group info.

Here is the console output:

darwin samba # /etc/init.d/samba start
 * samba -> start: smbd ...                                                                                   [ ok ]
 * samba -> start: nmbd ...                                                                                   [ ok ]
 * samba -> start: winbind ...                                                                                [ ok ]

darwin samba # ps ax |grep win
10148 ?        Ss     0:00 /usr/sbin/winbindd
10151 ?        S      0:00 /usr/sbin/winbindd
10153 ?        Z      0:00 [winbindd] <defunct>

darwin samba # wbinfo -u
Error looking up domain users

darwin samba # wbinfo -g
failed to call wbcListGroups: WBC_ERR_WINBIND_NOT_AVAILABLE
Error looking up domain groups

darwin samba # ls /var/log/samba/
cores  log.nmbd  log.smbd  log.wb-WARGAMES  log.winbindd  log.winbindd-dc-connect  smbd.log  winbindd.log
Comment 48 Marc Muehlfeld 2011-11-14 21:47:58 UTC
I already offered you to have a teamviewer session, so you can have a look at the machines. Just send me an mail during work hours. I can also call you back. I'm also from germany.
Comment 49 Chris Smith 2011-11-14 21:48:00 UTC
Created attachment 7096 [details]
_complete_ set of logs from failure with NT4 PDC

the set has an additional log file - the log-wb-<DOMAINNAME>
Comment 50 Volker Lendecke 2011-11-15 06:59:19 UTC
(In reply to comment #49)
> Created attachment 7096 [details]
> _complete_ set of logs from failure with NT4 PDC
> 
> the set has an additional log file - the log-wb-<DOMAINNAME>

vlendec@delphin:/tmp/wb$ grep process_request *
log.wb-WARGAMES:[2011/11/14 16:17:04.840630, 10] winbindd/winbindd_dual.c:439(child_process_request)
log.wb-WARGAMES:  child_process_request: request fn INIT_CONNECTION
log.wb-WARGAMES:[2011/11/14 16:18:59.416551, 10] winbindd/winbindd_dual.c:439(child_process_request)
log.wb-WARGAMES:  child_process_request: request fn LIST_TRUSTDOM
winbindd.log:[2011/11/14 16:18:59.416567, 10] winbindd/winbindd.c:642(process_request)
winbindd.log:  process_request: request fn INTERFACE_VERSION
winbindd.log:[2011/11/14 16:18:59.416656, 10] winbindd/winbindd.c:642(process_request)
winbindd.log:  process_request: request fn INTERFACE_VERSION

There is no attempt at all to list users, so I am at a complete loss which winbind your wbinfo command asked. One thing is 100% certain: It was NOT the winbind that you sent logfiles from.
Comment 51 Marc Muehlfeld 2011-11-15 13:56:01 UTC
Created attachment 7097 [details]
logfiles and smb.conf from a 3.6.1 member server and 3.5.12 PDC

I cleared the log directories on my PDC and member, set on both log level to 10 and started the services. Then I run "wbinfo -u" on the member and wait until it timed out.

The logs are created in my test environment, so no other machines should appear in the logs.

GENOME = PDC
TEST_Member = Member server
MUC = Domain name

The logs of both servers (log level 10) and both smb.conf are included.

PDC is 3.5.12 and member server is 3.6.1
Comment 52 Volker Lendecke 2011-11-15 14:52:38 UTC
Created attachment 7098 [details]
Patch for 3.6

Can you try the attached (completely untested) patch?

Thanks,

Volker
Comment 53 Marc Muehlfeld 2011-11-15 15:03:11 UTC
Could you see any unusual in my logs?

I compiled 3.6.1 with your patch. I run "net cache flush" and started the services. But "wbinfo -u" still time out.
Comment 54 Volker Lendecke 2011-11-15 15:25:46 UTC
Well, I saw a NT_STATUS_NO_MEMORY which is the wrong error message to an empty local domain. The patch should have fixed that, but apparently it did not. As I said, to me this seems an unfixable defect. Way too many roundtrips for a pro-bono volunteer project.
Comment 55 Volker Lendecke 2011-11-15 15:29:23 UTC
Probably impossible, but the same set of log files as in together with two sets of network traces on both the client and the server might give some additional clues. For detailed information how to create useful network traces, please see http://wiki.samba.org/index.php/Capture_Packets.
Comment 56 Marc Muehlfeld 2011-11-15 15:41:26 UTC
Created attachment 7099 [details]
Logs and network traces of 3.5.12 PDC and 3.6.1 member

Find attached a .tar.bz2 that contains (each from PDC and member):
- level 10 debug log
- smb.con
- network trace


Included are: Start of the services on both machines, "wbinfo -u" on the client until timeout.
Comment 57 Volker Lendecke 2011-11-15 15:59:06 UTC
Trace_PDC contains -- NOTHING. I would have liked to see whether frame 65 from Trace_Member arrives on the PDC. But to me it seems that you have traced the wrong network interface.
Comment 58 Marc Muehlfeld 2011-11-15 16:14:14 UTC
Created attachment 7102 [details]
Logs and network traces of 3.5.12 PDC and 3.6.1 member

You are right. I had the wrong interface on the PDC. Sorry.

Find attached the same as before. Just with new created traces and logs.
Comment 59 Chris Smith 2011-11-16 18:00:53 UTC
(In reply to comment #50)
> There is no attempt at all to list users, so I am at a complete loss which
> winbind your wbinfo command asked. One thing is 100% certain: It was NOT the
> winbind that you sent logfiles from.

All of the logs were deleted before the run, and I uploaded every single one created. My guess is that wbinfo is somehow attempting to gather data via the <defunct> winbind process and this is possibly preventing the log information you're expecting to see.

Maybe we're looking in the wrong spot. The thing to keep in mind is that after starting Samba, if there is a <defunct> winbind process (100% against an NT4 PDC if net cache flush is run before starting) the error has already occurred - there is no need to run wbinfo as it is a 100% foregone conclusion that it will fail.
Comment 60 Volker Lendecke 2011-11-16 18:59:32 UTC
Marc, got your fault reproduced.
Comment 61 Volker Lendecke 2011-11-16 20:30:43 UTC
Created attachment 7109 [details]
samlogon packet from 3.5
Comment 62 Volker Lendecke 2011-11-16 20:33:03 UTC
Created attachment 7110 [details]
samlogon packet from 3.6

This packet has a problem: The alignment after the SID is not correct. That's why the 3.5 DC discards this packet from 3.6, which leads to the initial timeouts. When winbind was able to figure out the DC name in a different way then everything is fine. Need to dig deeper why ndr_push_NETLOGON_SAM_LOGON_REQUEST in librpc/ndr/ndr_nbt.c (hand-marshalled) gets the alignment wrong.
Comment 63 Volker Lendecke 2011-11-16 20:33:42 UTC
BTW, I have not been able to reproduce this so far because I have run winbind without nmbd all the time. Without nmbd that code path is never executed.
Comment 64 Chris Smith 2011-11-16 20:56:21 UTC
(In reply to comment #63)
> BTW, I have not been able to reproduce this so far because I have run winbind
> without nmbd all the time. Without nmbd that code path is never executed.

Verified that without running nmbd the winbind bug does not occur against an NT4 PDC.
Comment 65 Volker Lendecke 2011-11-17 21:42:07 UTC
Created attachment 7114 [details]
Patch for 3.6

This is what I've pushed to master
Comment 66 Stefan Metzmacher 2011-11-17 22:19:29 UTC
Created attachment 7115 [details]
draft patch

I think ndr_pull_NETLOGON_SAM_LOGON_REQUEST() has the same problem.

Maybe can fix this more genericly in ndr_set_flags(), but I'd need check all callers.

If this takes to long we can add manual fixes to
ndr_push_NETLOGON_SAM_LOGON_REQUEST and ndr_pull_NETLOGON_SAM_LOGON_REQUEST.
Comment 67 Stefan Metzmacher 2011-11-17 22:22:00 UTC
Volker, I think for now we should add your fix also to ndr_pull_NETLOGON_SAM_LOGON_REQUEST()
Comment 68 Chris Smith 2011-11-17 22:24:04 UTC
(In reply to comment #65)
> Created attachment 7114 [details]
> Patch for 3.6

This does resolve the winbind bug with 3.6.1 as a member server to an NT4 PDC.
Comment 69 Volker Lendecke 2011-11-18 06:48:00 UTC
Comment on attachment 7114 [details]
Patch for 3.6

Ok, I stand to my comment: This is not fixable. My patch does not survive autobuild. I really don't know anymore what to do.
Comment 70 Marc Muehlfeld 2011-11-18 08:12:09 UTC
If I just apply the patch from Stefan, it don't work. But if I also apply Volkers patch, wbinfo seems to work like it should.


Is our setup so special, that we are the only two who are having this problem?
Do others with a 3.5 PDC and 3.6 Member don't have this problem?

At the moment I can't switch the PDC also to 3.6, because I discovered a bug with printing drivers which is a showstopper for me.
Comment 71 Volker Lendecke 2011-11-18 09:46:06 UTC
Created attachment 7117 [details]
Patch for 3.6

This is the equivalent of what ended up in master.
Comment 72 Stefan Metzmacher 2011-11-18 10:01:41 UTC
Comment on attachment 7117 [details]
Patch for 3.6

Looks good for 3.6
Comment 73 Stefan Metzmacher 2011-11-18 10:03:06 UTC
(In reply to comment #72)
> Comment on attachment 7117 [details]
> Patch for 3.6
> 
> Looks good for 3.6

Karolin, please pick this for the next release
Comment 74 Stefan Metzmacher 2011-11-18 10:05:32 UTC
Created attachment 7118 [details]
new draft patch

Marc, can you also test if this patch would fix it too?

A working capture against a real NT4 DC would also be very useful.
Comment 75 Marc Muehlfeld 2011-11-18 10:33:14 UTC
Looks good, Stefan. Thanks.


Maybe Chris can create a capture against a NT4 PDC (we fortunately just have a samba PDC).
Comment 76 Stefan Metzmacher 2011-11-18 12:23:47 UTC
(In reply to comment #75)
> Looks good, Stefan. Thanks.

Ok, thanks.

> Maybe Chris can create a capture against a NT4 PDC (we fortunately just have a
> samba PDC).

Chris, could do create a capture please.
See https://wiki.samba.org/index.php/Capture_Packets
Comment 77 Chris Smith 2011-11-18 14:45:23 UTC
(In reply to comment #76)
> Chris, could do create a capture please.
> See https://wiki.samba.org/index.php/Capture_Packets

Which patch(es)? Should both 7117 and 7118 be applied?
Comment 78 Chris Smith 2011-11-18 17:08:42 UTC
Created attachment 7124 [details]
tcpdump capture against NT4 PDC

(In reply to comment #76)
> Chris, could do create a capture please.
> See https://wiki.samba.org/index.php/Capture_Packets

Hopefully this is useful - if not I'll head back to the drawing board. This is the capture against an NT4 PDC after starting Samba-3.6.1 with all the (non-obsolete) patches available in the bugs listed in #8595 except for bugs #8600 & #8592 as they do not properly apply (maybe against master but not against 3.6.1) included, of course patches 7117 and 7118 from this bug.

Started the capture on the member server, started Samba (smbd, nmbd, winbind), ran wbinfo -u and -g.

I can easily adjust which patches get applied if necessary.
Comment 79 Karolin Seeger 2011-12-06 18:46:55 UTC
Re-assigning to metze for further investigation.
Please note that I did not push the patch yet.
I think it makes sense to wait until the bug can be closed.
Please let me know if I should push the patch right now.

Thanks!
Comment 80 Stefan Metzmacher 2011-12-07 11:07:44 UTC
Comment on attachment 7117 [details]
Patch for 3.6

Karolin, please pick this patch for the next 3.6 release, it's a correct fix for this bug report.
Comment 81 Marc Muehlfeld 2011-12-07 11:12:40 UTC
Thanks for the final fix. Good work.

I'm happy to hear, that it will be included in the next 3.6 version. I hope the other printer related bugs are also fixed soon (like #8612) - then I can finally switch all my samba servers to 3.6 :-)
Comment 82 Karolin Seeger 2011-12-11 18:51:38 UTC
(In reply to comment #80)
> Comment on attachment 7117 [details]
> Patch for 3.6
> 
> Karolin, please pick this patch for the next 3.6 release, it's a correct fix
> for this bug report.

Hi Metze,

done. Thanks for clarification!

Closing out bug report.
Please feel free to re-open if it's still an issue.

Thanks!