Bug 8373 - Can't join XP Pro workstations to 3.6.1 DC
Can't join XP Pro workstations to 3.6.1 DC
Status: RESOLVED FIXED
Product: Samba 3.6
Classification: Unclassified
Component: Domain Control
3.6.1
x64 Linux
: P3 regression
: ---
Assigned To: Jeremy Allison
Samba QA Contact
:
Depends on: 8371
Blocks: 8595
  Show dependency treegraph
 
Reported: 2011-08-12 09:15 UTC by Guillaume MENGUY
Modified: 2012-07-05 11:37 UTC (History)
7 users (show)

See Also:


Attachments
clientside_domain_join_failure_3.6.0.pcap (59.41 KB, application/octet-stream)
2011-08-13 10:04 UTC, Guillaume MENGUY
no flags Details
clientside_domain_join_success_3.6.0pre1.pcap (1.57 MB, application/octet-stream)
2011-08-13 10:04 UTC, Guillaume MENGUY
no flags Details
serverside_domain_join_failure_3.6.0.pcap (16.88 KB, application/octet-stream)
2011-08-13 10:05 UTC, Guillaume MENGUY
no flags Details
serverside_domain_join_success_3.6.0pre1.pcap (50.32 KB, application/octet-stream)
2011-08-13 10:05 UTC, Guillaume MENGUY
no flags Details
nmbd logging of error (113.04 KB, text/plain)
2012-01-06 18:34 UTC, Max Senft
no flags Details
wireshark logging of domain join error (9.88 KB, application/octet-stream)
2012-01-06 18:36 UTC, Max Senft
no flags Details
Server side 3.6.2 PDC domain join failure (41.26 KB, application/octet-stream)
2012-02-08 13:28 UTC, Guillaume MENGUY
no flags Details
Client side Windows XP domain join failure on 3.6.2 PDC (30.01 KB, application/octet-stream)
2012-02-08 13:29 UTC, Guillaume MENGUY
no flags Details
patch for a workaround (297 bytes, patch)
2012-05-23 14:06 UTC, Torsten
no flags Details
Test patch to understand the issue. (427 bytes, patch)
2012-05-23 23:55 UTC, Jeremy Allison
no flags Details
WARNING ! Experimental patch - also to investigate the problem.. (1.46 KB, patch)
2012-05-24 04:38 UTC, Jeremy Allison
no flags Details
WARNING ! Experimental code ! - More elegant experimental patch. (1.06 KB, patch)
2012-05-24 05:23 UTC, Jeremy Allison
no flags Details
Experimental patch for master and 3.6 (1.65 KB, patch)
2012-05-25 12:36 UTC, Stefan Metzmacher
no flags Details
Fix for 3.6.next (1.99 KB, patch)
2012-05-25 22:31 UTC, Jeremy Allison
abartlet: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Guillaume MENGUY 2011-08-12 09:15:23 UTC
My production DC is running Samba 3.6.0pre1 just fine (CentOS 5.3, OpenLDAP backend). I can join any Windows XP or 7 workstations.
When trying to compile and use newer samba version (3.6.0 pre2,pre3,rc1,rc2,final), I can't join new workstations to the domain. I compile it using the same command on the same server.
I get no information in any logfile, and it doesn't create a log.hostname file.
Whet trying to join the domaine, Windows XP asks for domain administrator user&password. Then after a minute, I got a message saying that it couldn't find a domain controler for that domain.
Thanks for your help.

Guillaume
Comment 1 Volker Lendecke 2011-08-12 13:44:54 UTC
Please provide a network trace of the successful and the failed attempt to join the domain, taken on the DC. See http://wiki.samba.org/index.php/Capture_Packets.

Volker
Comment 2 Guillaume MENGUY 2011-08-13 10:04:22 UTC
Created attachment 6783 [details]
clientside_domain_join_failure_3.6.0.pcap
Comment 3 Guillaume MENGUY 2011-08-13 10:04:37 UTC
Created attachment 6784 [details]
clientside_domain_join_success_3.6.0pre1.pcap
Comment 4 Guillaume MENGUY 2011-08-13 10:05:03 UTC
Created attachment 6785 [details]
serverside_domain_join_failure_3.6.0.pcap
Comment 5 Guillaume MENGUY 2011-08-13 10:05:22 UTC
Created attachment 6786 [details]
serverside_domain_join_success_3.6.0pre1.pcap
Comment 6 Guillaume MENGUY 2011-08-13 10:07:01 UTC
Attached 4 captures (both client and server side, success on 3.6.0pre1 and failure on 3.6.0).

Client name = tg1320442
Client IP = 172.20.33.100
domain=TG
DC = 172.20.0.2 TRANSGENE (transgene.transgene.fr)


tg1320442$ account exists in LDAP.

Guillaume
Comment 7 Guillaume MENGUY 2011-09-05 16:32:20 UTC
(In reply to comment #1)
> Please provide a network trace of the successful and the failed attempt to join
> the domain, taken on the DC. See
> http://wiki.samba.org/index.php/Capture_Packets.
> 
> Volker

Trace posted.
Thx.

Guillaume
Comment 8 Guillaume MENGUY 2011-10-25 13:04:07 UTC
Still not working in 3.6.1.
Comment 9 Stefan Metzmacher 2011-11-18 14:02:00 UTC
Guillaume, can you please test if this fix
https://attachments.samba.org/attachment.cgi?id=7117
from
https://bugzilla.samba.org/show_bug.cgi?id=8371
fixes your problem?
Comment 10 Guillaume MENGUY 2011-11-22 14:05:09 UTC
(In reply to comment #9)
> Guillaume, can you please test if this fix
> https://attachments.samba.org/attachment.cgi?id=7117
> from
> https://bugzilla.samba.org/show_bug.cgi?id=8371
> fixes your problem?

Hi,

Unfortunately it doesn't.
I also tried this one in addition and dit not get better results :
https://attachments.samba.org/attachment.cgi?id=7118
Comment 11 Volker Lendecke 2011-12-07 13:12:25 UTC
To me it seems that in the serverside domain join failure frame 70 is not correctly being responded to. This would be the task of nmbd. Can you do a new network trace and post a debug level 10 log of nmbd for this failure?

Thanks,

Volker Lendecke
Comment 12 Max Senft 2012-01-06 18:34:42 UTC
Created attachment 7232 [details]
nmbd logging of error

See line 711 for first occurence of the error.
Comment 13 Max Senft 2012-01-06 18:36:38 UTC
Created attachment 7233 [details]
wireshark logging of domain join error

This is the wireshark capture for the logfile just submitted.
Comment 14 Max Senft 2012-01-06 18:37:18 UTC
Hi,

I hope my current problem fits this bug. But I guess it does. Have a look at the added log.nmbd and the wireshark capture that I have uploaded a few moments ago. I hope this problem can be solved soon...

Max
Comment 15 Volker Lendecke 2012-01-06 21:01:07 UTC
I'm lost, sorry. Looked at the traces, I do see the error. I installed a fresh XP and a 3.6.1 based domain with your data (domain name and computer name) and it just works. Do you have weird locale settings on your DC?

Volker
Comment 16 Max Senft 2012-01-06 22:15:58 UTC
Hi. What is your definition of weird? ;) As I understand the error message "Conversion error: Illegal multibyte sequence()" correct, there should be *any* character between the brackets?! But there is nothing?

Max
Comment 17 Max Senft 2012-01-06 22:48:55 UTC
Hi again ... I don't know if this also is connected to this bug. But there is a problem when trying to log in to *another* XP machine which still is registered inside the domain: I get a message box indicating that the domain controller is not available or the computer account cannot be found.

There is no indication of any type of error inside the log.nmbd, but inside syslog there are the following entries:

Jan  6 23:27:28 server smbd[31419]: [2012/01/06 23:27:28.514480,  0] rpc_server/netlogon/srv_netlog_nt.c:976(_netr_ServerAuthenticate3)
Jan  6 23:27:28 server smbd[31419]:   _netr_ServerAuthenticate3: netlogon_creds_server_check failed. Rejecting auth request from client XPDATEV machine account XPDATEV$
Jan  6 23:27:28 server smbd[31419]: [2012/01/06 23:27:28.535395,  0] rpc_server/netlogon/srv_netlog_nt.c:976(_netr_ServerAuthenticate3)
Jan  6 23:27:28 server smbd[31419]:   _netr_ServerAuthenticate3: netlogon_creds_server_check failed. Rejecting auth request from client XPDATEV machine account XPDATEV$

As well log.xpdatev is full of stuff for this timecode. Do you think this might help?

Max
Comment 18 zil 2012-01-27 16:49:01 UTC
I get the same error ("Illegal multibyte sequence") with locale reporting en_US.utf8 for all variables on the samba server machine.  smb.conf has 
"dos charset = 850" and "unix charset = ISO8859-1", but I don't know if those matter in this context.  Downgraded to 3.5.11 and it works fine.
Comment 19 Guillaume MENGUY 2012-02-08 13:26:58 UTC
(In reply to comment #11)
> To me it seems that in the serverside domain join failure frame 70 is not
> correctly being responded to. This would be the task of nmbd. Can you do a new
> network trace and post a debug level 10 log of nmbd for this failure?
> 
> Thanks,
> 
> Volker Lendecke

3.6.2 doesn't fix the problem. Posting traces...
Comment 20 Guillaume MENGUY 2012-02-08 13:28:14 UTC
Created attachment 7304 [details]
Server side 3.6.2 PDC domain join failure
Comment 21 Guillaume MENGUY 2012-02-08 13:29:13 UTC
Created attachment 7305 [details]
Client side Windows XP domain join failure on 3.6.2 PDC
Comment 22 Torsten 2012-05-23 14:06:32 UTC
(In reply to comment #16)
> Hi. What is your definition of weird? ;) As I understand the error message
> "Conversion error: Illegal multibyte sequence()" correct, there should be *any*
> character between the brackets?! But there is nothing?
> 
> Max

Hello,

I have hit across this problem with a W2008 Server. In fact I was unable to change my user password. This would also explain why I could not (re)join the domain: the password gets changed during joining, and if this fails, all fails.

After digging a while (as just this one server shows this problem, all other Windows computers I tried did work fine) I have tried to create a patch / workaround for charcnv.c. I did not test very much so far, but my Vista is working again. I will attach a diff file with my code.

-- Torsten.
Comment 23 Torsten 2012-05-23 14:06:58 UTC
Created attachment 7584 [details]
patch for a workaround
Comment 24 Karolin Seeger 2012-05-23 18:56:03 UTC
Making this one a blocker for 3.6.6 as several people hit this issue.
Comment 25 Jeremy Allison 2012-05-23 23:55:13 UTC
Created attachment 7585 [details]
Test patch to understand the issue.

Can you test the attached patch please ? Note that you'll have to apply this *AFTER* the autogenerated pidl step, as it modifies an autogenerated file.

This will not be a final patch I'm just trying to understand the problem, and get more data on what is wrong.

Please report back to me asap as this is a blocker for the next 3.6.x.

Thanks !

Jeremy.
Comment 26 Jeremy Allison 2012-05-24 04:37:50 UTC
It looks like a bug in parsing the NBT netlogon
packet, inside the function: ndr_pull_nbt_netlogon_packet().

I looked closely, and I found an interesting thing.

These functions are auto-generated in 3.6.x via pidl,
but have been removed from auto-generation in master
with a note that:

        /* These responses are all handled manually, as they cannot be encoded in IDL fully
           See push_nbt_netlogon_response()
        */

Which was commit b782b5ed by Andrew Bartlett..
Curiouser and curiouser :-). This fix wasn't
back-ported into v3-6-test btw. (Actually this
fix was a simple comment addition, the actual
fix was 2f5a1d2b1cfdbfc3d4c7c1e96d1ed061e7970f88,

     Manually handle the NETLOGON_SAM_LOGON_REQUEST too.

    With the sid structure being both optional and aligned, it was too
    hard to do this in just IDL.

also not back-ported into 3.6.x.

Looking at bug #8373 *really* closely what it looks
like is that when parsing nbt_netlogon_query_for_pdc,
whose idl looks like this:

        /* query for pdc request */
        typedef struct {
                astring              computer_name;
                astring              mailslot_name;
                [flag(NDR_ALIGN2)]   DATA_BLOB _pad;
                nstring              unicode_name;
                netlogon_nt_version_flags               nt_version;
                uint16               lmnt_token;
                uint16               lm20_token;
        } nbt_netlogon_query_for_pdc;

we mess up on parsing out the :

        nstring              unicode_name;

field, as in the associated capture file from the
bug it shows that the mailslot_name ends on an odd
boundary and is then aligned with a zero byte (to
match the [flag(NDR_ALIGN2)]   DATA_BLOB _pad; field).

We get a iconv error which I believe is due to the
offset being stuck at the padding zero.

So, why isn't the :

        [flag(NDR_ALIGN2)]   DATA_BLOB _pad;

having the desired effect ? The chain of IDL looks
like:

nbt_netlogon_query_for_pdc is enclosed by :

        typedef [nodiscriminant] union {
                [case(LOGON_REQUEST)]  NETLOGON_LOGON_REQUEST logon0;
                [case(LOGON_SAM_LOGON_REQUEST)]       NETLOGON_SAM_LOGON_REQUEST logon;
                [case(LOGON_PRIMARY_QUERY)] nbt_netlogon_query_for_pdc pdc;
                [case(NETLOGON_ANNOUNCE_UAS)] NETLOGON_DB_CHANGE uas;
        } nbt_netlogon_request;

which itself is enclosed by :

        typedef [flag(NDR_NOALIGN),public] struct {
                netlogon_command command;
                [switch_is(command)] nbt_netlogon_request req;
        } nbt_netlogon_packet;

Note the flag(NDR_NOALIGN) assignment to the
nbt_netlogon_packet struct. It turns out that
setting flag(NDR_NOALIGN) on a structure affects
*all* enclosed sub-marshalling/unmarshalling calls
when called from code marshalling/unmarshalling
this struct.

Looking carefully into our NBT functions I found code
to hand-marshall similar structures such as :

ndr_push_NETLOGON_SAM_LOGON_REQUEST()

where we have :

                        uint32_t _flags_save_DATA_BLOB = ndr->flags;
                        ndr->flags &= ~LIBNDR_FLAG_NOALIGN;
                        ndr_set_flags(&ndr->flags, LIBNDR_FLAG_ALIGN4);
                        NDR_CHECK(ndr_push_DATA_BLOB(ndr, NDR_SCALARS, r->_pad));
                        ndr->flags = _flags_save_DATA_BLOB;

We're hand-unsetting LIBNDR_FLAG_NOALIGN here, as it
turns out that ndr_set_flags() only ever OR's given
flags into the ndr->flags field (with some complex
rules).

So why do we have to do this ? Right now it turns out
that when we set flag(NDR_NOALIGN) in the definition
of the nbt_netlogon_packet struct, this is recursive
and means that we don't align all the way down when
marshalling or unmarshalling - which is what we want.

Until, that is, we hit the flag(NDR_ALIGN2) on the
DATA_BLOB _pad definition. The NDR_ALIGN2 bit is
set in the generated code via ndr_set_flags(), but
as the LIBNDR_FLAG_NOALIGN is already set from the
calling code, and setting this bit does not reset
the LIBNDR_FLAG_NOALIGN it means it is completely
ignored when evaluating the alignment. Thus the
_pad blob alignment generation has no effect, and
we end up being stuck on the offset of the padding
zero.

Sorry for this being so long, but the upshot of all
this is I think that the flags

LIBNDR_FLAG_ALIGN2|LIBNDR_FLAG_ALIGN4|LIBNDR_FLAG_ALIGN8

which are collectively defined as LIBNDR_ALIGN_FLAGS,
should be mutually exclusive with LIBNDR_FLAG_NOALIGN,
in that when you set LIBNDR_FLAG_NOALIGN, the bits
of LIBNDR_ALIGN_FLAGS should be removed, and when you
set any of the LIBNDR_ALIGN_FLAGS bits, LIBNDR_FLAG_NOALIGN
should be removed.

If we do this correctly I think it then allows the
nbt_netlogon_query_for_pdc struct to be correctly
marshalled/unmarshalled by the gen_ndr generated
code. I'm wondering if it also may remove the
need for some of the hand-generation of this code
that got put into master ?

I will attach the *VERY PRELIMINARY* patch for evaluation,
not that I think it's a valid one - yet !
Comment 27 Jeremy Allison 2012-05-24 04:38:43 UTC
Created attachment 7588 [details]
WARNING ! Experimental patch - also to investigate the problem..
Comment 28 Jeremy Allison 2012-05-24 05:23:51 UTC
Created attachment 7589 [details]
WARNING ! Experimental code ! - More elegant experimental patch.
Comment 29 Stefan Metzmacher 2012-05-24 06:46:04 UTC
(In reply to comment #28)
> Created attachment 7589 [details]
> WARNING ! Experimental code ! - More elegant experimental patch.

See comment # 10, 

https://attachments.samba.org/attachment.cgi?id=7118
doesn't seem to fix it. (maybe the tester didn't regenerate the pidl output...)

The only difference to your patch is that, it your patch
resets LIBNDR_FLAG_NOALIGN when LIBNDR_FLAG_REMAINING is set.

        if (new_flags & LIBNDR_FLAG_REMAINING) {
                (*pflags) &= ~LIBNDR_ALIGN_FLAGS;
        }

I don't know which one is the better patch (maybe yours),
but the important thing is that we may need to review a lot of code.
Comment 30 Torsten 2012-05-24 07:26:04 UTC
(In reply to comment #28)
> Created attachment 7589 [details]
> WARNING ! Experimental code ! - More elegant experimental patch.

Hello, I have tested the patch, and it seems to fix the problem.

I have also tested the patch for comment #25, and this worked, too.

-- Torsten
Comment 31 Stefan Metzmacher 2012-05-24 09:18:07 UTC
(In reply to comment #30)
> (In reply to comment #28)
> > Created attachment 7589 [details] [details]
> > WARNING ! Experimental code ! - More elegant experimental patch.
> 
> Hello, I have tested the patch, and it seems to fix the problem.
> 
> I have also tested the patch for comment #25, and this worked, too.

Good, could you also check if
https://attachments.samba.org/attachment.cgi?id=7117
and/or
https://attachments.samba.org/attachment.cgi?id=7118
also fix this (which would mean they were not tested correctly).

Thanks!
Comment 32 Torsten 2012-05-24 11:03:14 UTC
(In reply to comment #31)

> Good, could you also check if
> https://attachments.samba.org/attachment.cgi?id=7117
> and/or
> https://attachments.samba.org/attachment.cgi?id=7118
> also fix this (which would mean they were not tested correctly).
> 
> Thanks!

Hi again,

I was trying attachment 7117 [details], but could not apply this (actually I am using 3.6.5 relase  source).
The other one 7118 did help for the problem.

-- Torsten.
Comment 33 c986124 2012-05-24 11:16:36 UTC
Im not sure if this helps at all but
I run into this problem to
and found out
if the netbios name exceeds 8 chars
you are unable to join the domain
Btw: Without any applied patch
Currently im only using precompiled debian testing Amd64 packages
3.6.5
Comment 34 Guillaume MENGUY 2012-05-24 11:25:59 UTC
(In reply to comment #28)
> Created attachment 7589 [details]
> WARNING ! Experimental code ! - More elegant experimental patch.

Hi,
This patch is working on our installation and fixes the problem, both for Windows XP & 7 joining the domain.

Hope it will be sloted in next production release.
Thanks !

Guillaume
Comment 35 Stefan Metzmacher 2012-05-24 11:54:56 UTC
(In reply to comment #32)
> (In reply to comment #31)
> 
> > Good, could you also check if
> > https://attachments.samba.org/attachment.cgi?id=7117
> > and/or
> > https://attachments.samba.org/attachment.cgi?id=7118
> > also fix this (which would mean they were not tested correctly).
> > 
> > Thanks!
> 
> Hi again,
> 
> I was trying attachment 7117 [details], but could not apply this (actually I am using
> 3.6.5 relase  source).
> The other one 7118 did help for the problem.

Ok, thanks for testing!
Comment 36 Stefan Metzmacher 2012-05-25 12:36:01 UTC
Created attachment 7596 [details]
Experimental patch for master and 3.6

I've discussed the problem with Günther.
It seems the solution is to make all alignment related flags
mutual exclusive (also the NDR_REMAINING flag).

This needs QA testing to make sure that it doesn't break
any unrelated code pathes.
Comment 37 Stefan Metzmacher 2012-05-25 12:37:05 UTC
Comment on attachment 7596 [details]
Experimental patch for master and 3.6

Also applies to v3-6-test
Comment 38 Jeremy Allison 2012-05-25 16:06:56 UTC
I've been doing a lot of investigation of this (I spent the entire day yesterday on it) and I can't see any breakage. I'm going to modify your change to add my comment and then push to master and re-upload here for 3.6.next.

(I think my comment is needed - look how long it took to track this down - I don't want to have to do that again :-).

Jeremy.
Comment 39 Jeremy Allison 2012-05-25 22:31:30 UTC
Created attachment 7599 [details]
Fix for 3.6.next

Patch that went into master. Applies cleanly to 3.6.x.
Comment 40 Stefan Metzmacher 2012-05-26 09:19:34 UTC
Comment on attachment 7599 [details]
Fix for 3.6.next

Andrew, can you run wintest with the current master?
Comment 41 Andrew Bartlett 2012-05-30 00:35:21 UTC
Comment on attachment 7599 [details]
Fix for 3.6.next

The wintest on master run was successful for test-s3.py and ran as usual (failure in DNS update) for test-s4-howto.py
Comment 42 Karolin Seeger 2012-05-31 19:02:12 UTC
Pushed to v3-6-test.
Closing out bug report.

Thanks!