Created attachment 11148 [details] log.wb-MUC (level 10 debug log) Since we've upgraded our member servers from 4.1.17 to 4.2.2, the log files on these machines get flooded with the following errors (6000-7000 per day on each server): [2015/06/12 06:17:31.230322, 3, pid=70786, effective(0, 0), real(0, 0), class=rpc_cli] ../source3/rpc_client/cli_pipe.c:1803(rpc_pipe_bind_step_one_done) rpc_pipe_bind: host allel.muc.medizinische-genetik.de bind request returned NT_STATUS_BUFFER_TOO_SMALL [2015/06/12 06:17:31.230350, 0, pid=70786, effective(0, 0), real(0, 0), class=rpc_cli] ../source3/rpc_client/cli_pipe.c:3065(cli_rpc_pipe_open_schannel_with_key) cli_rpc_pipe_open_schannel_with_key: rpc_pipe_bind failed with error NT_STATUS_BUFFER_TOO_SMALL ... [2015/06/12 06:17:31.230982, 3, pid=70786, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_cm.c:3015(cm_connect_netlogon) Could not open schannel'ed NETLOGON pipe. Error was NT_STATUS_BUFFER_TOO_SMALL [2015/06/12 06:17:31.231091, 3, pid=70786, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_pam.c:1322(winbind_samlogon_retry_loop) Could not open handle to NETLOGON pipe (error: NT_STATUS_BUFFER_TOO_SMALL, attempts: 0) [2015/06/12 06:17:31.231131, 3, pid=70786, effective(0, 0), real(0, 0), class=winbind] ../source3/winbindd/winbindd_pam.c:1352(winbind_samlogon_retry_loop) The connection to netlogon failed, retrying Find attached a level 10 debug log
This bug has a serious side effect: Users can't login any more e. g. via ssh, when this error comes up (and it does often here). Winbindd is configured in /etc/nsswitch.conf to retrieve the users from AD. SSH enabled users can login to the machine. But when this problem appears, ssh logins are denied and every try is logged in /var/log/messages with: Jun 15 16:03:31 storage-03 sshd[70478]: error: Received disconnect from 10.1.0.254: 14: No supported authentication methods available [preauth] Jun 15 16:03:32 storage-03 winbindd[60705]: [2015/06/15 16:03:32.836993, 0, pid=60705] ../source3/rpc_client/cli_pipe.c:3065(cli_rpc_pipe_open_schannel_with_key) Jun 15 16:03:32 storage-03 winbindd[60705]: cli_rpc_pipe_open_schannel_with_key: rpc_pipe_bind failed with error NT_STATUS_BUFFER_TOO_SMALL If I login local as root, the username resolving (e. g. id username) works fine. It seems the password isn't validated anymore. To temporary workaround this issue, I must login local as root and restart winbindd: # pkill winbindd # winbindd
Can you also get a wireshark trace when this is going on ?
Created attachment 11159 [details] network trace, messages and log.wb-DOMAIN > This bug has a serious side effect: Users can't login any more e. g. via ssh, > when this error comes up (and it does often here). I must correct this: Today I saw this messages comming up, and could login at the same time with an AD user. Also I saw on a different machine with 4.2.2, where the ssh login didn't work suddenly, that login start working again, if I waited a few minutes (without restarting winbindd). > Can you also get a wireshark trace when this is going on? Find attached a network trace, /var/log/messages and log.wb-DOMAIN captured while the cli_rpc_pipe_open_schannel_with_key: rpc_pipe_bind failed with error NT_STATUS_BUFFER_TOO_SMALL messages were logged. The log.wb-DOMAIN is this times just a level 1 log. If I turn up the log level and restart/reload winbindd, the NT_STATUS_BUFFER_TOO_SMALL is gone for a while. I hope the previous attached level 10 debug log is fine, even if it doesn't match with this network trace.
Created attachment 11167 [details] Packet capture, messages, log.wb-DOMAIN Re-attaching packet capture, /var/log/messages and log.wb-DOMAIN. This time the log.wb-DOMAIN logfile was captured with level 10 enabled and done at the same time than the other two files.
Because of this problem in 4.2 I downgraded a while ago to 4.1 again and the error was gone. Yesterday I updated to 4.3.1 and it starts flooding my logs again: [2015/10/21 09:10:39.990940, 0, pid=67178] ../source3/rpc_client/cli_pipe.c:3170(cli_rpc_pipe_open_schannel_with_creds) cli_rpc_pipe_open_schannel_with_creds: rpc_pipe_bind failed with error NT_STATUS_BUFFER_TOO_SMALL This occurs more than 1150 times only during the last 5 hours!
(In reply to Marc Muehlfeld from comment #5) The attached capture doesn't contain the traffic between winbindd and the domain controller. What Kind of DC is exon.muc.medizinische-genetik.de?
Created attachment 11517 [details] new packet capture (In reply to Stefan Metzmacher from comment #6) > The attached capture doesn't contain the traffic between winbindd and > the domain controller. I did a new capture with the following filter, that should have everything included: # tcpdump -p -s 0 -w trace.pcap host 192.168.29.9 or host 192.168.29.2 The IPs are the ones of the DCs on that AD site (allel + exon) The capture was done, while the following eight errors entries occure in log.wb-MUC: [2015/10/21 10:01:40.618009, 1, pid=67178] ../librpc/ndr/ndr.c:578(ndr_pull_error) ndr_pull_error(11): Pull bytes 1 (../librpc/ndr/ndr_basic.c:79) [2015/10/21 10:01:40.618072, 0, pid=67178] ../source3/rpc_client/cli_pipe.c:3170(cli_rpc_pipe_open_schannel_with_creds) cli_rpc_pipe_open_schannel_with_creds: rpc_pipe_bind failed with error NT_STATUS_BUFFER_TOO_SMALL [2015/10/21 10:01:40.695364, 1, pid=67178] ../librpc/ndr/ndr.c:578(ndr_pull_error) ndr_pull_error(11): Pull bytes 1 (../librpc/ndr/ndr_basic.c:79) [2015/10/21 10:01:40.695429, 0, pid=67178] ../source3/rpc_client/cli_pipe.c:3170(cli_rpc_pipe_open_schannel_with_creds) cli_rpc_pipe_open_schannel_with_creds: rpc_pipe_bind failed with error NT_STATUS_BUFFER_TOO_SMALL [2015/10/21 10:01:40.763428, 1, pid=67178] ../librpc/ndr/ndr.c:578(ndr_pull_error) ndr_pull_error(11): Pull bytes 1 (../librpc/ndr/ndr_basic.c:79) [2015/10/21 10:01:40.763484, 0, pid=67178] ../source3/rpc_client/cli_pipe.c:3170(cli_rpc_pipe_open_schannel_with_creds) cli_rpc_pipe_open_schannel_with_creds: rpc_pipe_bind failed with error NT_STATUS_BUFFER_TOO_SMALL [2015/10/21 10:01:40.866210, 1, pid=67178] ../librpc/ndr/ndr.c:578(ndr_pull_error) ndr_pull_error(11): Pull bytes 1 (../librpc/ndr/ndr_basic.c:79) [2015/10/21 10:01:40.866267, 0, pid=67178] ../source3/rpc_client/cli_pipe.c:3170(cli_rpc_pipe_open_schannel_with_creds) cli_rpc_pipe_open_schannel_with_creds: rpc_pipe_bind failed with error NT_STATUS_BUFFER_TOO_SMALL > What Kind of DC is exon.muc.medizinische-genetik.de? All DCs are Samba 4.1.19. Two (exon + allel) of them are in the same AD site as this domain member server.
I think I partly understand the problem. We got the ndr error because we don't have f73ef3028c4f4583c81b611a9714608eae79360c in v4-1. It means we can't parse a BIND_NAK response. The reason for the BIND_NAK might appear in the logs of the DC. It seems that the member has a valid netlogon_creds_cli.tdb but the doesn't know about the session key in its schannel_store.tdb. Can you provide logs and captures of all related servers from a fresh winbindd restart. Once winbindd is stopped check if someone else has netlogon_creds_cli.tdb still open. and then remove that file. If that works retry the same without removing the file.
(In reply to Stefan Metzmacher from comment #8) > We got the ndr error because we don't have > f73ef3028c4f4583c81b611a9714608eae79360c in v4-1. > It means we can't parse a BIND_NAK response. Does this mean, if the DC is running v4-2 or later, the problem won't exist? In that case we should close this bug as WONTFIX, because v4-1 is already in security-only mode. For me this won't be a problem, because it is anyway planned to update the DCs to 4.3.1 during the next days. Can you please confirm or deny?
(In reply to Marc Muehlfeld from comment #9) I'd guess so, but I haven't tested it myself. I think we should fix v4-2 and v4-3 in order to cope with 4.1/4.0 dcs.
(In reply to Stefan Metzmacher from comment #10) > I think we should fix v4-2 and v4-3 in order to cope with 4.1/4.0 dcs. Ok. Then I'll wait with the DC updates and come back with the requested debug stuff asap.
Created attachment 11518 [details] Possible patch for master
Your patch works. The error flood is gone. Thanks. The only message I saw - but just once - was: [2015/10/22 04:40:50.921599, 1, pid=22529] ../source3/rpc_client/cli_pipe.c:470(cli_pipe_validate_current_pdu) ../source3/rpc_client/cli_pipe.c:470: Bind NACK received from host allel.muc.medizinische-genetik.de!
(In reply to Marc Muehlfeld from comment #13) Yes, this means server and client got out of sync with the negotiated session key and the clients needs to recover, which failed before because of the unexpected NT_STATUS_BUFFER_TOO_SMALL error code.
Created attachment 11527 [details] Patch for v4-3-test
Created attachment 11528 [details] Patches for v4-2-test
Pushed to autobuild-v4-[2|3]-test.
(In reply to Karolin Seeger from comment #17) Pushed to both branches. Closing out bug report. Thanks!