I run a group of RHEL6-based RADIUS servers which use samba's ntlm_auth to do back-end authentication (and authorization, via group membership checking) against Active Directory. There are multiple Radiator processes running in parallel, each with its own invocation of e.g. /usr/bin/ntlm_auth --helper-protocol=ntlm-server-1 --require-membership-of='UOFI\CITES Radius Probe User' I have several automated monitoring probes which periodically log in as user cites-radius-probe (a member of the above group) to check that everything is working. Recently, these attempts have been sporadically failing even though the correct password is being supplied! I was able to capture debug-level output from one Radiator process during this time, showing the attributes passed to and received from its ntlm_auth process: Sun Dec 28 10:25:03 2014 894727: DEBUG: Radius::AuthNTLM looks for match with cites-radius-probe [cites-radius-probe] Sun Dec 28 10:25:03 2014 895599: DEBUG: Passing attribute Password:: XXX Sun Dec 28 10:25:03 2014 896364: DEBUG: Passing attribute NT-Domain:: VU9GSQ== Sun Dec 28 10:25:03 2014 897030: DEBUG: Passing attribute Username:: Y2l0ZXMtcmFkaXVzLXByb2Jl Sun Dec 28 10:25:03 2014 911745: DEBUG: Received attribute: Authenticated: Yes Sun Dec 28 10:25:03 2014 912507: DEBUG: Received attribute: . Sun Dec 28 10:25:03 2014 913165: DEBUG: Radius::AuthNTLM ACCEPT: : cites-radius-probe [cites-radius-probe] and a few seconds later, Sun Dec 28 10:26:03 2014 880835: DEBUG: Radius::AuthNTLM looks for match with cites-radius-probe [cites-radius-probe] Sun Dec 28 10:26:03 2014 881739: DEBUG: Passing attribute Password:: XXX Sun Dec 28 10:26:03 2014 882458: DEBUG: Passing attribute NT-Domain:: VU9GSQ== Sun Dec 28 10:26:03 2014 883105: DEBUG: Passing attribute Username:: Y2l0ZXMtcmFkaXVzLXByb2Jl Sun Dec 28 10:26:03 2014 900488: DEBUG: Received attribute: Authenticated: No Sun Dec 28 10:26:03 2014 901212: DEBUG: Received attribute: . Sun Dec 28 10:26:03 2014 901825: WARNING: NTLM Could not authenticate user: Unknown Sun Dec 28 10:26:03 2014 902632: DEBUG: Radius::AuthNTLM REJECT: AuthBy NTLM Password check failed: cites-radius-probe [cites-radius-probe] where the Password string from the original output (redacted here to "XXX") is the same for both attempts, but the result is different. The Active Directory domain controller logs these failures as e.g. Microsoft-Windows-Security-Audit 4776: The computer attempted to validate the credentials for an account. Authentication Package: MICROSOFT_AUTHENTICATION_PACKAGE_V1_0 Logon Account: cites-radius-probe Source Workstation: \\CITES-RADDEV1 Error Code: 0xC000006A where Error Code: 0xC000006A is Microsoftese for "wrong password", so there is no question of the _response_ being misinterpreted. These leaves three broad possibilities: 1. ntlm_auth/winbind is sporadically sending the wrong request to the AD domain controller, even though it receives the right input from Radiator. 2. The AD domain controller is sporadically returning Authenticated: No instead of Authenticated: Yes even though the request is valid. 3. Radiator's debug output is lying about the attributes which are being passed to ntlm_auth (I consider this possibility to be vanishingly unlikely, but include it for completeness). Here's the thing: a search of the log history from our AD domain controllers reveals that the problem began quite distinctly on the exact day when we first upgraded the RADIUS servers to Red Hat EL 6.6, which included an update of the samba packages from 3.6.9-169.el6_5 to 3.6.23-12.el6. (Prior to this day there were no occurrences, and ever since then we have seen many thousand per day among our several RADIUS servers). The Radiator software did not change, nor were any specific changes made to our Active Directory on this day, so I am fairly convinced that the problem was introduced in Samba. I can't reproduce the problem at will, but it happens with fair frequency in my environment, and yesterday I was able to capture some occurrences on one of my dev servers with smb.conf set to log level = 10. At least during that time period, every failure I saw happened when there were two (or three) requests simultaneously "in flight". However, at other times in the same log there were multiple requests which all succeeded. log.winbindd examples: [2015/02/05 14:31:04.515311, 10] winbindd/winbindd.c:645(process_request) process_request: Handling async request 12678:PAM_AUTH [2015/02/05 14:31:04.515433, 3] winbindd/winbindd_pam_auth.c:54(winbindd_pam_auth_send) [12678]: pam auth UOFI\cites-radius-probe [2015/02/05 14:31:04.520357, 10] winbindd/winbindd.c:645(process_request) process_request: Handling async request 12677:PAM_AUTH [2015/02/05 14:31:04.520417, 3] winbindd/winbindd_pam_auth.c:54(winbindd_pam_auth_send) [12677]: pam auth UOFI\cites-radius-probe [2015/02/05 14:31:04.536381, 10] winbindd/winbindd.c:707(wb_request_done) wb_request_done[12677:PAM_AUTH]: NT_STATUS_OK [2015/02/05 14:31:04.536509, 10] winbindd/winbindd.c:768(winbind_client_response_written) winbind_client_response_written[12677:PAM_AUTH]: delivered response to client [2015/02/05 14:31:04.542861, 10] winbindd/winbindd.c:707(wb_request_done) wb_request_done[12678:PAM_AUTH]: NT_STATUS_WRONG_PASSWORD [2015/02/05 14:31:04.542947, 10] winbindd/winbindd.c:768(winbind_client_response_written) winbind_client_response_written[12678:PAM_AUTH]: delivered response to client ... [2015/02/05 16:16:11.690383, 10] winbindd/winbindd.c:645(process_request) process_request: Handling async request 12924:PAM_AUTH [2015/02/05 16:16:11.690539, 3] winbindd/winbindd_pam_auth.c:54(winbindd_pam_auth_send) [12924]: pam auth UOFI\cites-radius-probe [2015/02/05 16:16:11.695125, 10] winbindd/winbindd.c:645(process_request) process_request: Handling async request 12904:PAM_AUTH [2015/02/05 16:16:11.695185, 3] winbindd/winbindd_pam_auth.c:54(winbindd_pam_auth_send) [12904]: pam auth UOFI\cites-radius-probe [2015/02/05 16:16:11.708018, 10] winbindd/winbindd.c:645(process_request) process_request: Handling async request 12923:PAM_AUTH [2015/02/05 16:16:11.708136, 3] winbindd/winbindd_pam_auth.c:54(winbindd_pam_auth_send) [12923]: pam auth UOFI\cites-radius-probe [2015/02/05 16:16:11.712165, 10] winbindd/winbindd.c:707(wb_request_done) wb_request_done[12923:PAM_AUTH]: NT_STATUS_OK [2015/02/05 16:16:11.712247, 10] winbindd/winbindd.c:768(winbind_client_response_written) winbind_client_response_written[12923:PAM_AUTH]: delivered response to client [2015/02/05 16:16:11.723760, 10] winbindd/winbindd.c:707(wb_request_done) wb_request_done[12904:PAM_AUTH]: NT_STATUS_WRONG_PASSWORD [2015/02/05 16:16:11.723895, 10] winbindd/winbindd.c:768(winbind_client_response_written) winbind_client_response_written[12904:PAM_AUTH]: delivered response to client [2015/02/05 16:16:11.730934, 10] winbindd/winbindd.c:707(wb_request_done) wb_request_done[12924:PAM_AUTH]: NT_STATUS_OK [2015/02/05 16:16:11.731048, 10] winbindd/winbindd.c:768(winbind_client_response_written) winbind_client_response_written[12924:PAM_AUTH]: delivered response to client [2015/02/05 16:16:11.736317, 10] winbindd/winbindd.c:645(process_request) process_request: Handling async request 12924:PAM_AUTH [2015/02/05 16:16:11.736395, 3] winbindd/winbindd_pam_auth.c:54(winbindd_pam_auth_send) [12924]: pam auth UOFI\cites-radius-probe [2015/02/05 16:16:11.739058, 10] winbindd/winbindd.c:707(wb_request_done) wb_request_done[12924:PAM_AUTH]: NT_STATUS_OK [2015/02/05 16:16:11.739137, 10] winbindd/winbindd.c:768(winbind_client_response_written) winbind_client_response_written[12924:PAM_AUTH]: delivered response to client My full smb.conf consists of: [global] security = ads netbios name = CITES-RADDEV2 realm = AD.UILLINOIS.EDU workgroup = UOFI password server = radiusdc02.ad.uillinois.edu, radiusdc01.ad.uillinois.edu log level = 10 syslog = 0 # allow multiple simultaneous connections to domain controller winbind max domain connections = 50 # do not establish connections to DCs of other trusted domains allow trusted domains = no # never change machine account password machine password timeout = 0
There is another dimension to this problem which was not reflected in my original report. The first failure mode (above) is visible to the client, resulting in NT_STATUS_WRONG_PASSWORD and Authenticated: No. The second failure mode produces a wrong password error code in the domain controller logs, followed by a successful authentication in the domain controller logs, but manifests only as a single successful authentication in the Samba logs (and the client gets back Authenticated: Yes). For example, compare these logs from the AD domain controller: 2015-02-09T11:46:41-06:00 Microsoft-Windows-Security-Audit 4776: The computer attempted to validate the credentials for an account. Authentication Package: MICROSOFT_AUTHENTICATION_PACKAGE_V1_0 Logon Account: cites-radius-probe Source Workstation: \\CITES-RADDEV2 Error Code: 0xC000006A 2015-02-09T11:46:41-06:00 Microsoft-Windows-Security-Audit 4776: The computer attempted to validate the credentials for an account. Authentication Package: MICROSOFT_AUTHENTICATION_PACKAGE_V1_0 Logon Account: cites-radius-probe Source Workstation: \\CITES-RADDEV2 Error Code: 0x0 with the corresponding output in log.winbindd (still with log level = 10): [2015/02/09 11:46:41.747114, 10] winbindd/winbindd.c:645(process_request) process_request: Handling async request 12924:PAM_AUTH [2015/02/09 11:46:41.747245, 3] winbindd/winbindd_pam_auth.c:54(winbindd_pam_auth_send) [12924]: pam auth UOFI\cites-radius-probe [2015/02/09 11:46:41.761732, 10] winbindd/winbindd.c:707(wb_request_done) wb_request_done[12924:PAM_AUTH]: NT_STATUS_OK [2015/02/09 11:46:41.761835, 10] winbindd/winbindd.c:768(winbind_client_response_written) winbind_client_response_written[12924:PAM_AUTH]: delivered response to client The AD log gives the impression of a failure followed by a success (as if Samba were transparently retrying the attempt). But the Samba log shows nothing whatsoever to distinguish this case from a single completely error-free authentication! Note also that in this case there were _not_ multiple requests simultaneously "in flight", only one. Finally: last week (Feb 5) I temporarily reverted one of my two dev servers to the 3.6.9-169.el6_5 samba packages. During Feb 6 through Feb 8, the reverted server generated zero occurrences of this problem in the domain controller logs, while the server running 3.6.23-12.el6 generated 33,702 occurrences. I am now 100% convinced that the Samba upgrade is the cause.
This should be fixed in 4.2.0, please reopen if it happens there.