Bug 14611 (CVE-2021-20251) - CVE-2021-20251 [SECURITY] Bad password count not incremented atomically
Summary: CVE-2021-20251 [SECURITY] Bad password count not incremented atomically
Status: RESOLVED FIXED
Alias: CVE-2021-20251
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.12.2
Hardware: All All
: P1 critical (vote)
Target Milestone: ---
Assignee: Jule Anger
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-08 18:44 UTC by Jeremy Allison
Modified: 2022-09-25 09:19 UTC (History)
7 users (show)

See Also:


Attachments
Initial patch (2.54 KB, patch)
2021-01-08 18:45 UTC, Jeremy Allison
no flags Details
git-am fix for master. (2.66 KB, patch)
2021-01-11 20:13 UTC, Jeremy Allison
no flags Details
git-am fix for master. (2.93 KB, patch)
2021-01-11 20:20 UTC, Jeremy Allison
no flags Details
git-am fix for master. (3.00 KB, patch)
2021-01-11 21:31 UTC, Jeremy Allison
no flags Details
git-am fix for master. (3.28 KB, patch)
2021-01-12 04:51 UTC, Jeremy Allison
no flags Details
Proposed patch for bad password path (4.53 KB, patch)
2021-01-28 02:30 UTC, Gary Lockyer
gary: ci-passed+
Details
s3/s4 combined patch, v1 (71.02 KB, patch)
2021-02-26 03:42 UTC, Douglas Bagnall
no flags Details
patch for master (v1) (71.37 KB, patch)
2021-03-02 02:40 UTC, Andrew Bartlett
abartlet: review-
Details
patch for 4.14 (v1) (71.37 KB, patch)
2021-03-02 02:41 UTC, Andrew Bartlett
no flags Details
patch for 4.13 (v1) (71.37 KB, patch)
2021-03-02 02:41 UTC, Andrew Bartlett
no flags Details
patch for 4.12 (v1) (71.25 KB, patch)
2021-03-02 02:42 UTC, Andrew Bartlett
no flags Details
patch for 4.11 (v1) (71.29 KB, patch)
2021-03-02 02:43 UTC, Andrew Bartlett
no flags Details
WIP patch for master (115.23 KB, patch)
2021-03-30 09:57 UTC, Andrew Bartlett
no flags Details
patch for master v2 (170.11 KB, patch)
2022-07-06 00:15 UTC, Joseph Sutton
no flags Details
patch for master v3 (174.35 KB, patch)
2022-07-06 03:46 UTC, Joseph Sutton
no flags Details
patch for master v4 (198.33 KB, patch)
2022-07-12 08:05 UTC, Joseph Sutton
jsutton: ci-passed-
Details
patch for master v5 (203.06 KB, patch)
2022-07-21 06:07 UTC, Joseph Sutton
jsutton: ci-passed+
Details
patch for master v7 (245.87 KB, patch)
2022-08-02 05:08 UTC, Joseph Sutton
jsutton: ci-passed-
Details
patch for master v8 (245.87 KB, patch)
2022-08-02 08:24 UTC, Joseph Sutton
no flags Details
patch for master v9 (249.55 KB, patch)
2022-08-03 02:47 UTC, Joseph Sutton
jsutton: ci-passed+
Details
patch for 4.17 v9 (255.05 KB, patch)
2022-09-13 08:01 UTC, Joseph Sutton
jsutton: review? (abartlet)
asn: review+
jsutton: ci-passed+
Details
proposed patch for 4.16 v9 (269.73 KB, patch)
2022-09-16 00:57 UTC, Joseph Sutton
abartlet: review+
jsutton: ci-passed+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeremy Allison 2021-01-08 18:44:06 UTC
Date: Thu, 7 Jan 2021 19:08:59 +0000
From: Nathaniel Turner <nturner@exagrid.com>
To: "security@samba.org" <security@samba.org>
Subject: Bad password count not incremented atomically

[-- Attachment #1 [details] --]
[-- Type: multipart/alternative, Encoding: 7bit, Size: 14K --]

Hi folks,

I was doing a little "defense in depth" work, with the goal of limiting an attacker's ability to brute force a samba u
ser's password (should they get past other defenses, such as firewalls and "hosts allow" configurations).

In the process of validating my new configuration, I found that samba's bad password count logic appears to have a sim
ple race condition: Each connection appears to get separate process, and each process loads, increments, and saves the
 bad password count without any coordination. So with many clients attacking, the "bad lockout attempt" limit can be e
xceeded by a significant margin.

I did my simple tests with these settings, but the exact numbers don't really matter:


pdbedit -P "bad lockout attempt" -C 10

pdbedit -P "lockout duration" -C 1

pdbedit -P "reset count minutes" -C 1

The attached "test-smb-auth-rate-limits" bash script is a trimmed version of a larger test suite, but hopefully gets the idea across. When run against a pretty low powered VM, I see the lockout attempt limit exceeded by about 10,000% (we get 1066 auth attempts within a minute, instead of 10):

$ test-smb-auth-rate-limits nturner vm-000-4698.maas.local smb-0
2021-01-07 18:15:41+00:00 Test: 'attack_smb nturner vm-000-4698.maas.local smb-0 20 240'
2021-01-07 18:16:54+00:00     199 do_connect: Connection to vm-000-4698.maas.local failed (Error NT_STATUS_IO_DEVICE_E
RROR)
2021-01-07 18:16:54+00:00    9592 do_connect: Connection to vm-000-4698.maas.local failed (Error NT_STATUS_IO_TIMEOUT)
2021-01-07 18:16:54+00:00     255 protocol negotiation failed: NT_STATUS_CONNECTION_RESET
2021-01-07 18:16:54+00:00    3665 protocol negotiation failed: NT_STATUS_IO_TIMEOUT
2021-01-07 18:16:54+00:00    5223 session setup failed: NT_STATUS_ACCOUNT_LOCKED_OUT
2021-01-07 18:16:54+00:00    1066 session setup failed: NT_STATUS_LOGON_FAILURE
2021-01-07 18:16:54+00:00 landed 1066 attempts out of 20000 (running total is now 1066/20)
2021-01-07 18:16:54+00:00 FAIL: 'attack_smb nturner vm-000-4698.maas.local smb-0 20 240' took 73 seconds (must take at
 least 120)
2021-01-07 18:16:54+00:00 FAIL: 1 tests failed

When run against a much beefier (32 processor, fast RAID) machine, it's not as bad, but the limit is still exceeded by more than 100%:
$ test-smb-auth-rate-limits nturner 063ts17-avta181002305.maas.local smb-0
2021-01-07 18:19:21+00:00 Test: 'attack_smb nturner 063ts17-avta181002305.maas.local smb-0 20 240'
2021-01-07 18:20:11+00:00   19977 session setup failed: NT_STATUS_ACCOUNT_LOCKED_OUT
2021-01-07 18:20:11+00:00      23 session setup failed: NT_STATUS_LOGON_FAILURE
2021-01-07 18:20:11+00:00 landed 23 attempts out of 20000 (running total is now 23/20)
2021-01-07 18:20:11+00:00 FAIL: 'attack_smb nturner 063ts17-avta181002305.maas.local smb-0 20 240' took 50 seconds (mu
st take at least 120)
2021-01-07 18:20:11+00:00 FAIL: 1 tests failed

The attached patch, which hacks check_sam_security to serialize all child processes via a lock file, while very silly,
 does prevent overshooting the configured limits.

A couple questions:

  1.  My test hack just open codes a call to lockf. What's the "right" way to do interprocess coordination like this in samba?
  2.  Instead of serializing the entire authentication path, could we perhaps reduce the critical section to everything after the call to sam_password_ok(), and add a call to reload the sampass structure from the live DB as the first thing in that critical section? (I can try that, but I'm wondering if there are any known reasons why say, reloading sampass here would be a bad idea.)


Nathaniel Turner
Comment 1 Jeremy Allison 2021-01-08 18:44:51 UTC
Created attachment 16385 [details]
Test script
Comment 2 Jeremy Allison 2021-01-08 18:45:17 UTC
Created attachment 16386 [details]
Initial patch
Comment 3 Jeremy Allison 2021-01-11 20:13:01 UTC
Created attachment 16390 [details]
git-am fix for master.

Nate, can you test this fix against your setup ? I'd like to see if I'm on the right track here.

Thanks !

Jeremy.
Comment 4 Jeremy Allison 2021-01-11 20:20:36 UTC
Created attachment 16391 [details]
git-am fix for master.

Slightly better version that checks the error after the account reload.
Comment 5 Jeremy Allison 2021-01-11 21:31:31 UTC
Created attachment 16392 [details]
git-am fix for master.

Only re-load the user info if we successfully got the mutex.
Comment 6 Jeremy Allison 2021-01-12 04:51:12 UTC
Created attachment 16393 [details]
git-am fix for master.

Version #3. Improve the granularity of the mutex. We can make the mutex specific to the user name itself, not a global mutex for the sam account database.
Comment 7 Nathaniel W. Turner 2021-01-13 03:38:34 UTC
Jeremy,

I tested with the latest version (v3) and this shows much better results than without. I'm seeing far fewer NT_STATUS_LOGON_FAILURE returns from smbclient per minute. I am still seeing more than 10, but I suspect that part of my test is too simplistic. I would not be surprised to find that this status is returned in paths where the authentication did not make it as far as the password check.

$ test-smb-auth-rate-limits nturner vm-000-4698.maas.local smb-0
2021-01-12 21:58:50-05:00 Test: 'attack_smb nturner vm-000-4698.maas.local smb-0 20 240'
2021-01-12 22:00:32-05:00     230 Connection to vm-000-4698.maas.local failed (Error NT_STATUS_IO_DEVICE_ERROR)
2021-01-12 22:00:32-05:00   11819 Connection to vm-000-4698.maas.local failed (Error NT_STATUS_IO_TIMEOUT)
2021-01-12 22:00:32-05:00     759 protocol negotiation failed: NT_STATUS_CONNECTION_RESET
2021-01-12 22:00:32-05:00    1163 protocol negotiation failed: NT_STATUS_IO_TIMEOUT
2021-01-12 22:00:32-05:00    3794 session setup failed: NT_STATUS_ACCOUNT_LOCKED_OUT
2021-01-12 22:00:32-05:00    1912 session setup failed: NT_STATUS_IO_TIMEOUT
2021-01-12 22:00:32-05:00     323 session setup failed: NT_STATUS_LOGON_FAILURE
2021-01-12 22:00:32-05:00 landed 323 attempts out of 20000 (running total is now 323/20)
2021-01-12 22:00:32-05:00 FAIL: 'attack_smb nturner vm-000-4698.maas.local smb-0 20 240' took 102 seconds (must take at least 120)
2021-01-12 22:00:32-05:00 FAIL: 1 tests failed

$ test-smb-auth-rate-limits nturner 063ts17-avta181002305.maas.local smb-0
2021-01-12 22:03:44-05:00 Test: 'attack_smb nturner 063ts17-avta181002305.maas.local smb-0 20 240'
2021-01-12 22:04:48-05:00   19980 session setup failed: NT_STATUS_ACCOUNT_LOCKED_OUT
2021-01-12 22:04:48-05:00      20 session setup failed: NT_STATUS_LOGON_FAILURE
2021-01-12 22:04:48-05:00 landed 20 attempts out of 20000 (running total is now 20/20)
2021-01-12 22:04:48-05:00 FAIL: 'attack_smb nturner 063ts17-avta181002305.maas.local smb-0 20 240' took 64 seconds (must take at least 120)
2021-01-12 22:04:48-05:00 FAIL: 1 tests failed
Comment 8 Nathaniel W. Turner 2021-01-13 03:49:53 UTC
Looking closer at the results with the patch on the faster machine, I think 20 attempts in 64 seconds is correct. If the first 10 failures occur within say, 1 second, the clock for ending the lockout starts then, which leaves a few seconds for 10 more attempts at the end of that time window. (I think I'd initially assumed the lockout clock didn't start until later, which in retrospect is silly.)

So my test is now passing (with the proposed patch) on my faster machine.

I'm still not sure why the slower machine is seeing 323 session setup failed: NT_STATUS_LOGON_FAILURE results within 102 seconds. (I'll see if I can hack smbd to return a more unique status when an actual password check has failed to convince myself that these are not in fact all landed password check attempts.)
Comment 9 Nathaniel W. Turner 2021-01-13 14:23:52 UTC
OK, hacking nt_status_squash to be a no-op in smbd (and hacking the test to look for NT_STATUS_WRONG_PASSWORD instead of NT_STATUS_LOGON_FAILURE) shows that with the patch, account lockout appears to be working exactly as designed now:

Test with nt_status_squash hack (but without fix):

$ test-smb-auth-rate-limits nturner vm-000-4698.maas.local smb-0
2021-01-13 14:20:56+00:00 Test: 'attack_smb nturner vm-000-4698.maas.local smb-0 30 240'
2021-01-13 14:22:11+00:00     186 do_connect: Connection to vm-000-4698.maas.local failed (Error NT_STATUS_IO_DEVICE_ERROR)
2021-01-13 14:22:11+00:00   10040 do_connect: Connection to vm-000-4698.maas.local failed (Error NT_STATUS_IO_TIMEOUT)
2021-01-13 14:22:11+00:00     329 protocol negotiation failed: NT_STATUS_CONNECTION_RESET
2021-01-13 14:22:11+00:00    2555 protocol negotiation failed: NT_STATUS_IO_TIMEOUT
2021-01-13 14:22:11+00:00    5927 session setup failed: NT_STATUS_ACCOUNT_LOCKED_OUT
2021-01-13 14:22:11+00:00     371 session setup failed: NT_STATUS_LOGON_FAILURE
2021-01-13 14:22:11+00:00     592 session setup failed: NT_STATUS_WRONG_PASSWORD
2021-01-13 14:22:11+00:00 landed 592 attempts out of 20000 (running total is now 592/30)
2021-01-13 14:22:11+00:00 FAIL: 'attack_smb nturner vm-000-4698.maas.local smb-0 30 240' took 75 seconds (must take at least 120)
2021-01-13 14:22:11+00:00 FAIL: 1 tests failed


Test with nt_status_squash hack AND fix:

$ test-smb-auth-rate-limits nturner vm-000-4698.maas.local smb-0
2021-01-13 14:09:33+00:00 Test: 'attack_smb nturner vm-000-4698.maas.local smb-0 30 240'
2021-01-13 14:10:46+00:00     215 do_connect: Connection to vm-000-4698.maas.local failed (Error NT_STATUS_IO_DEVICE_ERROR)
2021-01-13 14:10:46+00:00   10801 do_connect: Connection to vm-000-4698.maas.local failed (Error NT_STATUS_IO_TIMEOUT)
2021-01-13 14:10:46+00:00     331 protocol negotiation failed: NT_STATUS_CONNECTION_RESET
2021-01-13 14:10:46+00:00    2482 protocol negotiation failed: NT_STATUS_IO_TIMEOUT
2021-01-13 14:10:46+00:00    4712 session setup failed: NT_STATUS_ACCOUNT_LOCKED_OUT
2021-01-13 14:10:46+00:00    1449 session setup failed: NT_STATUS_LOGON_FAILURE
2021-01-13 14:10:46+00:00      10 session setup failed: NT_STATUS_WRONG_PASSWORD
2021-01-13 14:10:47+00:00 landed 10 attempts out of 20000 (running total is now 10/30)
2021-01-13 14:12:04+00:00     189 do_connect: Connection to vm-000-4698.maas.local failed (Error NT_STATUS_IO_DEVICE_ERROR)
2021-01-13 14:12:04+00:00   10568 do_connect: Connection to vm-000-4698.maas.local failed (Error NT_STATUS_IO_TIMEOUT)
2021-01-13 14:12:04+00:00     298 protocol negotiation failed: NT_STATUS_CONNECTION_RESET
2021-01-13 14:12:04+00:00    1743 protocol negotiation failed: NT_STATUS_IO_TIMEOUT
2021-01-13 14:12:04+00:00    6817 session setup failed: NT_STATUS_ACCOUNT_LOCKED_OUT
2021-01-13 14:12:04+00:00     375 session setup failed: NT_STATUS_LOGON_FAILURE
2021-01-13 14:12:04+00:00      10 session setup failed: NT_STATUS_WRONG_PASSWORD
2021-01-13 14:12:04+00:00 landed 10 attempts out of 20000 (running total is now 20/30)
2021-01-13 14:13:19+00:00     197 do_connect: Connection to vm-000-4698.maas.local failed (Error NT_STATUS_IO_DEVICE_ERROR)
2021-01-13 14:13:19+00:00   10792 do_connect: Connection to vm-000-4698.maas.local failed (Error NT_STATUS_IO_TIMEOUT)
2021-01-13 14:13:19+00:00     466 protocol negotiation failed: NT_STATUS_CONNECTION_RESET
2021-01-13 14:13:19+00:00    1546 protocol negotiation failed: NT_STATUS_IO_TIMEOUT
2021-01-13 14:13:19+00:00    6748 session setup failed: NT_STATUS_ACCOUNT_LOCKED_OUT
2021-01-13 14:13:19+00:00     224 session setup failed: NT_STATUS_LOGON_FAILURE
2021-01-13 14:13:19+00:00      27 session setup failed: NT_STATUS_WRONG_PASSWORD
2021-01-13 14:13:19+00:00 landed 27 attempts out of 20000 (running total is now 47/30)
2021-01-13 14:13:19+00:00 PASS: 'attack_smb nturner vm-000-4698.maas.local smb-0 30 240' took 226 seconds (more than 120; good)
2021-01-13 14:13:19+00:00 SUCCESS: all tests passed
Comment 10 Jeremy Allison 2021-01-13 18:41:15 UTC
Great ! Thanks for the update. Now I think I need a simple test I can add to our regression test suite so we can ensure this stays fixed. It'll be a little difficult as it's a race condition issue, which can mean flaky reproduction but I'm guessing if I set the limit low enough I should be able to get a reasonably reliable reproducer.

As far as I can see this only affects smbd/Samba3/local pdb accounts, not Samba4-AD accounts as that correctly does the atomic updates on bad-password counts. Do you concur ?
Comment 11 Nathaniel W. Turner 2021-01-13 23:40:58 UTC
By Samba4-AD, I think you are referring to the way samba manages its user DB when running as an AD domain controller. I'm not at all familiar with that code, but a quick look at source4/auth/ntlm/auth_sam.c makes me wonder if it has a similar problem:

If the password is bad (and not a prior good password), we do this in authsam_password_check_and_record():

    nt_status = authsam_update_bad_pwd_count(auth_context->sam_ctx,
                         msg, domain_dn);

This eventually calls into dsdb_update_bad_pwd_count(). It's not obvious to me how this is atomic, but even if it is, the code back out in the end of authsam_password_check_and_record() looks like it is going to return NT_STATUS_WRONG_PASSWORD even if we find we have exceeded the bad password count threshold. So if we have 1000 processes running through here at the same time, it looks like many of them could get past the samdb_result_passwords check at the beginning of that function before the lockout flag gets set. At that point, the caller will be able to tell if the password was good or not, because they'll see NT_STATUS_WRONG_PASSWORD (mapped to NT_STATUS_LOGON_FAILURE) instead of NT_STATUS_ACCOUNT_LOCKED_OUT (which would mask the fact that the password was checked and didn't match). Am I reading this right?

As for a test case for the current patch, yeah, this is a tricky thing to test reliably. It will probably vary based on the hardware used, but for what it's worth, my test VM has 2 cores and a virtual disk carved out of a larger RAID array and shared with a couple dozen other VMs. Without the fix, I can easily reproduce the problem with the xargs and smbclient approach in the attached test script. However, per_attempt=20000 is too high, as it results in many NT_STATUS_LOGON_FAILURE returns that are not due to NT_STATUS_WRONG_PASSWORD. If I set per_attempt=2000, the defect still reproduces, and all NT_STATUS_LOGON_FAILURE returns I see are due to NT_STATUS_WRONG_PASSWORD. (Confirmed by temporarily removing the mapping logic in nt_status_squash and repeating the test several times with each code variant.)
Comment 12 Jeremy Allison 2021-01-14 03:01:50 UTC
Gary, any chance you could take a look at this issue for the Samba-AD/DC code ?

I'm not as familiar with how it manages the password lockouts. Also, this could only happen in the prefork process model as it requires multiple simultaneous process access.
Comment 13 Gary Lockyer 2021-01-14 04:11:39 UTC
I'll need to take a more detailed look tomorrow, but it does look like it's an issue in the AD code.
Comment 14 Gary Lockyer 2021-01-14 22:04:40 UTC
Definitely looks to be an issue in the AD code.  I'll spend some time looking into this in more detail.
Comment 15 Gary Lockyer 2021-01-28 02:30:32 UTC
Created attachment 16411 [details]
Proposed patch for bad password path

Patch covers that bad password count update.

Still need to:

Write a test

Handle the good password case.
Comment 16 Andrew Bartlett 2021-01-29 03:08:52 UTC
Comment on attachment 16411 [details]
Proposed patch for bad password path

This looks OK.  It would be good if the new code was more explicit that it is operating on the DN by <GUID= and not the string DN, which might be in a race.

With the explicit transaction the dsdb_autotransaction_request() should be ldb_request() to avoid a nested transaction.
Comment 17 Gary Lockyer 2021-02-04 20:49:07 UTC
Are we going to treat this as a security bug, and get a CVE assigned? I feel that it needs to be fixed but does not warrant a CVE.
Comment 18 Jeremy Allison 2021-02-04 21:38:06 UTC
Maybe we should ask Red Hat product security to see if they think this needs a CVE ?
Comment 19 Andrew Bartlett 2021-02-05 09:43:59 UTC
I get a CVSSv3.1 of AV:N/AC:H/PR:N/UI:N/S:U/C:L/I:L/A:N (4.8)

That assumes that a password bad enough to be brute forced can only provide limited access, eg some confidentiality or integrity breach. 

It also rates such an attack as hard.

If so, we would stop here per:

4a.) Stop and ask for advise on security@samba.org if the CVSSv3
     score is less than 5.0.  This is a heavy-weight process with
     costs to many parties.

I'll ask Red Hat for some advice.
Comment 30 Douglas Bagnall 2021-02-17 23:55:31 UTC
Red Hat have assigned CVE-2021-20251 for this.

I assume we need to roll Gary's and Jeremy's patches together.
Comment 31 Andrew Bartlett 2021-02-18 21:35:10 UTC
It needs to be recognised that in a multi-DC situation (LDAP backend) for the 'classic' NT4-like DC the reload will still be racy, as different servers won't see each others locks.

This is of course limited to the number of servers, so the race won't be as fast (eg 4x parallel, with some stepping on toes meaning the count may not increase in the way expected).

We also don't have any integration tests in a format that we could actually include in our testsuite, or any unit tests of the source3 side.
Comment 32 Jeremy Allison 2021-02-23 02:13:21 UTC
Yep, putting them together would seem to be the thing to do. As this is a race condition it's difficult to test, and as the problem is worst (and tested :-) in the AD case (and Gary has a fix for that :-), I think we may be OK with the code we have (tested by the submitter).
Comment 33 Douglas Bagnall 2021-02-23 20:35:00 UTC
Jeremy, your commit message says "WIP" and "prototype". Is it ready?
Comment 34 Jeremy Allison 2021-02-23 21:00:21 UTC
I only put that on it until the reporter confirmed it worked :-).
Comment 35 Jeremy Allison 2021-02-23 21:00:29 UTC
It's ready.
Comment 36 Andrew Bartlett 2021-02-25 01:32:15 UTC
Douglas will drive this from here, but all assistance most valued.
Comment 37 Douglas Bagnall 2021-02-26 03:17:49 UTC
Nate: how [if at all] do you want to be credited in the advisory and patches? Nathaniel Turner of Exagrid? houseofnate?

Also, is there anything in the comments here that you would rather not have made public when the patch goes out? We can hide individual messages forever.
Comment 38 Douglas Bagnall 2021-02-26 03:42:45 UTC
Created attachment 16474 [details]
s3/s4 combined patch, v1
Comment 39 Nathaniel W. Turner 2021-02-26 17:26:50 UTC
Douglas, you can credit me as Nathaniel Turner of ExaGrid. I don't see anything in the discussion history that needs to be redacted. Thank you.
Comment 40 Andrew Bartlett 2021-03-02 02:40:18 UTC
Created attachment 16483 [details]
patch for master (v1)
Comment 41 Andrew Bartlett 2021-03-02 02:41:17 UTC
Created attachment 16484 [details]
patch for 4.14 (v1)
Comment 42 Andrew Bartlett 2021-03-02 02:41:54 UTC
Created attachment 16485 [details]
patch for 4.13 (v1)
Comment 43 Andrew Bartlett 2021-03-02 02:42:29 UTC
Created attachment 16486 [details]
patch for 4.12 (v1)
Comment 44 Andrew Bartlett 2021-03-02 02:43:34 UTC
Created attachment 16487 [details]
patch for 4.11 (v1)

I've backported the patches to all supported versions and have the result under CI.
Comment 45 Andrew Bartlett 2021-03-02 03:34:46 UTC
Comment on attachment 16483 [details]
patch for master (v1)

At the very least we need to put "objectGUID" into  dcesrv_samr_OemChangePasswordUser2(), but I'm also quite worried that we have not eliminated the race here.  

Also the KDC does not check the error returns from authsam_logon_success_accounting() so the attack just moves from the NTLM to the KDC codepath.  Thankfully the KDC is not often multi-process, bit might be in the future.

Likewise, the SAMR password change routines don't call authsam_logon_success_accounting().

The same will apply in source3, where the bad password mutex isn't taken out on the SAMR server.

I've stopped the test pipelines for now.

Sorry!
Comment 46 Andrew Bartlett 2021-03-02 03:39:45 UTC
For the AD DC, a valid integration test would:
 - take out a transaction
 - lock out the account
 - connect to the server with a valid password
 - sleep() to ensure the server hits the transaction lock
 - commit the local transaction

If the valid password is returned as LOCKED_OUT then the accounting is working. 

The same can be done with a bad password, set the value to X in the transaction and confirm it becomes X + 1 after.

We can then do this for the 4 methods by which passwords can be checked:
 - KDC
 - NTLM
 - LDAP password change
 - SAMR
Comment 47 Jeremy Allison 2021-03-02 18:39:17 UTC
(In reply to Andrew Bartlett from comment #45)

Good catch on the source3/rpc_server/samr_XXX.c code !

I can extend the source3/ part of the patch to do the same fix for
source3/rpc_server/samr/srv_samr_chgpasswd.c:pass_oem_change()

The code doing set_user_info_XXX() inside source3/rpc_server/samr/srv_samr_nt.c I think is Administrator level password changes, so I don't think we need the fix there, yes ? Or am I wrong ? :-) :-).
Comment 48 Andrew Bartlett 2021-03-02 18:59:57 UTC
(In reply to Jeremy Allison from comment #47)
Yes, the set_ routines should be the administrator password set.
Comment 49 Andrew Bartlett 2021-03-03 06:14:04 UTC
Looking at the AD DC side of things, and the code that still needs to be changed:

 * The SAMR server change need careful work to ensure that we match Windows behaviour.  Just plugging into the success accounting code might diverge behaviour.  We might call that a Windows bug, but this isn't just a 'fix a race'.

 * The KDC is much harder to restructure to support this.  

   In the MIT KDC, samba_bad_password_count() and the KDB functions kdb_samba_db_audit_as_req() return void.

Thankfully only the Heimdal KDC is supported, and the HDB hook auth_status() -> hdb_samba4_auth_status() returns an error code.

However the kerberos5.c code does not check that error code.  Also, the success accounting isn't called.

So while this patch will at least mean the counter will increment, if we have parallel KDC processing (a prefork KDC) it will be possible to check more passwords until the next DB read after the transaction finishes for a failure (and potential lockout).  But fixing it properly isn't easy.

I propose that we ignore the KDC for now, sadly, and focus on the NTLM case because due to the forking model of smbd much more parallel behaviour is possible.  The KDC is preforked by default, so this might allow a few more guesses while bad passwords are waiting on the DB transaction lock. 

I also propose no behaviour change in the SAMR server to call the success accounting, but instead open a transaction around the read.  Instead *if* we are preforking we might well allow extra guesses. 

I need to think as to if in these situations taking out a transaction around the initial read might provide some protection. 

Stepping back, all in all this is a lot of work to reduce the danger for which the standard mitigation is 'choose good passwords'.
Comment 50 Andrew Bartlett 2021-03-30 09:52:31 UTC
Gary and I have done significant work to address a major concern of mine, which is that putting every good password behind the transaction lock would be a bad idea - these can be quite long-lived.

We will only take out a transaction if a hint is found in a runtime tdb.

Gary made a start and I've then refined those patches with an eye to being reasonable to review (carefully adapting changes so that they don't touch as much of the code).  What remains is the adaption of the unit tests.

We still also need the source3 code updated.
Comment 51 Andrew Bartlett 2021-03-30 09:57:04 UTC
Created attachment 16573 [details]
WIP patch for master

The tests need to be updated as we no longer call dsdb_search_by_guid().
Comment 52 Andrew Bartlett 2021-04-13 22:28:20 UTC
The status here is that while this is an important and reproducible issue, we have not been able to put the resources into finishing the AD patches once the need to avoid the transaction in the good password case was identified.

I'll chip away at this when I get time, but there are higher priorities right now that can't be solved by simply not having really bad passwords.

Sorry,
Comment 53 Joseph Sutton 2022-07-06 00:15:01 UTC
Created attachment 17410 [details]
patch for master v2

An updated patchset that includes Python integration tests. One caveat is that the Python tests require an additional library (python3-crypto on Debian/Ubuntu), that the bootstrap system doesn't install at present, so we can use single-DES for testing SAMR password changes. The alternative is writing new bindings in C.
Comment 54 Andrew Bartlett 2022-07-06 01:48:48 UTC
Comment on attachment 17410 [details]
patch for master v2

You may be able to trigger the password change with py_net_change_password() (extended to specify SAMR).

But the minimal change from here would be to expose single DES from GnuTLS rather than bringing back python-crypto, as it causes dependency issues. 

Some similar things are in pyglue.c, eg py_generate_random_bytes()
Comment 55 Andrew Bartlett 2022-07-06 03:30:31 UTC
(In reply to Andrew Bartlett from comment #54)
lib/crypto/py_crypto.c would be the right spot, we already do the same for RC4 for similar reasons.
Comment 56 Joseph Sutton 2022-07-06 03:46:10 UTC
Created attachment 17411 [details]
patch for master v3

Thanks. The new patchset uses C bindings from py_crypto.c, so the dependency on python3-crypto is removed.
Comment 57 Andrew Bartlett 2022-07-06 08:01:18 UTC
Comment on attachment 17411 [details]
patch for master v3

Rather than wrap E_old_pw_hash(), which has a name which extends from days very long ago when we tried to pretend we didn't have full DES in Samba (pre 2000era crypto export rules), please wrap the equivilant des_crypt112_16() or des_crypt56_gnutls() and give a more expressive name.   This will make it easier for others later.

Down in the tests, it looks like there is some leftover from the previous modes where the GUID was put into the user message.  I don't think we need add_object_guid() and maybe also add_sid() any more, eg in this repeated block:

+	msg = create_message(ctx);
+	guid = GUID_random();
+	add_object_guid(msg, &guid);
+	add_sid(msg, "S-1-5-21-2470180966-3899876309-2637894779-1000");

We should however have a test that renames the user (eg in that transaction) and confirms the re-read still works (implicitly confirms the GUID in the DN is being used). 

This was part of the simplification I did between Gary's patch and my rework, as by removing the GUID handling we could simplify the cases and should be able to simplify the tests.

This is looking good.
Comment 60 Andrew Bartlett 2022-07-06 21:27:05 UTC
(In reply to Joseph Sutton from comment #56)
I don't see any tests that the bad password indicator is set or used.  The tests prepare for lockout by editing the DB directly, which is OK, but I wonder if we need some other test?

Or is this covered implicitly somehow?

I wonder if we should have a test that just sets up the race, eg 4 bad passwords all stuck behind the transaction lock with a threshold of 3.  That would be the simple proof that we have fixed the issue, and should be able to be knownfailed fairly reliably. 

We also never, until restart, clean up an old bad password indicator if the password is never used again successfully, but I guess that's OK.
Comment 61 Andrew Bartlett 2022-07-08 03:37:10 UTC
A note for later as I can't make this idea public until this is public.

The bad password indicator DB could be used to implement a properly secure version of the SMB authentication rate limiter that MS is doing on their next windows version. 

If we do that we could just write in that TDB the time the password should be able to be checked (2sec more each failure), and initially just sleep - even better properly async wait - till that time, up to (say) 30sec.
Comment 62 Andrew Bartlett 2022-07-11 02:36:13 UTC
(In reply to Andrew Bartlett from comment #60)
In particular, a test that a successful correct login is possible on an account without a bad password indicator, but despite an outstanding transaction.

Also a test that a successful correct login is possible on an account with a bad password indicator (but which is wrong - eg appears erroneously, forcing the re-read)
Comment 63 Joseph Sutton 2022-07-12 08:05:27 UTC
Created attachment 17413 [details]
patch for master v4

This patchset should address the new comments. It also now ensures that we return a LOCKED_OUT error code whether the password is bad or not, and makes the bad password count update for an LDAP password change atomic (previously, there was a small gap between ending a transaction and immediately starting a new one, during which other processes could acquire the transaction lock and sneak an update in).
Comment 64 Andrew Bartlett 2022-07-14 05:07:17 UTC
Comment on attachment 17413 [details]
patch for master v4

My only comment is that the other tests seem to suggest that the logonCount is not updated for a simple bind (except to set to 0 on the first logon). 

(I was checking on this because simple binds, which follow the same internal pattern as NTLM authentication, are the high-volume auth path we don't want stuck behind a transaction queue).
Comment 65 Andrew Bartlett 2022-07-19 20:36:15 UTC
Comment on attachment 17413 [details]
patch for master v4

The hold-up on this patch is a race in socket_wrapper being triggered by the forked processes.  There is a small stat()/[libc_]bind() race looking for a free 'port' in a loop within swrap_auto_bind().  Once this is worked around we should be almost ready to publish this.
Comment 66 Andreas Schneider 2022-07-20 07:49:00 UTC
I think we should get rid of the stat() in swrap_auto_bind().

 src/socket_wrapper.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/socket_wrapper.c b/src/socket_wrapper.c
index 8141b8b..68adcae 100644
--- a/src/socket_wrapper.c
+++ b/src/socket_wrapper.c
@@ -3916,10 +3916,12 @@ static int swrap_auto_bind(int fd, struct socket_info *si, int family)
                              type,
                              socket_wrapper_default_iface(),
                              port);
-               if (stat(un_addr.sa.un.sun_path, &st) == 0) continue;
 
                ret = libc_bind(fd, &un_addr.sa.s, un_addr.sa_socklen);
                if (ret == -1) {
+                       if (errno == EALREADY || errno == EADDRINUSE) {
+                               continue;
+                       }
                        goto done;
                }


Can you test if the above patch fixes it?
Comment 67 Joseph Sutton 2022-07-20 09:32:45 UTC
(In reply to Andreas Schneider from comment #66)
I've tested your patch, and it appears to fix the 'Address already in use' error I was getting -- thanks! However, I had to make the following additional change:

@@ -3815,7 +3815,6 @@ static int swrap_auto_bind(int fd, struct socket_info *si, int family)
 	char type;
 	int ret;
 	int port;
-	struct stat st;
 	char *swrap_dir = NULL;
 
 	swrap_mutex_lock(&autobind_start_mutex);

to avoid an 'unused variable' warning.
Comment 68 Joseph Sutton 2022-07-21 06:07:24 UTC
Created attachment 17436 [details]
patch for master v5
Comment 69 Andreas Schneider 2022-07-21 12:57:05 UTC
socket_wrapper update to version 1.3.4 addressing the TOCTOU issue:

https://gitlab.com/samba-team/samba/-/merge_requests/2627
Comment 70 Andreas Schneider 2022-07-28 11:38:00 UTC
I've tried to apply the patch on my asn-samr-aes branch. It fails to apply because of changes in master. There are no conflicts with the changes in my samr branch.




commit 63b0861eaf1ef6852783127fb4a1803b8108ddd4
Author:     Joseph Sutton <josephsutton@catalyst.net.nz>
AuthorDate: Tue Jul 5 20:18:01 2022 +1200
Commit:     Andreas Schneider <asn@samba.org>
CommitDate: Thu Jul 28 13:26:03 2022 +0200

    CVE-2021-20251 s4-rpc_server: Remove transaction logic from SAMR password change
    
    samdb_set_password() also starts a transaction, so doing it here is
    unecessary and only results in a nested transaction.
    
    BUG: https://bugzilla.samba.org/show_bug.cgi?id=14611
    
    Signed-off-by: Joseph Sutton <josephsutton@catalyst.net.nz>


I think it should be the other way around. We need to remove the transaction from samdb_set_password().


3.1.5.10.3 SamrUnicodeChangePasswordUser2 (Opnum 55)

[..]
1. On a DC configuration, if Active Directory is not running, the server MUST abort the request and return an error status.
2. All database operations MUST occur in a single transaction.
[..]

3.1.5.10.4 SamrUnicodeChangePasswordUser4 (Opnum 73)
Same as above ...


So we need to make sure we wrap all of them in a transaction.
Comment 71 Joseph Sutton 2022-07-28 21:52:54 UTC
(In reply to Andreas Schneider from comment #70)
Ah, OK -- I'll see if everything can be wrapped in a transaction. But I'm not entirely sure how that can work, given that the search is performed with a system samdb context, and the actual password change is performed with a new context, having only user privileges. Unless DSDB_SESSION_INFO can be modified on the context to lower our privileges?

I noticed that our ChangePasswordUser4 implementation seems to operate as system, without requiring any rights on the part of the user, unlike other password change methods. Is that really what Windows is doing?
Comment 72 Andrew Bartlett 2022-07-31 20:05:52 UTC
(In reply to Joseph Sutton from comment #71)
In the past, before the patches for NTLM removal, some measure of the 'transaction' was provided by the fact that the first SYSTEM read got the hashes, which the second USER operation had to have match byte-wise.

But I changed that to just a flag structure, to avoid relying on the NT hash.

The only way we can put this in a transaction (cleanly, without making assumptions about how TDB/LMDB work under the hood) is by changing the opaque, sort of like become_root() in the fileserver.
Comment 73 Joseph Sutton 2022-08-02 05:08:59 UTC
Created attachment 17452 [details]
patch for master v7

New patch rebased on master, currently under CI.
Comment 74 Joseph Sutton 2022-08-02 08:24:01 UTC
Created attachment 17453 [details]
patch for master v8
Comment 75 Andreas Schneider 2022-08-02 12:33:03 UTC
(In reply to Joseph Sutton from comment #71)
Yes, the password should be changed as the user. Thanks for fixing it correctly in all functions.
Comment 76 Joseph Sutton 2022-08-03 02:47:24 UTC
Created attachment 17456 [details]
patch for master v9

Well, almost correctly.

This time I think the patch is basically ready, if there are no more major changes needed.
Comment 77 Joseph Sutton 2022-09-06 00:23:09 UTC
Removing embargo.
Comment 78 Samba QA Contact 2022-09-13 00:09:08 UTC
This bug was referenced in samba master:

b27a67af0216811d330d8a4c52390cf4fc04b5fd
17b8d164f69a5ed79d9b7b7fc2f3f84f8ea534c8
4bb9d85fed8498566bdb87baa71a3147806baafc
91e2e5616ccd507fcaf097533c5fc25974119c1e
439f96a2cfe77f6cbf331d965a387512c2db91c6
2087b0cd986b8959b2a402b9a1891472e47ca0b0
408717242aad8adf4551f2394eee2d80a06c7e63
7b8e32efc336fb728e0c7e3dd6fbe2ed54122124
d6cf245b96fb02edb3bcc52733d040d5f03fb918
336e303cf1962b56b64c0d9d2b05ac15d00e8692
de4cc0a3dae89f3e51a099282615cf80c8539e11
4a9e0fdccfa218fbb2c3eb87e1a955ade0364b98
b954acfde258a1909ed60c1c3e1015701582719f
55147335aec8194b6439169b040556a96db22e95
712181032a47318576ef35f6a6cf0f958aa538fb
b5f78b7b895a6b92cfdc9221b18d67ab18bc2a24
d8a862cb811489abb67d4cf3a7fbd83d05c7e5cb
a65147a9e98ead70869cdfa20ffcc9c167dbf535
96479747bdb5bc5f33d903085f5f69793f369e3a
2b593c34c4f5cb82440b940766e53626c1cbec5b
b1e740896ebae14ba64250da2f718e1d707e9eed
bdfc9d96f8fe5070ab8a189bbf42ccb7e77afb73
a268a1a0e304d0702469e4ac146d8af5e7384c39
268ea7bef5af4b9c8a02f4f5856113ff0664d9e8
8587734bf989aeaafa9d09d78d0f381caf52d285
65c473d4a53fc8a22a0d531aff45203ea3a4d99b
fabbea25310a31c0409b1c11eaced39bd8cde8dd
f74f92aea164af40d9177b332778a76d7ecabcbd
fcabcb326d385c1e1daaa8dae9820e33a3868f56
7981cba87e3a7256b12bfc5fdd89b136c12979ff
1d869a2a666cfada1495d891021de6c2b8567a96
8ae0c38d54f065915e927bbfe1b656400a79eb13
Comment 79 Jeremy Allison 2022-09-13 00:13:59 UTC
Congratulations on landing this ! Lots of hard work.
Comment 80 Joseph Sutton 2022-09-13 08:01:17 UTC
Created attachment 17513 [details]
patch for 4.17 v9
Comment 81 Joseph Sutton 2022-09-16 00:57:04 UTC
Created attachment 17521 [details]
proposed patch for 4.16 v9
Comment 82 Andrew Bartlett 2022-09-16 02:31:12 UTC
Assigning to Jule for the next available releases.
Comment 83 Jule Anger 2022-09-18 16:45:42 UTC
Pushed to autobuild-v4-{17,16}-test.
Comment 84 Samba QA Contact 2022-09-18 17:47:12 UTC
This bug was referenced in samba v4-16-test:

f0c44d9e53dbeb9a8ebc387635f6146cb292882f
3542483de3f07806d78ea018852092516582c71d
b7351888e82c5d2ae6222aede762609c02da2905
63020bf13c04f7a8604b8f4a02dba48b8aeed769
0e3ac110df7e08823407006aa82e6300524212ee
bc30ca2117c75aa57c8a8b11420678cc8eea294b
740c4c2b95367793c557085e3cc5b4e367e2e0bb
831335aaaad6eead7ebd7dc3ed89c5f33c2a3e35
9dcf447d82238816f6056d64c68e85ddd8808660
8580b90a87b0a18dad8d9808e6d85ccbe5ecf107
a1a440c10148d19d3b9ac47ade2fa36ab0e39e33
79f791ff0ebe0a8a6787c74ad20870ce41907844
a9aae34d5a97081dff9126328167678cfc4601c7
6b826a375a13f44a0486024ee09564cf3b1528ca
0d6da5250be44888b3eaeb94aad7cee77f44d095
dd38fae8c8d7d47c75f03b6bc94bed8d47012cb6
6a70d006917486e4f1bf9333ab8a4961e79ef51a
2fe2485b93d589620684f79b61d3da222accfb1e
f725f2f2442b3cef0fb51ad2af5248147524873b
f58d7e42009180dba67832149123a19237d388b2
29b31129fd372513ad24e56ec4caab6844e2ed72
5eb5daaa1521d424ebde5a4d06ad05c9cdfc7996
74d8c3d5843f2ba285bc1a6ace1eba5080c0e99c
96c24b58b8cc2301007f0bf586184fd2d264b028
05447dfb2016d32fdfffb42fdb02cc4db68b18a9
69abe0c2b0aeeb8ac1fb93baae41f1452ac4a5fc
ae3b615236c4cc773be30f34ec9b794a25253449
f78ff75c51f6cd15b476de373f11b1c87f962970
6a0280d9553db535601c71c1be475f85949d5b83
Comment 85 Samba QA Contact 2022-09-19 05:04:06 UTC
This bug was referenced in samba v4-17-test:

d4ae8610ea37bdd0ac2e6dcb59f858d9921f949b
518818b3c104b7d314265a6c631742384de30d76
0b3604e6e0de7c03eeebb81ef8611dd4b6905c6f
b82543978d112fb31fba593bde16161eefbf8dbe
276d81368ecf057d462fca40c28cd68deb20b390
2dc965ad1d85b3f77e734a74b52ce9049b5df536
d57c4ea959927bf340c6326029db5a11941538a6
674dbeaca0710e4493d53b427903b2ef4ec59788
2e4c6196d88fe22d3666bcc07091d8b00f015a0b
180784c49b34a1cca30d2bfbef7397e9b6826d20
d07f34ec394c75f0530366fb010e3804c0570468
e0fdfce1327c8c68da99689fe3a3d6fc34f0dfca
fa22c9bf2bee98245fdfecbeb360bc2f11d9de89
ffe43511bb9ed9e5bb91273d13b63e1655ee1240
1167352291264d5e1c4b1617603f77e902beb1c8
446cfe34523cbdd18b1407f0ed219996b7c0e1b1
3a96ccbb8414b2642836cf88ea5531040fc658b4
254e94892cd3b561f5d17e7008b792fa7948c68b
4d0cba69c8f0d2f4426f3889b40f2f6b161aef25
5f1bafdd3f042a42f0cb9f5dec3345d24c3586d9
4adcada4104294d2b7e30617fee2f959e93e908e
5befe31c651419cfae9ff31e25290e2952441670
b3f48fae13ed7538b5997b90682615f0cafb1b49
13efa6261883edefd80267db7063d138aa8b5554
5c8bbe3e74c04d44c8c118bf3a81306ead02ddb5
3e54aabd9e38a0f2020dd0359613244cf578c719
b8c123d02d0d0968bfc8548e57931342dd77e068
7b28bd1080383fba3abb54b3cbf03134cbefd7db
7fe10442b7670b35101218ee3e9a303cc573618c
619ffc2a2fb50d87d771ea316818c65139c79e5a
9aabf78216f91aee6abcd401b10f1ca01f544be0
bb86d2f3a10edbe27aa2edeafce0475b9cd79feb
Comment 86 Jule Anger 2022-09-25 09:19:17 UTC
Closing out bug report.

Thanks!