Bug 13815 - winbind causing huge timeouts/delays since 4.8
Summary: winbind causing huge timeouts/delays since 4.8
Status: RESOLVED INVALID
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Winbind (show other bugs)
Version: unspecified
Hardware: x64 Linux
: P5 normal (vote)
Target Milestone: ---
Assignee: Andreas Schneider
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on: 13503
Blocks:
  Show dependency treegraph
 
Reported: 2019-02-28 17:35 UTC by Alexander Spannagel
Modified: 2020-12-17 00:55 UTC (History)
4 users (show)

See Also:


Attachments
log.winbind from the test (16.09 KB, text/plain)
2019-02-28 17:35 UTC, Alexander Spannagel
no flags Details
log.wb-CENTOS7DEV64 from the test (26.12 KB, text/plain)
2019-02-28 17:35 UTC, Alexander Spannagel
no flags Details
log.winbind-idmap from the test (7.31 KB, text/plain)
2019-02-28 17:39 UTC, Alexander Spannagel
no flags Details
re-add return of false within parse_domain_user (459 bytes, patch)
2019-02-28 17:43 UTC, Alexander Spannagel
no flags Details
return true and set domain same than namespace within parse_domain_user (473 bytes, patch)
2019-02-28 17:46 UTC, Alexander Spannagel
no flags Details
more dedicated patch for functions causing hangs (1.40 KB, patch)
2019-03-02 01:15 UTC, Alexander Spannagel
no flags Details
preferred patch, brings back logic for parse_domain_user return value (1.40 KB, patch)
2019-03-04 13:03 UTC, Alexander Spannagel
no flags Details
preferred patch, brings back logic for parse_domain_user return value (597 bytes, patch)
2019-03-04 15:00 UTC, Alexander Spannagel
no flags Details
patch as it went into the merge request (11.15 KB, patch)
2019-03-07 23:07 UTC, Alexander Spannagel
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Spannagel 2019-02-28 17:35:07 UTC
Created attachment 14879 [details]
log.winbind from the test

Hi,

this is a bug report as follow up of a thread in the samba list started at 22.02.19 with same subject.

Here a short summary of the bug:
It was coming up with OS update from RHEL/CentOS 7.5 to 7.6 including also a samba upgrade from 4.7 to 4.8 with no config changes. 
We have the following in nsswitch.conf:
passwd:     files sss winbindfind
shadow:     files sss
group:      files sss winbind

The winbind is the connection to our Winodows-AD (multiple-domains, using autorid), while sss connects against our unix LDAP server. Both coexists since years.

Samba is configured as domain member with option "winbind use default domain" set to it's default of "no". When calling winbind with an unknown "local" user (e.g. no domain delivered) it takes approx 60secs before an error returns. 
Here an example:
[root@centos7dev64 samba]# systemctl restart smb winbind sssd ; sss_cache -E ; net cache flush ; date ; smbcontrol winbind debug 10 ; sleep 60 ; date  ; touch start ; wbinfo -i foo ; touch stop ; smbcontrol winbind debug 3 ; date ; sleep 5 ; sync ; cp -a st* log.winbindd* log.wb* BAK/
Thu Feb 28 10:31:52 EST 2019
Thu Feb 28 10:32:52 EST 2019
failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND
Could not get info for user foo
Thu Feb 28 10:33:52 EST 2019

This behavior is 100% re-producabel on different servers and causing other processes using the getpwnam that are not locally resolving hanging for a while.
As a workaround we have set "winbind use default domain" to "yes" where possible and on the others and "winbind max domain connections" to 10.

After taking some time (and no patches fixed the issue) i started look in the code and the problem could be pinned down to the patch resolving Bug #13503. With our setup the call with a local name (e.g. none domain and not local account) sets the variable namespace to hostname while keeping domain empty in the function parse_domain_user from source3/winbindd/winbindd_util.c.
Removing the patch and so a return of false fixed so problem.
Also expanding the mentioned patch to also set domain to the same than namespace  fixed it for us.

Attached are the relevant logs from the example above and a patch against latest 4.8 tree, that should also work against the develoment branch.

Would be much appreciated if the patch would be reviewed and as such (or a corrected version) been added to the code base.

Thanks

Alex
Comment 1 Alexander Spannagel 2019-02-28 17:35:51 UTC
Created attachment 14880 [details]
log.wb-CENTOS7DEV64 from the test
Comment 2 Alexander Spannagel 2019-02-28 17:39:17 UTC
Created attachment 14881 [details]
log.winbind-idmap from the test
Comment 3 Alexander Spannagel 2019-02-28 17:43:03 UTC
Created attachment 14882 [details]
re-add return of false within parse_domain_user

This patch returns false like it does in code base for samba 4.7.
Comment 4 Alexander Spannagel 2019-02-28 17:46:17 UTC
Created attachment 14883 [details]
return true and set domain same than namespace within parse_domain_user

return true as patch from Bug #13503, but not certain if setting domain to hostname breaks the intention of #13503.
Comment 5 Alexander Spannagel 2019-03-01 21:40:25 UTC
Did some more testing and found that "wbinfo --group-info foo" didn't show the delay also it calls same function parse_domain_user() from  source3/winbindd/winbindd_util.c

Reason looks to be that within source3/winbindd/winbindd_getgrnam.c after calling the function the name_domain is set to get_global_sam_name() and so like the patch i added for in the function directly:

        /* if no domain or our local domain and no local tdb group, default to
         * our local domain for aliases */

        if ( !*(state->name_domain) || strequal(state->name_domain,
                                                get_global_sam_name()) ) {
                fstrcpy(state->name_domain, get_global_sam_name());
        }

Adding something similar to source3/winbindd/winbindd_getpwnam.c should also fix the issue.
Comment 6 Alexander Spannagel 2019-03-01 22:55:25 UTC
(In reply to Alexander Spannagel from comment #5)

Query winbind for users groups also show a delay:
[root@centos7dev64 samba-4.11.0]# time wbinfo --user-groups=foo
failed to call wbcGetGroups: WBC_ERR_DOMAIN_NOT_FOUND
Could not get groups for user foo

real    0m39.735s
user    0m0.053s
sys     0m0.009s

In this case not fixing empty domain while namespace is set to hostname...

The function call that looks not return in time looks to be: 
        subreq = wb_lookupname_send(state, ev,
                                    state->namespace,
                                    state->domname,
                                    state->username,
                                    LOOKUP_NAME_NO_NSS);

It is called from winbindd_getpwnam.c and winbindd_getgroups.c with empty domain and namespace set to hostname in our setup. I'll try an alternative patch that  patches these files and not parse_domain_user() as a whole.
Comment 7 Alexander Spannagel 2019-03-02 01:12:14 UTC
Ok, adding an alernative patch that looks usable to me as it fix it in the files source3/winbindd/getpwent.c/getgroups.c where it struggles when account with none/empty domain is provided.

The patch has been tested against dev branch, but also applies to 4.8.

Here the result from test:

[root@centos7dev64 SOURCES]# wbinfo -V
Version 4.11.0pre1-UNKNOWN
[root@centos7dev64 SOURCES]# time wbinfo --user-groups=foo
failed to call wbcGetGroups: WBC_ERR_DOMAIN_NOT_FOUND
Could not get groups for user foo

real    0m0.063s
user    0m0.052s
sys     0m0.008s
[root@centos7dev64 INSTALL]# time wbinfo -i foo
failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND
Could not get info for user foo

real    0m0.062s
user    0m0.055s
sys     0m0.004s
Comment 8 Alexander Spannagel 2019-03-02 01:15:06 UTC
Created attachment 14891 [details]
more dedicated patch for functions causing hangs

tested against development branch, but applies down to 4.8 tree too
Comment 9 Andreas Schneider 2019-03-02 09:47:21 UTC
Please post the output of:

testparm -s

If you remove 'sss' from the nsswitch.conf, do you still see the delay?

More debugging hints are here: https://hackmd.io/s/SkHk8rXBz
Comment 10 Alexander Spannagel 2019-03-02 10:35:25 UTC
Removing sss fix it also like removing winbind (both are needed and so not an option), so each on it's own is fine. As stated and discussed in mail-thread it worked over years and with update from 4.7 to 4.8 of samba it breaks. Removing patch from refered Bug fixes the issue to.

Here the requested output:
x[root@centos7dev64 ~]# testparm -s
Load smb config files from /etc/samba/smb.conf
Loaded services file OK.
'winbind separator = +' might cause problems with group membership.

Server role: ROLE_DOMAIN_MEMBER

# Global parameters
[global]
        dedicated keytab file = /etc/krb5.keytab
        disable spoolss = Yes
        domain master = No
        kerberos method = secrets and keytab
        load printers = No
        local master = No
        log file = /var/log/samba/log.%m
        max log size = 0
        os level = 0
        printcap name = /dev/null
        realm = OPS.GLOBAL.AD
        security = ADS
        server signing = required
        server string = FTP Samba Server
        show add printer wizard = No
        template shell = /bin/bash
        username map = /etc/samba/user.map
        winbind refresh tickets = Yes
        winbind separator = +
        workgroup = OPS
        idmap config na : range = 3000000-3999999
        idmap config na : backend = rid
        idmap config eu : range = 2000000-2999999
        idmap config eu : backend = rid
        idmap config ops : range = 1000000-1999999
        idmap config ops : backend = rid
        idmap config * : range = 100000-199999
        idmap config * : backend = tdb
        map acl inherit = Yes
        printing = bsd
        vfs objects = acl_xattr full_audit recycle extd_audit
Comment 11 Alexander Spannagel 2019-03-04 13:03:27 UTC
Created attachment 14895 [details]
preferred patch, brings back logic for parse_domain_user return value

Adding this patch as it brings back the logic of the function parse_domain_user from samba 4.7 and earlier:
There function returns false when these two matches:
- user has no domain provided (e.g. local account) [match our setup] 
- assume_domain call returns false, which happens, when one of these match:
  -ROLE_STANDALONE
  -ROLE_DOMAIN_MEMBER and winbind_use_default_domain set to No [match our setup]
  -Not a domain controller

With introduction of 4.8 parse_domain_user only returns false, when no user name was given, and so makes most of the following checks after calling the function obsolete:
        if (!ok) {
                DEBUG(5, ("Could not parse domain user: %s\n", domuser));
                tevent_req_nterror(req, NT_STATUS_INVALID_PARAMETER);
                return tevent_req_post(req, ev);
        }

Also the function canonicalize_username from winbindd_util.c is impacted as it will start return something like "+user" with winbind seperator "+" in the case for local user and winbind_use_default_domain set to No on a domain member.
Not certain which impact this has to the rest of the code, but maybe it also the root cause of Bug 13503.

I tested the patch successfully against developer branch and also against 4.8 an would be appreciated, if it get's into the code base.

Alex
Comment 12 Alexander Spannagel 2019-03-04 15:00:43 UTC
Created attachment 14896 [details]
preferred patch, brings back logic for parse_domain_user return value

sorry uploaded wrong pathc, this is the one i asked to get into the code ....
Comment 13 Alexander Spannagel 2019-03-04 15:04:56 UTC
(In reply to Alexander Spannagel from comment #10)
short correction - just removing patch from 13503 didn't fixed the problem.
Comment 14 Andreas Schneider 2019-03-04 17:12:00 UTC
Could you submit patches via our gitlab CI so that we see that the patch doesn't break anything?

You can find details here:
https://wiki.samba.org/index.php/Samba_CI_on_gitlab
Comment 15 Alexander Spannagel 2019-03-04 21:02:33 UTC
(In reply to Andreas Schneider from comment #14)
Shure. I'm on gitlab now, so would need required access to Samba CI repo.
Comment 16 Alexander Spannagel 2019-03-07 23:06:23 UTC
It was somewhat a back and forward, but finally my latest patch passed CI successfully and i created merge-request #281.
It is a mix of my first patches here, keeps patch for bug 13503 untouched and also applies to current samba 4.8.3-4 on RHEL/CentOS7.
Comment 17 Alexander Spannagel 2019-03-07 23:07:46 UTC
Created attachment 14910 [details]
patch as it went into the merge request
Comment 18 Andreas Schneider 2019-04-02 12:46:06 UTC
Ralph and I looked into the issue together.

We followed the code path when a user is looked up. Also we've taken into account that there are scenarios where a local user is configured even if we are a domain member, then we still need to be able to call getpwnam().

Winbind has measures to avoid recursion, so that a getpwnam() call will not call winbind again and we recurse. However we time out when it comes to a getpwnam() call which goes to sss and then we run into timeouts.

We think that the correct solution is to change the nsswitch.conf to call winbind before sss which seem to run into issues.

passwd: files winbind sss
group: files winbind sss

Could you please verify that this works as expected?
Comment 19 Ralph Böhme 2019-04-02 13:17:41 UTC
(In reply to Andreas Schneider from comment #18)
...and please also file a bug on sssd that it seemingly doesn't deal with recursion.
Comment 20 Andreas Schneider 2019-04-04 06:55:37 UTC
This is not a winbind issue at all.

The issue has been fixed in sssd: https://bugzilla.redhat.com/show_bug.cgi?id=1666819