Created attachment 14879 [details] log.winbind from the test Hi, this is a bug report as follow up of a thread in the samba list started at 22.02.19 with same subject. Here a short summary of the bug: It was coming up with OS update from RHEL/CentOS 7.5 to 7.6 including also a samba upgrade from 4.7 to 4.8 with no config changes. We have the following in nsswitch.conf: passwd: files sss winbindfind shadow: files sss group: files sss winbind The winbind is the connection to our Winodows-AD (multiple-domains, using autorid), while sss connects against our unix LDAP server. Both coexists since years. Samba is configured as domain member with option "winbind use default domain" set to it's default of "no". When calling winbind with an unknown "local" user (e.g. no domain delivered) it takes approx 60secs before an error returns. Here an example: [root@centos7dev64 samba]# systemctl restart smb winbind sssd ; sss_cache -E ; net cache flush ; date ; smbcontrol winbind debug 10 ; sleep 60 ; date ; touch start ; wbinfo -i foo ; touch stop ; smbcontrol winbind debug 3 ; date ; sleep 5 ; sync ; cp -a st* log.winbindd* log.wb* BAK/ Thu Feb 28 10:31:52 EST 2019 Thu Feb 28 10:32:52 EST 2019 failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND Could not get info for user foo Thu Feb 28 10:33:52 EST 2019 This behavior is 100% re-producabel on different servers and causing other processes using the getpwnam that are not locally resolving hanging for a while. As a workaround we have set "winbind use default domain" to "yes" where possible and on the others and "winbind max domain connections" to 10. After taking some time (and no patches fixed the issue) i started look in the code and the problem could be pinned down to the patch resolving Bug #13503. With our setup the call with a local name (e.g. none domain and not local account) sets the variable namespace to hostname while keeping domain empty in the function parse_domain_user from source3/winbindd/winbindd_util.c. Removing the patch and so a return of false fixed so problem. Also expanding the mentioned patch to also set domain to the same than namespace fixed it for us. Attached are the relevant logs from the example above and a patch against latest 4.8 tree, that should also work against the develoment branch. Would be much appreciated if the patch would be reviewed and as such (or a corrected version) been added to the code base. Thanks Alex
Created attachment 14880 [details] log.wb-CENTOS7DEV64 from the test
Created attachment 14881 [details] log.winbind-idmap from the test
Created attachment 14882 [details] re-add return of false within parse_domain_user This patch returns false like it does in code base for samba 4.7.
Created attachment 14883 [details] return true and set domain same than namespace within parse_domain_user return true as patch from Bug #13503, but not certain if setting domain to hostname breaks the intention of #13503.
Did some more testing and found that "wbinfo --group-info foo" didn't show the delay also it calls same function parse_domain_user() from source3/winbindd/winbindd_util.c Reason looks to be that within source3/winbindd/winbindd_getgrnam.c after calling the function the name_domain is set to get_global_sam_name() and so like the patch i added for in the function directly: /* if no domain or our local domain and no local tdb group, default to * our local domain for aliases */ if ( !*(state->name_domain) || strequal(state->name_domain, get_global_sam_name()) ) { fstrcpy(state->name_domain, get_global_sam_name()); } Adding something similar to source3/winbindd/winbindd_getpwnam.c should also fix the issue.
(In reply to Alexander Spannagel from comment #5) Query winbind for users groups also show a delay: [root@centos7dev64 samba-4.11.0]# time wbinfo --user-groups=foo failed to call wbcGetGroups: WBC_ERR_DOMAIN_NOT_FOUND Could not get groups for user foo real 0m39.735s user 0m0.053s sys 0m0.009s In this case not fixing empty domain while namespace is set to hostname... The function call that looks not return in time looks to be: subreq = wb_lookupname_send(state, ev, state->namespace, state->domname, state->username, LOOKUP_NAME_NO_NSS); It is called from winbindd_getpwnam.c and winbindd_getgroups.c with empty domain and namespace set to hostname in our setup. I'll try an alternative patch that patches these files and not parse_domain_user() as a whole.
Ok, adding an alernative patch that looks usable to me as it fix it in the files source3/winbindd/getpwent.c/getgroups.c where it struggles when account with none/empty domain is provided. The patch has been tested against dev branch, but also applies to 4.8. Here the result from test: [root@centos7dev64 SOURCES]# wbinfo -V Version 4.11.0pre1-UNKNOWN [root@centos7dev64 SOURCES]# time wbinfo --user-groups=foo failed to call wbcGetGroups: WBC_ERR_DOMAIN_NOT_FOUND Could not get groups for user foo real 0m0.063s user 0m0.052s sys 0m0.008s [root@centos7dev64 INSTALL]# time wbinfo -i foo failed to call wbcGetpwnam: WBC_ERR_DOMAIN_NOT_FOUND Could not get info for user foo real 0m0.062s user 0m0.055s sys 0m0.004s
Created attachment 14891 [details] more dedicated patch for functions causing hangs tested against development branch, but applies down to 4.8 tree too
Please post the output of: testparm -s If you remove 'sss' from the nsswitch.conf, do you still see the delay? More debugging hints are here: https://hackmd.io/s/SkHk8rXBz
Removing sss fix it also like removing winbind (both are needed and so not an option), so each on it's own is fine. As stated and discussed in mail-thread it worked over years and with update from 4.7 to 4.8 of samba it breaks. Removing patch from refered Bug fixes the issue to. Here the requested output: x[root@centos7dev64 ~]# testparm -s Load smb config files from /etc/samba/smb.conf Loaded services file OK. 'winbind separator = +' might cause problems with group membership. Server role: ROLE_DOMAIN_MEMBER # Global parameters [global] dedicated keytab file = /etc/krb5.keytab disable spoolss = Yes domain master = No kerberos method = secrets and keytab load printers = No local master = No log file = /var/log/samba/log.%m max log size = 0 os level = 0 printcap name = /dev/null realm = OPS.GLOBAL.AD security = ADS server signing = required server string = FTP Samba Server show add printer wizard = No template shell = /bin/bash username map = /etc/samba/user.map winbind refresh tickets = Yes winbind separator = + workgroup = OPS idmap config na : range = 3000000-3999999 idmap config na : backend = rid idmap config eu : range = 2000000-2999999 idmap config eu : backend = rid idmap config ops : range = 1000000-1999999 idmap config ops : backend = rid idmap config * : range = 100000-199999 idmap config * : backend = tdb map acl inherit = Yes printing = bsd vfs objects = acl_xattr full_audit recycle extd_audit
Created attachment 14895 [details] preferred patch, brings back logic for parse_domain_user return value Adding this patch as it brings back the logic of the function parse_domain_user from samba 4.7 and earlier: There function returns false when these two matches: - user has no domain provided (e.g. local account) [match our setup] - assume_domain call returns false, which happens, when one of these match: -ROLE_STANDALONE -ROLE_DOMAIN_MEMBER and winbind_use_default_domain set to No [match our setup] -Not a domain controller With introduction of 4.8 parse_domain_user only returns false, when no user name was given, and so makes most of the following checks after calling the function obsolete: if (!ok) { DEBUG(5, ("Could not parse domain user: %s\n", domuser)); tevent_req_nterror(req, NT_STATUS_INVALID_PARAMETER); return tevent_req_post(req, ev); } Also the function canonicalize_username from winbindd_util.c is impacted as it will start return something like "+user" with winbind seperator "+" in the case for local user and winbind_use_default_domain set to No on a domain member. Not certain which impact this has to the rest of the code, but maybe it also the root cause of Bug 13503. I tested the patch successfully against developer branch and also against 4.8 an would be appreciated, if it get's into the code base. Alex
Created attachment 14896 [details] preferred patch, brings back logic for parse_domain_user return value sorry uploaded wrong pathc, this is the one i asked to get into the code ....
(In reply to Alexander Spannagel from comment #10) short correction - just removing patch from 13503 didn't fixed the problem.
Could you submit patches via our gitlab CI so that we see that the patch doesn't break anything? You can find details here: https://wiki.samba.org/index.php/Samba_CI_on_gitlab
(In reply to Andreas Schneider from comment #14) Shure. I'm on gitlab now, so would need required access to Samba CI repo.
It was somewhat a back and forward, but finally my latest patch passed CI successfully and i created merge-request #281. It is a mix of my first patches here, keeps patch for bug 13503 untouched and also applies to current samba 4.8.3-4 on RHEL/CentOS7.
Created attachment 14910 [details] patch as it went into the merge request
Ralph and I looked into the issue together. We followed the code path when a user is looked up. Also we've taken into account that there are scenarios where a local user is configured even if we are a domain member, then we still need to be able to call getpwnam(). Winbind has measures to avoid recursion, so that a getpwnam() call will not call winbind again and we recurse. However we time out when it comes to a getpwnam() call which goes to sss and then we run into timeouts. We think that the correct solution is to change the nsswitch.conf to call winbind before sss which seem to run into issues. passwd: files winbind sss group: files winbind sss Could you please verify that this works as expected?
(In reply to Andreas Schneider from comment #18) ...and please also file a bug on sssd that it seemingly doesn't deal with recursion.
This is not a winbind issue at all. The issue has been fixed in sssd: https://bugzilla.redhat.com/show_bug.cgi?id=1666819