Dear Samba, It is becoming very stressful for me now to find the cause of why users are having discrupted connectivity to SAMBA server file shares when it is set up as member server to 2 PDC running Microsoft W2K3 server. I have read of Google posting about 'winbindd: Exceeding 200 client connections, no idle connection found'. Upgrading to latest version doesn't solve the problem. Could this be the cause of all problems I am facing or something else? Why is there a limition of 200 client connections because I am having approximately 1000 concurrent users at the same time. I am happy to go through the current setup with configuration to get to the root of the problem which is now causing so much frustration to users.
You can change the #deifne for WINBINDD_MAX_SIMULTANEOUS_CLIENTS in include/local.h to bump up the limit. The restrictction was put into place to pervent winbindd from running out of fd's.
I have the same problem. When I changed WINBINDD_MAX_SIMULTANEOUS_CLIENTS to 500 I received the message "winbindd: Exceeding 500 client connections, no idle connection found". After this, I tried to change WINBINDD_MAX_SIMULTANEOUS_CLIENTS to 1500 and at this time winbindd hang and log the message "PANIC: assert failed at nsswitch/winbindd.c(394)". Then I increased the number of fd's (ulimit -n 4096) to discover if this was the problem, but wasn't.
So obviously killing old clients does not work for you. I'm afraid I'm busy this week, I'll see if I can look over that code during the evening, latest on the weekend. Volker
Just tested with 3.0.20b and with current code: For me killing idle connections works fine. It must be something with your environment I think. What do your processes do? Do you have domain controllers behind a slow network? Do you frequently enumerate users and groups for example? Volker
I do not know how to delete idle connections. Anyway, the error shows there are no idle connections for me to delete. Few questions though to help me to understand this problem much better: 1)How do I know how many connections at any instant time to monitor overloading problem? There are only 3 winbindd process when I did 'ps waux | grep winbindd'. 2)Why 200 connections, why not 300 or more? I have a powerful server to serve 2000 users in school. Shouldn't it be configured automatically based on the spec of hardware and also manually configurable in smb.conf? 3)AD support with winbind technique is only new in version 3.x. How well is it to cope a busy network like my school where each user have simultenous 3-4 map drives to SAMBA system, running heavy network applications like running movies, designing, database, office applications, etc? I am facing multiple problems with network connection dropping, slow application running, samba crashing. There is definitely overloading problem and I am clueless how to resolve this problem. Lots of frustration on teachers and students as it is causing major disruption to lessons.
I am using Debian Sarge (Stable) and have 3 Samba/LDAP domains. The other 67 domains are Windows NT, for the moment, as we have a project to migrate them all to Samba/LDAP. Some domains have low-speed connections - about 20% of them. The remaining domains are connected via optical fiber. The problem takes place with Proxy Servers, where we also use Debian Sarge and Squid (2.5.9-10sarge2). All users (about 1000 simultaneously) are authenticated via NTLM, using winbind, and only the users belonging to the group 'domain\lib_internet' may access the Internet. The script we use to check if the user belongs to the group 'domain\lib_internet' is: --------------------------------- #!/usr/bin/perl -w # # External_acl helper to Squid to verify NT Domain group # membership using wbinfo. # # This script verify an user membership in groups from # your respective domain, enabling localized access control. # # external_acl uses shell style lines in it's protocol require 'shellwords.pl'; # Disable output buffering $|=1; # habilita debug $DEBUG = 1; $LOG = "/var/log/samba/log_wbinfo_domain_group"; # abre log open(LOG,">>".$LOG) if ($DEBUG); # write debug in log sub debug { # Uncomment this to enable debugging print LOG "@_" if ($DEBUG); } # # Check if a user belongs to a group # sub check { local($domain, $user, $group) = @_; &debug("check data domain($domain) user($user) group($group)\n"); # monta usuario e grupo com seus respectivos dominios if ($domain ne "default domain"){ $user = $domain."\\".$user; $group = $domain."\\".$group; } &debug("executando \`wbinfo -n \"$group\"\`\n"); $groupSID = `wbinfo -n "$group"`; chop $groupSID; &debug("executando \`wbinfo -Y \"$groupSID\"`\n"); $groupGID = `wbinfo -Y "$groupSID"`; chop $groupGID; &debug( "user: $user\ngroup: $group\nSID: $groupSID\nGID: $groupGID\n"); &debug("executando \`wbinfo -r \"$user\"`\n"); return 'OK' if(`wbinfo -r "$user"` =~ /^$groupGID$/m); return 'ERR'; } # # Main loop # while (<STDIN>) { chop; &debug ("Got $_ from squid\n"); # split user and domain @data = split(/[\\\/\|]+/,$_); # if found a \ then the first data is a domain $domain = "default domain"; $domain = shift(@data) if (@data>1); @data = split(/\ +/,$data[0]); $user = $data[0]; $group = $data[1]; for ($i=2;$i<@data;$i++){ $group .= " ".$data[$i]; } # verify user in group $ans = &check($domain, $user, $group); &debug ("Sending $ans to squid\n"); print "$ans\n"; } close(LOG) --------------------------------- The script checks the groups the users belongs to with the command wbinfo -r 'domain\user'. PS: This message corresponds to a BUG that was not open by me. Do I carry on with the same BUG or do I have to open another?
A bit more info to answer Volker's question. All servers (2 PDCs + SAMBA) are on gigabit network. On enumerating users and groups, I initially use it for proftpd package but later move to using MYSQL authentication and set enum users and groups to no, hoping this would help speed up processes but still no good.
I have the same problem too. Samba 3.0.20b FreeBSD 5.4 Windows 2003 I am rolled away on the old version.
Maybe problem in using mod_ntlm (0.4)
Hi, I have same problem, my netwotk have 2 PDC samba interdomain trunsting through openvpn connection and have only 30 computer clients! and when it occurs the interdomain trust fail! my logwatch say: nsswitch/winbindd.c:process_loop(844) winbindd: Exceeding 200 client connections, no idle connection found : 983 Time(s) nsswitch/winbindd.c:process_loop(863) winbindd: Exceeding 200 client connections, no idle connection found : 959 Time(s) Thanks for regards
severity should be determined by the developers and not the reporter.
Can anyone reproduce this one something other than FreeBSD? I am not seeing any problems here.
In last night my logwatch reports log below! I´m use fedora 4 with two Samba PDC-LDAP trusting domain over VPN. All network have 40 computers! nsswitch/winbindd.c:process_loop(813) winbindd: Exceeding 200 client connections, no idle connection found : 1507 Time(s) nsswitch/winbindd.c:process_loop(832) winbindd: Exceeding 200 client connections, no idle connection found : 1445 Time(s)
Happens for me on debian/unstable with samba 3.0.23b as well. The problem did not exist with 3.0.22. When the error comes up I only have one winbindd running, normally it's four. Since I was tweaking the winbind configuration a bit, it might have come up because I set winbind nested groups = yes I will set it back to winbind nested groups = no and report back if that changes anything. Regards, Harald
winbindd also dies with winbind nested groups = yes
got the same problem on 3 proxy servers (Squid + Samba): winbindd[5910]: [2006/08/22 10:56:00, 0] nsswitch/winbindd.c:process_loop(863) winbindd[5910]: winbindd: Exceeding 200 client connections, no idle connection found several times... samba version : 3.0.22-1 OS : Red Hat ES 4 The servers have joined a domain, with 2 PDC on Windows 2003. When i have this errors, "net ads testjoin" is OK, but "wbinfo -t" doesn't answer anything. Errors happen almost every day, on one or all servers, it depends i think on the load of the servers. Because of that, squid doesn't authenticate users, and users go time out. Need to restart Squid and Winbind to resolve this issue.
I can confirm the bug on FreeBSD 6.1 with samba 3.0.23c, compiled using freebsd ports system. I had to roll back my installation to the pre-compiled version from the installation CD (3.0.21) for it to work.
OS: Red Hat Enterprise Linux AS 4 Update 3 arch: x86_64 Samba version: 3.0.23c-4 System is configured as a squid proxy using AD membership. Showing same error as listed here when system is under high load. Was able to resolve issue by reverting to the Red Hat included version (3.0.10-1.4E.2).
while staring at this problem, I noticed a small opportunity for winbind to leak fds: --- samba/source/nsswitch/winbindd.c (revision 176) +++ samba/source/nsswitch/winbindd.c (working copy) @@ -602,8 +602,10 @@ /* Create new connection structure */ - if ((state = TALLOC_ZERO_P(NULL, struct winbindd_cli_state)) == NULL) + if ((state = TALLOC_ZERO_P(NULL, struct winbindd_cli_state)) == NULL) { + close(sock); return; + } state->sock = sock;
Applied - thanks ! Jeremy.
There is an FD leakage issue in winbindd fixed with this patch: Index: nsswitch/winbindd.c =================================================================== --- nsswitch/winbindd.c (revision 183) +++ nsswitch/winbindd.c (working copy) @@ -870,6 +870,8 @@ winbind_child_died(pid); } } + + close_winbindd_socket(); } /* Main function */ (Thanks to Brent Snow for helping me track this down)
Great catch - thanks ! But I don't see why we're opening these sockets in this function at all... Surely we should be doing this once before calling it. I bet we were doing that in the past and it got moved. I'll check into this. Jeremy.
Wow - this has been broken a loooong looong time. Traced back to 3.0.14a and it's still there.... Jeremy.
Created attachment 2351 [details] Proposed patch. Can you try this patch instead please. I think this is correct as it prevents these sockets from being continually closed and reopened. Jeremy.
Ah - ok, I'm not sure this is the bug you think it is. Look *carefully* at open_winbindd_socket() : static int _winbindd_socket = -1; static int _winbindd_priv_socket = -1; int open_winbindd_socket(void) { if (_winbindd_socket == -1) { _winbindd_socket = create_pipe_sock( WINBINDD_SOCKET_DIR, WINBINDD_SOCKET_NAME, 0755); DEBUG(10, ("open_winbindd_socket: opened socket fd %d\n", _winbindd_socket)); } return _winbindd_socket; } Note that '_winbindd_socket' is a *static* int which is returned without modification if it's not already -1. Which means that open_winbindd_socket() creates the socket only on the first call, and on all subsequent calls just returns the existing socket. So your patch causes a race condition in which there's a time within which the socket is closed and not accepting connections, during which a client connection would fail. My patch doesn't suffer from that, but is in fact unneeded as I think the original code works well as designed (it's just a little unclear). How did you track this bug down, and what did you use to confirm that this indeed fixed the fd leak ? Jeremy.
(In reply to comment #25) > How did you track this bug down, and what did you use to > confirm that this indeed fixed the fd leak ? Good questions. Here's the story. We distribute a slightly modified samba to customers - one that uses our krb5 product - and customers have noticed the 'Exceeding 200 client connections.' One used lsof on Linux RHEL4 i386, showing that just one of the 4 winbindd processes had an enormous number of fds open. I'll attach an excerpt of the lsof in the next comment... I haven't been able to reproduce the cause, myself. It seems elusive.
Created attachment 2353 [details] lsof on rhel4 showing a winbindd with fd leak (3.0.23c)
(In reply to comment #21) > There is an FD leakage issue in winbindd fixed with this patch: Doh.. I meant to qualify that as 'possibly' fixed.
What was the position of the fd-leaking winbindd in the process tree ? Was it the parent or one of the children ? This might be important. I'm assuming you haven't given your patch to the customer ? Or did you give it them and it fixed the problem ? Jeremy.
(In reply to comment #29) > What was the position of the fd-leaking winbindd in the process tree ? Was it > the parent or one of the children ? This might be important. I didn't ask for a ps.. so I don't have ppid info, sorry. I'll ask. > I'm assuming you haven't given your patch to the customer ? Or did you give it > them and it fixed the problem ? They've probably gone home for the evening - I sent them a patched winbindd binary to try, but, I have doubts it will work because of your correct analysis that re-opening an already-open priv socket is a no-op. But it could be interesting. I doubt it is close-on-exec leakage because the PIDs look too low for that to have happened a thousand times (assuming linux is allocating sequential pids) nmbd 25044 smbd 25051 winbindd 25058 although now I look carefully at the processes listed in the lsof, it seems that smbd/nmbd/winbindd have been started multiple times - i.e. the init script didn't shut down a previous process group.. So.. how about I come back to you with decent information later, instead of a half-baked patch :(
Created attachment 2355 [details] Wild guess :-) So here's a completely insane guess, which I don't think is right based on my understanding of UNIX socket semantics, but here you go. If the fd leak is in one of the domain children, it is just a very outside chance that because it's inheriting it's parent's sockets in the listen state, that some bug is causing clients to have sockets created in that child. I know, I know - where's the accept() call I hear you ask.... But just to close that possibility, as I'm stuck for ideas here, here's a patch that would stop that happening. If it could. Which it can't :-). Jeremy.
The do loop in winbindd's new_connection() is probably not necessary.. although I have no idea how it could be triggered into an EINTR spin
Actually it is necessary. winbindd uses tdb messaging which is triggered by a SIGUSR1 which can arrive at any time. All "slow" system calls must be wrapped in a EINTR loop, and accept is certainly one of those (look at the EINTR loop wrappers for most system calls in lib/system.c). Jeremy.
I've looked through the SAMBA_3_0_25 tree and I don't see how this can happen either unless you really have a large number of clients. Possibly a high number of long lived smbd process could trigger it but then that is not a leak. I'm lowering the priority until someone can supply a reproducible test case for us.
In case that can help, more info, including maybe a test case are in Debian bug #410663: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=410663 Apparently, our user there is able to reproduce the bug in 3.0.24
I looked at the debian bug report. It's essentially the same as this one - no real reproducible test case. Jeremy.
Is there any news on this? It's only that one of our servers is hitting this bug at this very moment, and I'm not sure what I can do about it. already increased the limit WINBINDD_MAX_SIMULTANEOUS_CLIENTS to 500 but winbind had no problem to reach this in 15 minutes. has anybody tried one of the suggested patches? This server is running with an ldap backend integrated in an win2003 ads domain, with about 20000 users and groups (and counting...). Version is 3.0.22. The problem started when we removed winbind's .tdb files and restarted it, because it could not resolve some SIDs (they were missing in winbindd_idmap.tdb) Peter
No one can give us any reproducible test case. So we are kind of blocked on it. If you can help us figure out what conditions trigger the problem, We'll be glad to fix it.
would it help, if I start winbindd with a higher log level and send you the logfile?
The problem in my server persist after I installed 3.0.25. CPU usage goes to 100%! How I can help find the solution for this case? What I can send for you?
Same problem, using samba in "Domain" security mode. Network has 2 Win2003 domain controllers. security = DOMAIN idmap uid = 15000-20000 idmap gid = 15000-20000 winbind use default domain = Yes [2007/05/25 15:23:54, 1] nsswitch/winbindd.c:main(953) winbindd version 3.0.23c started. Copyright The Samba Team 2000-2004 [2007/05/25 15:39:14, 0] nsswitch/winbindd.c:process_loop(813) winbindd: Exceeding 200 client connections, no idle connection found after these log events start, "getent passwd <username>" hangs with no result before this log event getent is fine..
(In reply to comment #41) > Same problem, using samba in "Domain" security mode. Network has 2 Win2003 > domain controllers. > > security = DOMAIN > idmap uid = 15000-20000 > idmap gid = 15000-20000 > winbind use default domain = Yes > > [2007/05/25 15:23:54, 1] nsswitch/winbindd.c:main(953) > winbindd version 3.0.23c started. > Copyright The Samba Team 2000-2004 > [2007/05/25 15:39:14, 0] nsswitch/winbindd.c:process_loop(813) > winbindd: Exceeding 200 client connections, no idle connection found > > after these log events start, "getent passwd <username>" hangs with no result > before this log event getent is fine.. > and it should be known this is on FreeBSD 6.2 STABLE-RELEASE
I wonder if there's some process not closing a getpwent or getgrent loop. Can you attach the output of "ps ax" when this happens? Maybe it gives a hint. And for the "lsof" output, we need the other end of that socket, not just the winbind end of it. Volker
I'm pretty sure the repro case is to send a SIGSTOP to the child process for our domain and let the async request states build up in the parent winbindd.
Created attachment 2747 [details] Patch Jerry got this reproducible. Here is a patch I created that fixes it in his testing. This was a complex logic case :-). Jeremy.
*** Bug 4089 has been marked as a duplicate of this bug. ***
Finally fixed for 3.0.25c
Hi, I've just built Samba 3.0.25c but am finding still that one of the winbind processes is again becoming very large (in terms of memory usage): # ps -elfy | grep winbindd S root 4189 1 0 75 0 4396 2805 - 07:37 ? 00:02:31 /usr/sbin/winbindd -D S root 4191 4189 0 85 0 3604 2614 429496 07:37 ? 00:00:00 /usr/sbin/winbindd -D S root 4194 4189 0 75 0 426908 108348 429496 07:38 ? 00:03:59 /usr/sbin/winbindd -D S root 4195 4189 0 75 0 2640 2355 429496 07:38 ? 00:00:00 /usr/sbin/winbindd -D S root 4222 4189 0 84 0 3348 2439 429496 07:38 ? 00:00:01 /usr/sbin/winbindd -D S root 4380 4189 0 75 0 1896 1928 429496 08:33 ? 00:00:00 /usr/sbin/winbindd -D i.e. over 400 MB of memory. Eventually the process becomes so large that oom-killer kicks in and starts killing random processes. I thought that this bug fix may have resolved this problem (if any). Is winbindd supposed to be getting this large ? If so, why ? I am seeing this on both the BDC and PDC. The samba domain is in an NT trust relationship with a Win 2k3 Domain. Regards, Patrick
(In reply to comment #48) > Hi, > > I've just built Samba 3.0.25c but am finding still that > one of the winbind processes is again becoming very large > (in terms of memory usage): This has nothing to do with the original bug report. Please open a new one. Thanks.
Hello, we experience the same bug (Exceeding 200 client connections, no idle connection found) on Samba 3.0.26a, although they should be solved in 3.0.25c. We are using the Sernet provided RPMs for RedHat RHEL4u5: samba3-3.0.26a-35 samba3-utils-3.0.26a-35 samba3-client-3.0.26a-35 samba3-winbind-3.0.26a-35 (Version 3.0.26a-SerNet-RedHat) From the times of occurance of the errors (outside and inside business hours) it seems not to be likely that the max. number of connections is really reached by user activity. Is there any possibility that the patch is not working/is not included in 3.0.26a? Regards, Tom
(In reply to comment #50) > Is there any possibility that the patch is not working/is > not included in 3.0.26a? Unless you can provide some more details, it's hard to comment. The bug fix is in 3.0.26a and appears to be working well in my environments.
I'm still experiencing this exact same problem (Exceeding 200 client connections, no idle connection found) on Fedora 7 with samba version 3.0.27a and kernel 2.6.23. The server is an Intel Core 2 Duo based machine with an Asus server mobo, 4GB of memory and a 750GB RAID5. My server is used by 10-15 engineers who primarily use it for a public file store, CAD data storage, Outlook PST file storage, and for launching Pro/Engineer Wildfire 2.0. This issue popped up for me about 3 months ago. At this time winbind would crash on me about once a week and I would have to manually restart it. Now it has become so bad that I'm spending most of my day babysitting winbind to make sure my engineers can get some work done because it is crashing every 15 minutes or so. I'm willing to provide whatever information I can to help resolve this bug because it is currently costing our company thousands of dollars in lost time each week and my boss is breathing down my neck about it.
(In reply to comment #52) > I'm still experiencing this exact same problem (Exceeding 200 client > connections, no idle connection found) on Fedora 7 with samba version 3.0.27a > and kernel 2.6.23. The server is an Intel Core 2 Duo based machine with an > Asus server mobo, 4GB of memory and a 750GB RAID5. > > My server is used by 10-15 engineers who primarily use it for a public file > store, CAD data storage, Outlook PST file storage, and for launching > Pro/Engineer Wildfire 2.0. That doesn't help me to understand which file descriptors are open. I need details from /proc/, ps , truss, etc.... > > This issue popped up for me about 3 months ago. At this time winbind would > crash on me about once a week and I would have to manually restart it. Now it > has become so bad that I'm spending most of my day babysitting winbind to make > sure my engineers can get some work done because it is crashing every 15 > minutes or so. > > I'm willing to provide whatever information I can to help resolve this bug > because it is currently costing our company thousands of dollars in lost time > each week and my boss is breathing down my neck about it. This bug report has nothing to do with crashes. Please file that as a separate bug and attach a gzipped tarbal of any log files and configuration files you have that are relevant. Thanks.
(In reply to comment #53) > > That doesn't help me to understand which file descriptors are open. I need > details from /proc/, ps , truss, etc.... > What exactly do you need? I'm not really a system administrator (I'm actually a mechanical engineer) so I'm not really sure what you are talking about. > > This bug report has nothing to do with crashes. Please file that as a separate > bug and attach a gzipped tarbal of any log files and configuration files you > have that are relevant. Thanks. > I really shouldn't have said that it crashed. It just hangs when it starts spitting out the "winbindd: Exceeding 200 client connections, no idle connection found" messages.
(In reply to comment #54) > (In reply to comment #53) > > > > That doesn't help me to understand which file descriptors are open. I need > > details from /proc/, ps , truss, etc.... > > > What exactly do you need? I'm not really a system administrator (I'm actually a > mechanical engineer) so I'm not really sure what you are talking about. > It has been two weeks and I still haven't heard what is needed from me. Our Samba server is to the point that it is completely unusable. Because of this all of our files have been moved to a Windows 2003 server. I would like to help get this bug squashed!
Probably you need to get local support from someone being able to log into the box to get the necessary information. We need to know why winbind is hanging and where. Doing this remotely is very difficult, there can be a million reasons why a process might get stuck. Someone with experience in analyzing this kind of problem might very quickly find something that we will never find via sending each other exact explanations what to do next. Sorry, Volker
(In reply to comment #56) > Probably you need to get local support from someone being able to log into the > box to get the necessary information. We need to know why winbind is hanging > and where. Doing this remotely is very difficult, there can be a million > reasons why a process might get stuck. Someone with experience in analyzing > this kind of problem might very quickly find something that we will never find > via sending each other exact explanations what to do next. > > Sorry, > > Volker > Volker, I have full access to this machine. I can provide any logs, configuration files, command outputs, or whatever else is required to figure out what is going on. I have checked everything I know to check and the only thing I can come up with is the fact that I get "winbindd: Exceeding 200 client connections, no idle connection found" when winbindd stops answering authentication requests. Do I need to dump the contents of /proc/winbind_process_id into a zip file whenever things start acting up?
You need to find out which of the winbind child processes hang and where. For this you need to "strace -p" all the winbind processes while winbind is stuck. You will very likely see that winbind sits in a select() system call. If it is something else like for example fcntl64(), it would be very suspicious and worth further investigation. If it is the case that it sits in fcntl, you will see the file descriptor as the first argument of that syscall, from there you can see which of the tdb files this via a "ls -l /proc/<winbind-pid>/fd". If it is a select system call, is is very likely not the faulty process. What could be is that it sits in a read() syscall. From there some detective work is necessary to see which kind of file descriptor this is. It might be interesting at this point to attach to the suspect process with gdb and do a backtrace with "bt". This output might be very valuable. If you find out that your winbind waits for a domain controller in a select or read system call, you might want to do a subsequent tcpdump filtering out just this domain controller to find out what sequence of requests winbind is doing and which ones lead to the lock-up. It might also be something I just overlooked: It might very well be possible that one of your tdb files is corrupt and winbind enters a 100% cpu loop, thus not being able to reply at all anymore. Volker
Ok, I'm seeing this bug now, and here's what I see: [root@docs ~]# pgrep winbind 677 915 1124 1286 1409 1547 1858 2285 ... [root@docs ~]# strace -p 677 Process 677 attached - interrupt to quit futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...> Process 677 detached [root@docs ~]# strace -p 915 Process 915 attached - interrupt to quit futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...> Process 915 detached [root@docs ~]# strace -p 1124 Process 1124 attached - interrupt to quit futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...> Process 1124 detached [root@docs ~]# strace -p 1286 Process 1286 attached - interrupt to quit futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...> Process 1286 detached [root@docs ~]# strace -p 1409 Process 1409 attached - interrupt to quit futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...> Process 1409 detached [root@docs ~]# strace -p 1547 Process 1547 attached - interrupt to quit futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...> Process 1547 detached [root@docs ~]# strace -p 1858 Process 1858 attached - interrupt to quit futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...> Process 1858 detached [root@docs ~]# strace -p 2285 Process 2285 attached - interrupt to quit select(390, [12 13 14 389], [], NULL, {20, 346000} <unfinished ...> Also, these winbindd's have lots of open fds: [root@docs ~]# pgrep winbind | while read WBPROC; do echo -n "$WBPROC: "; ls -Fla /proc/$WBPROC/fd | grep socket | wc -l; done 677: 245 915: 286 1124: 313 1286: 338 1409: 354 1547: 364 1858: 371 2285: 375 2293: 6 22198: 8 22199: 7 23400: 43 23442: 103 23986: 126 24033: 164 24078: 207 24135: 207 32644: 225 Looking at the lsof output, most of these fd's are winbind->winbind. Like another user reported, I'm also using the ldap backend against an AD server. This server doesn't see much use; just a handful of users.
We have seen this problem as well with a couple of customers, but only once each. Environment is Linux 2.6.14 and Samba 3.0.25, but it could also be present with 3.0.31. Now that I have stumbled across this bug, I have more information that I can use to figure out the root cause.
Ahhhh, here, I suspect is the problem. Here is a log entry: lib/util_tdb.c:tdb_chainlock_with_timeout_internal(84) tdb_chainlock_with_timeout_internal: alarm (40) timed out for key ranger.msbdomain.lan in tdb /etc/samba/secrets.tdb [2008/12/21 10:33:51.959971, 0, pid=17551/winbindd] nsswitch/winbindd_cm.c:cm_prepare_connection(644) cm_prepare_connection: mutex grab failed for <dc name redacted> What has happened here is two fold: 1. The code in 3.0.25 (up to and including possibly 3.0.30) had a bug in it because we were not properly handling timeouts in the brlock code in tdb/common/lock.c. The timeout handler would be called in tdb_chainlock_with_timeout_internal, but the loop in tdb/common/lock:tdb_brlock did this: do { ret = fcntl(tdb->fd,lck_type,&fl); } while (ret == -1 && errno == EINTR); Which took us straight back into the fcntl. It the problem was simply that some other process (winbindd?) had the lock for a extraordinary period of time (longer that the 40 second timeout) the timeout counter would be called but then we would go back to waiting on the lock. 2. When we finally got the lock, we would return to tdb_chainlock_with_timeout_internal which had a bug. It just looked at the timeout count, and if non-zero, returned an error. Now the process that was waiting for the lock has the lock /mutex but does not know it and is unlikely to release the lock. This would be more likely if multiple processes were waiting for the mutex ... I alerted Jeremy to a race in the 3.0.31 and above code and he has fixed that in the latest release, so I think this problem will be fixed by really upgrading to the latest release, or by backporting the single line change. The change is roughly this: diff --git a/source3/lib/util_tdb.c b/source3/lib/util_tdb.c index bb568bc..8ceaa46 100644 --- a/source3/lib/util_tdb.c +++ b/source3/lib/util_tdb.c @@ -64,7 +64,7 @@ static int tdb_chainlock_with_timeout_internal( TDB_CONTEXT *tdb, TDB_DATA key, alarm(0); tdb_setalarm_sigptr(tdb, NULL); CatchSignal(SIGALRM, SIGNAL_CAST SIG_IGN); - if (gotalarm) { + if (gotalarm && (ret == -1)) { DEBUG(0,("tdb_chainlock_with_timeout_internal: alarm (%u) timed out for key %s in tdb %s\n", timeout, key.dptr, tdb_name(tdb))); /* TODO: If we time out waiting for a lock, it might
Adding me to the CC list.
Hey Jerry, Can we close this bug now? I believe that I have explained what happens and that it is definitely fixed in 3.0.34 (or whatever is next) and, modulo the race, which is much harder to hit, is fixed in at least 3.0.31.
(In reply to comment #63) > Hey Jerry, > > Can we close this bug now? > > I believe that I have explained what happens and that it is definitely fixed in > 3.0.34 (or whatever is next) and, modulo the race, which is much harder to hit, > is fixed in at least 3.0.31. > Agreed.
Was fixed in 3.2.8 and above (3.3.0+). Jeremy.
The same problem with samba-3.6.14, OK with samba 3.6.13 After upgrading from samba-3.6.13 to samba-3.6.14 in winbind log [2013/05/06 07:07:43.830181, 0] winbindd/winbindd.c:947(winbindd_listen_fde_handler) winbindd: Exceeding 500 client connections, no idle connection found From smb.conf winbind cache time = 1200 winbind use default domain = yes winbind refresh tickets = yes winbind offline logon = yes winbind enum users = yes winbind enum groups = yes winbind nss info = template winbind nested groups = yes winbind max clients = 500 idmap uid = 10000-100000 idmap gid = 10000-100000 idmap cache time = 1200
Got the same problem with samba 3.6.12+ stack. Infact we got this issue twice for one of our customer. [2014/07/17 11:23:02.223054, 0] winbindd/winbindd.c:947(winbindd_listen_fde_handler) winbindd: Exceeding 400 client connections, no idle connection found [2014/07/17 11:23:02.224055, 0] winbindd/winbindd.c:947(winbindd_listen_fde_handler) winbindd: Exceeding 400 client connections, no idle connection found Winbindd went unresponsive and found that there are lot(~10K) of open file handles for this stuck winbindd process. Had to kill this process to restore the user access/connectivity. .... .... 3378 winbindd 10294 s - rw------ 1 0 UDS /usr/local/var/locks/winbindd_privileged/pipe 3378 winbindd 10295 s - rw------ 1 0 UDS /usr/local/var/locks/winbindd_privileged/pipe 3378 winbindd 10296 s - rw------ 1 0 UDS /usr/local/var/locks/winbindd_privileged/pipe Here is the gdb stack info of stuck winbindd process. === Dumping Process winbindd (3380) === [Switching to Thread 8030021c0 (LWP 101095)] 0x00000008026cd5ec in poll () from /lib/libc.so.7 Thread 1 (Thread 8030021c0 (LWP 101095)): #0 0x00000008026cd5ec in poll () from /lib/libc.so.7 #1 0x00000008010a27fe in poll () from /lib/libthr.so.3 #2 0x0000000801ea0ee9 in wait4msg (result=<optimized out>, timeout=<optimized out>, all=<optimized out>, msgid=<optimized out>, ld=<optimized out>) at result.c:312 #3 ldap_result (ld=0x80302eac0, msgid=5, all=1, timeout=<optimized out>, result=0x7fffffffc258) at result.c:117 #4 0x0000000801ea8348 in ldap_sasl_bind_s (ld=0x80302eac0, dn=0x0, mechanism=0xa00760 "GSS-SPNEGO", cred=0x7fffffffc370, sctrls=0x0, cctrls=<optimized out>, servercredp=0x7fffffffc400) at sasl.c:194 #5 0x0000000000821a1d in ads_sasl_spnego_rawkrb5_bind (principal=<optimized out>, ads=<optimized out>) at libads/sasl.c:795 #6 ads_sasl_spnego_krb5_bind (ads=0x803002a80, p=<optimized out>) at libads/sasl.c:823 #7 0x0000000000822545 in ads_sasl_spnego_bind (ads=0x803002a80) at libads/sasl.c:904 #8 0x000000000082055d in ads_sasl_bind (ads=0x803002a80) at libads/sasl.c:1213 #9 0x000000000081f8e4 in ads_connect (ads=0x803002a80) at libads/ldap.c:730 #10 0x00000000004a78b0 in ads_cached_connection (domain=0x80305f200) at winbindd/winbindd_ads.c:131 #11 0x00000000004a7ba0 in sequence_number (domain=0x80305f200, seq=0x80305f718) at winbindd/winbindd_ads.c:1262 #12 0x0000000000492b31 in refresh_sequence_number (domain=0x80305f200, force=<optimized out>) at winbindd/winbindd_cache.c:558 #13 0x0000000000492df4 in wcache_fetch (cache=<optimized out>, domain=0x80305f200, format=0x8b9580 "GM/%s") at winbindd/winbindd_cache.c:711 #14 0x0000000000493ba6 in wcache_lookup_groupmem (domain=0x80305f200, mem_ctx=0x803009290, group_sid=<optimized out>, num_names=0x7fffffffcf84, sid_mem=0x7fffffffcf78, names=0x7fffffffcf70, name_types=0x7fffffffcf68) at winbindd/winbindd_cache.c:2615 #15 0x0000000000493e58 in lookup_groupmem (domain=0x8030af004, mem_ctx=0x1, group_sid=0x80308c3d0, type=SID_NAME_DOM_GRP, num_names=0x1, sid_mem=0x7fffffffcf78, names=0x7fffffffcf70, name_types=0x7fffffffcf68) at winbindd/winbindd_cache.c:2673 #16 0x00000000004b1be4 in _wbint_LookupGroupMembers (p=0x7fffffffd000, r=0x803044140) at winbindd/winbindd_dual_srv.c:347 #17 0x00000000004bad16 in api_wbint_LookupGroupMembers (p=0x7fffffffd000) at librpc/gen_ndr/srv_wbint.c:1271 #18 0x00000000004b04a2 in winbindd_dual_ndrcmd (domain=0x80305f200, state=0x7fffffffe800) at winbindd/winbindd_dual_ndr.c:322 #19 0x00000000004aecfd in child_process_request (state=<optimized out>, child=<optimized out>) at winbindd/winbindd_dual.c:495 #20 fork_domain_child (child=<optimized out>) at winbindd/winbindd_dual.c:1609 #21 wb_child_request_trigger (req=<optimized out>, private_data=<optimized out>) at winbindd/winbindd_dual.c:200 #22 0x0000000000569970 in tevent_common_loop_immediate (ev=0x80301e110) at ../lib/tevent/tevent_immediate.c:139 #23 0x0000000000567c35 in run_events_poll (ev=0x80301e110, pollrtn=0, pfds=0x0, num_pfds=0) at lib/events.c:197 #24 0x0000000000568359 in s3_event_loop_once (ev=0x80301e110, location=<optimized out>) at lib/events.c:331 #25 0x0000000000568771 in _tevent_loop_once (ev=0x80301e110, location=0x8b41c9 "winbindd/winbindd.c:1456") at ../lib/tevent/tevent.c:494 #26 0x00000000004898a2 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at winbindd/winbindd.c:1456 Would like to know if someone has an idea on where exactly FDs are leaking.
(In reply to comment #67) > Would like to know if someone has an idea on where exactly FDs are leaking. If child winbind's pile up waiting for something, this will inevitably happen. Maybe we should add an emergency "watchdog": If a winbind child does not reply within, say, 1 minute, call it dead. Start a new one even if this would exceed the max winbind domain children. If it does not come back for an hour, just kill it. No winbind request at all should take more than an hour, to me it would be reasonable to really kill that helper process after an hour busy time. Comments?
(In reply to comment #68) > If child winbind's pile up waiting for something, this will inevitably happen. > Maybe we should add an emergency "watchdog": If a winbind child does not reply > within, say, 1 minute, call it dead. Start a new one even if this would exceed > the max winbind domain children. If it does not come back for an hour, just > kill it. No winbind request at all should take more than an hour, to me it > would be reasonable to really kill that helper process after an hour busy time. > > Comments? Actually in our case, we did not see too many child winbindd processes. Infact there are only 3 winbinnd processes. Parent/main winbindd was holding up too many(>10K) FDs to the unix doamin socket. I found following piece of code for winbindd_listen_fde_handler() { struct winbindd_listen_state *s = talloc_get_type_abort(private_data, struct winbindd_listen_state); while (winbindd_num_clients() > lp_winbind_max_clients() - 1) { DEBUG(5,("winbindd: Exceeding %d client " "connections, removing idle " "connection.\n", lp_winbind_max_clients())); if (!remove_idle_client()) { DEBUG(0,("winbindd: Exceeding %d " "client connections, no idle " "connection found\n", lp_winbind_max_clients())); break; } } new_connection(s->fd, s->privileged); } Here we seem to be allowing more connections than the configured "max clients" even if we dont find any idle/dead client connections. Due to this connection list keeps growing and reached 10K in our case. Also on listening to every new socket connection requests, we will be iterating through this list to find any idle client connection. I think this one keeping windbindd busy and becoming almost unresponsive for clients. Would also like to understand to allow the new connections limitless when it exceed the max clients limit and no idle connection are found. Also I am trying to understand the reason for having these winbindd client connections for longer time. In my test setup, I could see the asynchronous requests are getting processed quickly and closes the socket connection. But in this case, these connections are holding up for longer time. As Volker points out, it would a good idea to make the sessions stale on exceeding 60 seconds timeout. In this case, I assume no such request will take more than 60 seconds to process by winbindd. Thanks, Hemanth.
Created attachment 10127 [details] Prototype patch for 3.6.x. Ok Hemanth, can you give this (prototype) patch for 3.6.x a try ? It adds a new [global] parameter: winbind request timeout default value of 60 (seconds). What it does is terminate every client connection that has either remained idle for 60 seconds, or has not replied within 60 seconds. Initially I worried this was a little aggressive, but I don't think so - if a request has take > 60 seconds it's almost certainly dead, and pruning idle clients after 60 seconds is also probably ok. Also it's tuneable :-). If this works for you I can forward port to 4.1.next and 4.0.next. Cheers, Jeremy.
(In reply to comment #70) > Created attachment 10127 [details] > Prototype patch for 3.6.x. > > Ok Hemanth, can you give this (prototype) patch for 3.6.x a try ? > > It adds a new [global] parameter: > > winbind request timeout > > default value of 60 (seconds). What it does is terminate every client > connection that has either remained idle for 60 seconds, or has not replied > within 60 seconds. Initially I worried this was a little aggressive, but I > don't think so - if a request has take > 60 seconds it's almost certainly dead, > and pruning idle clients after 60 seconds is also probably ok. Also it's > tuneable :-). > > If this works for you I can forward port to 4.1.next and 4.0.next. > > Cheers, > > Jeremy. Sure. Thanks Jeremy. I will incorporate these changes and try to provide patch to customer. Since this issue is not reproducible inhouse we can expect some delay in getting the feedback. But will followup with customer and update soon. Thanks, Hemanth.
Comment on attachment 10127 [details] Prototype patch for 3.6.x. Ach - don't use this patch. Time logic is wrong (reversed). New patch shortly.
Created attachment 10128 [details] Fixed patch for 3.6.x Correctly tests expiry time for request in remove_timed_out_clients(). Sorry for the earlier error.
*** Bug 6825 has been marked as a duplicate of this bug. ***
*** Bug 6087 has been marked as a duplicate of this bug. ***
(In reply to comment #73) > Created attachment 10128 [details] > Fixed patch for 3.6.x > We couldn't reproduce the issue inhouse. I have tweaked the code to have some intentional FD leak and verified that older connections are getting cleaned up. Right now I have ported this patch and pushed to our QA. Will let you know if we here anything internally or from customers. Thanks, Hemanth.
*** Bug 10573 has been marked as a duplicate of this bug. ***
*** Bug 9127 has been marked as a duplicate of this bug. ***
Created attachment 10160 [details] git-am fix for 4.1.next and 4.0.next. Back ported from fix that went into master.
Given this patch as early access to one our customers few(4-5) days back. We have been monitoring the FDs used by winbindd. Haven't seen the FD list growing so far (seen only 2 to 3 active UDS connections at any point of time). With this I can say that patch is working fine. Also I have added a debug level zero messages to see if we are cleaning up any idle or timed out sessions. Every time I could see that only idle client connections getting cleanup. Based on this, I assume that we do not have any timed out/stuck client sessions instead there have been idle connections whose FDs were leaking. This is just my observation in our customer setup.
Re-assigning to Karolin for inclusion in 4.1.next, 4.0.next.
*** Bug 6423 has been marked as a duplicate of this bug. ***
Pushed to autobuild-v4-[0|1]-test.
Pushed to both branches. Closing out bug report. Thanks!