Bug 3204 - winbindd: Exceeding 200 client connections, no idle connection found
winbindd: Exceeding 200 client connections, no idle connection found
Status: RESOLVED FIXED
Product: Samba 3.0
Classification: Unclassified
Component: winbind
3.0.20b
x86 Linux
: P2 normal
: none
Assigned To: Karolin Seeger
Samba QA Contact
:
: 4089 6087 6423 6825 9127 10573 (view as bug list)
Depends on:
Blocks: 6423
  Show dependency treegraph
 
Reported: 2005-10-23 23:44 UTC by Daniel (The recipient's e-mail address was not found in the recipient's e-mail system.)
Modified: 2014-09-03 07:31 UTC (History)
26 users (show)

See Also:


Attachments
Proposed patch. (1.73 KB, patch)
2007-03-28 18:05 UTC, Jeremy Allison
no flags Details
lsof on rhel4 showing a winbindd with fd leak (3.0.23c) (123.07 KB, text/plain)
2007-03-28 19:32 UTC, David Leonard (550 5.7.1 Unable to deliver)
no flags Details
Wild guess :-) (404 bytes, patch)
2007-03-31 16:43 UTC, Jeremy Allison
no flags Details
Patch (2.67 KB, patch)
2007-06-11 17:29 UTC, Jeremy Allison
no flags Details
Prototype patch for 3.6.x. (3.51 KB, patch)
2014-07-18 19:01 UTC, Jeremy Allison
no flags Details
Fixed patch for 3.6.x (3.57 KB, patch)
2014-07-19 05:41 UTC, Jeremy Allison
no flags Details
git-am fix for 4.1.next and 4.0.next. (6.47 KB, patch)
2014-07-29 22:48 UTC, Jeremy Allison
ira: review+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel (The recipient's e-mail address was not found in the recipient's e-mail system.) 2005-10-23 23:44:36 UTC
Dear Samba,

It is becoming very stressful for me now to find the cause of why users are 
having discrupted connectivity to SAMBA server file shares when it is set up 
as member server to 2 PDC running Microsoft W2K3 server.

I have read of Google posting about 'winbindd: Exceeding 200 client 
connections, no idle connection found'. Upgrading to latest version doesn't 
solve the problem. Could this be the cause of all problems I am facing or 
something else? Why is there a limition of 200 client connections because I am 
having approximately 1000 concurrent users at the same time.

I am happy to go through the current setup with configuration to get to the 
root of the problem which is now causing so much frustration to users.
Comment 1 Gerald (Jerry) Carter 2005-10-24 07:07:03 UTC
You can change the #deifne for WINBINDD_MAX_SIMULTANEOUS_CLIENTS
in include/local.h to bump up the limit.  The restrictction was 
put into place to pervent winbindd from running out of fd's.
Comment 2 Sergio Roberto Claser 2005-10-26 10:59:24 UTC
I have the same problem. When I changed WINBINDD_MAX_SIMULTANEOUS_CLIENTS to 500
I received the message "winbindd: Exceeding 500 client connections, no idle
connection found".
After this, I tried to change WINBINDD_MAX_SIMULTANEOUS_CLIENTS to 1500 and at
this time winbindd hang and log the message "PANIC: assert failed at
nsswitch/winbindd.c(394)".
Then I increased the number of fd's (ulimit -n 4096) to discover if this was the
problem, but wasn't.
Comment 3 Volker Lendecke 2005-10-26 23:33:29 UTC
So obviously killing old clients does not work for you. I'm afraid I'm busy this
week, I'll see if I can look over that code during the evening, latest on the
weekend.

Volker
Comment 4 Volker Lendecke 2005-10-28 02:33:21 UTC
Just tested with 3.0.20b and with current code: For me killing idle connections
works fine. It must be something with your environment I think. What do your
processes do? Do you have domain controllers behind a slow network? Do you
frequently enumerate users and groups for example?

Volker
Comment 5 Daniel (The recipient's e-mail address was not found in the recipient's e-mail system.) 2005-10-28 07:06:00 UTC
I do not know how to delete idle connections. Anyway, the error shows there are 
no idle connections for me to delete. Few questions though to help me to 
understand this problem much better:

1)How do I know how many connections at any instant time to monitor overloading 
problem? There are only 3 winbindd process when I did 'ps waux | grep winbindd'.
2)Why 200 connections, why not 300 or more? I have a powerful server to serve 
2000 users in school. Shouldn't it be configured automatically based on the 
spec of hardware and also manually configurable in smb.conf?
3)AD support with winbind technique is only new in version 3.x. How well is it 
to cope a busy network like my school where each user have simultenous 3-4 map 
drives to SAMBA system, running heavy network applications like running movies, 
designing, database, office applications, etc?

I am facing multiple problems with network connection dropping, slow 
application running, samba crashing. There is definitely overloading problem 
and I am clueless how to resolve this problem. Lots of frustration on teachers 
and students as it is causing major disruption to lessons.
Comment 6 Sergio Roberto Claser 2005-10-28 10:49:07 UTC
I am using Debian Sarge (Stable) and have 3 Samba/LDAP domains.

The other 67 domains are Windows NT, for the moment, as we have a project to
migrate them all to Samba/LDAP.

Some domains have low-speed connections - about 20% of them. The remaining
domains are connected via optical fiber. The problem takes place with Proxy
Servers, where we also use Debian Sarge and Squid (2.5.9-10sarge2). All users
(about 1000 simultaneously) are authenticated via NTLM, using winbind, and only
the users belonging to the group 'domain\lib_internet' may access the Internet.

The script we use to check if the user belongs to the group
'domain\lib_internet' is:

---------------------------------
#!/usr/bin/perl -w
#
# External_acl helper to Squid to verify NT Domain group
# membership using wbinfo.
#
# This script verify an user membership in groups from
# your respective domain, enabling localized access control.
#

# external_acl uses shell style lines in it's protocol
require 'shellwords.pl';

# Disable output buffering
$|=1;

# habilita debug
$DEBUG = 1;
$LOG = "/var/log/samba/log_wbinfo_domain_group";

# abre log
open(LOG,">>".$LOG) if ($DEBUG);

# write debug in log
sub debug {
        # Uncomment this to enable debugging
        print LOG "@_" if ($DEBUG);
}

#
# Check if a user belongs to a group
#
sub check {
        local($domain, $user, $group) = @_;
        &debug("check data domain($domain) user($user) group($group)\n");

        # monta usuario e grupo com seus respectivos dominios
        if ($domain ne "default domain"){
                $user = $domain."\\".$user;
                $group = $domain."\\".$group;
        }

        &debug("executando \`wbinfo -n \"$group\"\`\n");
        $groupSID = `wbinfo -n "$group"`;
        chop  $groupSID;

        &debug("executando \`wbinfo -Y \"$groupSID\"`\n");
        $groupGID = `wbinfo -Y "$groupSID"`;
        chop $groupGID;

        &debug( "user: $user\ngroup: $group\nSID: $groupSID\nGID: $groupGID\n");

        &debug("executando \`wbinfo -r \"$user\"`\n");
        return 'OK' if(`wbinfo -r "$user"` =~ /^$groupGID$/m);
        return 'ERR';
}

#
# Main loop
#
while (<STDIN>) {
        chop;
        &debug ("Got $_ from squid\n");

        # split user and domain
        @data = split(/[\\\/\|]+/,$_);

        # if found a \ then the first data is a domain
        $domain = "default domain";
        $domain = shift(@data) if (@data>1);

        @data = split(/\ +/,$data[0]);

        $user = $data[0];
        $group = $data[1];
        for ($i=2;$i<@data;$i++){ $group .= " ".$data[$i]; }

        # verify user in group
        $ans = &check($domain, $user, $group);

        &debug ("Sending $ans to squid\n");

        print "$ans\n";
}

close(LOG)

---------------------------------

The script checks the groups the users belongs to with the command wbinfo -r
'domain\user'.

PS: This message corresponds to a BUG that was not open by me. Do I carry on
with the same BUG or do I have to open another?
Comment 7 Daniel (The recipient's e-mail address was not found in the recipient's e-mail system.) 2005-10-29 02:09:15 UTC
A bit more info to answer Volker's question. All servers (2 PDCs + SAMBA) are 
on gigabit network. 

On enumerating users and groups, I initially use it for proftpd package but 
later move to using MYSQL authentication and set enum users and groups to no, 
hoping this would help speed up processes but still no good.
Comment 8 Andrew Kalinov 2005-11-22 23:42:44 UTC
I have the same problem too.

Samba 3.0.20b
FreeBSD 5.4
Windows 2003

I am rolled away on the old version.
Comment 9 Andrew Kalinov 2005-12-06 00:56:40 UTC
Maybe problem in using mod_ntlm (0.4)
Comment 10 Frederico Gendorf 2006-03-08 05:24:00 UTC
Hi, I have same problem, my netwotk have 2 PDC samba interdomain trunsting through openvpn connection and have only 30 computer clients!
and when it occurs the interdomain trust fail!

my logwatch say:
nsswitch/winbindd.c:process_loop(844)  winbindd: Exceeding 200 client connections, no idle connection found : 983 Time(s)
 nsswitch/winbindd.c:process_loop(863)  winbindd: Exceeding 200 client connections, no idle connection found : 959 Time(s)

Thanks for regards
Comment 11 Gerald (Jerry) Carter 2006-04-20 08:03:36 UTC
severity should be determined by the developers and not the reporter.
Comment 12 Gerald (Jerry) Carter 2006-08-11 07:35:55 UTC
Can anyone reproduce this one something other than FreeBSD?
I am not seeing any problems here.
Comment 13 Frederico Gendorf 2006-08-11 07:43:04 UTC
In last night my logwatch reports log below!
I´m use fedora 4 with two Samba PDC-LDAP trusting domain over VPN.
All network have 40 computers!



nsswitch/winbindd.c:process_loop(813)  winbindd: Exceeding 200 client connections, no idle connection found : 1507 Time(s)
 nsswitch/winbindd.c:process_loop(832)  winbindd: Exceeding 200 client connections, no idle connection found : 1445 Time(s)
Comment 14 Harald Wagener (Unrouteable address) 2006-08-22 01:38:05 UTC
Happens for me on debian/unstable with

samba 3.0.23b

as well.

The problem did not exist with 3.0.22. 

When the error comes up I only have one winbindd running, normally it's four.

Since I was tweaking the winbind  configuration a bit, it might have come up because I set

winbind nested groups = yes

I will set it back to

winbind nested groups = no


and report back if that changes anything. 

Regards,
    Harald
Comment 15 Harald Wagener (Unrouteable address) 2006-08-22 03:50:44 UTC
winbindd also dies with 

winbind nested groups = yes
Comment 16 Yann D. 2006-08-22 04:20:34 UTC
got the same problem on 3 proxy servers (Squid + Samba):

winbindd[5910]: [2006/08/22 10:56:00, 0] nsswitch/winbindd.c:process_loop(863)
winbindd[5910]:   winbindd: Exceeding 200 client connections, no idle connection found

several times...
samba version : 3.0.22-1
OS : Red Hat ES 4

The servers have joined a domain, with 2 PDC on Windows 2003.
When i have this errors, "net ads testjoin" is OK, but "wbinfo -t" doesn't answer anything.

Errors happen almost every day, on one or all servers, it depends i think on the load of the servers.

Because of that, squid doesn't authenticate users, and users go time out.
Need to restart Squid and Winbind to resolve this issue.
Comment 17 tsm 2006-10-18 20:30:37 UTC
I can confirm the bug on FreeBSD 6.1 with samba 3.0.23c, compiled using freebsd ports system.
I had to roll back my installation to the pre-compiled version from the installation CD (3.0.21) for it to work.
Comment 18 Marc D. 2006-12-15 10:12:19 UTC
OS: Red Hat Enterprise Linux AS 4 Update 3
arch: x86_64
Samba version: 3.0.23c-4

System is configured as a squid proxy using AD membership.

Showing same error as listed here when system is under high load. Was able to resolve issue by reverting to the Red Hat included version (3.0.10-1.4E.2).
Comment 19 David Leonard (550 5.7.1 Unable to deliver) 2007-02-06 23:01:22 UTC
while staring at this problem, I noticed a small opportunity for winbind to leak fds:

--- samba/source/nsswitch/winbindd.c    (revision 176)
+++ samba/source/nsswitch/winbindd.c    (working copy)
@@ -602,8 +602,10 @@
        
        /* Create new connection structure */
        
-       if ((state = TALLOC_ZERO_P(NULL, struct winbindd_cli_state)) == NULL)
+       if ((state = TALLOC_ZERO_P(NULL, struct winbindd_cli_state)) == NULL) {
+               close(sock);
                return;
+       }
        
        state->sock = sock;
 
Comment 20 Jeremy Allison 2007-02-07 18:28:48 UTC
Applied - thanks !
Jeremy.
Comment 21 David Leonard (550 5.7.1 Unable to deliver) 2007-03-28 17:37:32 UTC
There is an FD leakage issue in winbindd fixed with this patch:

Index: nsswitch/winbindd.c
===================================================================
--- nsswitch/winbindd.c (revision 183)
+++ nsswitch/winbindd.c (working copy)
@@ -870,6 +870,8 @@
                        winbind_child_died(pid);
                }
        }
+
+       close_winbindd_socket();
 }

 /* Main function */

(Thanks to Brent Snow for helping me track this down)
Comment 22 Jeremy Allison 2007-03-28 17:49:42 UTC
Great catch - thanks ! But I don't see why we're opening these sockets in this function at all... Surely we should be doing this once before calling it. I bet we were doing that in the past and it got moved. I'll check into this.
Jeremy.
Comment 23 Jeremy Allison 2007-03-28 18:00:03 UTC
Wow - this has been broken a loooong looong time. Traced back to 3.0.14a and it's still there....
Jeremy.
Comment 24 Jeremy Allison 2007-03-28 18:05:12 UTC
Created attachment 2351 [details]
Proposed patch.

Can you try this patch instead please. I think this is correct as it prevents these sockets from being continually closed and reopened.
Jeremy.
Comment 25 Jeremy Allison 2007-03-28 18:16:51 UTC
Ah - ok, I'm not sure this is the bug you think it is.

Look *carefully* at open_winbindd_socket() :

static int _winbindd_socket = -1;
static int _winbindd_priv_socket = -1;

int open_winbindd_socket(void)
{
        if (_winbindd_socket == -1) {
                _winbindd_socket = create_pipe_sock(
                        WINBINDD_SOCKET_DIR, WINBINDD_SOCKET_NAME, 0755);
                DEBUG(10, ("open_winbindd_socket: opened socket fd %d\n",
                           _winbindd_socket));
        }

        return _winbindd_socket;
}

Note that '_winbindd_socket' is a *static* int which
is returned without modification if it's not already -1.
Which means that open_winbindd_socket() creates the socket
only on the first call, and on all subsequent calls just
returns the existing socket.

So your patch causes a race condition in which there's a
time within which the socket is closed and not accepting
connections, during which a client connection would fail.

My patch doesn't suffer from that, but is in fact unneeded
as I think the original code works well as designed (it's
just a little unclear).

How did you track this bug down, and what did you use to
confirm that this indeed fixed the fd leak ?

Jeremy.
Comment 26 David Leonard (550 5.7.1 Unable to deliver) 2007-03-28 19:29:12 UTC
(In reply to comment #25)

> How did you track this bug down, and what did you use to
> confirm that this indeed fixed the fd leak ?

Good questions. Here's the story. We distribute a slightly modified samba to customers - one that uses our krb5 product - and customers have noticed the 'Exceeding 200 client connections.'

One used lsof on Linux RHEL4 i386, showing that just one of the 4 winbindd processes had an enormous number of fds open. I'll attach an excerpt of the lsof in the next comment...

I haven't been able to reproduce the cause, myself. It seems elusive.
Comment 27 David Leonard (550 5.7.1 Unable to deliver) 2007-03-28 19:32:51 UTC
Created attachment 2353 [details]
lsof on rhel4 showing a winbindd with fd leak (3.0.23c)
Comment 28 David Leonard (550 5.7.1 Unable to deliver) 2007-03-28 19:34:50 UTC
(In reply to comment #21)
> There is an FD leakage issue in winbindd fixed with this patch:
Doh.. I meant to qualify that as 'possibly' fixed.
Comment 29 Jeremy Allison 2007-03-28 19:38:07 UTC
What was the position of the fd-leaking winbindd in the process tree ? Was it the parent or one of the children ? This might be important.

I'm assuming you haven't given your patch to the customer ? Or did you give it them and it fixed the problem ?

Jeremy.
Comment 30 David Leonard (550 5.7.1 Unable to deliver) 2007-03-28 20:05:20 UTC
(In reply to comment #29)
> What was the position of the fd-leaking winbindd in the process tree ? Was it
> the parent or one of the children ? This might be important.

I didn't ask for a ps.. so I don't have ppid info, sorry. I'll ask.

> I'm assuming you haven't given your patch to the customer ? Or did you give it
> them and it fixed the problem ?

They've probably gone home for the evening - I sent them a patched winbindd binary to try, but, I have doubts it will work because of your correct analysis that re-opening an already-open priv socket is a no-op. But it could be interesting.

I doubt it is close-on-exec leakage because the PIDs look too low for that to have happened a thousand times (assuming linux is allocating sequential pids)

nmbd      25044
smbd      25051
winbindd  25058

although now I look carefully at the processes listed in the lsof, it seems that smbd/nmbd/winbindd have been started multiple times - i.e. the init script didn't shut down a previous process group..

So.. how about I come back to you with decent information later, instead of a half-baked patch :(
Comment 31 Jeremy Allison 2007-03-31 16:43:45 UTC
Created attachment 2355 [details]
Wild guess :-)

So here's a completely insane guess, which I don't think is
right based on my understanding of UNIX socket semantics, but
here you go.

If the fd leak is in one of the domain children, it is just a very outside chance that because it's inheriting it's parent's sockets in the listen state, that some bug is causing clients to have sockets created in that child. I know, I know - where's the accept() call I hear you ask....

But just to close that possibility, as I'm stuck for ideas here, here's a patch that would stop that happening. If it could. Which it can't :-).

Jeremy.
Comment 32 David Leonard (550 5.7.1 Unable to deliver) 2007-03-31 19:38:37 UTC
The do loop in winbindd's new_connection() is probably not necessary.. 
although I have no idea how it could be triggered into an EINTR spin
Comment 33 Jeremy Allison 2007-03-31 19:43:45 UTC
Actually it is necessary. winbindd uses tdb messaging which is triggered by a SIGUSR1 which can arrive at any time. All "slow" system calls must be wrapped in a EINTR loop, and accept is certainly one of those (look at the EINTR loop wrappers for most system calls in lib/system.c).

Jeremy.
Comment 34 Gerald (Jerry) Carter 2007-04-06 14:30:39 UTC
I've looked through the SAMBA_3_0_25 tree and I don't see how this 
can happen either unless you really have a large number of clients.
Possibly a high number of long lived smbd process could trigger it
but then that is not a leak.

I'm lowering the priority until someone can supply a reproducible
test case for us.
Comment 35 Christian Perrier 2007-04-24 02:55:24 UTC
In case that can help, more info, including maybe a test case are in Debian bug  #410663: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=410663

Apparently, our user there is able to reproduce the bug in 3.0.24
Comment 36 Jeremy Allison 2007-04-24 03:21:13 UTC
I looked at the debian bug report. It's essentially the same as this one - no real reproducible test case.
Jeremy.
Comment 37 Peter Kruse 2007-05-10 08:20:06 UTC
Is there any news on this?  It's only that one of our servers is hitting this bug
at this very moment, and I'm not sure what I can do about it.  already increased
the limit WINBINDD_MAX_SIMULTANEOUS_CLIENTS to 500 but winbind had no problem to 
reach this in 15 minutes.  has anybody tried one of the suggested patches?
This server is running with an ldap backend integrated in an win2003 ads domain,
with about 20000 users and groups (and counting...).  Version is 3.0.22.
The problem started when we removed winbind's .tdb files and restarted it,
because it could not resolve some SIDs (they were missing in winbindd_idmap.tdb)

      Peter
Comment 38 Gerald (Jerry) Carter 2007-05-10 08:26:30 UTC
No one can give us any reproducible test case.  So we are kind of 
blocked on it.  If you can help us figure out what conditions trigger 
the problem, We'll be glad to fix it.
Comment 39 Peter Kruse 2007-05-10 08:29:35 UTC
would it help, if I start winbindd with a higher log level and send you the logfile?
Comment 40 Frederico Gendorf 2007-05-16 18:36:35 UTC
The problem in my server persist after I installed 3.0.25.
CPU usage goes to 100%!
How I can help find the solution for this case?
What I can send for you?
Comment 41 Chris Stefanetti (User unknown in virtual alias table) 2007-05-26 02:09:37 UTC
Same problem, using samba in "Domain" security mode. Network has 2 Win2003 domain controllers.

security = DOMAIN
idmap uid = 15000-20000
idmap gid = 15000-20000
winbind use default domain = Yes

[2007/05/25 15:23:54, 1] nsswitch/winbindd.c:main(953)
  winbindd version 3.0.23c started.
  Copyright The Samba Team 2000-2004
[2007/05/25 15:39:14, 0] nsswitch/winbindd.c:process_loop(813)
  winbindd: Exceeding 200 client connections, no idle connection found

after these log events start, "getent passwd <username>" hangs with no result
before this log event getent is fine..
Comment 42 Chris Stefanetti (User unknown in virtual alias table) 2007-05-26 02:14:09 UTC
(In reply to comment #41)
> Same problem, using samba in "Domain" security mode. Network has 2 Win2003
> domain controllers.
> 
> security = DOMAIN
> idmap uid = 15000-20000
> idmap gid = 15000-20000
> winbind use default domain = Yes
> 
> [2007/05/25 15:23:54, 1] nsswitch/winbindd.c:main(953)
>   winbindd version 3.0.23c started.
>   Copyright The Samba Team 2000-2004
> [2007/05/25 15:39:14, 0] nsswitch/winbindd.c:process_loop(813)
>   winbindd: Exceeding 200 client connections, no idle connection found
> 
> after these log events start, "getent passwd <username>" hangs with no result
> before this log event getent is fine..
> 

and it should be known this is on FreeBSD 6.2 STABLE-RELEASE
Comment 43 Volker Lendecke 2007-05-26 02:50:30 UTC
I wonder if there's some process not closing a getpwent or getgrent loop. Can you attach the output of "ps ax" when this happens? Maybe it gives a hint.

And for the "lsof" output, we need the other end of that socket, not just the winbind end of it.

Volker
Comment 44 Gerald (Jerry) Carter 2007-06-11 14:11:42 UTC
I'm pretty sure the repro case is to send a SIGSTOP to the child 
process for our domain and let the async request states build
up in the parent winbindd.
Comment 45 Jeremy Allison 2007-06-11 17:29:56 UTC
Created attachment 2747 [details]
Patch

Jerry got this reproducible. Here is a patch I created that fixes it in his testing. This was a complex logic case :-).
Jeremy.
Comment 46 Gerald (Jerry) Carter 2007-08-17 15:32:11 UTC
*** Bug 4089 has been marked as a duplicate of this bug. ***
Comment 47 Gerald (Jerry) Carter 2007-08-17 15:35:34 UTC
Finally fixed for 3.0.25c
Comment 48 Patrick Rynhart 2007-08-23 01:53:08 UTC
Hi,

I've just built Samba 3.0.25c but am finding still that one of the winbind processes is again becoming very large (in terms of memory usage):

# ps -elfy | grep winbindd
S root      4189     1  0  75   0  4396  2805 -      07:37 ?        00:02:31 /usr/sbin/winbindd -D
S root      4191  4189  0  85   0  3604  2614 429496 07:37 ?        00:00:00 /usr/sbin/winbindd -D
S root      4194  4189  0  75   0 426908 108348 429496 07:38 ?      00:03:59 /usr/sbin/winbindd -D
S root      4195  4189  0  75   0  2640  2355 429496 07:38 ?        00:00:00 /usr/sbin/winbindd -D
S root      4222  4189  0  84   0  3348  2439 429496 07:38 ?        00:00:01 /usr/sbin/winbindd -D
S root      4380  4189  0  75   0  1896  1928 429496 08:33 ?        00:00:00 /usr/sbin/winbindd -D

i.e. over 400 MB of memory.  Eventually the process becomes so large that oom-killer kicks in and starts killing random processes.  I thought that this bug fix may have resolved this problem (if any). Is winbindd supposed to be getting this large ?  If so, why ?

I am seeing this on both the BDC and PDC.  The samba domain is in an NT trust relationship with a Win 2k3 Domain.

Regards,

Patrick
Comment 49 Gerald (Jerry) Carter 2007-08-23 06:39:36 UTC
(In reply to comment #48)
> Hi,
> 
> I've just built Samba 3.0.25c but am finding still that 
> one of the winbind processes is again becoming very large 
> (in terms of memory usage):

This has nothing to do with the original bug report.  Please open 
a new one.  Thanks.
Comment 50 Thomas Merz 2007-11-28 06:29:03 UTC
Hello,

we experience the same bug (Exceeding 200 client connections, no idle connection found) on Samba 3.0.26a, although they should be solved in 3.0.25c.

We are using the Sernet provided RPMs for RedHat RHEL4u5:

samba3-3.0.26a-35
samba3-utils-3.0.26a-35
samba3-client-3.0.26a-35
samba3-winbind-3.0.26a-35 (Version 3.0.26a-SerNet-RedHat)

From the times of occurance of the errors (outside and inside business hours) it seems not to be likely that the max. number of connections is really reached by user activity.

Is there any possibility that the patch is not working/is not included in 3.0.26a?

Regards,
Tom
Comment 51 Gerald (Jerry) Carter 2007-11-28 08:07:42 UTC
(In reply to comment #50)

> Is there any possibility that the patch is not working/is 
> not included in 3.0.26a?

Unless you can provide some more details, it's hard to comment.
The bug fix is in 3.0.26a and appears to be working well in my 
environments.

Comment 52 Jarrod Hyder 2007-12-03 09:16:35 UTC
I'm still experiencing this exact same problem (Exceeding 200 client connections, no idle connection found) on Fedora 7 with samba version 3.0.27a and kernel 2.6.23. The server is an Intel Core 2 Duo based machine with an  Asus server mobo, 4GB of memory and a 750GB RAID5.

My server is used by 10-15 engineers who primarily use it for a public file store, CAD data storage, Outlook PST file storage, and for launching Pro/Engineer Wildfire 2.0.

This issue popped up for me about 3 months ago. At this time winbind would crash on me about once a week and I would have to manually restart it. Now it has become so bad that I'm spending most of my day babysitting winbind to make sure my engineers can get some work done because it is crashing every 15 minutes or so.

I'm willing to provide whatever information I can to help resolve this bug because it is currently costing our company thousands of dollars in lost time each week and my boss is breathing down my neck about it.
Comment 53 Gerald (Jerry) Carter 2007-12-03 09:38:31 UTC
(In reply to comment #52)
> I'm still experiencing this exact same problem (Exceeding 200 client
> connections, no idle connection found) on Fedora 7 with samba version 3.0.27a
> and kernel 2.6.23. The server is an Intel Core 2 Duo based machine with an 
> Asus server mobo, 4GB of memory and a 750GB RAID5.
> 
> My server is used by 10-15 engineers who primarily use it for a public file
> store, CAD data storage, Outlook PST file storage, and for launching
> Pro/Engineer Wildfire 2.0.

That doesn't help me to understand which file descriptors are open.  I need
details from /proc/, ps , truss, etc....

> 
> This issue popped up for me about 3 months ago. At this time winbind would
> crash on me about once a week and I would have to manually restart it. Now it
> has become so bad that I'm spending most of my day babysitting winbind to make
> sure my engineers can get some work done because it is crashing every 15
> minutes or so.
> 
> I'm willing to provide whatever information I can to help resolve this bug
> because it is currently costing our company thousands of dollars in lost time
> each week and my boss is breathing down my neck about it.

This bug report has nothing to do with crashes.  Please file that as a separate bug and attach a gzipped tarbal of any log files and configuration files you have that are relevant.  Thanks. 

Comment 54 Jarrod Hyder 2007-12-04 16:00:37 UTC
(In reply to comment #53)
> 
> That doesn't help me to understand which file descriptors are open.  I need
> details from /proc/, ps , truss, etc....
> 
What exactly do you need? I'm not really a system administrator (I'm actually a mechanical engineer) so I'm not really sure what you are talking about.

> 
> This bug report has nothing to do with crashes.  Please file that as a separate
> bug and attach a gzipped tarbal of any log files and configuration files you
> have that are relevant.  Thanks. 
> 

I really shouldn't have said that it crashed. It just hangs when it starts spitting out the "winbindd: Exceeding 200 client connections, no idle connection found" messages.
Comment 55 Jarrod Hyder 2007-12-17 07:45:55 UTC
(In reply to comment #54)
> (In reply to comment #53)
> > 
> > That doesn't help me to understand which file descriptors are open.  I need
> > details from /proc/, ps , truss, etc....
> > 
> What exactly do you need? I'm not really a system administrator (I'm actually a
> mechanical engineer) so I'm not really sure what you are talking about.
> 

It has been two weeks and I still haven't heard what is needed from me.

Our Samba server is to the point that it is completely unusable. Because of this all of our files have been moved to a Windows 2003 server.

I would like to help get this bug squashed!
Comment 56 Volker Lendecke 2007-12-17 07:53:01 UTC
Probably you need to get local support from someone being able to log into the box to get the necessary information. We need to know why winbind is hanging and where. Doing this remotely is very difficult, there can be a million reasons why a process might get stuck. Someone with experience in analyzing this kind of problem might very quickly find something that we will never find via sending each other exact explanations what to do next.

Sorry,

Volker
Comment 57 Jarrod Hyder 2007-12-17 09:10:16 UTC
(In reply to comment #56)
> Probably you need to get local support from someone being able to log into the
> box to get the necessary information. We need to know why winbind is hanging
> and where. Doing this remotely is very difficult, there can be a million
> reasons why a process might get stuck. Someone with experience in analyzing
> this kind of problem might very quickly find something that we will never find
> via sending each other exact explanations what to do next.
> 
> Sorry,
> 
> Volker
> 

Volker,

I have full access to this machine. I can provide any logs, configuration files, command outputs, or whatever else is required to figure out what is going on. I have checked everything I know to check and the only thing I can come up with is the fact that I get "winbindd: Exceeding 200 client connections, no idle connection found" when winbindd stops answering authentication requests.

Do I need to dump the contents of /proc/winbind_process_id into a zip file whenever things start acting up?
Comment 58 Volker Lendecke 2007-12-17 09:23:28 UTC
You need to find out which of the winbind child processes hang and where. For this you need to "strace -p" all the winbind processes while winbind is stuck. You will very likely see that winbind sits in a select() system call. If it is something else like for example fcntl64(), it would be very suspicious and worth further investigation. If it is the case that it sits in fcntl, you will see the file descriptor as the first argument of that syscall, from there you can see which of the tdb files this via a "ls -l /proc/<winbind-pid>/fd". If it is a select system call, is is very likely not the faulty process. What could be is that it sits in a read() syscall. From there some detective work is necessary to see which kind of file descriptor this is. It might be interesting at this point to attach to the suspect process with gdb and do a backtrace with "bt". This output might be very valuable. If you find out that your winbind waits for a domain controller in a select or read system call, you might want to do a subsequent tcpdump filtering out just this domain controller to find out what sequence of requests winbind is doing and which ones lead to the lock-up. It might also be something I just overlooked: It might very well be possible that one of your tdb files is corrupt and winbind enters a 100% cpu loop, thus not being able to reply at all anymore.

Volker
Comment 59 David L. Parsley 2008-01-03 10:25:33 UTC
Ok, I'm seeing this bug now, and here's what I see:
[root@docs ~]# pgrep winbind
677
915
1124
1286
1409
1547
1858
2285
...
[root@docs ~]# strace -p 677
Process 677 attached - interrupt to quit
futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...>
Process 677 detached
[root@docs ~]# strace -p 915
Process 915 attached - interrupt to quit
futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...>
Process 915 detached
[root@docs ~]# strace -p 1124
Process 1124 attached - interrupt to quit
futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...>
Process 1124 detached
[root@docs ~]# strace -p 1286
Process 1286 attached - interrupt to quit
futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...>
Process 1286 detached
[root@docs ~]# strace -p 1409
Process 1409 attached - interrupt to quit
futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...>
Process 1409 detached
[root@docs ~]# strace -p 1547
Process 1547 attached - interrupt to quit
futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...>
Process 1547 detached
[root@docs ~]# strace -p 1858
Process 1858 attached - interrupt to quit
futex(0x2a96601640, FUTEX_WAIT, 2, NULL <unfinished ...>
Process 1858 detached
[root@docs ~]# strace -p 2285
Process 2285 attached - interrupt to quit
select(390, [12 13 14 389], [], NULL, {20, 346000} <unfinished ...>

Also, these winbindd's have lots of open fds:
[root@docs ~]# pgrep winbind | while read WBPROC; do echo -n "$WBPROC: "; ls -Fla /proc/$WBPROC/fd | grep socket | wc -l; done
677: 245
915: 286
1124: 313
1286: 338
1409: 354
1547: 364
1858: 371
2285: 375
2293: 6
22198: 8
22199: 7
23400: 43
23442: 103
23986: 126
24033: 164
24078: 207
24135: 207
32644: 225

Looking at the lsof output, most of these fd's are winbind->winbind.  Like another user reported, I'm also using the ldap backend against an AD server.  This server doesn't see much use; just a handful of users.
Comment 60 Richard Sharpe 2009-01-13 13:14:01 UTC
We have seen this problem as well with a couple of customers, but only once each.

Environment is Linux 2.6.14 and Samba 3.0.25, but it could also be present with 3.0.31.

Now that I have stumbled across this bug, I have more information that I can use to figure out the root cause.
Comment 61 Richard Sharpe 2009-01-13 14:37:40 UTC
Ahhhh, here, I suspect is the problem. Here is a log entry:

lib/util_tdb.c:tdb_chainlock_with_timeout_internal(84) 
tdb_chainlock_with_timeout_internal: alarm (40) timed out for key
ranger.msbdomain.lan in tdb /etc/samba/secrets.tdb
[2008/12/21 10:33:51.959971, 0, pid=17551/winbindd]
nsswitch/winbindd_cm.c:cm_prepare_connection(644)  cm_prepare_connection: mutex
grab failed for <dc name redacted>

What has happened here is two fold:

1. The code in 3.0.25 (up to and including possibly 3.0.30) had a bug in it because we were not properly handling timeouts in the brlock code in tdb/common/lock.c. The timeout handler would be called in tdb_chainlock_with_timeout_internal, but the loop in tdb/common/lock:tdb_brlock did this:

        do {
                ret = fcntl(tdb->fd,lck_type,&fl);
        } while (ret == -1 && errno == EINTR);

Which took us straight back into the fcntl. It the problem was simply that some other process (winbindd?) had the lock for a extraordinary period of time (longer that the 40 second timeout) the timeout counter would be called but then we would go back to waiting on the lock.

2. When we finally got the lock, we would return to tdb_chainlock_with_timeout_internal which had a bug. It just looked at the timeout count, and if non-zero, returned an error. Now the process that was waiting for the lock has the lock /mutex but does not know it and is unlikely to release the lock. This would be more likely if multiple processes were waiting for the mutex ... 

I alerted Jeremy to a race in the 3.0.31 and above code and he has fixed that in the latest release, so I think this problem will be fixed by really upgrading to the latest release, or by backporting the single line change. The change is roughly this:

diff --git a/source3/lib/util_tdb.c b/source3/lib/util_tdb.c
index bb568bc..8ceaa46 100644
--- a/source3/lib/util_tdb.c
+++ b/source3/lib/util_tdb.c
@@ -64,7 +64,7 @@ static int tdb_chainlock_with_timeout_internal( TDB_CONTEXT
*tdb, TDB_DATA key,
 		alarm(0);
 		tdb_setalarm_sigptr(tdb, NULL);
 		CatchSignal(SIGALRM, SIGNAL_CAST SIG_IGN);
-		if (gotalarm) {
+		if (gotalarm && (ret == -1)) {
 			DEBUG(0,("tdb_chainlock_with_timeout_internal: alarm (%u) timed out for key
%s in tdb %s\n",
 				timeout, key.dptr, tdb_name(tdb)));
 			/* TODO: If we time out waiting for a lock, it might

Comment 62 Richard Sharpe 2009-01-13 14:38:13 UTC
Adding me to the CC list.
Comment 63 Richard Sharpe 2009-01-14 12:28:48 UTC
Hey Jerry,

Can we close this bug now? 

I believe that I have explained what happens and that it is definitely fixed in 3.0.34 (or whatever is next) and, modulo the race, which is much harder to hit, is fixed in at least 3.0.31.
Comment 64 Gerald (Jerry) Carter 2009-01-14 13:37:54 UTC
(In reply to comment #63)
> Hey Jerry,
> 
> Can we close this bug now? 
> 
> I believe that I have explained what happens and that it is definitely fixed in
> 3.0.34 (or whatever is next) and, modulo the race, which is much harder to hit,
> is fixed in at least 3.0.31.
> 

Agreed.

Comment 65 Jeremy Allison 2009-07-13 13:26:18 UTC
Was fixed in 3.2.8 and above (3.3.0+).
Jeremy.
Comment 66 Hodur 2013-05-06 07:31:41 UTC
The same problem with samba-3.6.14, OK with samba 3.6.13
After upgrading from samba-3.6.13 to samba-3.6.14 in winbind log

[2013/05/06 07:07:43.830181,  0] winbindd/winbindd.c:947(winbindd_listen_fde_handler)
  winbindd: Exceeding 500 client connections, no idle connection found

From smb.conf

    winbind cache time = 1200
    winbind use default domain = yes
    winbind refresh tickets = yes
    winbind offline logon = yes
    winbind enum users = yes
    winbind enum groups = yes
    winbind nss info = template
    winbind nested groups = yes
    winbind max clients = 500
    idmap uid = 10000-100000
    idmap gid = 10000-100000
    idmap cache time = 1200
Comment 67 Hemanth 2014-07-17 21:09:29 UTC
Got the same problem with samba 3.6.12+ stack. Infact we got this issue twice for one of our customer.

[2014/07/17 11:23:02.223054,  0] winbindd/winbindd.c:947(winbindd_listen_fde_handler)
  winbindd: Exceeding 400 client connections, no idle connection found
[2014/07/17 11:23:02.224055,  0] winbindd/winbindd.c:947(winbindd_listen_fde_handler)
  winbindd: Exceeding 400 client connections, no idle connection found

Winbindd went unresponsive and found that there are lot(~10K) of open file handles for this stuck winbindd process. Had to kill this process to restore the user access/connectivity.

 ....
 ....
 3378 winbindd         10294 s - rw------   1       0 UDS /usr/local/var/locks/winbindd_privileged/pipe
 3378 winbindd         10295 s - rw------   1       0 UDS /usr/local/var/locks/winbindd_privileged/pipe
 3378 winbindd         10296 s - rw------   1       0 UDS /usr/local/var/locks/winbindd_privileged/pipe

Here is the gdb stack info of stuck winbindd process.
=== Dumping Process winbindd (3380) ===

[Switching to Thread 8030021c0 (LWP 101095)]
0x00000008026cd5ec in poll () from /lib/libc.so.7

Thread 1 (Thread 8030021c0 (LWP 101095)):
#0  0x00000008026cd5ec in poll () from /lib/libc.so.7
#1  0x00000008010a27fe in poll () from /lib/libthr.so.3
#2  0x0000000801ea0ee9 in wait4msg (result=<optimized out>, timeout=<optimized out>, all=<optimized out>, msgid=<optimized out>, ld=<optimized out>) at result.c:312
#3  ldap_result (ld=0x80302eac0, msgid=5, all=1, timeout=<optimized out>, result=0x7fffffffc258) at result.c:117
#4  0x0000000801ea8348 in ldap_sasl_bind_s (ld=0x80302eac0, dn=0x0, mechanism=0xa00760 "GSS-SPNEGO", cred=0x7fffffffc370, sctrls=0x0, cctrls=<optimized out>, servercredp=0x7fffffffc400) at sasl.c:194
#5  0x0000000000821a1d in ads_sasl_spnego_rawkrb5_bind (principal=<optimized out>, ads=<optimized out>) at libads/sasl.c:795
#6  ads_sasl_spnego_krb5_bind (ads=0x803002a80, p=<optimized out>) at libads/sasl.c:823
#7  0x0000000000822545 in ads_sasl_spnego_bind (ads=0x803002a80) at libads/sasl.c:904
#8  0x000000000082055d in ads_sasl_bind (ads=0x803002a80) at libads/sasl.c:1213
#9  0x000000000081f8e4 in ads_connect (ads=0x803002a80) at libads/ldap.c:730
#10 0x00000000004a78b0 in ads_cached_connection (domain=0x80305f200) at winbindd/winbindd_ads.c:131
#11 0x00000000004a7ba0 in sequence_number (domain=0x80305f200, seq=0x80305f718) at winbindd/winbindd_ads.c:1262
#12 0x0000000000492b31 in refresh_sequence_number (domain=0x80305f200, force=<optimized out>) at winbindd/winbindd_cache.c:558
#13 0x0000000000492df4 in wcache_fetch (cache=<optimized out>, domain=0x80305f200, format=0x8b9580 "GM/%s") at winbindd/winbindd_cache.c:711
#14 0x0000000000493ba6 in wcache_lookup_groupmem (domain=0x80305f200, mem_ctx=0x803009290, group_sid=<optimized out>, num_names=0x7fffffffcf84, sid_mem=0x7fffffffcf78, names=0x7fffffffcf70, name_types=0x7fffffffcf68) at winbindd/winbindd_cache.c:2615
#15 0x0000000000493e58 in lookup_groupmem (domain=0x8030af004, mem_ctx=0x1, group_sid=0x80308c3d0, type=SID_NAME_DOM_GRP, num_names=0x1, sid_mem=0x7fffffffcf78, names=0x7fffffffcf70, name_types=0x7fffffffcf68) at winbindd/winbindd_cache.c:2673
#16 0x00000000004b1be4 in _wbint_LookupGroupMembers (p=0x7fffffffd000, r=0x803044140) at winbindd/winbindd_dual_srv.c:347
#17 0x00000000004bad16 in api_wbint_LookupGroupMembers (p=0x7fffffffd000) at librpc/gen_ndr/srv_wbint.c:1271
#18 0x00000000004b04a2 in winbindd_dual_ndrcmd (domain=0x80305f200, state=0x7fffffffe800) at winbindd/winbindd_dual_ndr.c:322
#19 0x00000000004aecfd in child_process_request (state=<optimized out>, child=<optimized out>) at winbindd/winbindd_dual.c:495
#20 fork_domain_child (child=<optimized out>) at winbindd/winbindd_dual.c:1609
#21 wb_child_request_trigger (req=<optimized out>, private_data=<optimized out>) at winbindd/winbindd_dual.c:200
#22 0x0000000000569970 in tevent_common_loop_immediate (ev=0x80301e110) at ../lib/tevent/tevent_immediate.c:139
#23 0x0000000000567c35 in run_events_poll (ev=0x80301e110, pollrtn=0, pfds=0x0, num_pfds=0) at lib/events.c:197
#24 0x0000000000568359 in s3_event_loop_once (ev=0x80301e110, location=<optimized out>) at lib/events.c:331
#25 0x0000000000568771 in _tevent_loop_once (ev=0x80301e110, location=0x8b41c9 "winbindd/winbindd.c:1456") at ../lib/tevent/tevent.c:494
#26 0x00000000004898a2 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at winbindd/winbindd.c:1456


Would like to know if someone has an idea on where exactly FDs are leaking.
Comment 68 Volker Lendecke 2014-07-18 08:51:50 UTC
(In reply to comment #67)
> Would like to know if someone has an idea on where exactly FDs are leaking.

If child winbind's pile up waiting for something, this will inevitably happen. Maybe we should add an emergency "watchdog": If a winbind child does not reply within, say, 1 minute, call it dead. Start a new one even if this would exceed the max winbind domain children. If it does not come back for an hour, just kill it. No winbind request at all should take more than an hour, to me it would be reasonable to really kill that helper process after an hour busy time.

Comments?
Comment 69 Hemanth 2014-07-18 16:45:47 UTC
(In reply to comment #68)
> If child winbind's pile up waiting for something, this will inevitably happen.
> Maybe we should add an emergency "watchdog": If a winbind child does not reply
> within, say, 1 minute, call it dead. Start a new one even if this would exceed
> the max winbind domain children. If it does not come back for an hour, just
> kill it. No winbind request at all should take more than an hour, to me it
> would be reasonable to really kill that helper process after an hour busy time.
> 
> Comments?

Actually in our case, we did not see too many child winbindd processes. Infact there are only 3 winbinnd processes. Parent/main winbindd was holding up too many(>10K) FDs to the unix doamin socket.

I found following piece of code for winbindd_listen_fde_handler()

{
	struct winbindd_listen_state *s = talloc_get_type_abort(private_data,
					  struct winbindd_listen_state);

	while (winbindd_num_clients() > lp_winbind_max_clients() - 1) {
		DEBUG(5,("winbindd: Exceeding %d client "
			 "connections, removing idle "
			 "connection.\n", lp_winbind_max_clients()));
		if (!remove_idle_client()) {
			DEBUG(0,("winbindd: Exceeding %d "
				 "client connections, no idle "
				 "connection found\n",
				 lp_winbind_max_clients()));
			break;
		}
	}
	new_connection(s->fd, s->privileged);
}

Here we seem to be allowing more connections than the configured "max clients" even if we dont find any idle/dead client connections. Due to this connection list keeps growing and reached 10K in our case. Also on listening to every new socket connection requests, we will be iterating through this list to find any idle client connection. I think this one keeping windbindd busy and becoming almost unresponsive for clients. Would also like to understand to allow the new connections limitless when it exceed the max clients limit and no idle connection are found.  

Also I am trying to understand the reason for having these winbindd client connections for longer time. In my test setup, I could see the asynchronous requests are getting processed quickly and closes the socket connection. But in this case, these connections are holding up for longer time. 

As Volker points out, it would a good idea to make the sessions stale on exceeding 60 seconds timeout. In this case, I assume no such request will take more than 60 seconds to process by winbindd. 

Thanks,
Hemanth.
Comment 70 Jeremy Allison 2014-07-18 19:01:15 UTC
Created attachment 10127 [details]
Prototype patch for 3.6.x.

Ok Hemanth, can you give this (prototype) patch for 3.6.x a try ?

It adds a new [global] parameter:

winbind request timeout

default value of 60 (seconds). What it does is terminate every client connection that has either remained idle for 60 seconds, or has not replied within 60 seconds. Initially I worried this was a little aggressive, but I don't think so - if a request has take > 60 seconds it's almost certainly dead, and pruning idle clients after 60 seconds is also probably ok. Also it's tuneable :-).

If this works for you I can forward port to 4.1.next and 4.0.next.

Cheers,

Jeremy.
Comment 71 Hemanth 2014-07-18 20:24:24 UTC
(In reply to comment #70)
> Created attachment 10127 [details]
> Prototype patch for 3.6.x.
> 
> Ok Hemanth, can you give this (prototype) patch for 3.6.x a try ?
> 
> It adds a new [global] parameter:
> 
> winbind request timeout
> 
> default value of 60 (seconds). What it does is terminate every client
> connection that has either remained idle for 60 seconds, or has not replied
> within 60 seconds. Initially I worried this was a little aggressive, but I
> don't think so - if a request has take > 60 seconds it's almost certainly dead,
> and pruning idle clients after 60 seconds is also probably ok. Also it's
> tuneable :-).
> 
> If this works for you I can forward port to 4.1.next and 4.0.next.
> 
> Cheers,
> 
> Jeremy.

Sure. Thanks Jeremy. I will incorporate these changes and try to provide patch to customer. Since this issue is not reproducible inhouse we can expect some delay in getting the feedback. But will followup with customer and update soon.

Thanks,
Hemanth.
Comment 72 Jeremy Allison 2014-07-19 05:11:20 UTC
Comment on attachment 10127 [details]
Prototype patch for 3.6.x.

Ach - don't use this patch. Time logic is wrong (reversed). New patch shortly.
Comment 73 Jeremy Allison 2014-07-19 05:41:33 UTC
Created attachment 10128 [details]
Fixed patch for 3.6.x

Correctly tests expiry time for request in remove_timed_out_clients().

Sorry for the earlier error.
Comment 74 Björn Jacke 2014-07-23 18:21:43 UTC
*** Bug 6825 has been marked as a duplicate of this bug. ***
Comment 75 Björn Jacke 2014-07-24 20:23:07 UTC
*** Bug 6087 has been marked as a duplicate of this bug. ***
Comment 76 Hemanth 2014-07-24 23:58:32 UTC
(In reply to comment #73)
> Created attachment 10128 [details]
> Fixed patch for 3.6.x
> 
We couldn't reproduce the issue inhouse. I have tweaked the code to have some intentional FD leak and verified that older connections are getting cleaned up. Right now I have ported this patch and pushed to our QA. Will let you know if we here anything internally or from customers.

Thanks,
Hemanth.
Comment 77 Björn Jacke 2014-07-25 07:14:10 UTC
*** Bug 10573 has been marked as a duplicate of this bug. ***
Comment 78 Björn Jacke 2014-07-25 09:13:25 UTC
*** Bug 9127 has been marked as a duplicate of this bug. ***
Comment 79 Jeremy Allison 2014-07-29 22:48:44 UTC
Created attachment 10160 [details]
git-am fix for 4.1.next and 4.0.next.

Back ported from fix that went into master.
Comment 80 Hemanth 2014-07-29 23:02:00 UTC
Given this patch as early access to one our customers few(4-5) days back. We have been monitoring the FDs used by winbindd. Haven't seen the FD list growing so far (seen only 2 to 3 active UDS connections at any point of time). With this I can say that patch is working fine.

Also I have added a debug level zero messages to see if we are cleaning up any idle or timed out sessions. Every time I could see that only idle client connections getting cleanup. Based on this, I assume that we do not have any timed out/stuck client sessions instead there have been idle connections whose FDs were leaking. This is just my observation in our customer setup.
Comment 81 Jeremy Allison 2014-08-07 18:42:08 UTC
Re-assigning to Karolin for inclusion in 4.1.next, 4.0.next.
Comment 82 Stefan Metzmacher 2014-08-28 07:07:04 UTC
*** Bug 6423 has been marked as a duplicate of this bug. ***
Comment 83 Karolin Seeger 2014-09-01 19:32:23 UTC
Pushed to autobuild-v4-[0|1]-test.
Comment 84 Karolin Seeger 2014-09-03 07:31:48 UTC
Pushed to both branches.
Closing out bug report.

Thanks!