Bug 2736 - Winbindd idle connection close code fails under high number of lookups
Winbindd idle connection close code fails under high number of lookups
Product: Samba 3.0
Classification: Unclassified
Component: winbind
All Linux
: P3 normal
: none
Assigned To: Jim McDonough
Samba QA Contact
Depends on:
  Show dependency treegraph
Reported: 2005-05-24 08:29 UTC by John Janosik
Modified: 2005-08-24 10:19 UTC (History)
1 user (show)

See Also:

Retries up to 3 times if the winbind daemon returns NSS_STATUS_UNAVAIL (855 bytes, patch)
2005-05-26 11:07 UTC, Jim McDonough
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description John Janosik 2005-05-24 08:29:10 UTC
We have an environment that is being used to simulate a large file server
consolidation.  The customer wants to see 12000 concurrent users on a single
file server that is a domain member server running winbindd.  The symptom of
this problem is login failures while there are high nubmers of nsswitch lookups.
 The smbd log for the failed connection will have the following:

[2005/05/18 12:12:03, 2, pid=26222] auth/auth.c:check_ntlm_password(312)
  check_ntlm_password:  Authentication for user [client826] -> [client826]

But the user client826 does exist and is found via nsswitch upon retry.

In a level 10 winbind log there will be no mention of this lookup, but a lot of
messages like the following:

[2005/05/18 12:10:37, 5, pid=25342] nsswitch/winbindd.c:process_loop(704)
  winbindd: Exceeding 200 client connections, removing idle connection.
[2005/05/18 12:10:37, 5, pid=25342] nsswitch/winbindd.c:remove_idle_client(426)
  Found 198 idle client connections, shutting down sock 22, pid 25760

The logic in remove_idle_client in winbindd.c looks ok but since
state->read_buf_len can't be updated once the the idle client removal is started
 there is a window where a client can write a request and then get its socket
closed with no response.

I worked around the issue by adding retry code to winbindd_request in
Comment 1 Jim McDonough 2005-05-24 09:22:53 UTC
Here is John's patch (IBM rules, I have to post it for him).  The main comment I
have is I'd like to see a timeout or total count limit on number of retries, so
we don't get into an infinite loop.  Any thoughts?

--- nsswitch/wb_common.c.orig   2005-05-19 16:48:11.000000000 -0500
+++ nsswitch/wb_common.c        2005-05-19 13:56:27.000000000 -0500
@@ -588,12 +588,16 @@
                            struct winbindd_request *request,
                            struct winbindd_response *response)
-       NSS_STATUS status;
-       status = winbindd_send_request(req_type, request);
-       if (status != NSS_STATUS_SUCCESS)
-               return(status);
-       return winbindd_get_response(response);
+       while( status == NSS_STATUS_UNAVAIL ) {
+               status = winbindd_send_request(req_type, request);
+               if (status != NSS_STATUS_SUCCESS)
+                       return(status);
+               status = winbindd_get_response(response);
+       }
+       return status;
Comment 2 Jim McDonough 2005-05-26 11:07:54 UTC
Created attachment 1241 [details]
Retries up to 3 times if the winbind daemon returns NSS_STATUS_UNAVAIL
Comment 3 Volker Lendecke 2005-05-27 03:09:54 UTC
Question: Why do we need the timout in the library at all? If winbind hangs for
some reason, it hangs. Why should application programs depending on it not also
hang? Wouldn't it be better to add a timeout mechanism to winbind itself? It
knows best what is a real problem and what is just a slow DC? We're not there
yet though.

Comment 4 Jim McDonough 2005-05-31 09:34:41 UTC
Reassigning to me, will check in patch with 10 retries (well, 10 tries total).
Comment 5 Jim McDonough 2005-05-31 11:51:03 UTC
Checked in attached patch, with retries set to 10
Comment 6 Gerald (Jerry) Carter 2005-08-24 10:19:45 UTC
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.