We have an environment that is being used to simulate a large file server consolidation. The customer wants to see 12000 concurrent users on a single file server that is a domain member server running winbindd. The symptom of this problem is login failures while there are high nubmers of nsswitch lookups. The smbd log for the failed connection will have the following: [2005/05/18 12:12:03, 2, pid=26222] auth/auth.c:check_ntlm_password(312) check_ntlm_password: Authentication for user [client826] -> [client826] FAILED with error NT_STATUS_NO_SUCH_USER But the user client826 does exist and is found via nsswitch upon retry. In a level 10 winbind log there will be no mention of this lookup, but a lot of messages like the following: [2005/05/18 12:10:37, 5, pid=25342] nsswitch/winbindd.c:process_loop(704) winbindd: Exceeding 200 client connections, removing idle connection. [2005/05/18 12:10:37, 5, pid=25342] nsswitch/winbindd.c:remove_idle_client(426) Found 198 idle client connections, shutting down sock 22, pid 25760 The logic in remove_idle_client in winbindd.c looks ok but since state->read_buf_len can't be updated once the the idle client removal is started there is a window where a client can write a request and then get its socket closed with no response. I worked around the issue by adding retry code to winbindd_request in nsswitch/wb_common.c
Here is John's patch (IBM rules, I have to post it for him). The main comment I have is I'd like to see a timeout or total count limit on number of retries, so we don't get into an infinite loop. Any thoughts? --- nsswitch/wb_common.c.orig 2005-05-19 16:48:11.000000000 -0500 +++ nsswitch/wb_common.c 2005-05-19 13:56:27.000000000 -0500 @@ -588,12 +588,16 @@ struct winbindd_request *request, struct winbindd_response *response) { - NSS_STATUS status; + NSS_STATUS status = NSS_STATUS_UNAVAIL; - status = winbindd_send_request(req_type, request); - if (status != NSS_STATUS_SUCCESS) - return(status); - return winbindd_get_response(response); + while( status == NSS_STATUS_UNAVAIL ) { + status = winbindd_send_request(req_type, request); + if (status != NSS_STATUS_SUCCESS) + return(status); + status = winbindd_get_response(response); + } + + return status; } /*************************************************************************
Created attachment 1241 [details] Retries up to 3 times if the winbind daemon returns NSS_STATUS_UNAVAIL
Question: Why do we need the timout in the library at all? If winbind hangs for some reason, it hangs. Why should application programs depending on it not also hang? Wouldn't it be better to add a timeout mechanism to winbind itself? It knows best what is a real problem and what is just a slow DC? We're not there yet though. Volker
Reassigning to me, will check in patch with 10 retries (well, 10 tries total).
Checked in attached patch, with retries set to 10
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.