Frequnetly, one of Winbindd processes gets pinned at 100% CPU (See PID 27926): # ps auxf |grep winbindd root 6640 0.0 0.0 3608 748 pts/1 S+ 11:06 0:00 \_ grep winbindd root 27487 0.0 0.3 163692 77884 ? Ss Aug19 20:11 /usr/sbin/winbindd root 27488 0.0 0.2 134076 53320 ? S Aug19 14:42 \_ /usr/sbin/winbindd root 27600 0.0 0.1 137284 31768 ? S Aug19 11:11 \_ /usr/sbin/winbindd root 27601 0.0 0.0 137284 21608 ? S Aug19 11:00 \_ /usr/sbin/winbindd root 27926 98.3 0.0 97644 10464 ? R Aug19 7385:39 \_ /usr/sbin/winbindd root 28024 0.0 0.0 140280 12620 ? S Aug19 0:17 \_ /usr/sbin/winbindd root 8106 0.0 0.0 38160 10384 ? S Aug20 0:00 \_ /usr/sbin/winbindd root 18562 0.0 0.0 38160 10664 ? S Aug20 0:00 \_ /usr/sbin/winbindd root 17520 0.0 0.0 70520 10492 ? S Aug22 0:00 \_ /usr/sbin/winbindd root 19734 0.0 0.0 116392 10524 ? S Aug22 0:00 \_ /usr/sbin/winbindd # strace -p 27926 -ff Process 27926 attached read(q, "", 4) = 0 gettimeofday({1410444433, 337148}, NULL) = 0 poll([{fd=30, events=POLLIN}], 1, -443701286) = 1 ([{fd=30, revents=POLLIN}]) read(30, "", 4) = 0 gettimeofday({1410444433, 337680}, NULL) = 0 poll([{fd=30, events=POLLIN}], 1, -443701287) = 1 ([{fd=30, revents=POLLIN}]) read(30, "", 4) = 0 gettimeofday({1410444433, 337915}, NULL) = 0 poll([{fd=30, events=POLLIN}], 1, -443701287) = 1 ([{fd=30, revents=POLLIN}]) read(30, "", 4) = 0 gettimeofday({1410444433, 338024}, NULL) = 0 poll([{fd=30, events=POLLIN}], 1, -443701287) = 1 ([{fd=30, revents=POLLIN}]) ... lsof -p 27926 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ... winbindd 27926 root 30u IPv4 208445530 0t0 TCP some-local-ip:43965->some-remote-ip:kerberos (CLOSE_WAIT) ... IT doesn't always happen on kerberos connection. I have seen, the same event happens on DNS connection as well: ----- winbindd 23114 root 30u IPv4 299792617 0t0 TCP 127.0.0.1:dns->127.0.0.1:34905 (CLOSE_WAIT) ----- What's odd about these two occurrences that I have looked at is both time the FD (file descriptor) is 30.
This happened again on the second system again. This time on a different fd (31) and the connection was to active directory server. This bug may go unnoticed on systems since there is no symptom other than 100% cpu utilization. I wonder how prevalent this bug is.
3.6 is quite aged and gets only security fixed. It would be helpful if you could update to 4.1 and see if the problem still pops up there.
Are you able to attach with GDB and get a traceback? It would be useful to know where this is occurring. Also, did you have a chance to try Samba 4.1.x?
I will attach gdb and get a stack trace next time this happens. Do you need anything else other than stack trace from all running threads?