Bug 10836 - Winbindd running on 100% cpu while poll on socket in CLOSE_WAIT state
Winbindd running on 100% cpu while poll on socket in CLOSE_WAIT state
Status: NEW
Product: Samba 3.6
Classification: Unclassified
Component: Winbind
3.6.22
x86 Linux
: P3 major
: ---
Assigned To: Michael Adam
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-24 17:48 UTC by Damon Chitsaz
Modified: 2015-01-07 17:34 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Damon Chitsaz 2014-09-24 17:48:27 UTC
Frequnetly, one of Winbindd processes gets pinned at 100% CPU (See PID 27926):

# ps auxf |grep winbindd
root 6640 0.0 0.0 3608 748 pts/1 S+ 11:06 0:00 \_ grep winbindd
root 27487 0.0 0.3 163692 77884 ? Ss Aug19 20:11 /usr/sbin/winbindd
root 27488 0.0 0.2 134076 53320 ? S Aug19 14:42 \_ /usr/sbin/winbindd
root 27600 0.0 0.1 137284 31768 ? S Aug19 11:11 \_ /usr/sbin/winbindd
root 27601 0.0 0.0 137284 21608 ? S Aug19 11:00 \_ /usr/sbin/winbindd
root 27926 98.3 0.0 97644 10464 ? R Aug19 7385:39 \_ /usr/sbin/winbindd
root 28024 0.0 0.0 140280 12620 ? S Aug19 0:17 \_ /usr/sbin/winbindd
root 8106 0.0 0.0 38160 10384 ? S Aug20 0:00 \_ /usr/sbin/winbindd
root 18562 0.0 0.0 38160 10664 ? S Aug20 0:00 \_ /usr/sbin/winbindd
root 17520 0.0 0.0 70520 10492 ? S Aug22 0:00 \_ /usr/sbin/winbindd
root 19734 0.0 0.0 116392 10524 ? S Aug22 0:00 \_ /usr/sbin/winbindd

# strace -p 27926 -ff
Process 27926 attached
read(q, "", 4) = 0
gettimeofday({1410444433, 337148}, NULL) = 0
poll([{fd=30, events=POLLIN}], 1, -443701286) = 1 ([{fd=30, revents=POLLIN}])
read(30, "", 4) = 0
gettimeofday({1410444433, 337680}, NULL) = 0
poll([{fd=30, events=POLLIN}], 1, -443701287) = 1 ([{fd=30, revents=POLLIN}])
read(30, "", 4) = 0
gettimeofday({1410444433, 337915}, NULL) = 0
poll([{fd=30, events=POLLIN}], 1, -443701287) = 1 ([{fd=30, revents=POLLIN}])
read(30, "", 4) = 0
gettimeofday({1410444433, 338024}, NULL) = 0
poll([{fd=30, events=POLLIN}], 1, -443701287) = 1 ([{fd=30, revents=POLLIN}])
...

lsof -p 27926
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
...
winbindd 27926 root 30u IPv4 208445530 0t0 TCP some-local-ip:43965->some-remote-ip:kerberos (CLOSE_WAIT)
...


IT doesn't always happen on kerberos connection. I have seen, the same event happens on DNS connection as well:
-----
winbindd 23114 root 30u IPv4 299792617 0t0 TCP 127.0.0.1:dns->127.0.0.1:34905 (CLOSE_WAIT)
-----

What's odd about these two occurrences that I have looked at is both time the FD (file descriptor) is 30.
Comment 1 Damon Chitsaz 2014-10-20 22:43:22 UTC
This happened again on the second system again. This time on a different fd (31) and the connection was to active directory server.

This bug may go unnoticed on systems since there is no symptom other than 100% cpu utilization. I wonder how prevalent this bug is.
Comment 2 Björn Jacke 2014-10-21 07:43:23 UTC
3.6 is quite aged and gets only security fixed. It would be helpful if you could update to 4.1 and see if the problem still pops up there.
Comment 3 Richard Sharpe 2014-12-30 20:41:37 UTC
Are you able to attach with GDB and get a traceback?

It would be useful to know where this is occurring.

Also, did you have a chance to try Samba 4.1.x?
Comment 4 Damon Chitsaz 2015-01-07 17:34:00 UTC
I will attach gdb and get a stack trace next time this happens.

Do you need anything else other than stack trace from all running threads?