The Samba-Bugzilla – Bug 9509
WINS lookup may cause 1000s of concurrent threads
Last modified: 2012-12-19 15:56:16 UTC
The offending code is in UniAddress.java. The implementation of lookupServerOrWorkgroup() creates a pair of threads which appear to issue WINS query of some kind, listen for both to reply, and then finish. Both threads will exit very shortly afterwards, as their work is done.
Unfortunately, it is possible that once they have notified the main thread to continue, these threads will not be scheduled for execution (and thus do not get a chance to exit!) fast enough, when these lookups are occurring at a very high rate. The end result of the current implementation was over 2000 threads on my tomcat7 server which repeatedly caused it to stop responding. It was not until I was able to see it occurring with jconsole attached to the server that I saw the culprit, large number of "JCIFS-QueryThread" threads.
I believe I triggered the issue by performing listFiles() on a cifs share with 1000s of files in one directory. I believe I managed to workaround the issue by disabling WINS query altogether and only using DNS querying with jcifs.resolveOrder set to DNS.
General commentary: It would improve jcifs reliability if all its work was scheduled to occur via ThreadPoolExecutors. There should be 1 executor pool with 1 to maybe 10 threads capable of work, and short-lived Runnables for all things such as WINS queries, which would be passed to them as work items. The Executor can then schedule them when threads are available. The executor returns a Future which can be used to pause execution in the main thread, until its corresponding Runnable has run, thus suggesting a simple rewrite of the UniAddress lookup as follows, with QueryThread implementing Runnable instead, against some multithreading executor:
QueryRunnable w1x = new QueryRunnable(name, type, null, svr);
QueryRunnable w20 = new QueryRunnable(name 0x20, null, svr);
Future<?> f1x = executor.submit(w1x);
Future<?> f20 = executor.submit(w20);
w1.ans & w2.ans now hold the response. (Should probably use AtomicReference to hold them.)
Since the code primarily is interested of the w1x response, it would also be possible to only wait for w20 to finish if w1x's ans is null. w20 might be canceled if it didn't yet get chance to run (which is of course unlikely), but in principle it would be the right thing to do.
It appears that the tomcat7 server's nonresponsivity was an unrelated bug that occurred concurrently with the jcifs listFiles() operation, so that mystery got solved.
Technically the high thread count is harmless, I guess, just not particularly good, imho.