This is a continuation of Bug 460 into Samba 3.0.0rc4. I managed to get 3.0.0rc4 to build under AIX and installed it today. Under an 11% higher printing load than last Monday when the load average got to 28+ and the service melted, the server load average today never exceeded 8 and some of that was I/O. So I should be happy, but I'm not. I am still seeing smbd processes chewing up CPU time with major network ping-ponging. Here is an example of the network traffic: 17:28:50.618706297 truncated-ip - 107 bytes missing!dhcp-151-106.johnson.cornell.edu.1027 > page2.cit.cornell.edu.445: P 347264:347397(133) ack 477996 win 64146 (DF) 17:28:50.621102325 truncated-ip - 157 bytes missing!page2.cit.cornell.edu.445 > dhcp-151-106.johnson.cornell.edu.1027: P 477996:478179(183) ack 347397 win 16384 (DF) 17:28:50.622408803 truncated-ip - 107 bytes missing!dhcp-151-106.johnson.cornell.edu.1027 > page2.cit.cornell.edu.445: P 347397:347530(133) ack 478179 win 64512 (DF) 17:28:50.624493533 truncated-ip - 157 bytes missing!page2.cit.cornell.edu.445 > dhcp-151-106.johnson.cornell.edu.1027: P 478179:478362(183) ack 347530 win 16384 (DF) This goes on forever. My server serves 115 printers, hundreds of workstations, and thousands of students. Do you agree that this is a serious problem?
Created attachment 181 [details] tcpdump hex output of packet traffic between client and server
Created attachment 182 [details] tcpdump ASCII output of packet traffic between client and server
One small detail which I always forget to mention is that my server has two CPUs. This could be why not everybody is having this problem.
do you have "disable spoolss = yes" by chance?
reseting target milestone. 3.0.1 has been frozen. WIll have to re-evaluate these.
I'm betting this is a driver problem. I've seen the spooler on XP clients go into tight loops issugin getprinterdata() calls over and over when using certain Lexmark PCL drivers (even with Windows print servers).
Thanks for continuing to think about this problem. I suppose it could be driver-related. Your description of the symptoms is exactly what I'm seeing. My Windows instructions for my users say to use the Adobe PostScript driver, but I have no way of enforcing that. However, I do trash non-PostScript jobs (after they get through Samba), so I doubt that PCL drivers are involved. I am planning to experiment with enabling spoolss when I get the time. Meanwhile, it works to kill all the Samba daemons periodically and restart them. A load average drop from 6 to 1 is typical. If you've seen the getprinterdata() loop with spoolss enabled, that would be rather discouraging. It is my considered opinion that this problem is confined to XP clients. That is why I am strongly recommending that my users use LPR protocol under XP in spite of Microsoft's brain-dead implementation.
first comment is to enable spoolss. I have had reports of that making clients loop (not sure the client OS). The lexmark PCL driver bug is exactly that -- a driver bug. The workaround in that case would be to use the PS version of the same driver. At this point I belive that the spoolss code is a better choice that setting 'disable spoolss = yes'. That was provided as a a temporary workaround during the 2.2 days.
I am now running 3.0.2a (which BTW builds flawlessly under AIX 5.2 with gcc-3.2.2) with spoolss _enabled_, and I am still experiencing the same problem. As I pointed out, Lexmark PCL drivers will not work with our printing system because we redirect non-PostScript jobs to /dev/null. It is possible that some misguided individuals might ignore our instructions, install a Lexmark PCL driver, find out that it doesn't work, and then leave it installed to bedevil Samba, but I find this extremely unlikely. In any case, this would constitute a DOS attack vulnerability in Samba which should still be fixed. My fallback strategy is to run an hourly cron job which will kill all Samba daemons and then restart Samba. This lacks something in esthetic appeal. Thanks for your continued interest.
Rick, please take a raw network trace and send it to me. also include the output of ps -aux (or AIX equiv) and netstat -pant.
Jerry, I would if I could but I can't. Since the day when I switched to 3.0.2a with spoolss and immediately encountered the same ping-pong problem, IT HASN'T HAPPENED AGAIN - even during periods of high printing load. Is it possible that there was a client out there which persisted in using the non-spoolss RPC calls until I bounced Samba one last time? I hope you will not be too disappointed by this lost debugging opportunity :-) Since I have been observing for only three days it could still blow up again, but I think this may have solved my problem. Thank you for your patient interest.
thanks for the update. i'll wait about a week and then close this bug out if we don't see any more comments.
Created attachment 474 [details] Tar containing lsof, tcpdump, ps, and netstat output Looks like I spoke too soon. Again. AIX netstat doesn't appear to have any of the arguments you suggest. The tcpdump output is a "-w" file.
Rick, I finally looked at the tcpdump and you have a client that is still using the lanman printing interface. You will probably need to clean out the registery on the clients to remove any lanman printing ports. Looks like the spoolss feature is still the right solution.
On May 7, I upgraded to Samba 3.0.3 and I haven't encountered the problem since. This is nice since the two weeks following that (during one of which I was on vacation) have the most heavy printing traffic of the year. Our server handled the load without breaking a sweat. Unless there is some obscure side effect from just doing a build and install which caused the problem to go away, I would have to say that something got fixed between 3.0.2a and 3.0.3. This is a tremendous relief for us. Thanks for your help.
originally reported against one of the 3.0.0rc[1-4] releases. Cleaning up non-production versions.
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.