535 – ping-ponging smbd processes chewing up CPU time and network bandwidth

Bug 535 - ping-ponging smbd processes chewing up CPU time and network bandwidth

Summary: ping-ponging smbd processes chewing up CPU time and network bandwidth

Status:	CLOSED FIXED

Alias:	None

Product:	Samba 3.0
Classification:	Unclassified
Component:	Printing (show other bugs)
Version:	3.0.0preX
Hardware:	All AIX

Importance:	P5 major
Target Milestone:	none
Assignee:	Gerald (Jerry) Carter (dead mail address)
QA Contact:

URL:
Keywords:

Depends on:
Blocks:

Reported:	2003-09-29 14:48 UTC by Rick Cochran
Modified:	2005-08-24 10:18 UTC (History)
CC List:	0 users

See Also:

Attachments
tcpdump hex output of packet traffic between client and server (47.92 KB, text/plain) 2003-10-06 07:18 UTC, Rick Cochran	no flags	Details
tcpdump ASCII output of packet traffic between client and server (24.19 KB, text/plain) 2003-10-06 07:18 UTC, Rick Cochran	no flags	Details
Tar containing lsof, tcpdump, ps, and netstat output (132.62 KB, application/octet-stream) 2004-04-23 15:01 UTC, Rick Cochran	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Rick Cochran 2003-09-29 14:48:58 UTC

This is a continuation of Bug 460 into Samba 3.0.0rc4.

I managed to get 3.0.0rc4 to build under AIX and installed it today.  Under an
11% higher printing load than last Monday when the load average got to 28+ and
the service melted, the server load average today never exceeded 8 and some of
that was I/O.

So I should be happy, but I'm not.  I am still seeing smbd processes chewing up
CPU time with major network ping-ponging.

Here is an example of the network traffic:

17:28:50.618706297 truncated-ip - 107 bytes
missing!dhcp-151-106.johnson.cornell.edu.1027 > page2.cit.cornell.edu.445: P
347264:347397(133) ack 477996 win 64146 (DF)
17:28:50.621102325 truncated-ip - 157 bytes missing!page2.cit.cornell.edu.445 >
dhcp-151-106.johnson.cornell.edu.1027: P 477996:478179(183) ack 347397 win 16384
(DF)
17:28:50.622408803 truncated-ip - 107 bytes
missing!dhcp-151-106.johnson.cornell.edu.1027 > page2.cit.cornell.edu.445: P
347397:347530(133) ack 478179 win 64512 (DF)
17:28:50.624493533 truncated-ip - 157 bytes missing!page2.cit.cornell.edu.445 >
dhcp-151-106.johnson.cornell.edu.1027: P 478179:478362(183) ack 347530 win 16384
(DF)

This goes on forever.

My server serves 115 printers, hundreds of workstations, and thousands of students.

Do you agree that this is a serious problem?

Comment 1 Rick Cochran 2003-10-06 07:18:09 UTC

Created attachment 181 [details]
tcpdump hex output of packet traffic between client and server

Comment 2 Rick Cochran 2003-10-06 07:18:42 UTC

Created attachment 182 [details]
tcpdump ASCII output of packet traffic between client and server

Comment 3 Rick Cochran 2003-10-06 07:21:33 UTC

One small detail which I always forget to mention is that my server has two CPUs.

This could be why not everybody is having this problem.

Comment 4 Gerald (Jerry) Carter (dead mail address) 2003-11-13 12:21:03 UTC

do you have "disable spoolss = yes" by chance?

Comment 5 Gerald (Jerry) Carter (dead mail address) 2003-12-12 08:27:41 UTC

reseting target milestone.  3.0.1 has been frozen.  WIll have to 
re-evaluate these.

Comment 6 Gerald (Jerry) Carter (dead mail address) 2004-03-04 07:34:23 UTC

I'm betting this is a driver problem.  I've seen the spooler 
on XP clients go into tight loops issugin getprinterdata() 
calls over and over when using certain Lexmark PCL drivers
(even with Windows print servers).

Comment 7 Rick Cochran 2004-03-04 08:56:29 UTC

Thanks for continuing to think about this problem.

I suppose it could be driver-related.  Your description of the symptoms is
exactly what I'm seeing.

My Windows instructions for my users say to use the Adobe PostScript driver, but
I have no way of enforcing that.  However, I do trash non-PostScript jobs (after
they get through Samba), so I doubt that PCL drivers are involved.

I am planning to experiment with enabling spoolss when I get the time. 
Meanwhile, it works to kill all the Samba daemons periodically and restart them.
  A load average drop from 6 to 1 is typical.

If you've seen the getprinterdata() loop with spoolss enabled, that would be
rather discouraging.

It is my considered opinion that this problem is confined to XP clients.  That
is why I am strongly recommending that my users use LPR protocol under XP in
spite of Microsoft's brain-dead implementation.

Comment 8 Gerald (Jerry) Carter (dead mail address) 2004-03-04 10:31:26 UTC

first comment is to enable spoolss.  I have had reports 
of that making clients loop (not sure the client OS).

The lexmark PCL driver bug is exactly that -- a driver 
bug.  The workaround in that case would be to use the PS 
version of the same driver.  

At this point I belive that the spoolss code is 
a better choice that setting 'disable spoolss = yes'.
That was provided as a a temporary workaround during 
the 2.2 days.

Comment 9 Rick Cochran 2004-04-17 06:38:37 UTC

I am now running 3.0.2a (which BTW builds flawlessly under AIX 5.2 with
gcc-3.2.2) with spoolss _enabled_, and I am still experiencing the same problem.

As I pointed out, Lexmark PCL drivers will not work with our printing system
because we redirect non-PostScript jobs to /dev/null.  It is possible that some
misguided individuals might ignore our instructions, install a Lexmark PCL
driver, find out that it doesn't work, and then leave it installed to bedevil
Samba, but I find this extremely unlikely.  In any case, this would constitute a
DOS attack vulnerability in Samba which should still be fixed.

My fallback strategy is to run an hourly cron job which will kill all Samba
daemons and then restart Samba.  This lacks something in esthetic appeal.

Thanks for your continued interest.

Comment 10 Gerald (Jerry) Carter (dead mail address) 2004-04-19 07:22:41 UTC

Rick, please take a raw network trace and send it to me.
also include the output of ps -aux (or AIX equiv) and 
netstat -pant.

Comment 11 Rick Cochran 2004-04-22 06:57:28 UTC

Jerry,  I would if I could but I can't.  Since the day when I switched to 3.0.2a
with spoolss and immediately encountered the same ping-pong problem, IT HASN'T
HAPPENED AGAIN - even during periods of high printing load.

Is it possible that there was a client out there which persisted in using the
non-spoolss RPC calls until I bounced Samba one last time?

I hope you will not be too disappointed by this lost debugging opportunity :-)

Since I have been observing for only three days it could still blow up again,
but I think this may have solved my problem.  Thank you for your patient interest.

Comment 12 Gerald (Jerry) Carter (dead mail address) 2004-04-22 18:22:57 UTC

thanks for the update.  i'll wait about a week and then 
close this bug out if we don't see any more comments.

Comment 13 Rick Cochran 2004-04-23 15:01:43 UTC

Created attachment 474 [details]
Tar containing lsof, tcpdump, ps, and netstat output

Looks like I spoke too soon.  Again.

AIX netstat doesn't appear to have any of the arguments you suggest.
The tcpdump output is a "-w" file.

Comment 14 Gerald (Jerry) Carter (dead mail address) 2004-06-03 11:55:34 UTC

Rick, I finally looked at the tcpdump and you have a client that
is still using the lanman printing interface.  You will probably
need to clean out the registery on the clients to remove any lanman 
printing ports.

Looks like the spoolss feature is still the right solution.

Comment 15 Rick Cochran 2004-06-03 12:19:13 UTC

On May 7, I upgraded to Samba 3.0.3 and I haven't encountered the problem since.

This is nice since the two weeks following that (during one of which I was on
vacation) have the most heavy printing traffic of the year.  Our server handled
the load without breaking a sweat.

Unless there is some obscure side effect from just doing a build and install
which caused the problem to go away, I would have to say that something got
fixed between 3.0.2a and 3.0.3.  This is a tremendous relief for us.

Thanks for your help.

Comment 16 Gerald (Jerry) Carter (dead mail address) 2005-02-07 09:06:01 UTC

originally reported against one of the 3.0.0rc[1-4] releases.
Cleaning up non-production versions.

Comment 17 Gerald (Jerry) Carter (dead mail address) 2005-08-24 10:18:41 UTC

sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.