Bug 2107 - smbd processes take more and more memory on print server
smbd processes take more and more memory on print server
Status: CLOSED FIXED
Product: Samba 3.0
Classification: Unclassified
Component: Printing
3.0.9
Sparc Solaris
: P3 normal
: none
Assigned To: Gerald (Jerry) Carter
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2004-12-01 03:31 UTC by Michael Young
Modified: 2005-08-24 10:26 UTC (History)
1 user (show)

See Also:


Attachments
output from testparm (969 bytes, text/plain)
2004-12-04 03:30 UTC, Michael Young
no flags Details
Output from smbd -b (7.98 KB, text/plain)
2004-12-04 09:25 UTC, Michael Young
no flags Details
patch for fix memory bloating caused by print_queue_update() messages (18.33 KB, patch)
2004-12-07 19:53 UTC, Gerald (Jerry) Carter
no flags Details
invalid patch (1008 bytes, patch)
2004-12-10 20:44 UTC, Jim Delahanty
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Young 2004-12-01 03:31:59 UTC
I have just tried to upgrade our University print server to 3.0.9 and noticed
that over about 12 hours the smbd processes grew to about 100M in size. As all
the server does is printing it looks like the printing code isn't garbage
collecting its freed memory. I have reverted the server to 3.0.7 which didn't
have this problem (though for 3.0.7 I did notice a single smbd process - one
spawned at startup - growing more gradually, getting up to about 1G over about a
month).
Comment 1 Gerald (Jerry) Carter 2004-12-03 15:42:20 UTC
I've got another confirmed report of this on a Solaris system.
Could you attach your smb.conf to this bug report please?

Also, please include some details about you environment
such as the client OS of the machines attached to the 
smbd process with a large VSZ or RSS.
Comment 2 Gerald (Jerry) Carter 2004-12-03 15:45:57 UTC
might also be a good idea to include the output from 
'smbd -b' (attached as a file) and the options passed 
to configure.  Thanks.
Comment 3 Michael Young 2004-12-04 03:31:03 UTC
Created attachment 823 [details]
output from testparm
Comment 4 Michael Young 2004-12-04 03:36:01 UTC
The setup is samba on a Solaris 9 box, authenticating via Active Directory, with
probably over 50 print queues, and connections from over 1000 windows boxes,
mostly XP SP1 but could be anything. Spoolss isn't disabled, but there is no
[print$] share yet (though we will probably add one soon).
Comment 5 Gerald (Jerry) Carter 2004-12-04 07:28:37 UTC
Do you see any tdb files in $locakdir) that are extremely large?
Possibly the messages.tdb file ?
Comment 6 Michael Young 2004-12-04 09:23:27 UTC
No the largest file currently is netsamlogon_cache.tdb at about 6M, though I am
not running 3.0.9 at the moment as the system runs out of memory. Note that the
large size seemed to be uniform in the newer smbd processes (ones started soon
after I started samba were still small).
Comment 7 Michael Young 2004-12-04 09:25:42 UTC
Created attachment 824 [details]
Output from smbd -b
Comment 8 Gerald (Jerry) Carter 2004-12-07 19:53:16 UTC
Created attachment 827 [details]
patch for fix memory bloating caused by print_queue_update() messages
Comment 9 Gerald (Jerry) Carter 2004-12-07 19:56:40 UTC
messages.tdb is truncated at startup so it 
could have been trimmed by the restart.  The 
patch attached to this report should address 
this.  It's also being testing by another 
solaris site I've been working with.

the patch will apply cleanly to 3.0.9.
Comment 10 Jim Delahanty 2004-12-10 20:44:47 UTC
Created attachment 837 [details]
invalid patch

We had the same problem on Intel (Linux RH 7.3/Whitebox EL3).
This patch, applied with the previous patch (id=827), appears to do the right
thing.
Comment 11 Gerald (Jerry) Carter 2004-12-10 21:39:50 UTC
Jim,  The patch is comment #10 makes no sense.  Why do you 
think think fixes a memory leak.
Comment 12 Jim Delahanty 2004-12-13 05:45:42 UTC
(In reply to comment #11)
> Jim,  The patch is comment #10 makes no sense.  Why do you 

> think think fixes a memory leak.

Jerry -- Sorry for the late reply, I actually went home for the weekend!

Well, the first patch to strstr_m is gratuitious and keeps a warning from
cropping up :^), but the patch to nt_printing is the real deal.

I determined this by both code review and testing.  (Note, I'm using vim as my
editor, so if the line# are 1 off or so, forgive me.)  I didn't profile or
valgrind as the server was actually in use at the time.

Code review:
Examining the /unpack_values/ function in nt_printing.c, shows that you're
allocating a new key for the default "SPOOL_PRINTERDATA_KEY" every time (line#
3124) /unpack values/ is called -- even if no keys need be allocated at all (by
test, line# 3170).  Without a test this early in the function, you don't know
you need to add the key, and since you DO add the key if needed when you test
(at line# 3170), I removed the code.  Before the patch, (at the least, if you
only call the code once for every valid printer (say 20)), you would wind up
with 20 "SPOOL_PRINTERDATA_KEY" keys, and 20 printer keys.  At the worst, you
wind up with 20 valid printerkeys and (n) "SPOOL_PRINTERDATA_KEYS", where
n=number of calls to /unpack values/.  Our testing shows that (n) is
sufficiently large to rock our 120+ print user world.  The issue compounds with
number of users/connections.

Testing:
Linux: Redhat 7.3 and Whitebox EL3, say 120 users, 29 defined printers, lprng
printing environment.  Upgrade from Samba 2.2.10.  Standalone server w/domain
lookups.

YMMV, but before this patch Samba 3.0.9 would /absolutely/ lock up a dual Xeon
2.8G/1Gig Memory print server every four-five hours (120 users, highest load
average: 118!, smb processes: some with 40M memory usage, smbstatus: 300
entries!, smb connections to the printer NEVER leave.  After the first lockup,
we cron'd a restart every four hours as a precaution.

With the patch: We're back down to zero daily maintenance, a max load average of
about 0.25, smb processes of about 3-4K average and removal of unused smb
processes after 7 minutes or so.  No worries.

It applies cleanly to 3.0.9, and the same changes make 3.0.10-pre1 behave
correctly as well.
Comment 13 Andrzej Zawadzki 2004-12-13 18:46:51 UTC
Confirm that on PLD Linux Distribution x86 (kernel 2.6.9 - 2.6.10-rc2)

After few days with a frequent printers use (four network printers) my linux
server hanged down :-(
After 3 days this look like this (after <shift> + <M>)

top - 19:08:06 up 7 days, 21:09,  1 user,  load average: 0.07, 0.05, 0.00
Tasks: 244 total,   1 running, 243 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.4% us,  0.2% sy,  0.1% ni, 94.9% id,  0.3% wa,  0.0% hi,  0.1% si
Mem:    776556k total,   763908k used,    12648k free,        0k buffers
Swap:   979192k total,   438252k used,   540940k free,   126288k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
2636 root      21   5  589m 386m 1496 S  0.0 51.0  69:26.05 smbd
5373 root      22   0  715m 100m 4816 S  0.0 13.2   0:09.92 java
5374 root      16   0  715m 100m 4816 S  0.0 13.2   0:00.00 java
5375 root      16   0  715m 100m 4816 S  5.0 13.2   0:50.60 java
5376 root      16   0  715m 100m 4816 S  0.0 13.2   0:00.39 java
5377 root      16   0  715m 100m 4816 S  0.1 13.2   0:01.36 java 

I'll try patches (both?) tomorrow.
Comment 14 Gerald (Jerry) Carter 2004-12-13 21:43:24 UTC
(In reply to comment #12)

> Examining the /unpack_values/ function in nt_printing.c, 
> shows that you're allocating a new key for the default 
> "SPOOL_PRINTERDATA_KEY" every time (line# 3124) 
> /unpack values/ is called -- even if no keys need be 
> allocated at all (by test, line# 3170).  Without a 
> test this early in the function, you don't know
> you need to add the key, and since you DO add the 
> key if needed when you test (at line# 3170), I

Sorry but unless there a bug here I don't see, you just don't 
understand the code.  The printer date key that is added 
first is so that is appears first in the linked list.  That 
memory is talloc()'d and therefore freed when the context
is released (in free_a_printer() ).  So there is no memory leak
as far as I can tell.

And if you don't have printer data at all, then you don't 
have a valid device mode which means that you didn't 
initialize the printer data.  Many drivers will have problems
with a NULL device mode and no printer data.

> removed the code.  Before the patch, (at the least, 
> if you only call the code once for every valid printer 
> (say 20)), you would wind up with 20 "SPOOL_PRINTERDATA_KEY" 
> keys, and 20 printer keys.  At the worst, you
> wind up with 20 valid printerkeys and (n) 
> "SPOOL_PRINTERDATA_KEYS", where n=number of calls 
> to /unpack values/.  Our testing shows that (n) is
> sufficiently large to rock our 120+ print user world.  
> The issue compounds with number of users/connections.

Sorry this doesn't make sense.  I appreciate the patch
but you're saying that the we should throw away part 
of the data structure.  

> YMMV, but before this patch Samba 3.0.9 would /absolutely/ 
> lock up a dual Xeon 2.8G/1Gig Memory print server every 
> four-five hours (120 users, highest load average: 118!, 
> smb processes: some with 40M memory usage, smbstatus: 300
> entries!, smb connections to the printer NEVER leave.  
> After the first lockup, we cron'd a restart every four 
> hours as a precaution.

I just find this hard to believe.  I can't accept this 
patch until you can provide a code path where we leak 
this memory.  Telling me that we should never allocate 
it is wrong.

ps: running 'smbcontrol <pid> pool-usage' will show 
the talloc()'d memory by a process.


Comment 15 Jim Delahanty 2004-12-14 06:58:17 UTC
You're right, I was wrong -- I didn't understand the code.  Thanks for the
vetting, and the explanation!

I don't see really bad behavior printing using stock 3.0.9 with a small (2-5
user) test, only when we're in production.  3.0.9 + 12/7 patch dropped some
memory usage, but we still restarted twice a day.

Is it even possible that we're seeing good behavior due to the silly (char *)
cast to the return val in strstr_m?  Is that even possible?

Right now, with 3.0.9 code + your patch of 12/7 + a (char *) cast in strstr_m,
we're seeing max smbd process size hovering about 5K, which makes us happy.  If
you'd like, I'll test 3.0.9+(char *) and see what shakes out.

Thanks again for the patient explanation - jim
Comment 16 Andrzej Zawadzki 2004-12-15 16:55:45 UTC
(In reply to comment #9)
> messages.tdb is truncated at startup so it 
> could have been trimmed by the restart.  The 
> patch attached to this report should address 
> this.  It's also being testing by another 
> solaris site I've been working with.
> 
> the patch will apply cleanly to 3.0.9.

(In reply to comment #8)
> Created an attachment (id=827) [edit]
> patch for fix memory bloating caused by print_queue_update() messages
>
 
Works for me - smbd does't grow up
Thanks!!
Comment 17 Michael Young 2004-12-21 08:10:41 UTC
I am testing the patch in #8 on 3.0.10 (with smb_xmalloc( sizeof(char)*len )
replaced by SMB_XMALLOC_ARRAY(char, len) ). It has been running for about 7
hours now and there is no sign of the memory usage blowing up as in 3.0.9, or
even the lesser problem from 3.0.7.
Comment 18 Andrzej Zawadzki 2004-12-28 02:57:38 UTC
(In reply to comment #17)
> I am testing the patch in #8 on 3.0.10 (with smb_xmalloc( sizeof(char)*len )
> replaced by SMB_XMALLOC_ARRAY(char, len) ). It has been running for about 7
> hours now and there is no sign of the memory usage blowing up as in 3.0.9, or
> even the lesser problem from 3.0.7.

But I observe lateral effect. On windows, priners tasks doesn't vanish from
spool after beeing printed! So after one day people get mad ;-)
I cat delete them from domain administrtor account.
Comment 19 Gerald (Jerry) Carter 2005-01-05 20:32:43 UTC
Fixed in 3.0.11pre1.  peint queue list not updating is bug 2170 and may also be
fixed in 3.0.11pre1.  Still testing.
Comment 20 Andrzej Zawadzki 2005-01-07 05:43:17 UTC
(In reply to comment #19)
> Fixed in 3.0.11pre1.  peint queue list not updating is bug 2170 and may also be
> fixed in 3.0.11pre1.  Still testing.

Yes it is good now :-) Thanks.
I'm brave man ;-) and I'm using 3.0.11pre1 on my PLD server
http://www.pld-linux.org/
Still problem with slow dialog box after XP SP2... only
Comment 21 Gerald (Jerry) Carter 2005-08-24 10:26:37 UTC
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.