I have just tried to upgrade our University print server to 3.0.9 and noticed that over about 12 hours the smbd processes grew to about 100M in size. As all the server does is printing it looks like the printing code isn't garbage collecting its freed memory. I have reverted the server to 3.0.7 which didn't have this problem (though for 3.0.7 I did notice a single smbd process - one spawned at startup - growing more gradually, getting up to about 1G over about a month).
I've got another confirmed report of this on a Solaris system. Could you attach your smb.conf to this bug report please? Also, please include some details about you environment such as the client OS of the machines attached to the smbd process with a large VSZ or RSS.
might also be a good idea to include the output from 'smbd -b' (attached as a file) and the options passed to configure. Thanks.
Created attachment 823 [details] output from testparm
The setup is samba on a Solaris 9 box, authenticating via Active Directory, with probably over 50 print queues, and connections from over 1000 windows boxes, mostly XP SP1 but could be anything. Spoolss isn't disabled, but there is no [print$] share yet (though we will probably add one soon).
Do you see any tdb files in $locakdir) that are extremely large? Possibly the messages.tdb file ?
No the largest file currently is netsamlogon_cache.tdb at about 6M, though I am not running 3.0.9 at the moment as the system runs out of memory. Note that the large size seemed to be uniform in the newer smbd processes (ones started soon after I started samba were still small).
Created attachment 824 [details] Output from smbd -b
Created attachment 827 [details] patch for fix memory bloating caused by print_queue_update() messages
messages.tdb is truncated at startup so it could have been trimmed by the restart. The patch attached to this report should address this. It's also being testing by another solaris site I've been working with. the patch will apply cleanly to 3.0.9.
Created attachment 837 [details] invalid patch We had the same problem on Intel (Linux RH 7.3/Whitebox EL3). This patch, applied with the previous patch (id=827), appears to do the right thing.
Jim, The patch is comment #10 makes no sense. Why do you think think fixes a memory leak.
(In reply to comment #11) > Jim, The patch is comment #10 makes no sense. Why do you > think think fixes a memory leak. Jerry -- Sorry for the late reply, I actually went home for the weekend! Well, the first patch to strstr_m is gratuitious and keeps a warning from cropping up :^), but the patch to nt_printing is the real deal. I determined this by both code review and testing. (Note, I'm using vim as my editor, so if the line# are 1 off or so, forgive me.) I didn't profile or valgrind as the server was actually in use at the time. Code review: Examining the /unpack_values/ function in nt_printing.c, shows that you're allocating a new key for the default "SPOOL_PRINTERDATA_KEY" every time (line# 3124) /unpack values/ is called -- even if no keys need be allocated at all (by test, line# 3170). Without a test this early in the function, you don't know you need to add the key, and since you DO add the key if needed when you test (at line# 3170), I removed the code. Before the patch, (at the least, if you only call the code once for every valid printer (say 20)), you would wind up with 20 "SPOOL_PRINTERDATA_KEY" keys, and 20 printer keys. At the worst, you wind up with 20 valid printerkeys and (n) "SPOOL_PRINTERDATA_KEYS", where n=number of calls to /unpack values/. Our testing shows that (n) is sufficiently large to rock our 120+ print user world. The issue compounds with number of users/connections. Testing: Linux: Redhat 7.3 and Whitebox EL3, say 120 users, 29 defined printers, lprng printing environment. Upgrade from Samba 2.2.10. Standalone server w/domain lookups. YMMV, but before this patch Samba 3.0.9 would /absolutely/ lock up a dual Xeon 2.8G/1Gig Memory print server every four-five hours (120 users, highest load average: 118!, smb processes: some with 40M memory usage, smbstatus: 300 entries!, smb connections to the printer NEVER leave. After the first lockup, we cron'd a restart every four hours as a precaution. With the patch: We're back down to zero daily maintenance, a max load average of about 0.25, smb processes of about 3-4K average and removal of unused smb processes after 7 minutes or so. No worries. It applies cleanly to 3.0.9, and the same changes make 3.0.10-pre1 behave correctly as well.
Confirm that on PLD Linux Distribution x86 (kernel 2.6.9 - 2.6.10-rc2) After few days with a frequent printers use (four network printers) my linux server hanged down :-( After 3 days this look like this (after <shift> + <M>) top - 19:08:06 up 7 days, 21:09, 1 user, load average: 0.07, 0.05, 0.00 Tasks: 244 total, 1 running, 243 sleeping, 0 stopped, 0 zombie Cpu(s): 4.4% us, 0.2% sy, 0.1% ni, 94.9% id, 0.3% wa, 0.0% hi, 0.1% si Mem: 776556k total, 763908k used, 12648k free, 0k buffers Swap: 979192k total, 438252k used, 540940k free, 126288k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2636 root 21 5 589m 386m 1496 S 0.0 51.0 69:26.05 smbd 5373 root 22 0 715m 100m 4816 S 0.0 13.2 0:09.92 java 5374 root 16 0 715m 100m 4816 S 0.0 13.2 0:00.00 java 5375 root 16 0 715m 100m 4816 S 5.0 13.2 0:50.60 java 5376 root 16 0 715m 100m 4816 S 0.0 13.2 0:00.39 java 5377 root 16 0 715m 100m 4816 S 0.1 13.2 0:01.36 java I'll try patches (both?) tomorrow.
(In reply to comment #12) > Examining the /unpack_values/ function in nt_printing.c, > shows that you're allocating a new key for the default > "SPOOL_PRINTERDATA_KEY" every time (line# 3124) > /unpack values/ is called -- even if no keys need be > allocated at all (by test, line# 3170). Without a > test this early in the function, you don't know > you need to add the key, and since you DO add the > key if needed when you test (at line# 3170), I Sorry but unless there a bug here I don't see, you just don't understand the code. The printer date key that is added first is so that is appears first in the linked list. That memory is talloc()'d and therefore freed when the context is released (in free_a_printer() ). So there is no memory leak as far as I can tell. And if you don't have printer data at all, then you don't have a valid device mode which means that you didn't initialize the printer data. Many drivers will have problems with a NULL device mode and no printer data. > removed the code. Before the patch, (at the least, > if you only call the code once for every valid printer > (say 20)), you would wind up with 20 "SPOOL_PRINTERDATA_KEY" > keys, and 20 printer keys. At the worst, you > wind up with 20 valid printerkeys and (n) > "SPOOL_PRINTERDATA_KEYS", where n=number of calls > to /unpack values/. Our testing shows that (n) is > sufficiently large to rock our 120+ print user world. > The issue compounds with number of users/connections. Sorry this doesn't make sense. I appreciate the patch but you're saying that the we should throw away part of the data structure. > YMMV, but before this patch Samba 3.0.9 would /absolutely/ > lock up a dual Xeon 2.8G/1Gig Memory print server every > four-five hours (120 users, highest load average: 118!, > smb processes: some with 40M memory usage, smbstatus: 300 > entries!, smb connections to the printer NEVER leave. > After the first lockup, we cron'd a restart every four > hours as a precaution. I just find this hard to believe. I can't accept this patch until you can provide a code path where we leak this memory. Telling me that we should never allocate it is wrong. ps: running 'smbcontrol <pid> pool-usage' will show the talloc()'d memory by a process.
You're right, I was wrong -- I didn't understand the code. Thanks for the vetting, and the explanation! I don't see really bad behavior printing using stock 3.0.9 with a small (2-5 user) test, only when we're in production. 3.0.9 + 12/7 patch dropped some memory usage, but we still restarted twice a day. Is it even possible that we're seeing good behavior due to the silly (char *) cast to the return val in strstr_m? Is that even possible? Right now, with 3.0.9 code + your patch of 12/7 + a (char *) cast in strstr_m, we're seeing max smbd process size hovering about 5K, which makes us happy. If you'd like, I'll test 3.0.9+(char *) and see what shakes out. Thanks again for the patient explanation - jim
(In reply to comment #9) > messages.tdb is truncated at startup so it > could have been trimmed by the restart. The > patch attached to this report should address > this. It's also being testing by another > solaris site I've been working with. > > the patch will apply cleanly to 3.0.9. (In reply to comment #8) > Created an attachment (id=827) [edit] > patch for fix memory bloating caused by print_queue_update() messages > Works for me - smbd does't grow up Thanks!!
I am testing the patch in #8 on 3.0.10 (with smb_xmalloc( sizeof(char)*len ) replaced by SMB_XMALLOC_ARRAY(char, len) ). It has been running for about 7 hours now and there is no sign of the memory usage blowing up as in 3.0.9, or even the lesser problem from 3.0.7.
(In reply to comment #17) > I am testing the patch in #8 on 3.0.10 (with smb_xmalloc( sizeof(char)*len ) > replaced by SMB_XMALLOC_ARRAY(char, len) ). It has been running for about 7 > hours now and there is no sign of the memory usage blowing up as in 3.0.9, or > even the lesser problem from 3.0.7. But I observe lateral effect. On windows, priners tasks doesn't vanish from spool after beeing printed! So after one day people get mad ;-) I cat delete them from domain administrtor account.
Fixed in 3.0.11pre1. peint queue list not updating is bug 2170 and may also be fixed in 3.0.11pre1. Still testing.
(In reply to comment #19) > Fixed in 3.0.11pre1. peint queue list not updating is bug 2170 and may also be > fixed in 3.0.11pre1. Still testing. Yes it is good now :-) Thanks. I'm brave man ;-) and I'm using 3.0.11pre1 on my PLD server http://www.pld-linux.org/ Still problem with slow dialog box after XP SP2... only
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.