Bug 7020 - smbd using 2G memory
smbd using 2G memory
Status: RESOLVED FIXED
Product: Samba 3.4
Classification: Unclassified
Component: File services
3.4.3
x86 Linux
: P3 regression
: ---
Assigned To: Karolin Seeger
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-01-04 16:31 UTC by Ross Boylan
Modified: 2010-01-11 05:09 UTC (History)
1 user (show)

See Also:


Attachments
smbcontrol -t 120 26853 pool-usage output (6.97 KB, application/octet-stream)
2010-01-05 14:11 UTC, Ross Boylan
no flags Details
Fix for 3.4.x (and above, probably). (741 bytes, patch)
2010-01-05 15:13 UTC, Jeremy Allison
no flags Details
git-am format patch for 3.4.4. (1.20 KB, patch)
2010-01-05 16:40 UTC, Jeremy Allison
jra: review?
Details
Another smbcontrol pool-usage (1.49 KB, application/octet-stream)
2010-01-05 17:35 UTC, Ross Boylan
no flags Details
pool-useage and /proc/.../map (3.10 KB, application/octet-stream)
2010-01-05 18:41 UTC, Ross Boylan
no flags Details
/proc/.../maps after a little more run time (7.58 KB, application/octet-stream)
2010-01-05 19:10 UTC, Ross Boylan
no flags Details
valgrind capture (1.96 KB, application/octet-stream)
2010-01-05 19:57 UTC, Ross Boylan
no flags Details
valgrind capture after deamon shutdown. (3.94 KB, application/octet-stream)
2010-01-05 20:03 UTC, Ross Boylan
no flags Details
wireshark packet capture (193.10 KB, application/octet-stream)
2010-01-05 20:56 UTC, Ross Boylan
no flags Details
valgrind after an extended run (3.49 KB, application/octet-stream)
2010-01-05 21:56 UTC, Ross Boylan
no flags Details
smb.conf (11.93 KB, application/octet-stream)
2010-01-05 23:58 UTC, Ross Boylan
no flags Details
Include valgrind trace (smbvalg9a) and wireshark capture (smbpackets9b) (918.45 KB, application/octet-stream)
2010-01-06 01:28 UTC, Ross Boylan
no flags Details
Second part of the fix for 3.4.4. (677 bytes, patch)
2010-01-06 14:27 UTC, Jeremy Allison
no flags Details
git-am format patch for 3.4.4 (1.15 KB, patch)
2010-01-06 15:49 UTC, Jeremy Allison
vl: review+
Details
valgrind with the latest 2 patches (111.59 KB, application/octet-stream)
2010-01-07 18:37 UTC, Ross Boylan
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ross Boylan 2010-01-04 16:31:12 UTC
The only client is a Vista laptop.  While doing a windows file-based backup to a samba share, smbd memory use went to nearly 2G (1.85G virtual memory, 1.57G resident most recently; over 2G in a previous problem).  The system has 2G RAM and 2G swap and is 32bit.

I am not certain the previous time this happened coincided with a Vista backup, but the dates are right.  The prior time resulted in the kernel killing processes (not including samba) and the system becoming almost unresponsive; it took about half an hour to get it back.  This time, I watched the smbd memory use grow as the backup took place, and shutdown other processes to make room.

The server also provides print services with CUPS.  Problems with the printing part of 3.2 motivated the upgrade (see https://bugzilla.samba.org/show_bug.cgi?id=6754).

Note Debian has two open bugs, each with reports from multiple users, about apparent leaks with samba 3.4 in both a backported Debian and Debian unstable environment.  Those bugs include some valgrind  capture.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=558453
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=538819

I have some wireshark capture of the backup if you want it, though I'm a little concerned it might capture some sensitive information.

I am running with Debian 3.4.3-1 packages from unstable source built and installed on my Lenny system, pulling in some other unstable packages (e.g., pam) as necessary to build and run.


Today I first printed a file and then started the backup.  log.cotton, with cotton the Vista client, shows
[2010/01/04 11:09:01,  1] smbd/service.c:1047(make_connection_snum)
  cotton (192.168.40.46) connect to service cotback initially as user ross (uid=1000, gid=1000) (pid 4327)
[2010/01/04 11:09:02,  1] smbd/service.c:676(make_connection_snum)
  create_connection_server_info failed: NT_STATUS_ACCESS_DENIED
[2010/01/04 11:09:02,  1] smbd/service.c:676(make_connection_snum)
  create_connection_server_info failed: NT_STATUS_ACCESS_DENIED
[2010/01/04 11:09:31,  1] smbd/service.c:1226(close_cnum)
  cotton (192.168.40.46) closed connection to service cotback
[2010/01/04 11:14:48,  1] smbd/service.c:1047(make_connection_snum)
  cotton (192.168.40.46) connect to service cotback initially as user ross (uid=1000, gid=1000) (pid 4327)
[2010/01/04 11:15:13,  1] smbd/service.c:1226(close_cnum)
  cotton (192.168.40.46) closed connection to service cotback
[2010/01/04 11:17:41,  1] smbd/service.c:1047(make_connection_snum)
  cotton (192.168.40.46) connect to service cotback initially as user ross (uid=1000, gid=1000) (pid 4327)
[2010/01/04 11:17:43,  1] smbd/service.c:1047(make_connection_snum)
  cotton (192.168.40.46) connect to service ross initially as user ross (uid=1000, gid=1000) (pid 4327)
[2010/01/04 11:20:23,  1] smbd/service.c:1226(close_cnum)
  cotton (192.168.40.46) closed connection to service cotback
[2010/01/04 11:20:45,  1] smbd/service.c:1047(make_connection_snum)
  cotton (192.168.40.46) connect to service cotback initially as user ross (uid=1000, gid=1000) (pid 4327)
[2010/01/04 11:27:35,  1] smbd/service.c:1226(close_cnum)
  cotton (192.168.40.46) closed connection to service cotback
[2010/01/04 12:14:50,  1] smbd/service.c:1226(close_cnum)
  cotton (192.168.40.46) closed connection to service ross


log.nmbd shows entries like
[2010/01/04 11:13:54,  0] nmbd/nmbd_browsesync.c:350(find_domain_master_name_query_fail)
  find_domain_master_name_query_fail:
  Unable to find the Domain Master Browser name BOYLAN<1b> for the workgroup BOYLAN.
  Unable to sync browse lists in this workgroup.
every 15 minutes.

log.smbd shows only a startup message.
Comment 1 Ross Boylan 2010-01-05 13:25:47 UTC
The process grew to about 1G overnight, with no backup from the client during that time.  It seems to be growing fairly steadily, e.g.,
corn:/usr/local/src/tools# date; ps -lC smbd
Tue Jan  5 11:18:58 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S     0 26851     1  0  80   0 -  3904 -      ?        00:00:00 smbd
1 S     0 26852 26851  0  80   0 -  3984 -      ?        00:00:00 smbd
5 S 65534 26853 26851  5  80   0 -  4082 -      ?        00:00:00 smbd
corn:/usr/local/src/tools# date; ps -lC smbd
Tue Jan  5 11:23:17 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S     0 26851     1  0  80   0 -  3904 -      ?        00:00:00 smbd
1 S     0 26852 26851  0  80   0 -  3984 -      ?        00:00:00 smbd
5 S 65534 26853 26851  5  80   0 - 15138 -      ?        00:00:15 smbd

SZ is in blocks, so I think 512 bytes.
Comment 2 Volker Lendecke 2010-01-05 13:45:19 UTC
Can you run a "smbcontrol <pid> pool-usage" with <pid> replaced by the huge smbd's process ID and upload the output?

Thanks,

Volker
Comment 3 Ross Boylan 2010-01-05 14:11:47 UTC
Created attachment 5129 [details]
smbcontrol -t 120 26853 pool-usage  output

I had to add the -t option to get output.
A minute or two earlier system showed
# date; ps -lC smbd
Tue Jan  5 11:58:55 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S     0 26851     1  0  80   0 -  3904 -      ?        00:00:00 smbd
1 S     0 26852 26851  0  80   0 -  3984 -      ?        00:00:00 smbd
5 S 65534 26853 26851  5  80   0 - 109235 -     ?        00:02:18 smbd
Comment 4 Volker Lendecke 2010-01-05 14:16:37 UTC
Ok, that does look like my bug. Looking into it.

Thanks for reporting!

Volker
Comment 5 Jeremy Allison 2010-01-05 15:13:56 UTC
Created attachment 5130 [details]
Fix for 3.4.x (and above, probably).

This looks right to me, Volker, can you confirm ?

Jeremy.
Comment 6 Jeremy Allison 2010-01-05 15:53:05 UTC
Raising to blocker as it's critical to fix for 3.4.4 (memory leaks are *baad* :-).
Jeremy.
Comment 7 Volker Lendecke 2010-01-05 16:11:42 UTC
.... and another week without a 3.4.4.

Volker
Comment 8 Jeremy Allison 2010-01-05 16:13:09 UTC
What can we do ? This one really needs fixing.

Jeremy.
Comment 9 Volker Lendecke 2010-01-05 16:16:14 UTC
Lets just stop doing releases and point people at git? If this continues as it does right now, we will NEVER again see any release.

Volker
Comment 10 Guenther Deschner 2010-01-05 16:19:39 UTC
(In reply to comment #9)
> Lets just stop doing releases and point people at git? If this continues as it
> does right now, we will NEVER again see any release.

ehh, bugzilla entries are possibly not the best forum to discuss this :-)
Comment 11 Volker Lendecke 2010-01-05 16:22:10 UTC
Why not? This is a blocker now. Karolin is going to pick this and we have to settle for another week then. Within that week, another blocker will be uncovered, so we will do this forever.

Volker
Comment 12 Jeremy Allison 2010-01-05 16:25:49 UTC
Volker, if you don't think this is a blocker, you can lower the priority - and we'll discuss on samba-technical. Maybe we need a guide to catagorize what types of bugs we consider blocker. IMHO security holes, data corruption and out of control memory leaks (which this one certainly is IMHO) qualify. That's pretty much the catagories I use internally, but maybe we need an express policy.

It's unlikely we'll have two of these types of bugs in one week, so we will eventually ship.

Jeremy.
Comment 13 Jeremy Allison 2010-01-05 16:27:32 UTC
BTW I can reproduce this leak using rpcclient "sourcedata 9000" command, so
it's not exactly hard to trigger.

Jeremy.
Comment 14 Volker Lendecke 2010-01-05 16:31:32 UTC
I don't know what a blocker bug is. I just know for sure that there's a continuous, never-ending stream of blockers.

Volker
Comment 15 Jeremy Allison 2010-01-05 16:33:47 UTC
Well I've just given you a definition, which if we use as a guide I expect will converge on zero in a non-infinite time (I know you were a mathematician at one point :-). So let's just use that definition and see which of us is right. I'll give it another 4 weeks to win my bet :-).

Jeremy.
Comment 16 Jeremy Allison 2010-01-05 16:40:45 UTC
Created attachment 5131 [details]
git-am format patch for 3.4.4.

Volker, once you're happy please re-assign to Karolin for push to 3.4.4.
Jeremy.
Comment 17 Jeremy Allison 2010-01-05 16:43:16 UTC
Ok, I'm going to be more precise on the definition of "blocker" as:

Security holes.
Data corruption.
Resource leaks (covers fd's, memory, and anything else I might have missed).
Unrecoverable client-visible errors.

I think that covers everything and I still think we'll converge on zero blockers in less than 4 weeks.

Jeremy.

Comment 18 Ross Boylan 2010-01-05 17:18:43 UTC
I applied the patch (the first one), rebuilt, and reinstalled.
It still seems to be leaking:
corn:/usr/local/src/tools# date; ps -lC smbd
Tue Jan  5 15:12:29 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S     0  2095     1  0  80   0 -  3902 -      ?        00:00:00 smbd
1 S     0  2096  2095  0  80   0 -  3984 -      ?        00:00:00 smbd
5 S 65534  2099  2095  6  80   0 -  4113 -      ?        00:00:01 smbd
corn:/usr/local/src/tools# date; ps -lC smbd
Tue Jan  5 15:14:16 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S     0  2095     1  0  80   0 -  3902 -      ?        00:00:00 smbd
1 S     0  2096  2095  0  80   0 -  3984 -      ?        00:00:00 smbd
5 S 65534  2099  2095  4  80   0 -  4530 -      ?        00:00:06 smbd
corn:/usr/local/src/tools# date; ps -lC smbd
Tue Jan  5 15:15:16 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S     0  2095     1  0  80   0 -  3902 -      ?        00:00:00 smbd
1 S     0  2096  2095  0  80   0 -  3984 -      ?        00:00:00 smbd
5 S 65534  2099  2095  4  80   0 -  4781 -      ?        00:00:09 smbd
corn:/usr/local/src/tools# date; ps -lC smbd
Tue Jan  5 15:15:59 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S     0  2095     1  0  80   0 -  3902 -      ?        00:00:00 smbd
1 S     0  2096  2095  0  80   0 -  3984 -      ?        00:00:00 smbd
5 S 65534  2099  2095  5  80   0 -  4962 -      ?        00:00:11 smbd
corn:/usr/local/src/tools# 
Comment 19 Volker Lendecke 2010-01-05 17:20:33 UTC
new pool-usage?

Volker
Comment 20 Jeremy Allison 2010-01-05 17:25:38 UTC
Yeah, we'll need a new smbcontrol output after it's grown a bit. This patch definately fixed the big leak I saw in the output (the struct smb_request values) so let's see what else it might be. Give it a while, as the memory useage does vary so you still might be seeing natural variation as it works.

Jeremy.


Comment 21 Ross Boylan 2010-01-05 17:35:17 UTC
Created attachment 5132 [details]
Another smbcontrol pool-usage

Just before making this file:
corn:/usr/local/src/tools# date; ps -lC smbd
Tue Jan  5 15:30:38 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S     0  2095     1  0  80   0 -  3902 -      ?        00:00:00 smbd
1 S     0  2096  2095  0  80   0 -  3984 -      ?        00:00:00 smbd
5 S 65534  2099  2095  5  80   0 -  8698 -      ?        00:01:03 smbd
Comment 22 Jeremy Allison 2010-01-05 17:43:58 UTC
There is no talloc memory leak in this pool file. Look at the overview of memory usage from smbpool2:

full talloc report on 'null_context' (total  17075 bytes in 240 blocks)

Compare this with the overview from smbpool1

full talloc report on 'null_context' (total 40687555 bytes in 146737 blocks)

Leave the process alone for an hour or so, then check again. We also might want to look inside the /proc/<pid>/maps entry to see what is mmap'ed where.

Jeremy.
Comment 23 Ross Boylan 2010-01-05 18:41:01 UTC
Created attachment 5133 [details]
pool-useage and /proc/.../map

corn:/usr/local/src/tools# date; ps -lC smbd
Tue Jan  5 16:36:22 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S     0  2095     1  0  80   0 -  3902 -      ?        00:00:00 smbd
1 S     0  2096  2095  0  80   0 -  3984 -      ?        00:00:00 smbd
5 S 65534  2099  2095  5  80   0 - 24856 -      ?        00:04:47 smbd
corn:/usr/local/src/tools# smbcontrol -t 120 2099 pool-usage > ~ross/smbpool3
corn:/usr/local/src/tools# cat /proc/2099/maps >> !$
Comment 24 Jeremy Allison 2010-01-05 18:45:14 UTC
Ok, so it definately isn't a talloc leak. Inside smbpool3:

full talloc report on 'null_context' (total  17075 bytes in 240 blocks)

which is identical to smbpool2. The bug fix works as intended and fixes the leak reported. I'm examining the mmap data now.

Jeremy.
Comment 25 Jeremy Allison 2010-01-05 18:52:24 UTC
Ok, from smbpool3 we have the heap as:

09cc6000-0ee13000 rw-p 09cc6000 00:00 0          [heap]

Can you give it a little longer (10 mins say) and then post another text from /proc/pid/map ? I'm interested to see if the [heap] area is growing (if the process size is still growing). If so then we'll need to start looking at running this under valgrind to see where any malloc memory leaks might be.

Jeremy.
Comment 26 Ross Boylan 2010-01-05 19:10:02 UTC
Created attachment 5134 [details]
/proc/.../maps after a little more run time

I wasn't sure if you meant 10 minutes from my previous capture, or from your message.  So I waited for the latter.  Just after the capture I see
# date; ps -lC smbd
Tue Jan  5 17:08:14 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S     0  2095     1  0  80   0 -  3902 -      ?        00:00:00 smbd
5 S     0  2096  2095  0  80   0 -  3984 -      ?        00:00:00 smbd
5 S     0  2099  2095  5  80   0 - 31059 -      ?        00:06:09 smbd

Anticipating possible valgrind, I have 2 questions.  Do I need to install the debug symbols to make it effective?  And doesn't the process get overwhelmed rapidly since it's so slow?
Comment 27 Jeremy Allison 2010-01-05 19:12:37 UTC
What is this client doing whilst the process size is growing ? If I can
reproduce it here it will be much easier to investigate.

Yes, adding debug symbols will be very helpful. Thanks.

Jeremy.
Comment 28 Jeremy Allison 2010-01-05 19:15:43 UTC
Yes the heap seems to be growing (from spool4) :

09cc6000-10515000 rw-p 09cc6000 00:00 0          [heap]

so a valgrind trace would now be very helpful. What is this client doing ?

Jeremy.
Comment 29 Ross Boylan 2010-01-05 19:39:40 UTC
The client is doing "nothing," which doesn't prevent it from chatting
constantly.  It used the file share and the printer from samba. I can upload a
wireshark capture if that would help.  The client is Vista.

The client is also querying, using various protocols, for a printer it uses
directly, not through samba.  I don't think any of that is going to samba.

The constant chatter may be related to my original problem, bug 6754.  Things
are better under samba 3.4 than 3.2 (in terms of CPU), but still not great.
Comment 30 Ross Boylan 2010-01-05 19:57:09 UTC
Created attachment 5135 [details]
valgrind capture

I think I'm doing something wrong because the thing keeps terminating.
Comment 31 Ross Boylan 2010-01-05 20:03:39 UTC
Created attachment 5136 [details]
valgrind capture after deamon shutdown.
Comment 32 Jeremy Allison 2010-01-05 20:18:58 UTC
Yes, the valgrind output isn't showing anything (only the initial "globals" being "lost", which is just because we know they have to be around for the entirity of the program run so leave them to be cleaned up on normal exit). This means you're only getting the valgrind output from the initial daemons (and the background printer enumeration daemon).

Try doing this:

valgrind --trace-children=yes --leak-check=full /usr/bin/smbd

allowing the client to run a bit and grow the connected <pid>. Just terminate the <pid> that is leaking by doing:

kill <pid>

This should cause valgrind to print out the error log for the client connected pid that is causing a problem.

Is this problem still reproducible by doing a Vista file-based backup ?

Jeremy.
Comment 33 Jeremy Allison 2010-01-05 20:21:14 UTC
A wireshark trace would *definately* help whilst the client is causing the smbd to grow.

Jeremy.
Comment 34 Ross Boylan 2010-01-05 20:56:38 UTC
Created attachment 5137 [details]
wireshark packet capture

These are from earlier today.  The .2 machine is the server running samba; the .46 is the Vista laptop client.  I did not touch the laptop during this period.
Comment 35 Ross Boylan 2010-01-05 21:56:11 UTC
Created attachment 5138 [details]
valgrind after an extended run

I think the part after kill 9390 is what you want.  I got thrown because the processes weren't showing up under smbd.
Comment 36 Jeremy Allison 2010-01-05 22:27:28 UTC
Unfortunately not (the part after 9390). None of these valgrind traces show any signs of any significant memory leaks. Yes, there are some in the libc part of the code (in the nss code), but not enough to explain the problem.

I'll take a look at the wireshark trace tomorrow. What would also be helpful is your smb.conf, and what resources the Windows Vista client connects to.

The possibility remaining is not a leak, but a reference to malloc'ed memory that is too long lived.

Jeremy.
Comment 37 Ross Boylan 2010-01-05 23:49:05 UTC
For the valgrind capture just reported, smbd.log shows
[2010/01/05 18:58:51,  0] smbd/server.c:1069(main)
  smbd version 3.4.3 started.
  Copyright Andrew Tridgell and the Samba Team 1992-2009
[2010/01/05 18:58:53,  1] smbd/files.c:179(file_init)
  file_init: Information only: requested 1014 open files, 994 are available.
[2010/01/05 18:58:53,  0] smbd/server.c:1115(main)
  standard input is not a socket, assuming -D option

Could the error about open files indicate that some part of samba, or some part
of valgrind, didn't run?  I'm not sure why the message is "Information only"
since it sounds as if necessary resources are lacking.  1,000 files sounds
awfully high.
Comment 38 Ross Boylan 2010-01-05 23:58:07 UTC
Created attachment 5139 [details]
smb.conf

Note that ethfast is the interface the laptop connects through, and the one subject to wireshark monitoring.

The client is talking to the printer, especially rawPrinter, and referencing files through the home directory (possibly of multiple users).  cotback is the share that received the backups, though it is not in active use now.

I don't know enough about how Vista operates to know how often, or even if, it hits the various resources.  Once I print something, wireshark shows a constant dialog with the printer.

I could try another backup, but it seems to be managing to leak without one.  However, after the valgrind I relaunched samba the regular way, and I am not observing any memory growth.  I'll try to get a reproducible recipe for growth and then run it under valgrind and do the packet capture again.
Comment 39 Jeremy Allison 2010-01-06 00:01:37 UTC
No that's fine. 3.4.4 will request around 16k open files, to fix a bug with Win7.
I'm starting to look at the spoolss pipe code, as the wireshark trace shows a continuous stream of spoolss requests. My best guess is that there's a pointer reference (malloc based) that is being held far too long....

Can you build from source ? If so I would like to send you a snapshot of what will be 3.4.4 (including the fix in comment #16), as there have been many spoolss fixes and I'd like to ensure this isn't something we've already fixed.

Jeremy.
Comment 40 Jeremy Allison 2010-01-06 00:31:47 UTC
Actually, there's one more thing we can use with valgrind to try and find a long-lived malloc pointer reference.

Can you add the option "--show-reachable=yes" to your valgind command line and try again (run under valgrind, let it grow a while, then kill <pid>).

So the full command line would be:

valgrind --trace-children=yes --leak-check=full --show-reachable=yes /usr/bin/smbd

This will dump out *all* known memory references on exit, even the ones that aren't lost. That will allow me to look for any large, spoolss-related allocations that are still reachable but probably should already have been freed.

Jeremy.
Comment 41 Ross Boylan 2010-01-06 01:28:51 UTC
Created attachment 5141 [details]
Include valgrind trace (smbvalg9a) and wireshark capture (smbpackets9b)

I think this is a much more satisfactory test, with the key thing being
==16908== 3,032,112 bytes in 698 blocks are definitely lost in loss record 49 of 49
==16908==    at 0x4023E8C: realloc (vg_replace_malloc.c:429)
==16908==    by 0x833F02C: Realloc (util.c:1115)
==16908==    by 0x8195048: prs_grow (parse_prs.c:292)
==16908==    by 0x8196A5E: prs_append_some_data (parse_prs.c:405)
==16908==    by 0x8196AD4: prs_append_some_prs_data (parse_prs.c:417)
==16908==    by 0x82C285D: create_next_pdu (srv_pipe.c:552)
==16908==    by 0x82B977D: np_read_send (srv_pipe_hnd.c:879)
==16908==    by 0x80E20E4: api_dcerpc_cmd_write_done (ipc.c:287)
==16908==    by 0x8351C49: _tevent_req_notify_callback (tevent_req.c:124)
==16908==    by 0x8351FB5: tevent_req_trigger (tevent_req.c:133)
==16908==    by 0x8350FAF: tevent_common_loop_immediate (tevent_immediate.c:135)
==16908==    by 0x834F3EF: run_events (events.c:94)
=

valgrind reported on a process that did not show up in my ps; I'm not sure what it was.

I shut down samba and the Vista spooler.
I started nmbd and then, once it settled, smbd. I started the Vista spooler, but there wasn't much activity until I opened the printers dialog and got properties of rawPrinter.  This led to steady traffic, and the gradual increase in size of the server process.  Capture ended because (I presume) of some wireshark limit.  After capture ended I killed the apparently leaking process, and then shutdown the others.

I included only about the first 1/3 of the packets because of limits on upload size.
Comment 42 Ross Boylan 2010-01-06 01:35:51 UTC
This comment has additional information about the conditions under which leaking occurred and other odd behavior.

As mentioned earlier, there was initially little packet activity.  I discovered I was unable to print a test page at all, with no obvious indication in the logs of why not.  I believe this was because an earlier page had been printed from a different account on the laptop; we have run into this problem repeatedly, though not repeatably.

By restarting samba and the Vista print spooler service (it may have been the 2nd that was key) I was able to restore printing.  packets were quiet until I printed a test page.  There was a flurry of exchanges, lasting perhaps a few minutes after the page printed.  Things went quiet for a few minutes, and then, for no obvious reason, they started up again and the server process started growing.

After all this I shutdown server and spooler, and performed the steps described in the preceding upload.  The behavior was slightly different, in that the rapid packet exchanges started as soon as I opened the printer (I think I had tried that previously).
Comment 43 Ross Boylan 2010-01-06 01:45:56 UTC
This is in response to Jeremy's questions in #39 and #40.

I am building from debianized source; I assume I could drop the new code in and repeat.  However, I think I have already incorporated the changes in the patch you mention (just from the earlier patch in different format).

If the spooler problems may have been fixed, that obviously would be very interesting.  The constant exchanges use perhaps 5 to 10% of the CPU (P4 3Ghz) on the server and 10% on the client (Core 2 Duo @ 2Ghz).  Although, perhaps for purposes of isolating the leak they are a plus. :)

I could repeat with show-reachable, but perhaps the info I've already provided, which shows a real memory leak, means we don't need to go down that route.  Let me know.

Enough for today.
Comment 44 Jeremy Allison 2010-01-06 10:36:17 UTC
Yay ! I think the smb9 valgrind trace has the info I need to fix this bug....

==16948== 868,800 bytes in 200 blocks are definitely lost in loss record 60 of 60
==16948==    at 0x4023E8C: realloc (vg_replace_malloc.c:429)
==16948==    by 0x833F02C: Realloc (util.c:1115)
==16948==    by 0x8195048: prs_grow (parse_prs.c:292)
==16948==    by 0x8196A5E: prs_append_some_data (parse_prs.c:405)
==16948==    by 0x8196AD4: prs_append_some_prs_data (parse_prs.c:417)
==16948==    by 0x82C285D: create_next_pdu (srv_pipe.c:552)
==16948==    by 0x82B977D: np_read_send (srv_pipe_hnd.c:879)
==16948==    by 0x80E20E4: api_dcerpc_cmd_write_done (ipc.c:287)
==16948==    by 0x8351C49: _tevent_req_notify_callback (tevent_req.c:124)
==16948==    by 0x8351FB5: tevent_req_trigger (tevent_req.c:133)
==16948==    by 0x8350FAF: tevent_common_loop_immediate (tevent_immediate.c:135)
==16948==    by 0x834F3EF: run_events (events.c:94)

The "868,800 bytes in 200 blocks are definitely lost" is a bit of a give-away, and this is also an area that changed between 3.3.x and 3.4.x. Thanks a *lot* for this.

More once I've isolated the bug....

Jeremy.
Comment 45 Jeremy Allison 2010-01-06 14:27:29 UTC
Created attachment 5142 [details]
Second part of the fix for 3.4.4.

Ok, can you try this second patch as well as the one in the attachment on comment #5. This fixes the second set of leaks for me. It won't do anything about the CPU spinning problems with Vista continually requesting printer info, but that's what Guenther's other fix addresses.

I'll create a git-am format patch for 3.4.4 shortly, once I've tested and committed to master and 3.5.0.

Jeremy.
Comment 46 Jeremy Allison 2010-01-06 15:49:21 UTC
Created attachment 5143 [details]
git-am format patch for 3.4.4

Volker, once you've confirmed please re-assign to Karolin for inclusion in 3.4.4.

Karolin, we need attachments (once Volker confirms) : 

5131: git-am format patch for 3.4.4.
5142: Second part of the fix for 3.4.4.

Thanks !

Jeremy.
Comment 47 Ross Boylan 2010-01-07 01:41:45 UTC
I've been running with the old and new patches (applied to 3.4.3) for an hour, during which there has been constant (useless) chatter between server and client.  There's no sign of a leak.

Yay!

(the noticeable CPU use on client and server, and inability to print without restarting the client spooler continue to be issues.)

I'll leave it running overnight, but probably won't be able to report til later tomorrow.  But I don't expect any new problems will emerge.

The stack trace in comment #41 and #44 are the same problem, right?
Comment 48 Ross Boylan 2010-01-07 09:25:28 UTC
Just a little more memory after running overnight:
corn:/usr/local/src/tools# date; ps -lC smbd
Wed Jan  6 22:36:32 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S 65534  1483 29227  7  80   0 -  4053 -      ?        00:00:18 smbd
5 S     0 29227     1  0  80   0 -  3902 -      ?        00:00:00 smbd
5 S     0 29228 29227  0  80   0 -  3983 -      ?        00:00:00 smbd
# one hour later was same (not shown)

#overnight
corn:/usr/local/src/tools# date; ps -lC smbd
Thu Jan  7 07:14:15 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S 65534  1483 29227  7  80   0 -  4092 -      ?        00:38:04 smbd
5 S     0 29227     1  0  80   0 -  3902 -      ?        00:00:00 smbd
5 S     0 29228 29227  0  80   0 -  4014 -      ?        00:00:01 smbd
Comment 49 Jeremy Allison 2010-01-07 12:19:18 UTC
Reassigning to me so I can see it in "my bugs".
Jeremy.
Comment 50 Jeremy Allison 2010-01-07 12:26:58 UTC
#overnight
corn:/usr/local/src/tools# date; ps -lC smbd
Thu Jan  7 07:14:15 PST 2010
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
5 S 65534  1483 29227  7  80   0 -  4092 -      ?        00:38:04 smbd
5 S     0 29227     1  0  80   0 -  3902 -      ?        00:00:00 smbd
5 S     0 29228 29227  0  80   0 -  4014 -      ?        00:00:01 smbd

That's within normal variation (I think). Might be a minor leak, but doesn't seem so.

Jeremy.
Comment 51 Ross Boylan 2010-01-07 18:37:32 UTC
Created attachment 5149 [details]
valgrind with the latest 2 patches

Here's a valgrind with both leak fixes after running for a few hours.
It looks to me as if they show no accumulating leaks (i haven't looked really closely).

P.S.  The Vista spooler goes bananas, using about 50% cpu (with allied programs) when I shut the server down.  To me, this seems more of a MS issue, though it would be nice if samba could suppress the behavior.

I'm also a little curious about all those processes it seemed to spawn off and kill over the hours.  Anybody know what they are?
Comment 52 Guenther Deschner 2010-01-07 18:46:35 UTC
(In reply to comment #51)
> P.S.  The Vista spooler goes bananas, using about 50% cpu (with allied
> programs) when I shut the server down.  To me, this seems more of a MS issue,
> though it would be nice if samba could suppress the behavior.

Is this still the case with 3.4.4 ?
Comment 53 Ross Boylan 2010-01-08 00:26:01 UTC
(In reply to comment #52)
> (In reply to comment #51)
> > P.S.  The Vista spooler goes bananas, using about 50% cpu (with allied
> > programs) when I shut the server down.  To me, this seems more of a MS issue,
> > though it would be nice if samba could suppress the behavior.
> 
> Is this still the case with 3.4.4 ?
> 
I don't know; I've been working from the 3.4.3 codebase and the patches in this bug.  I can try 3.4.4 if you can point me to it, or to the other patches.
Ross
Comment 54 Volker Lendecke 2010-01-10 04:56:40 UTC
Comment on attachment 5143 [details]
git-am format patch for 3.4.4

Thanks, Jeremy, for looking at this!

Volker
Comment 55 Volker Lendecke 2010-01-10 04:59:55 UTC
Karolin, please merge
https://bugzilla.samba.org/attachment.cgi?id=5143&action=view
and
https://bugzilla.samba.org/attachment.cgi?id=5131&action=view

for the next 3.4 release!

Thanks,

Volker
Comment 56 Karolin Seeger 2010-01-11 05:09:10 UTC
Pushed to v3-4-test. Patches will be included in Samba 3.4.5.
Closing out bug report.

Thanks!