Bug 7518 - Log files overrunning maximum size specified in smb.conf
Log files overrunning maximum size specified in smb.conf
Status: ASSIGNED
Product: Samba 3.5
Classification: Unclassified
Component: Config Files
3.5.4
x64 Linux
: P3 normal
: ---
Assigned To: Volker Lendecke
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-16 09:06 UTC by andrew
Modified: 2011-06-11 06:57 UTC (History)
2 users (show)

See Also:


Attachments
Patch for 3.5 (1.48 KB, patch)
2010-09-28 12:41 UTC, Volker Lendecke
vl: review? (metze)
Details
Zombie Process logs (491.54 KB, application/zip)
2010-10-15 15:44 UTC, andrew
no flags Details
Samba crashes to date (189.94 KB, application/octet-stream)
2011-06-10 15:58 UTC, andrew
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description andrew 2010-06-16 09:06:36 UTC
Samba 3.5.2 on RHEL 5.4, and the log files generated by a rogue/runaway smb process quite frequently fill the entire root partition (thank goodness for the 5% overhead).

/var/log/samba will end up with a log file that fills all available space.  Tracking down the offending file with lsof usually reveals an smbd process that isn't attached to any users.

# lsof | grep ws1018.old
smbd      11519             root    2w      REG              253,0 46460801024                              2783315 /var/log/samba/ws1018.old
smbd      11519             root  988w      REG              253,0 46460801024                              2783315 /var/log/samba/ws1018.old
smbd      32055             root    2w      REG              253,0 46460801024                              2783315 /var/log/samba/ws1018.old
smbd      32055             root  988w      REG              253,0 46460801024                              2783315 /var/log/samba/ws1018.old

# smbstatus | grep 11519
# smbstatus -p | grep 11519
# ps -ef | grep 11519
root     11519 32055 21 Jun15 ?        03:43:21 /usr/sbin/smbd -D
root     17044 16997  0 07:33 pts/2    00:00:00 grep 11519
# ps -ef | grep 32055
root     11519 32055 21 Jun15 ?        03:43:24 /usr/sbin/smbd -D
root     17047 16997  0 07:34 pts/2    00:00:00 grep 32055
root     32055 32043  0 Jun12 ?        00:01:42 /usr/sbin/smbd -D

Killing the offending process renders the file deletable and all is right with the world again.

# kill 11519
# lsof | grep ws1018.old

There have also been multiple reports of similar events on the samba mailing lists.

This particular time, the error that it was spamming the logs with was: 
select() error: Invalid argument
Comment 1 Karolin Seeger 2010-08-24 09:44:54 UTC
Any volunteer for this one?
Comment 2 andrew 2010-09-09 09:30:26 UTC
Problem still persists after fully updating and patching on July 3.  Now on samba 3.5.4

Log file filled entire root partition this time (100%), and crashed the server.
Comment 3 Björn Jacke 2010-09-23 01:29:57 UTC
is there anything special about your setup? Is is running on hardware or in a virtual machine? I ask because I wonder if the clock is "stable".
If this can be reproduced or if a developer can have access to a box in that state it would be easier to nail the error down.
Comment 4 Björn Jacke 2010-09-23 13:29:07 UTC
also please paste the last couple of lines from the log file which come before the select error starts showing up.
Comment 5 andrew 2010-09-24 14:57:31 UTC
This is on a physical x3650M2 box.  All local SAS disk, no funny configurations.  

It is using a W2008SP2 DC as ntp source, and *seems* to keep time quite well (logwatch mails faithfully at the same time and cron jobs happen on time as well).

When the log files overrun, it is *always* on a '<machine name>.old' log file - never the main one.  We wonder if 
1) a zombie smbd process (of which I have to kill one 1 week) has the log file open and locked
2) another smbd process attached to that same machine starts up when that machine reconnects
3) the zombie process continues writing to the log file, filling it with garbage
4) the new smbd process tries to rotate the log, and succeeds in naming it as ".old", zombie process still has it locked though
5) zombie process spams more entries into log file
6) main process tries to rotate .old file and fails as it's still locked by zombie
7) *crash*...

I might be able to arrange some remote access for the next time it happens.

As for the last few lines before the select error, I'll have to wait until the next time it happens (hopefully never).

I cannot, however, understand why we get so many zombie processes.
Comment 6 andrew 2010-09-27 11:27:31 UTC
After moving log files to a USB partition, I found a log file that had its last write on Sept 16.

> [root@Bubbles sambaUSB]# ls -l ws1011.old
> -rwxr-xr-x 1 root root 4294967295 Sep 16 19:21 ws1011.old

The top lines of the log file look like this, and notice that the last log entry before it started bombing was on Sep 14 - two days prior, and is was again on a "XXXXXX.old" file.  So it looks like log level 0 isn't catching anything.


[root@Bubbles sambaUSB]# more ws1011.old
[2010/09/14 18:43:46.026688,  0] lib/util_sock.c:675(write_data)
[2010/09/14 18:43:46.026751,  0] lib/util_sock.c:1432(get_peer_addr_internal)
  getpeername failed. Error was Transport endpoint is not connected
  write_data: write failure in writing to client 0.0.0.0. Error Connection timed out
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument

etc.

The process locking and writing to this file was killed some time ago during (I suspect) my routine zombie process turfing.
Comment 7 andrew 2010-09-27 12:15:11 UTC
Just caught a box doing it, was going to crash the box so I had to stop it.  (I will have to initiate another USB partition for logs... lol)

Here's some forensics of what I saw on the box.  I really think this is a zombie process problem.

***The root smbd process should check for detached processes occasionally and kill -9 them***

-----------------------------------

[root@sookie ~]# top
top - 10:54:25 up 85 days, 14:55,  4 users,  load average: 1.19, 1.19, 1.13
Tasks: 326 total,   2 running, 324 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.2%us,  7.2%sy,  0.0%ni, 85.6%id,  0.7%wa,  0.0%hi,  0.4%si,  0.0%st
Mem:  24661124k total, 24536016k used,   125108k free,   202396k buffers
Swap: 10120940k total,      164k used, 10120776k free, 22976172k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
17831 root      25   0  166m 3664 1108 R 99.8  0.0 117:21.02 smbd

(17831 - rogue process)


[root@sookie ~]# cd /var/log/samba
[root@sookie samba]# ls -lS | head
total 83354964
-rw-r--r-- 1 root root 85268071078 Sep 27 10:54 ws1004.old

(85 gigs of log file)


[root@sookie samba]# smbstatus | grep 17831

(main samba process doesn't know about 17831)


[root@sookie samba]# lsof | grep ws1004.old
smbd      10178             root    2w      REG              253,0 85547409412   15172420 /var/log/samba/ws1004.old
smbd      10178             root 1063w      REG              253,0 85547776900   15172420 /var/log/samba/ws1004.old
smbd      17831             root    2w      REG              253,0 85549207780   15172420 /var/log/samba/ws1004.old
smbd      17831             root 1063w      REG              253,0 85549575070   15172420 /var/log/samba/ws1004.old

(rogue process has log file open and locked so smbd can't delete it)


[root@sookie samba]# kill -9 17831

(kill the rogue process)


[root@sookie samba]# top
top - 10:55:23 up 85 days, 14:56,  4 users,  load average: 0.99, 1.14, 1.11
Tasks: 323 total,   1 running, 322 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.8%us,  1.4%sy,  0.0%ni, 97.3%id,  0.4%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  24661124k total, 24528132k used,   132992k free,   200812k buffers
Swap: 10120940k total,      164k used, 10120776k free, 22974424k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
19664 root      15   0 17116 1280  796 R  1.9  0.0   0:00.01 top

(things look good now)
(log file can be deleted and isn't filling up any more)

(log file looks the same as previous log files)

[2010/09/02 10:19:00.178607,  0] lib/util_sock.c:474(read_fd_with_timeout)
[2010/09/02 10:19:00.178657,  0] lib/util_sock.c:1432(get_peer_addr_internal)
  getpeername failed. Error was Transport endpoint is not connected
  read_fd_with_timeout: client 0.0.0.0 read error = Connection reset by peer.
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument

Comment 8 Volker Lendecke 2010-09-28 12:41:20 UTC
Created attachment 5988 [details]
Patch for 3.5

This is what I pushed to master. I'm not sure how we get into this situation. This patch should get us out of there however.

Volker
Comment 9 andrew 2010-09-29 15:15:43 UTC
Thanks Volker.  I have plugged some USB drives in and migrated the samba logs there.

I have also upped the log level to 3 to see if we can catch anything in there as the large number of zombie processes has me confused.  Given that we get zombie processes on a seemingly daily basis I should have something better for you to troll through Volker.

My suspicion is that a poorly coded app we are required to use, that crashes on the client side rather frequently, is exacerbating the problem.

That doesn't mean that samba shouldn't be robust enough to live through it though.
Comment 10 andrew 2010-10-07 17:07:42 UTC
I have captured some machine log files at level 3 from processes going zombie and chewing up 100% of a processor.

Where should I put them?
Comment 11 Volker Lendecke 2010-10-08 01:42:48 UTC
It still happens even with this patch? This is weird. Please bzip2 -9 the logfiles and upload them to this bug. Alternatively, if they are too large, you might want to send them via normal Email to my account. My inbox recently has received a legitimate mail of 100MB, so it is pretty patient.

Volkre
Comment 12 andrew 2010-10-12 12:18:31 UTC
Sorry - I should have been clearer - I can't apply this patch to the production system so it is unpatched.

I did manage to capture some logs from a process going "zombie" - as you were wondering how we got into this situation in the first place.  Maybe it will help track down the cause of the zombie processes.
Comment 13 Volker Lendecke 2010-10-13 08:11:36 UTC
Did you upload those logs somewhere? I don't see that.

Volker
Comment 14 andrew 2010-10-13 16:30:24 UTC
I sent them to your personal email with the title "Bug 7518 - zombie process logs"
Comment 15 Volker Lendecke 2010-10-14 01:15:13 UTC
That did not arrive, sorry. Maybe it was too large?

Volker
Comment 16 andrew 2010-10-15 15:44:41 UTC
Created attachment 6013 [details]
Zombie Process logs

Hi Volker, I have attached level 3 log files from the time around when process 14403 went zombie on our samba server.
I have a process monitoring for >90% CPU utilisation on a consistent basis, and it caught this process.

Attached are
-	samba output log file (ws1011)
-	/var/log/messages parsed output from around the time it went zombie (15_05_oct_7_2010_crash.messages.log) – the cores it references in there are not present in /var/log/samba/cores/smbd
-	list of open/locked files by that process from lsof (14403_crash.log)

If you need me to capture something else that would be more use in debugging the next time it happens, please let me know.
Comment 17 Volker Lendecke 2010-10-20 12:57:53 UTC
Oct  7 11:07:27 Bubbles smbd[30290]:   Too many open files, unable to open more!  smbd's max open files = 32768

You probably have seen those. This is the root of your problems. smbd probably does not deal properly with ENFILE in all situations, but for your client functionality you need to solve this issue. You seem to have an application that opens a large number of files. You have 1109 files open, which is an odd number, not a 1024 limit or so. If lots of clients access your file server with that application, you might overflow a global limit.

What kind of system is that? On my Linux system I have a file /proc/sys/fs/file-max which gives a global limit of file descriptors. You might want to increase that. If it's not Linux, I would have to investigate further.

Volker
Comment 18 andrew 2010-11-25 09:10:49 UTC
Here's the stack trace I caught with level 3 logging enabled:

[2010/11/01 16:59:41.077102,  0] lib/fault.c:46(fault_report)
  ===============================================================
[2010/11/01 16:59:41.077167,  0] lib/fault.c:47(fault_report)
  INTERNAL ERROR: Signal 11 in pid 12640 (3.5.4)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2010/11/01 16:59:41.077219,  0] lib/fault.c:49(fault_report)

  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2010/11/01 16:59:41.077267,  0] lib/fault.c:50(fault_report)
  ===============================================================
[2010/11/01 16:59:41.077302,  0] lib/util.c:1465(smb_panic)
  PANIC (pid 12640): internal error
[2010/11/01 16:59:41.128219,  0] lib/util.c:1569(log_stack_trace)
  BACKTRACE: 29 stack frames:
   #0 /usr/sbin/smbd(log_stack_trace+0x1a) [0x2ac094c772c5]
   #1 /usr/sbin/smbd(smb_panic+0x55) [0x2ac094c773c9]
   #2 /usr/sbin/smbd [0x2ac094c68101]
   #3 /lib64/libc.so.6 [0x2ac097ae42d0]
   #4 /usr/lib64/libwbclient.so.0 [0x2ac09768b3ce]
   #5 /usr/lib64/libwbclient.so.0 [0x2ac09768bc97]
   #6 /usr/lib64/libwbclient.so.0 [0x2ac097683b92]
   #7 /usr/lib64/libwbclient.so.0 [0x2ac097683bc5]
   #8 /usr/lib64/libwbclient.so.0(wbcPing+0x4d) [0x2ac097683fc1]
   #9 /usr/sbin/smbd(winbind_ping+0x9) [0x2ac094c40a00]
   #10 /usr/sbin/smbd(create_local_token+0x20) [0x2ac094cc72f0]
   #11 /usr/sbin/smbd [0x2ac094cca50d]
   #12 /usr/sbin/smbd [0x2ac094acd097]
   #13 /usr/sbin/smbd(ntlmssp_update+0x270) [0x2ac094accc86]
   #14 /usr/sbin/smbd(auth_ntlmssp_update+0x17) [0x2ac094cca215]
   #15 /usr/sbin/smbd [0x2ac094a40e5e]
   #16 /usr/sbin/smbd(reply_sesssetup_and_X+0x191) [0x2ac094a4118f]
   #17 /usr/sbin/smbd [0x2ac094a75eb1]
   #18 /usr/sbin/smbd [0x2ac094a78b4e]
   #19 /usr/sbin/smbd [0x2ac094a79341]
   #20 /usr/sbin/smbd(run_events+0x1d6) [0x2ac094c861f8]
   #21 /usr/sbin/smbd(smbd_process+0x97c) [0x2ac094a7837d]
   #22 /usr/sbin/smbd [0x2ac094f844ca]
   #23 /usr/sbin/smbd(run_events+0x1d6) [0x2ac094c861f8]
   #24 /usr/sbin/smbd [0x2ac094c86467]
   #25 /usr/sbin/smbd(_tevent_loop_once+0x84) [0x2ac094c867e9]
   #26 /usr/sbin/smbd(main+0xf83) [0x2ac094f841ff]
   #27 /lib64/libc.so.6(__libc_start_main+0xf4) [0x2ac097ad1994]
   #28 /usr/sbin/smbd [0x2ac0949ff5a9]
[2010/11/01 16:59:41.131336,  0] lib/fault.c:326(dump_core)
  dumping core in /var/log/samba/cores/smbd
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
select() error: Invalid argument
.
.
.
etc.
Comment 19 andrew 2010-11-25 09:34:55 UTC
> You seem to have an application
> that opens a large number of files. You have 1109 files open, which is an odd
> number, not a 1024 limit or so. If lots of clients access your file server 
> with that application, you might overflow a global limit.

This application stores large (2+GB) files on the network and accesses them in a database-"like" fashion using ctree is what I have been able to gather.

I have collected a list of the files that a hung process has open for the past month or so, and it is always a list of the same application's files, and seems to be a random number of files

1230, 1176, 1925, 1112, 1121, 1107, 1145, 1183, 1111

It also looks like they are using large file locking so it ends up with at 10-400 locks on any given file, which is probably exacerbating the problem.

The application itself doesn't open a large number of files, but it does seem to access many different parts of the same large file.

This is a RHEL 5 system

# [root@Bubbles ~]# cat /proc/sys/fs/file-max
# 4670824

And I set up all the boxes with this:
# > add the following line to /etc/security/limits.conf
# *	soft	nofile	16384
# *	hard	nofile	100000

From the man page for smb.conf, 

"max open files (G)
This parameter limits the maximum number of open files that one smbd(8) file serving process may have open for a client at any one time"

So I can't see how having max open files = 32768 would make any difference in this case - this client sessions only have a max of 1900 "files" (or parts of a file) open at a time which is far below the limit.

Unless I misread something?
Comment 20 Volker Lendecke 2010-12-01 12:45:11 UTC
dumping core in /var/log/samba/cores/smbd

Can you upload the core file? I also need the precise versions of your packages. As this is a RHEL, you might also contact your RedHat support to create a "backtrace full" gdb output from that core file together with the debuginfo packages.
Comment 21 andrew 2011-06-10 15:58:36 UTC
Created attachment 6563 [details]
Samba crashes to date
Comment 22 andrew 2011-06-10 16:01:41 UTC
I have attached another log file that contains a whole whack of stack traces from Jan 1 of this year to date.

We are currently on 3.5.8, patched on Mar 10, 2011 - so a bunch of stack traces in that log file have occurred since that date.

What has stopped since 3.5.8, however, are the zombie processes.  We no longer see processes pegged at 100% that need intervention to kill - they seem to kill themselves somehow.  But the log files are still filling and smb user processes are still crashing.
Comment 23 Volker Lendecke 2011-06-11 06:57:35 UTC
(In reply to comment #22)
> I have attached another log file that contains a whole whack of stack traces
> from Jan 1 of this year to date.
> 
> We are currently on 3.5.8, patched on Mar 10, 2011 - so a bunch of stack traces
> in that log file have occurred since that date.
> 
> What has stopped since 3.5.8, however, are the zombie processes.  We no longer
> see processes pegged at 100% that need intervention to kill - they seem to kill
> themselves somehow.  But the log files are still filling and smb user processes
> are still crashing.

This is yet a different one. It crashes in

   #10 /usr/lib64/libgnutls.so.13(gnutls_global_init+0x20b) [0x2b6e57159fab]
   #11 /usr/lib64/libcups.so.2(httpInitialize+0x33) [0x2b6e54f49d13]
   #12 /usr/lib64/libcups.so.2(httpConnectEncrypt+0x95) [0x2b6e54f4b1d5]

Later on we see

   #7 /lib64/libc.so.6(__stack_chk_fail+0x2f) [0x2acb8be3a26f]
   #8 /usr/lib64/libcups.so.2(httpConnectEncrypt+0x176) [0x2acb8aa8f2b6]
   #9 /usr/sbin/smbd [0x2acb88edccd1]

or other crashes within gnutls. This is really, really weird.

I wonder if no other server process on that box has any problems.

Volker