Bug 4715 - samba crashes when user is trying to open a share residing on a memory disk
samba crashes when user is trying to open a share residing on a memory disk
Status: ASSIGNED
Product: Samba 3.0
Classification: Unclassified
Component: File Services
3.0.25a
x86 Windows XP
: P3 normal
: none
Assigned To: Samba Bugzilla Account
Samba QA Contact
:
: 4738 4858 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2007-06-20 02:49 UTC by drookie
Modified: 2008-08-28 04:44 UTC (History)
5 users (show)

See Also:


Attachments
Patch to turn off directory cache with new parameter. (4.86 KB, patch)
2007-08-23 16:54 UTC, Jeremy Allison
no flags Details
Q&D patch to disable repdir_* replacement functions. (319 bytes, patch)
2007-08-29 19:24 UTC, Timur Bakeyev
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description drookie 2007-06-20 02:49:23 UTC
Requirements: 
- Samba 3.0.25a on a FreeBSD 6.2-RELEASE-p5, build from ports.
- Vista/Office2K7 tryouts on a UDF image
- a samba config with a share definition which points to the image mountpoint.

/mnt/office2k7 in this case is a memory-disk, produced by FreeBSD's mdconfig and then mounted locally via mount_udf (the whole stuff is similar to lunux mount -loop option).

Bug is reproduceable. I can also provide an ssh access to the machine running this installation.

Note: this used to work in Samba 3.0.24, a user was able to open a share, and samba was crashing later when user was performing some of the file operations.
I'm quite sure this bug is a continuation of a bug 3683, described here - https://bugzilla.samba.org/show_bug.cgi?id=3683.

Config follows:

[global]
workgroup = SOFTLAB
machine password timeout = 0
netbios name = PANICBOX
server string = Samba 3.0.25a on FreeBSD 6.2-RELEASE-p5
hosts allow = 192.168. 127. 172.16.
guest account = pcguest
map to guest = bad user
log file = /var/log/samba/log.%m
encrypt passwords = yes
socket options = TCP_NODELAY
dns proxy = no
local master = no
os level = 32
interfaces = fxp0 lo0
bind interfaces only = yes
log level = 9
syslog = 4
deadtime = 15
wins server = 192.168.3.6
printing = BSD
unix charset = KOI8-R
dos charset = 866
passdb backend = smbpasswd
security = user

[office2k7]
comment = Panic Campground
path = /mnt/office2k7
browseable = yes
guest ok = yes
guest only = yes
writeable = no
#posix locking = no

[public]
comment = Public Share
path = /usr/local/public
browseable = yes
guest ok = yes
guest only = yes
writeable = yes



Log output follows:

[2007/06/20 13:01:55, 8] smbd/dosmode.c:dos_mode_from_sbuf(188)
  dos_mode_from_sbuf returning rd
[2007/06/20 13:01:55, 8] smbd/dosmode.c:dos_mode(409)
  dos_mode returning rd
[2007/06/20 13:01:55, 5] smbd/trans2.c:get_lanman2_dir_entry(1255)
  get_lanman2_dir_entry found ./sources fname=sources
[2007/06/20 13:01:55, 8] smbd/trans2.c:get_lanman2_dir_entry(1161)
  get_lanman2_dir_entry:readdir on dirptr 0x8384080 now at offset 156
[2007/06/20 13:01:55, 8] smbd/dosmode.c:dos_mode(371)
  dos_mode: ./support
[2007/06/20 13:01:55, 8] smbd/dosmode.c:dos_mode_from_sbuf(188)
  dos_mode_from_sbuf returning rd
[2007/06/20 13:01:55, 8] smbd/dosmode.c:dos_mode(409)
  dos_mode returning rd
[2007/06/20 13:01:55, 5] smbd/trans2.c:get_lanman2_dir_entry(1255)
  get_lanman2_dir_entry found ./support fname=support
[2007/06/20 13:01:55, 0] lib/fault.c:fault_report(41)
  ===============================================================
[2007/06/20 13:01:55, 0] lib/fault.c:fault_report(42)
  INTERNAL ERROR: Signal 6 in pid 1756 (3.0.25a)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2007/06/20 13:01:55, 0] lib/fault.c:fault_report(44)

  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2007/06/20 13:01:55, 0] lib/fault.c:fault_report(45)
  ===============================================================
[2007/06/20 13:01:55, 0] lib/util.c:smb_panic(1632)
  PANIC (pid 1756): internal error
[2007/06/20 13:01:55, 0] lib/util.c:log_stack_trace(1786)
  unable to produce a stack trace on this platform
[2007/06/20 13:01:55, 3] smbd/sec_ctx.c:push_sec_ctx(208)
  push_sec_ctx(1004, 1003) : sec_ctx_stack_ndx = 1
[2007/06/20 13:01:55, 3] smbd/uid.c:push_conn_ctx(358)
  push_conn_ctx(103) : conn_ctx_stack_ndx = 0
[2007/06/20 13:01:55, 3] smbd/sec_ctx.c:set_sec_ctx(243)
  setting sec ctx (0, 0) - sec_ctx_stack_ndx = 1
[2007/06/20 13:01:55, 5] auth/auth_util.c:debug_nt_user_token(448)
  NT user token: (NULL)
[2007/06/20 13:01:55, 5] auth/auth_util.c:debug_unix_user_token(474)
  UNIX token of user 0
  Primary group is 0 and contains 0 supplementary groups
[2007/06/20 13:01:55, 0] lib/fault.c:dump_core(181)
  dumping core in /var/log/samba/cores/smbd
Comment 1 Jeremy Allison 2007-06-20 18:31:54 UTC
Can you run a version with symbols, set the parameter :

panic action = "/bin/sleep 90000"

re-create the panic and then attach to the parent of the sleep process with gdb and get me a backtrace please ?

Thanks,

Jeremy.
Comment 2 drookie 2007-06-21 12:41:36 UTC
Still cannot obtain the backtrace. :/
When reassembled with -o0 -g options, and running unstrippped - started to work as intended, at least reverting to the bug 3683 original behaviour.
Comment 3 Matt H 2007-06-30 23:22:49 UTC
I am able to reproduce this.

Tested with: FreeBSD 6.2-RELEASE, FreeBSD 6.2-RELEASE-p5 or FreeBSD 6.2-STABLE.

The problem I have is identical, however, I'm using this with ntfs-3g and/or regular mount_ntfs.

There is also a report of it occuring with ZFS:
http://lists.freebsd.org/pipermail/freebsd-current/2007-May/072918.html

There's a PR here on the issue with a memory disk:
http://www.freebsd.org/cgi/query-pr.cgi?pr=113158

The PR states the problem doesn't exist in 3.0.24, I confirmed that it also does not exist in 3.0.23.  This seems to have been introduced in 3.0.25.

I tested it both with a ntfs drive and a file-based image.  Empty drive loaded ok, but as soon as I put a file in it, it would crash.  Its interesting that I created 3 directories, test, test2, and test3, but reading the output, it runs the get_lanman2_dir_entry stuff on test and test2, but doesn't get to test3.  I examined this on my other drives, and it crashes on reading the last directory on all of them.

I downgraded to samba-3.0.23 and the problem went away.

Here's some debug output:

[2007/06/30 21:16:05, 5] smbd/trans2.c:get_lanman2_dir_entry(1255)
  get_lanman2_dir_entry found ./test fname=test
[2007/06/30 21:16:05, 10] smbd/trans2.c:get_lanman2_dir_entry(1398)
  get_lanman2_dir_entry: SMB_FIND_FILE_BOTH_DIRECTORY_INFO
[2007/06/30 21:16:05, 8] smbd/trans2.c:get_lanman2_dir_entry(1161)
  get_lanman2_dir_entry:readdir on dirptr 0x836d180 now at offset 56
[2007/06/30 21:16:05, 8] smbd/dosmode.c:dos_mode(371)
  dos_mode: ./test2
[2007/06/30 21:16:05, 8] smbd/dosmode.c:dos_mode_from_sbuf(188)
  dos_mode_from_sbuf returning d
[2007/06/30 21:16:05, 8] smbd/dosmode.c:dos_mode(409)
  dos_mode returning d
[2007/06/30 21:16:05, 5] smbd/trans2.c:get_lanman2_dir_entry(1255)
  get_lanman2_dir_entry found ./test2 fname=test2
[2007/06/30 21:16:05, 10] smbd/trans2.c:get_lanman2_dir_entry(1398)
  get_lanman2_dir_entry: SMB_FIND_FILE_BOTH_DIRECTORY_INFO
[2007/06/30 21:16:05, 0] lib/fault.c:fault_report(41)
  ===============================================================
[2007/06/30 21:16:05, 0] lib/fault.c:fault_report(42)
  INTERNAL ERROR: Signal 6 in pid 2669 (3.0.25a)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2007/06/30 21:16:05, 0] lib/fault.c:fault_report(44)

  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2007/06/30 21:16:05, 0] lib/fault.c:fault_report(45)
  ===============================================================
[2007/06/30 21:16:05, 0] lib/util.c:smb_panic(1632)
  PANIC (pid 2669): internal error
[2007/06/30 21:16:05, 0] lib/util.c:log_stack_trace(1786)
  unable to produce a stack trace on this platform
[2007/06/30 21:16:05, 3] smbd/sec_ctx.c:push_sec_ctx(208)
  push_sec_ctx(1001, 100) : sec_ctx_stack_ndx = 1
[2007/06/30 21:16:05, 3] smbd/uid.c:push_conn_ctx(358)




Reproduction is quite simple, get a basic freebsd install, install samba and ntfs-3g (or use a memory based disk or ZFS).

I initialized a 50mb image:
dd if=/dev/zero of=ntfs-test.img count=100000
mkntfs -fF -c 512 -s 512 ntfs-test.img 100000

then mounted it
ntfs-3g ntfs-test.img /mnt

then I added "/mnt" to my share list.

Put some directories in there.. mkdir /mnt/test1 /mnt/test2

Watch smbd crash.

If instructed, I can generate additional crash dumps perhaps, but it would seem its unable to produce a stack trace, and I'm unsure why (sorry!).
Comment 4 Jeremy Allison 2007-07-01 00:39:01 UTC
I need a stack trace. I'm not a *BSD user and so it isn't so simple to reproduce for me. As it's happening in the directory code I have a sneaky feeling it's to do with this code :

lib/replace/repdir_getdents.c

/*
  a replacement for opendir/readdir/telldir/seekdir/closedir for BSD systems

  This is needed because the existing directory handling in FreeBSD
  and OpenBSD (and possibly NetBSD) doesn't correctly handle unlink()
  on files in a directory where telldir() has been used. On a block
  boundary it will occasionally miss a file when seekdir() is used to
  return to a position previously recorded with telldir().

  This also fixes a severe performance and memory usage problem with
  telldir() on BSD systems. Each call to telldir() in BSD adds an
  entry to a linked list, and those entries are cleaned up on
  closedir(). This means with a large directory closedir() can take an
  arbitrary amount of time, causing network timeouts as millions of
  telldir() entries are freed

  Note! This replacement code is not portable. It relies on getdents()
  always leaving the file descriptor at a seek offset that is a
  multiple of DIR_BUF_SIZE. If the code detects that this doesn't
  happen then it will abort(). It also does not handle directories
  with offsets larger than can be stored in a long,
---------------------

This only goes to show that you should never try and work around
bugs in the underlying platform - you should always scream until
they get fixed properly :-). I'm guessing the problem is the section
that states : 

"It relies on getdents() always leaving the file descriptor at a seek offset that is a multiple of DIR_BUF_SIZE. If the code detects that this doesn't happen then it will abort()."

Can you try disabling this code and see if the problem goes away ? I'm guessing the seek offset was always a multiple of DIR_BUF_SIZE on ufs, and isn't on other *BSD filesystems.

Can you also check if the underlying bug has been fixed ? If so we'll just remove this code. If not then you're between a rock and a hard place. *BSD
is still broken for large directories.

Jeremy.
Comment 5 Timur Bakeyev 2007-07-19 12:13:12 UTC
(In reply to comment #4)

Hi, Jeremy!

> "It relies on getdents() always leaving the file descriptor at a seek offset
> that is a multiple of DIR_BUF_SIZE. If the code detects that this doesn't
> happen then it will abort()."
> 
> Can you try disabling this code and see if the problem goes away ? I'm guessing
> the seek offset was always a multiple of DIR_BUF_SIZE on ufs, and isn't on
> other *BSD filesystems.

You was absolutely right in your guess, commenting out abort() let me to connect to the MS-DOS partition and get it's listing in smbclient. Still, an attempt to fetch the file leaded to the "short read" message and 0 length file on the client side. But, at least, we know where it comes from.

> Can you also check if the underlying bug has been fixed ? If so we'll just
> remove this code. If not then you're between a rock and a hard place. *BSD
> is still broken for large directories.

I remember old conversation between Tridge an PHK(phk@FreeBSD.org) about this problem and workaround, and IIRC, PHK said there is no bug in FreeBSD, but rather wrong(and Linux specific:) approach of the Samba team. I can try to ask him again, but don't expect that anything changed in the FreeBSD kernel.

Is there a possibility not to rely on this very specific behaviour in Samba instead?

With best regards,
Timur.
Comment 6 Jeremy Allison 2007-07-19 12:16:12 UTC
I recently spoke to Julian Elisher about this. He thinks it might still be a problem in *BSD. You can comment out this replacement code for *BSD, but the problem will then be that you'll randomly miss files in directory listings, plus you'll still have the linked-list scaling problem.

Someone who *knows the code* needs to examine the *BSD kernels and give a definitive answer on this.

Jeremy.
Comment 7 Jeremy Allison 2007-08-04 23:45:57 UTC
*** Bug 4858 has been marked as a duplicate of this bug. ***
Comment 8 Jeremy Allison 2007-08-04 23:47:25 UTC
*** Bug 4738 has been marked as a duplicate of this bug. ***
Comment 9 Jeremy Allison 2007-08-04 23:53:14 UTC
I still need someone from the FreeBSD community to answer the question raised by this comment in the Samba code.

This is needed because the existing directory handling in FreeBSD
  and OpenBSD (and possibly NetBSD) doesn't correctly handle unlink()
  on files in a directory where telldir() has been used. On a block
  boundary it will occasionally miss a file when seekdir() is used to
  return to a position previously recorded with telldir().

Has this bug been fixed in FreeBSD ? I can remove this replacement code in Samba, but if I do the underlying bug of missing entries on unlink will return.

I urgently need confirmation from FreeBSD kernel developers.

Jeremy.

Comment 10 Jeremy Allison 2007-08-05 00:01:55 UTC
Just a further note for Timur@FreeBSD.org.

You say : "I remember old conversation between Tridge an PHK(phk@FreeBSD.org) bout this problem and workaround, and IIRC, PHK said there is no bug in FreeBSD, but rather wrong(and Linux specific:) approach of the Samba team."

All Samba is doing in our directory code is the following :

 while ((n = vfs_readdirname(conn, dirp->dir))) {
          ....
          dirp->offset = SMB_VFS_TELLDIR(conn, dirp->dir);
 }

Where dirp->dir is a directory handle opened with opendir.

While this directory handle is open we expect that a
SMB_VFS_SEEKDIR(dirp->conn, dirp->dir, offset); call
will return to the same point in the directory, even
after an unlink() call.

All other platforms that Samba runs on have this property - does *BSD ?
If you say yes, remember to check the case described above :

"FreeBSD and OpenBSD (and possibly NetBSD) doesn't correctly handle unlink()
  on files in a directory where telldir() has been used. On a block
  boundary it will occasionally miss a file when seekdir() is used to
  return to a position previously recorded with telldir()."

This code isn't going to change in Samba, it makes directory operations from CIFS clients fast even on large directories. Without it we have to rewinddir and seek from the start after every unlink. When deleting a directory this becomes order O(n^2) - not feasible in working code (and a DOS attack to boot).

Jeremy.
Comment 11 Jeremy Allison 2007-08-05 00:12:04 UTC
Actually it's not true when I say this code isn't going to change. I can change the Samba code here, but if I do so what I will have to do is disable the name cache for *BSD systems. This will mean the performance on these systems goes down the toilet for large directories, but at least I won't get bug reports complaining about missing files. All directory operations will require a rewindir() and a search from the start on name on every FindNext operation. Is this what you want for *BSD ? This will hurt.
Jeremy.
Comment 12 Jeremy Allison 2007-08-05 00:19:16 UTC
James Peach - can you bug Terry Lambert and Jordan Hubbard to get some feedback on the state of this bug within FreeBSD please ?

Thanks,

Jeremy.
Comment 13 Carl Mascott 2007-08-05 14:46:32 UTC
Can you show me where in the documentation for any BSD OS it states that directory entry offsets do not change after deleting files in the directory?

I think you are relying on undocumented, unspecified internal behavior on the other platforms, which happens to be different from the behavior on BSD.  (It may also be different from how the other platforms behave next week.)

Since SVR4 uses Berkeley FFS (ufs) and FFS is the foundation for Veritas vxfs I would expect to see the same problem on more than just BSD systems.

QUESTION:
What is in your "name cache": ASCII strings representing file names?  Numbers representing directory entry offsets?  Something else?

Please bear with me: I don't have the time to become familiar w/ the Samba source.
Comment 14 Jeremy Allison 2007-08-05 15:01:52 UTC
No problem then - I'll just disable this code for *BSD. This is going to hurt you much more than it'll hurt me :-). I use Linux :-) :-).

We're depending on seekdir(telldir()) being an identity, so long as the handle isn't closed in the meantime. That seems pretty reasonable to me, and indeed all other platforms seem to guarentee this.

Jeremy.
Comment 15 Timur Bakeyev 2007-08-05 16:49:44 UTC
(In reply to comment #9)
> 
> I urgently need confirmation from FreeBSD kernel developers.
> 
> Jeremy.

Wish I can point to any appropriate one :( I'll send a call to the MLs, maybe, someone will raise his voice.

Timur

Comment 16 Timur Bakeyev 2007-08-05 17:12:17 UTC
Hi, Jeremy!

(In reply to comment #14)
> No problem then - I'll just disable this code for *BSD. This is going to hurt
> you much more than it'll hurt me :-). I use Linux :-) :-).

Can we isolate this code and leave some #ifdef that would choose (currently non-portable) caching with replacement lib or portable, but slow rewind/telldir/seekdir sequence. I can't give you educated answer right now, as I'm not kernel developer and can't find yet the one, familiar with VFS code enough, but that will make some compromise meanwhile for the next few versions of Samba.

Two more questions - is it possible to perform deleting of several files in a directory in a batch, so rewinding will be done only once?

How it happen, that prior to 3.0.25 people were able to use non-UFS partitions in FreeBSD without a problem? I.e. 3.0.24 still works ok with ISO9660, UDF, MSDOS, ZFS, etc. And for UFS the fix from Tridge works ok. I remember that originally it was implemented around 2005(http://samba.sernet.de/irclog/2005/01/20050130-Sun.log), but till now it wasn't a problem.

Would it be possible to to revert back to .24 behavior?

With regards,
Timur.

 
Comment 17 Jeremy Allison 2007-08-05 19:07:26 UTC
Yes, that's what I'm talking about. It was an easy change to just delete the directory caching for *BSD - the problem is that for the sequence : findfirst -> findnext -> findnext....... findnext (end of dir) for a large directory the performance will be *horrible* without the directory cache as I have to rewinddir/readdir to find the correct resume point on every findnext. With a seekdir(telldir()) identity system we can associate the last 100 filenames read with a telldir() offset we know we can resume from - very efficient.

We can't change the orders of deletes as this is completely client driven. We delete what the client told us to delete as it tells us to delete it.

Jeremy.
Comment 18 Jeremy Allison 2007-08-05 19:10:54 UTC
Timur wrote : "Would it be possible to to revert back to .24 behavior?"

Nope, sorry. That behaviour was the cause of bug reports and so that's why it was changed.

Jeremy.
Comment 19 Timur Bakeyev 2007-08-05 19:31:07 UTC
(In reply to comment #17)
> Yes, that's what I'm talking about. It was an easy change to just delete the
> directory caching for *BSD - the problem is that for the sequence : findfirst
> -> findnext -> findnext....... findnext (end of dir) for a large directory the

Can we meanwhile leave a #ifdef'ed code around this caching, something like:

#ifdef FREEBSD_DIR_CACHE_OFF

and make it possible to define macro during compilation so, that either caching code with the replacement telldir/seekdir will be used or no caching at all(i.e. rewind() after each delete).

Then we can collect some user statistics and see, what is better for the end user - after all, directories with more than 1000 files not quick on UFS anyhow, but rare.

Other option, but possibly too complex to implement is to fit cache into the size of a sector and make sure it doesn't cross the boundary. But that sounds overcomplicated.

So, Jeremy, would it be possible to implement #ifdef solution for 3.0.26, for example? I guess, it's too late for 3.0.25c(but if you give me a patch, I can include it into the FreeBSD package).

As for Samba4, let's see, what would be the outcome with the Samba3 first, before making similar changes there.
Comment 20 Timur Bakeyev 2007-08-05 19:32:31 UTC
(In reply to comment #18)
> Timur wrote : "Would it be possible to to revert back to .24 behavior?"
> 
> Nope, sorry. That behaviour was the cause of bug reports and so that's why it
> was changed.

Just curious - what was the bug or what is it's number in bugzilla if any?
Comment 21 tim newsham (550 5.1.1 User unknown) 2007-08-06 19:12:44 UTC
[sorry, made this comment on bug#4858 earlier, copying it here as well]
I tested os2_delete.c on FreeBSD/amd64 6.2-stable (updated 8/2/2007)
and FreeBSD/x86 6.2-release and both still fail the test.

This issue arises (at least on my machine) because an assumption in the replacement code is violated -- that the file position is always padded out to 512byte alignment[*].  Is it possible to fix the assumption in the replacment code without throwing out the (caching) replacement code?

[*] repdir_getdirentries line 135, abort() if d->seekpos & (DIR_BUF_SIZE-1))
Comment 22 James Peach 2007-08-17 11:44:14 UTC
FWIW, Terry says:

There are two bugs; one is in Samba, the other in FreeBSD.

The first is in the Samba assumptions about being able to delete, create, and iterate at the same time.  POSIX specifies that it shall be possible to iterate and delete at the same time; however, it also states that an intervening change of position within the directopy may result in undefined behaviour:

	<http://www.opengroup.org/onlinepubs/009695399/functions/readdir.html>

Specifically, from knowledge of historical implementation which are POSIX conformant, I would argue that any application which does this type of operation (deletion or creation of a file during the iteration) should expect that it may be required to perform duplicate elimination at a higher abstraction layer in their own software.  Similarly, historical implementaitons involving NFS will potentially result in an incomplete iteration of directory contents.

These particular problems arise because of cached multiple entries potentially being in a directory block which spans the area in which a new file is created, or file system or opendir implementations which make it impossible for a single directory entry block to be returned by the system's getdirentries (or equivalent) system call.

In the FreeBSD case in particular, the directory block boundary problem can result in incorrect operation with an intervening delete; this is specific to the read-restart code at the VNOP implementation layer.  The NetBSD implementation is more technically correct in this regard (supports arbitrary restart without lost entries), but both are insufficient.

The particular issue is an interaction between the cached getdirentries system call (this is the system call on BSD-based systems), and the library implementation of seekdir/telldir.

There are two ways of addressing this issue, but both will result in the necessity of duplicate suppression being implemented by the calling application.  Particularly, coalescing of free space in directory entry blocks in UFS, or, in other FS's, the rebalancing of btrees makes it impossible to avoid the problem, when seeking between offsets representing once cached buffer object an another.

Comment 23 Jeremy Allison 2007-08-23 16:08:22 UTC
So should I take this :

"I would argue that any application which does this type of
operation (deletion or creation of a file during the iteration) should expect
that it may be required to perform duplicate elimination at a higher
abstraction layer in their own software."

as a "will not fix" from you Terry ? :-). Do you know of *any* file management software that does this ?

Jeremy.
Comment 24 Jeremy Allison 2007-08-23 16:46:24 UTC
Ok, I'm going to parameterize the directory cache so it can be disabled on *BSD systems. I'll try and get this into any releases after 3.0.25c.
Jeremy.
Comment 25 Jeremy Allison 2007-08-23 16:54:30 UTC
Created attachment 2879 [details]
Patch to turn off directory cache with new parameter.

To test this set "directory name cache size = 0". We still need to detect the broken directory handling on *BSD and set the #define accordingly.
Jeremy.
Comment 26 Timur Bakeyev 2007-08-29 19:22:58 UTC
(In reply to comment #25)
> Created an attachment (id=2879) [edit]
> Patch to turn off directory cache with new parameter.
> 
> To test this set "directory name cache size = 0". We still need to detect the
> broken directory handling on *BSD and set the #define accordingly.

Sorry for the delay, it's not so easy to test this feature, apparently. In fact, I'm not sure still - does it work for me or not...

At minimum, repdir_* functions have to be disabled in the code, as otherwise they core dump on non-UFS FS anyhow. I attach a tiny patch to do this.

Ok, with repdir_* disabled it was possible to test directory caching. I've created two shares, one on UFS2 FS, another on FAT16. For both shares I set 'directory name cache size'.

On a first run I set cache to 0 for both shares, created 1200 files on each share and removed them via smbclient. All files has gone. Then I set cache size 
to 200 for both shares, assuming that for FAT16 it should improve speed and work and for UFS2 it should expose bug with some files remain undeleted. But, all the files have gone on UFS2 share as well. So, here I'm puzzled already.

I understand, that such testing doesn't prove anything, as there is quite specific OS2 deletion pattern, that have to be followed to expose the bug with caching enabled for UFS2 and should be eliminated by disabling caching. But I'm not sure if it is possible to repeat this pattern with smbclient only. Possibly, an smbtorture test exists (or can be created) to test this behavior against SMB share.

So, at this point I don't know, how I can check, that given patch works and addresses the bug in UFS2 dir handling.

With regards,
Timur
Comment 27 Timur Bakeyev 2007-08-29 19:24:32 UTC
Created attachment 2905 [details]
Q&D patch to disable repdir_* replacement functions.
Comment 28 Timur Bakeyev 2007-08-29 19:30:40 UTC
(In reply to comment #1)
> Can you run a version with symbols, set the parameter :
> 
> panic action = "/bin/sleep 90000"
> 
> re-create the panic and then attach to the parent of the sleep process with gdb
> and get me a backtrace please ?

Just happen to get the backtrace for the original bug, although, I guess, it's too late I add it for completeness.
[2007/08/29 00:38:44, 1] smbd/service.c:make_connection_snum(1033)
  build (10.10.10.10) connect to service dos initially as user nobody (uid=65534, gid=65534) (pid 78443)
[2007/08/29 00:38:45, 0] lib/fault.c:fault_report(41)
  ===============================================================
[2007/08/29 00:38:45, 0] lib/fault.c:fault_report(42)
  INTERNAL ERROR: Signal 6 in pid 78443 (3.0.25c)
  Please read the Trouble-Shooting section of the Samba3-HOWTO
[2007/08/29 00:38:45, 0] lib/fault.c:fault_report(44)
  
  From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
[2007/08/29 00:38:45, 0] lib/fault.c:fault_report(45)
  ===============================================================
[2007/08/29 00:38:45, 0] lib/util.c:smb_panic(1626)
  smb_panic: clobber_region() last called from [get_lanman2_dir_entry(1140)]
[2007/08/29 00:38:45, 0] lib/util.c:smb_panic(1632)
  PANIC (pid 78443): internal error
[2007/08/29 00:38:45, 0] lib/util.c:log_stack_trace(1736)
  BACKTRACE: 19 stack frames:
   #0 0x8247278 <smb_panic+164> at /var/tmp/samba/sbin/smbd
   #1 0x82346e8 <debug_parse_levels+1224> at /var/tmp/samba/sbin/smbd
   #2 0x88826183 <sigaction+2503> at /usr/lib/libpthread.so.2
   #3 0xbfbfff94
   #4 0x888f39eb <abort+87> at /lib/libc.so.6
   #5 0x822eac0 <seekdir+0> at /var/tmp/samba/sbin/smbd
   #6 0x8236ff7 <sys_telldir+27> at /var/tmp/samba/sbin/smbd
   #7 0x810dce5 <posix_mangle_init+1337> at /var/tmp/samba/sbin/smbd
   #8 0x809e5df <ReadDirName+143> at /var/tmp/samba/sbin/smbd
   #9 0x809e8a2 <dptr_SearchDir+126> at /var/tmp/samba/sbin/smbd
   #10 0x809e977 <dptr_ReadDirName+155> at /var/tmp/samba/sbin/smbd
   #11 0x80d986c <send_trans2_replies+9296> at /var/tmp/samba/sbin/smbd
   #12 0x80db970 <send_trans2_replies+17748> at /var/tmp/samba/sbin/smbd
   #13 0x80e1e84 <handle_trans2+4132> at /var/tmp/samba/sbin/smbd
   #14 0x80e6959 <reply_trans2+2957> at /var/tmp/samba/sbin/smbd
   #15 0x8100615 <smb_fn_name+981> at /var/tmp/samba/sbin/smbd
   #16 0x8101afd <smbd_process+2613> at /var/tmp/samba/sbin/smbd
   #17 0x82f1b41 <main+2261> at /var/tmp/samba/sbin/smbd
   #18 0x8089ce1 <_start+137> at /var/tmp/samba/sbin/smbd
[2007/08/29 00:38:45, 0] lib/util.c:smb_panic(1637)
  smb_panic(): calling panic action [/bin/sleep 999999999]


And gdb backtrace looks like:

[Switching to LWP 100323]
0x88895db1 in wait4 () at wait4.S:2
2       RSYSCALL(wait4)
(gdb) bt
#0  0x88895db1 in wait4 () at wait4.S:2
#1  0x8885df34 in __system (command=0x88a1efb0 "/bin/sleep 999999999") at /usr/src/lib/libc/stdlib/system.c:91
#2  0x88820423 in _system (string=0x88a1efb0 "/bin/sleep 999999999") at /usr/src/lib/libpthread/thread/thr_system.c:47
#3  0x082472c0 in smb_panic (why=0x834fafa "internal error") at lib/util.c:1638
#4  0x082346e8 in sig_fault (sig=6) at lib/fault.c:47
#5  0x88826183 in _thr_sig_handler (sig=6, info=0xbfbf9980, ucp=0xbfbf96c0) at /usr/src/lib/libpthread/thread/thr_sig.c:392
#6  0xbfbfff94 in ?? ()
#7  0x00000006 in ?? ()
#8  0xbfbf9980 in ?? ()
#9  0xbfbf96c0 in ?? ()
#10 0x00000000 in ?? ()
#11 0x88825ddc in _thr_sig_dispatch (curkse=0xbfbf9a00, sig=-1077962232, info=0x0) at /usr/src/lib/libpthread/thread/thr_sig.c:291
#12 0x888f39eb in abort () at /usr/src/lib/libc/stdlib/abort.c:69
#13 0x0822eac0 in telldir (dir=0x88a0fc00) at lib/replace/repdir_getdirentries.c:135
#14 0x08236ff7 in sys_telldir (dirp=0x88a0fc00) at lib/system.c:492
#15 0x0810dce5 in vfswrap_telldir (handle=0x88a0f030, dirp=0x88a0fc00) at modules/vfs_default.c:127
#16 0x0809e5df in ReadDirName (dirp=0x88970540, poffset=0xbfbfa004) at smbd/dir.c:1169
#17 0x0809e8a2 in dptr_normal_ReadDirName (dptr=0x889b3680, poffset=0xbfbfa004, pst=0xbfbfb420) at smbd/dir.c:563
#18 0x0809e977 in dptr_ReadDirName (dptr=0x889b3680, poffset=0xbfbfa004, pst=0xbfbfb420) at smbd/dir.c:642
#19 0x080d986c in get_lanman2_dir_entry (conn=0x88a16030, inbuf=0x0, outbuf=0x88a49000 "", path_mask=0xbfbfb5b0 "*", 
    dirtype=<error type>, info_level=260, requires_resume_key=4, dont_descend=0, ppdata=0xbfbfb538, base_data=0x88a91000 "`", 
    space_remaining=14168, out_of_space=0xbfbfb53c, got_exact_match=0xbfbfb540, last_entry_off=0xbfbfb544, name_list=0x0, 
    ea_ctx=0x0) at smbd/trans2.c:1149
#20 0x080db970 in call_trans2findfirst (conn=0x88a16030, inbuf=0x88a28000 "", outbuf=0x88a49000 "", bufsize=65535, 
    pparams=0x88a1a368, total_params=18, ppdata=0x88a1a370, total_data=0, max_data_bytes=<error type>) at smbd/trans2.c:1857
#21 0x080e1e84 in handle_trans2 (conn=0x88a16030, state=0x88a1a230, inbuf=0x88a28000 "", outbuf=0x88a49000 "", size=90, 
    bufsize=65535) at smbd/trans2.c:6382
#22 0x080e6959 in reply_trans2 (conn=0x88a16030, inbuf=0x88a28000 "", outbuf=0x88a49000 "", size=90, bufsize=65535)
    at smbd/trans2.c:6652
#23 0x08100615 in switch_message (type=50, inbuf=0x88a28000 "", outbuf=0x88a49000 "", size=90, bufsize=65535)
    at smbd/process.c:1003
#24 0x08101afd in smbd_process () at smbd/process.c:1030
#25 0x082f1b41 in main (argc=4, argv=0xbfbfecb0) at smbd/server.c:1120
Current language:  auto; currently asm


Comment 29 Timur Bakeyev 2007-09-11 21:35:34 UTC
(In reply to comment #24)
> Ok, I'm going to parameterize the directory cache so it can be disabled on *BSD
> systems. I'll try and get this into any releases after 3.0.25c.

Tried this patch independently and in 3.0.26a - seems, it doesn't address the problem. Running 'smbtorture4 RAW-SEARCH' against UFS share with both enabled and disabled cache and native libc seekdir/telldir leaves files in the test directory. In case of caching enable 673 out of 700 files deleted, with disabled caching only 4 are...

Would it be possible to retrieve type of FS underneath the share and use different seekdir/telldir routines? Actually, that possibly won't help either, as the same code used for all the FS types, just blocksize is different, what breaks replacement functions...

I'm lost at the moment... How did it work in pre-3.0.25 era?