Bug 3010 - samba-3.0.14a & samba-3.0.20pre2 endless loop AIX 5.3 (jfs2) & Win98
Summary: samba-3.0.14a & samba-3.0.20pre2 endless loop AIX 5.3 (jfs2) & Win98
Status: RESOLVED FIXED
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: File Services (show other bugs)
Version: 3.0.20
Hardware: Other Windows 98
: P3 normal
Target Milestone: none
Assignee: Jeremy Allison
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-08-17 20:13 UTC by Steve Williams
Modified: 2005-09-29 04:14 UTC (History)
0 users

See Also:


Attachments
3.0.14a OR 3.0.20 debug from PC looping (46.62 KB, application/zip)
2005-08-17 20:21 UTC, Steve Williams
no flags Details
log level 10 of 3.0.20rc2 and Win98 DOS client (335.43 KB, text/plain)
2005-08-18 08:33 UTC, William Jojo
no flags Details
ZIP file containing debug 10 with AIX 5.3 and Samba 3.0.20 (530.32 KB, application/zip)
2005-08-18 13:03 UTC, Steve Williams
no flags Details
lame patch...still digging on bad offset values. (1.47 KB, patch)
2005-08-21 10:47 UTC, William Jojo
no flags Details
Proposed patch. (917 bytes, patch)
2005-08-21 14:09 UTC, Jeremy Allison
no flags Details
Fix going into 3.0.20a. (1.01 KB, patch)
2005-09-27 13:37 UTC, Jeremy Allison
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Steve Williams 2005-08-17 20:13:25 UTC
I have replaced an older AIX system with a new one running AIX 5.3, all
the latest patches.  It is acting as a PDC (I think irrelevant).  The
old server was running AIX 4.3.2 with Samba 3.0.14a (upgraded from
2.0.7) , and was working 100% fine. I had the old server running 3.0.14a
for 6 weeks prior to the upgrade as part of my migration plan.

There are Windows 98 boxes that connect to this server (workgroup), as
well as XP SP2 boxes that connect to the server (domain).  The shares
that I am having problems with are on IBM's "jfs2" filesystem.

The XP boxes are working perfectly.

The Windows 98 boxes work to read and save files.  HOWEVER... if one
"Explores" into one of the folders, Samba goes into an endless loop. 
The little flashlight in Windows 98 Explorer just keeps waving back and
forth.

The behavior can be duplicated by going into a DOS prompt and doing a
"DIR" on the shared directory.  It is more obvious what is happening,
because the screen updates continuously.  It just scrolls forever.  It
gets to the end of the directory listing and starts again at the
top...looping forever.

1.  AIX 4.3.2, jfs, samba-3.0.14a worked perfectly
2.  AIX 5.3, jfs2, samba-3.0.14a & samba-3.0.20pre2 have problem with 
Windows 98 computers
Comment 1 Steve Williams 2005-08-17 20:14:46 UTC
From samba email list:

Jeremy Allison wrote:

> On Wed, Aug 17, 2005 at 05:26:36PM -0500, Gerald (Jerry) Carter wrote:
>  
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Steve Williams wrote:
>>
>>   
>>
>>> My "gut feeling" is that it is related to jfs2.  No concrete proof though. 
This is the ONLY problem we encountered
>>> with the entire upgrade, and the only thing that we did
>>> "radically" different was use jfs2 rather than JFS.  The advantage
>>> we saw was that JFS2 can "shrink" the filesystems, which can
>>> be nice in a year or two when requirements change.
>>>     
>>> Did you do testing on AIX?  I was not aware that I could get an "ext3" fs on
AIX.  If you are interested in persuing
>>> this further, I will try to set things up to do some
>>> troubleshooting...  I am remote to the  location & will
>>> need to have someone work with me.. not a big deal, they
>>> have a good summer student... but does need some coordination.
>>>     
>>
>> I spoke with Jeremy about it.  He believes that it is a
>> problem with  the way we implement resume keys now.  Apparently
>> on;y win9x uses resume keys these days in the findfirst/findnext
>> sequence.  WinNT and later uses resume by name.
>>   
>
>
> Although to confirm it I'd like to see a debug level 10 log
> of one of your clients "looping" with a directory listing
> against a 3.0.20 Samba server please.
>
> Jeremy.
>  
>
Hi,

That's cool, I will try to get this for you tomorrow morning. 
How would you like me to get this to you?

Cheers,
Steve Williams
Comment 2 Steve Williams 2005-08-17 20:21:05 UTC
Created attachment 1384 [details]
3.0.14a OR 3.0.20 debug from PC looping

I am not sure if this is a log file from 3.0.14a or from 3.0.20.  To be honest,
I am not even sure what level of log file it is!  I was trying to troubleshoot
the problem in a production environment.  I include it here in case it will
help.  I will try to create a new debug level 10 ASAP
Comment 3 Steve Williams 2005-08-17 20:32:14 UTC
Further investigation has revealed the attached logfile is from 3.0.14a.

I am not sure that it is an internal problem in Samba.  I was running 3.0.14a on
AIX 4.3.2 for several months prior to the upgrade with Windows 98 hosts
accessing the system with no problem at all.

I installed a new system with AIX 5.3 and a freshly compiled 3.0.14a and
encountered the problem.  If it was a problem with Samba internals, I would have
thought that the problem would have arisen when I upgraded the original server
from 2.0.7 to 3.0.14a.

Comment 4 Steve Williams 2005-08-17 20:34:48 UTC
OOPS..

I forgot to mention... I upgraded the server to 3.0.20 to find out if the
problem went away.  It did not.  The 3.0.20 was compiled locally with IBM's C
compiler.

./configure --prefix=/usr/local/samba-3.0.20

Looking at the configure and compile logfiles everything seemd to work fine.
Comment 5 Jeremy Allison 2005-08-17 20:53:08 UTC
Thereis a known bug with 3.14a that can cause this. Please test the same thing
immediately with 3.0.20 as I've done a *lot* of work in this area between the
two releases.

Thanks,

Jeremy.
Comment 6 William Jojo 2005-08-18 08:33:18 UTC
Created attachment 1386 [details]
log level 10 of 3.0.20rc2 and Win98 DOS client

This is running on AIX 5.2 ML-06 on a JFS (not JFS2) filesystem.
Comment 7 Steve Williams 2005-08-18 09:03:32 UTC
Jerry,

I have an appointment August 18 at 15:00 Eastern Canada time (GMT-5??)  to get a
debug level 10 from my client's system.  Just a FYI..

Cheers,
Steve
Comment 8 Steve Williams 2005-08-18 13:03:23 UTC
Created attachment 1388 [details]
ZIP file containing debug 10 with AIX 5.3 and Samba 3.0.20

map_drive.log - This is a Windows 98 PC "win98test" connecting to a samba share
called "\\OSHAWA\EKG", and mapping it to drive letter.

log.win98test.part1
log.win98test.part2
Are debug level 10 output's from the "looping" problem.
Comment 9 William Jojo 2005-08-21 10:47:13 UTC
Created attachment 1391 [details]
lame patch...still digging on bad offset values.

Ok, this patch is lame, but points out the flaw. I had to mod some DEBUGs to
find it. SeekDir RewindDir's when dptr->offset==END_OF_DIRECTORY_OFFSET. The
real problem is reply_search doesn't update the values with dptr_fill properly
because of a scope problem in smbd/dir.c get_dir_entry() or AIX's telldir is
broken...but since 3.0.11 works fine, I'm leaning toward the former.
Comment 10 Jeremy Allison 2005-08-21 11:31:01 UTC
I don't think this part of the patch is correct :

 void SeekDir(struct smb_Dir *dirp, long offset)
 {
+
+	if ( dirp->offset == END_OF_DIRECTORY_OFFSET )
+		return ;
+

Shouldn't this be 

 void SeekDir(struct smb_Dir *dirp, long offset)
 {
+
+	if ( offset == END_OF_DIRECTORY_OFFSET )
+		return ;
+

instead ?

Jeremy.
Comment 11 Jeremy Allison 2005-08-21 14:09:55 UTC
Created attachment 1392 [details]
Proposed patch.

Can you try this patch instead please. I think it may fix the problems with
END_OF_DIRECTORY_OFFSET not being handled consistently.
Thanks,
Jeremy.
Comment 12 Jeremy Allison 2005-08-21 14:10:21 UTC
Reassigned to me - probably my bug.
Jeremy.
Comment 13 Steve Williams 2005-08-21 14:38:22 UTC
Jeremy,

What would you like me to try this patch against?  In the interest of least
change, I would be inclined to test it against 3.0.20pre2.  However, I have
downloaded & compiled 3.0.20 release.  I'd need to put it into production, but
that's not a big issue.

What is your preference?

Thanks,
Steve
Comment 14 Jeremy Allison 2005-08-21 14:43:28 UTC
Try against 3.0.20 - that's what I've applied it to. If your analysis is correct
on the mishandling of the END_OF_DIRECTORY "magic" value I'm hoping it'll work.
You might want to try it on a non-production server first - especially if the
looping behaviour is reproducible on demand.

Jeremy
Comment 15 Steve Williams 2005-08-24 09:17:16 UTC
The primary problem is that AIX does not have a "DIR" abstraction between a
"normal" directory entry (32 bit??) and 64 bit DIR entry.  Instead, they chose
to have a DIR, and a DIR64.  The assumption throughout configure and Samba was
that "DIR" would always be the "correct" type.  Well, on AIX it isn't.  This was
causing configure to do assorted "random" things, mixing 32 bit & 64 bit calls,
thus hammering memory, or subsequent calls not finding what they were expecting.

There was a change to return properly at the end of a diretory, as well as
changes to configure.in to test for "telldir64", etc.

Jeremy added an an abstraction "SMB_STRUCT_DIR" which will always be either DIR,
or DIR64 as appropriate.

To properly resolve this problem, the following SVN patches were made.  These
have been applied to the 3.0.20 tree and have resolved the problem.

Most of the work was done by William Jojo to troubleshoot this problem.

svn_9456.patch
svn_9481.patch
svn_9484.patch
svn_9534.patch
svn_9536.patch
svn_9545.patch

Cheers,
Steve

This can be considered "RESOLVED".
Comment 16 Gerald (Jerry) Carter (dead mail address) 2005-08-24 10:23:18 UTC
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.
Comment 17 Paul Kranenburg 2005-08-29 00:30:22 UTC
This PR is still not fully resolved. An end-less loop over `.' and `..' entries
still occurs if the directory only contains entries that the client does not
want to see, e.g. that are invisible or don't match a requested pattern)


Suggested patch:

*** dir.c.trunk	Mon Aug 29 08:38:02 2005
--- dir.c.fix	Mon Aug 29 08:37:56 2005
***************
*** 1136,1142 ****
  void RewindDir(struct smb_Dir *dirp, long *poffset)
  {
  	SMB_VFS_REWINDDIR(dirp->conn, dirp->dir);
! 	dirp->file_number = 0;
  	dirp->offset = START_OF_DIRECTORY_OFFSET;
  	*poffset = START_OF_DIRECTORY_OFFSET;
  }
--- 1136,1143 ----
  void RewindDir(struct smb_Dir *dirp, long *poffset)
  {
  	SMB_VFS_REWINDDIR(dirp->conn, dirp->dir);
! 	if (*poffset != DOT_DOT_DIRECTORY_OFFSET)
! 		dirp->file_number = 0;
  	dirp->offset = START_OF_DIRECTORY_OFFSET;
  	*poffset = START_OF_DIRECTORY_OFFSET;
  }
Comment 18 Jeremy Allison 2005-09-27 12:26:09 UTC
This patch is not correct. Still examining the right way to fix this.
Jeremy.
Comment 19 Jeremy Allison 2005-09-27 12:34:00 UTC
I've now reproduced this. Working on a final fix.
Jeremy.
Comment 20 Jeremy Allison 2005-09-27 13:37:42 UTC
Created attachment 1459 [details]
Fix going into 3.0.20a.
Comment 21 Jeremy Allison 2005-09-27 13:38:30 UTC
Hopefully the long saga of this bug is now at an end.... Please test and let me
know.
Thanks,
Jeremy.
Comment 22 Paul Kranenburg 2005-09-29 04:14:50 UTC
(In reply to comment #21)

Yes, this fixes the problem, -thanks!