Bug 2620 - Oplock PANIC when using NetApp opl1 Test Suite Tool; Resource deadlock avoided.
Summary: Oplock PANIC when using NetApp opl1 Test Suite Tool; Resource deadlock avoided.
Status: RESOLVED WORKSFORME
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: File Services (show other bugs)
Version: 3.0.11
Hardware: All Windows XP
: P3 normal
Target Milestone: none
Assignee: Samba Bugzilla Account
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-04-15 14:31 UTC by Tom Lackemann
Modified: 2006-04-20 08:25 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Lackemann 2005-04-15 14:31:19 UTC
This is somewhat easy to reproduce using the NetApp tool opl1.exe. Sorry if this 
is too much information; although I think more is better than less.
Note that I built Samba 3.0.11 with "-g" option so I could get around easily in 
gdb.

Here is my environment setup:
Samba 3.0.11 serving a single share "pub"
Two Windows machines, mapped to "pub", each running the opl1 tool:
   winclient1: opl1 -t1 z:\m.bat
   winclient2: opl1 -t1 f:\m.bat

smb.conf:
[global]
        workgroup = TJAPAN
        realm = TJAPAN.ACME.NET
        netbios name = iqtest
        security = ADS
        password server = japan1,japan2
        encrypt passwords = Yes
        winbind uid = 10000-65534
        winbind gid = 10000-65534
        debug pid = yes

[pub]
        path = /shares/pub
        read only = yes
        valid users = "TJAPAN\Domain Users",
        write list = "TJAPAN\Domain Users",

Here is an overview of the opl1 tool, straight from NetApp Test Suite readme 
file:
---------------
opl1 is a command line test to test file sharing. It is designed to cause oplock 
breaks. N copies of opl1 are run against the same file either on the same 
machine or using different machines.

   usage: G:\OPL1.EXE [filename][-?hvt]
           filename - test file (default: opl1.dat)
           -?|h     - display help
           -t#      - set wait time in ms (default 5000ms)
           -v       - toggle messages (default: off)
   Program flow:
     while (true) {
       open file
       wait specified time
       perform single operation
       wait specified time
       close file
       wait specified time
     }

The usual test will be for two machines to open the same file. We generally run 
with both the default time (which increase the probability that two system will 
collide) and with -t = 1 (which increases the number of test cases). Locking 
problems show themselves as pauses in the test or stats within the filer. The 
test does not explicitly detect or flag errors.
-----------

While the test is running if you tail the smb.log on the server, after a while 
one of the Samba processes will panic with something like:
release_level_2_oplocks_on_change(1287)
  release_level_2_oplocks_on_change: PANIC. File m.bat still has a level II 
oplock.

A common precursor to this is a locking error reported by the other samba client 
daemon:
[2005/04/15 16:42:07, 0, pid=20423] tdb/tdbutil.c:tdb_log(725)
  tdb(/var/tmp/iqtest/locking.tdb): tdb_lock failed on list 57 ltype=1 (Resource 
deadlock avoided)
[2005/04/15 16:42:07, 0, pid=20423] smbd/oplock.c:oplock_break(708)
  oplock_break: unable to lock share entry for file m.bat

I added an F_GETLK fcntl() call to tdb_brlock() on a failure to verify it was 
the other Samba daemon holding the lock. I attached to both daemons from gdb, 
set a breakpoint on the F_SETLKW failure. Here are the relevant gdb backtraces 
and frame information for both:

PID 20423:
Breakpoint 1, tdb_brlock (tdb=0x8348770, offset=396, rw_type=1, lck_type=14,
    probe=0) at tdb/tdb.c:288
288              errno = saved_errno;
(gdb) bt
#0  tdb_brlock (tdb=0x8348770, offset=396, rw_type=1, lck_type=14, probe=0)
    at tdb/tdb.c:288
#1  0x081b9aaa in tdb_lock (tdb=0x8348770, list=57, ltype=1) at tdb/tdb.c:315
#2  0x081bc7bf in tdb_chainlock (tdb=0x8348770, key=
      {dptr = 0x82a2f20 "\001▒~V~R", dsize = 16}) at tdb/tdb.c:1974
#3  0x08172246 in lock_share_entry_fsp (fsp=0x83539f8) at locking/locking.c:392
#4  0x081d4a2a in oplock_break (dev=65025, inode=2582110, file_id=1026,
    local_request=0) at smbd/oplock.c:707
#5  0x081d42a4 in process_local_message (buffer=0x159f9008 "\026",
    buf_size=131137) at smbd/oplock.c:450
#6  0x080c8107 in async_processing (buffer=0x159f9008 "\026",
    buffer_len=131137) at smbd/process.c:285
#7  0x080c8624 in receive_message_or_smb (buffer=0x159f9008 "\026",
    buffer_len=131137, timeout=60000) at smbd/process.c:469
#8  0x080c9bc5 in smbd_process () at smbd/process.c:1543
#9  0x08211ef8 in main (argc=4, argv=0x3ffff7f4) at smbd/server.c:951
#10 0x1577d1c4 in __libc_start_main () from /lib/libc.so.6
(gdb) l
283
284              fcntl(tdb->fd, F_GETLK, &fl);
285              TDB_LOG((tdb, 0, "F_GETLCK:l_type:%d, l_whence:%d, l_pid:%d"
286                       " l_start:%Ld, l_len:%Ld\n", fl.l_type, fl.l_whence,
287                       fl.l_pid, fl.l_start, fl.l_len));
288              errno = saved_errno;
289                     }
290                     return TDB_ERRCODE(TDB_ERR_LOCK, -1);
291             }
292             return 0;
(gdb) p fl
$1 = {l_type = 1, l_whence = 0, l_start = 396, l_len = 1, l_pid = 20487}
(gdb) p errno
$2 = 35

PID 20487: (hit ctrl-c to see where it's blocked)
Program received signal SIGINT, Interrupt.
0x15835674 in open () from /lib/libc.so.6
(gdb) bt
#0  0x15835674 in open () from /lib/libc.so.6
#1  0x1588830c in __DTOR_END__ () from /lib/libc.so.6
#2  0x0819a1b0 in sys_open (path=0x3ffff120 "m.bat", oflag=66, mode=496)
    at lib/system.c:345
#3  0x080bebf4 in vfswrap_open (handle=0x0, conn=0x834fb38,
    fname=0x3ffff120 "m.bat", flags=66, mode=496) at smbd/vfs-wrap.c:190
#4  0x080b7206 in fd_open (conn=0x834fb38, fname=0x3ffff120 "m.bat", flags=66,
    mode=496) at smbd/open.c:47
#5  0x080b74ef in open_file (fsp=0x834c788, conn=0x834fb38,
    fname=0x3ffff120 "m.bat", psbuf=0x3ffff0c0, flags=578, mode=496,
    desired_access=131487) at smbd/open.c:175
#6  0x080b9487 in open_file_shared1 (conn=0x834fb38, fname=0x3ffff120 "m.bat",
    psbuf=0x3ffff0c0, desired_access=131487, share_mode=66, ofun=18,
    new_dos_mode=0, oplock_request=3, Access=0x3fffecac, paction=0x3fffecb0)
    at smbd/open.c:1250
#7  0x08096010 in reply_ntcreate_and_X (conn=0x834fb38, inbuf=0x159f9008 "",
    outbuf=0x15a1a008 "", length=102, bufsize=131072) at smbd/nttrans.c:823
#8  0x080c8d4a in switch_message (type=162, inbuf=0x159f9008 "",
    outbuf=0x15a1a008 "", size=102, bufsize=131072) at smbd/process.c:968
#9  0x080c8dd9 in construct_reply (inbuf=0x159f9008 "", outbuf=0x15a1a008 "",
    size=102, bufsize=131072) at smbd/process.c:998
#10 0x080c90e9 in process_smb (inbuf=0x159f9008 "", outbuf=0x15a1a008 "")
    at smbd/process.c:1098
#11 0x080c9bfd in smbd_process () at smbd/process.c:1558
#12 0x08211ef8 in main (argc=4, argv=0x3ffff7f4) at smbd/server.c:951
#13 0x1577d1c4 in __libc_start_main () from /lib/libc.so.6

Someone who understands the oplock logic can maybe figure out how this can 
occur. To me it looks as if the open() (pid 20487) call triggered a kernel 
oplock break to the other samba daemon (pid 20423). Granted pid 20487 has the 
locking.tdb list locked for this dev/inode so I'm not sure how pid 20423 placed 
an oplock since it seems you need to aquire the tdb lock prior to placing the 
file oplock. But I'm rambling.

If you need more info, please email me. I'm going to continue to look at this as 
well.
Comment 1 Jeremy Allison 2005-04-15 15:21:24 UTC
Do you have a copy of this tool I can download and use ? That would be helpful I
think.

Thanks,

Jeremy.
Comment 2 Tom Lackemann 2005-04-16 09:00:36 UTC
Sorry, having the tool would help I guess.

I believe opl1.exe depends on a few dll's in the netapp-testtools.exe package. 
But you can find this package at the Connectathon 2005 site, http://www.
connectathon.org/, under the Test Suites downloads.

TomL
Comment 3 Jeremy Allison 2005-06-02 14:58:14 UTC
Ok, I've been trying to reproduce this with the latest SVN build of Samba 3.0.x
(revision 7198) and can't. I'm using a 2.6.10 kernel... Can you try and
reproduce this with our latest SVN code please ?
Jeremy.
Comment 4 Gerald (Jerry) Carter (dead mail address) 2006-04-20 08:25:12 UTC
closing.  no response.