1032 – frequent tdb corruption

Bug 1032 - frequent tdb corruption

Summary: frequent tdb corruption

Status:	CLOSED FIXED

Alias:	None

Product:	Samba 3.0
Classification:	Unclassified
Component:	winbind (show other bugs)
Version:	3.0.1
Hardware:	All Linux

Importance:	P3 critical
Target Milestone:	none
Assignee:	Lars Müller
QA Contact:

URL:
Keywords:

Depends on:
Blocks:

Reported:	2004-02-03 06:38 UTC by Daniel Jarboe
Modified:	2005-08-24 10:19 UTC (History)
CC List:	4 users (show)

See Also:

Attachments
example of connections.tdb file that samba 3 tdbbackup reports is bad (120.00 KB, application/octet-stream) 2004-02-04 10:45 UTC, Daniel Jarboe	no flags	Details
Now smbstatus is reporting locking.tdb has bad magic (16.00 KB, application/octet-stream) 2004-02-05 04:01 UTC, Daniel Jarboe	no flags	Details
diff -u patch, will apply both to samba 2.2.8a and samba 3.0.2 (2.27 KB, patch) 2004-02-11 06:50 UTC, Daniel Jarboe	no flags	Details
config.log (Samba 3.0.2) (422.24 KB, text/plain) 2004-02-13 05:00 UTC, Daniel Jarboe	no flags	Details
config.h (Samba 3.0.2) (53.02 KB, text/plain) 2004-02-13 05:02 UTC, Daniel Jarboe	no flags	Details
tdb_dump_chain "freelist top" problem (for tdbtool, minor) (322 bytes, patch) 2004-03-03 06:10 UTC, Daniel Jarboe	no flags	Details
tdb_dump_chain "freelist top" problem (for tdbtool, minor) (319 bytes, patch) 2004-03-03 06:14 UTC, Daniel Jarboe	no flags	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Daniel Jarboe 2004-02-03 06:38:18 UTC

On SuSE SLES8 SP3 s390 our tdb's frequently become corrupted.  The biggest 
culprits tend to be winbindd_cache.tdb and connections.tdb.  This is for all 
versions of Samba 2.2.8a through 3.0.1.

While our connections.tdb and sessionid.tdb files have been fine for the last 
few days since we increased log level to 3, I looked back through the logs and 
noticed that one of the .tdb files that we weren't monitoring 
(winbindd_cache.tdb) quickly became corrupt.  We couldn't easily monitor this 
one, because the tool we've been using (tdbbackup) always reports an error on 
winbindd_cache.tdb for Samba 2, even after this tdb is freshly generated.  But 
looking back in the logs, it happened within almost a half hour of having the 
winbindd_cache rebuilt:

Jan 30 12:24:27 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=43544
Jan 30 12:24:27 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=43544
Jan 30 12:24:27 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=43544
Jan 30 12:24:27 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=43256
Jan 30 12:24:27 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=43256

(etc)

Jan 30 12:24:27 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=42532
Jan 30 12:24:27 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=42532
Jan 30 12:24:28 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): tdb_oob len 1111638618 beyond eof at 57344
Jan 30 12:24:28 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): tdb_oob len 1111638618 beyond eof at 57344
Jan 30 12:24:28 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=44968
Jan 30 12:24:28 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=44968
Jan 30 12:24:28 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=44968
Jan 30 12:24:28 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=44660

(etc)

Jan 30 12:24:28 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=41356
Jan 30 12:24:28 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=41356
Jan 30 12:24:28 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0xd9fee666 at 
offset=41068
Jan 30 12:24:28 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0xd9fee666 at 
offset=41068
Jan 30 12:24:28 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=44592
Jan 30 12:24:28 linfs1 winbindd[5729]:   tdb
(/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at 
offset=44592

(etc)

On a samba 3.0.1 print server, winbindd aborted:

string_to_sid: Sid 
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB does not 
start with 'S-'.
string_to_sid: Sid BBBBBBBBBBB does not start with 'S-'.
string_to_sid: Sid 648334-1534^Gftpuser^A does not start with 'S-'.
string_to_sid: Sid 48334-153fitzpatp^A does not start with 'S-'.
string_to_sid: Sid 24891-1907648334-154quintasv^A does not start with 'S-'.
string_to_sid: Sid 20293289-429224891-1907648334-1552^Eletol^A does not start 
with 'S-'.
string_to_sid: Sid  does not start with 'S-'.
string_to_sid: Sid -5-21-2020293289-429224891-1907648334-1554^Fstumpc does not 
start with 'S-'.
string_to_sid: Sid 1-5-21-2020293289-429224891-1907648334-156g does not start 
with 'S-'.
string_to_sid: Sid eickmac^A does not start with 'S-'.
string_to_sid: Sidwsmadmin^A does not start with 'S-'.
string_to_sid: Sid 8334-1588^Elambd^A does not start with 'S-'.
string_to_sid: Sid -1907648334-160nelavaij^A does not start with 'S-'.
string_to_sid: Sid 93289-429224891-1907648334-1604^Fmentzk^A does not start 
with 'S-'.
string_to_sid: Sid 21-2020293289-429224891-1907648334-160stric does not start 
with 'S-'
string_to_sid: Sid alkera^A does not start with 'S-'.
string_to_sid: Sid -162cavinesd^A does not start with 'S-'.

etc. etc.  It must have started pointing in the middle of records instead of 
at the beginning.  Anyway, it ended like this:

string_to_sid: Sid 0293289-429224891-1907648334-1787^Gwpscomm^A does not start 
with 'S-'.
string_to_sid: Sid 5-21-2020293289-429224891-1907648334-1796^Fsmi does not 
start with 'S-'.
nsswitch/winbindd_cache.c:centry_uint8(175)
  centry corruption? needed 1 bytes, have 0
lib/util.c:smb_panic(1400)
  PANIC: centry_uint32
lib/util.c:smb_panic(1408)
  BACKTRACE: 12 stack frames:
   #0 [0x8042ee3a]
   #1 [0x8042eea4]
   #2 [0x8042efe8]
   #3 [0x804319d4]
   #4 [0x80429b22]
   #5 [0x8042a8e0]
   #6 [0x80426c7e]
   #7 [0x80427028]
   #8 [0x80427a58]
   #9 [0x8042807e]
   #10 [0xc024f7bc]
   #11 [0x804263e0]

Though there was a backtrace, I doubt it will be too helpful because this is 
just when the corruption was noticed.

But the backgrace is really:
   #0 [0x0042ee3a]
   #1 [0x0042eea4]
   #2 [0x0042efe8]
   #3 [0x004319d4]
   #4 [0x00429b22]
   #5 [0x0042a8e0]
   #6 [0x00426c7e]
   #7 [0x00427028]
   #8 [0x00427a58]
   #9 [0x0042807e]
   #10 [0x4024f7bc]
   #11 [0x004263e0]

# file winbindd
winbindd: ELF 32-bit MSB executable, version 1 (SYSV), dynamically linked 
(uses shared libs), not stripped

# readelf -l winbindd

Elf file type is EXEC (Executable file)
Entry point 0x4263a8
There are 7 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  PHDR           0x000034 0x00400034 0x00400034 0x000e0 0x000e0 R E 0x4
  INTERP         0x000114 0x00400114 0x00400114 0x0000d 0x0000d R   0x1
      [Requesting program interpreter: /lib/ld.so.1]
  LOAD           0x000000 0x00400000 0x00400000 0x15f874 0x15f874 R E 0x1000
  LOAD           0x15f878 0x00560878 0x00560878 0x06d40 0x198b4 RW  0x1000
  DYNAMIC        0x165f98 0x00566f98 0x00566f98 0x00140 0x00140 RW  0x4
  NOTE           0x000124 0x00400124 0x00400124 0x00020 0x00020 R   0x4
  GNU_EH_FRAME   0x15f858 0x0055f858 0x0055f858 0x0001c 0x0001c R   0x4

 Section to Segment mapping:
  Segment Sections...
   00
   01     .interp
   02     .interp .note.ABI-
tag .hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init
 .plt .text .fini .rodata .eh_frame_hdr
   03     .data .eh_frame .dynamic .ctors .dtors .jcr .got .bss
   04     .dynamic
   05     .note.ABI-tag
   06     .eh_frame_hdr


# ldd winbindd | awk '{print $4 " " $3 }'|sort
(0x40000000) /lib/ld.so.1
(0x4001b000) /lib/libcrypt.so.1
(0x4004e000) /lib/libattr.so.1
(0x40052000) /lib/libresolv.so.2
(0x40066000) /lib/libnsl.so.1
(0x4007d000) /lib/libdl.so.2
(0x40081000) /usr/lib/libpopt.so.0
(0x40088000) /usr/lib/libcom_err.so.1
(0x4008b000) /usr/lib/libgssapi.so.1
(0x40098000) /usr/lib/libkrb5.so.17
(0x400d6000) /usr/lib/libasn1.so.5
(0x400fd000) /usr/lib/libcrypto.so.0.9.6
(0x401de000) /usr/lib/libroken.so.9
(0x401f3000) /usr/lib/libldap.so.2
(0x40225000) /usr/lib/liblber.so.2
(0x40233000) /lib/libc.so.6
(0x40358000) /usr/lib/libdb-4.0.so
(0x403fc000) /usr/lib/libsasl.so.7
(0x40411000) /usr/lib/libssl.so.0.9.6
(0x40440000) /lib/libpam.so.0
(0x40449000) /usr/lib/libgdbm.so.2

Portions of nm output around backtrace addresses:

Winbind:

004263a8 T _start
004263f4 t __do_global_dtors_aux
0042645c t frame_dummy
004264bc t reload_services_file
00426588 t dump_core
004267b8 t fault_quit
004267e0 t winbindd_status
004269c8 t print_winbindd_status
004269f8 t flush_caches
00426a20 t terminate
00426a84 t termination_handler
00426abc t sigusr2_handler
00426af4 t sighup_handler
00426b2c t msg_reload_services
00426b60 t msg_shutdown
00426b88 t process_request
00426d0c t new_connection
00426e60 t remove_client
00426edc t remove_idle_client
00426ff0 T winbind_process_packet
00427054 T winbind_client_read
004271dc t client_write
00427494 t process_loop
00427b3c T main
00428094 t winbindd_fill_pwent
004282dc T winbindd_getpwnam


004294c8 T winbindd_list_users
004297b0 t gr_mem_buffer
00429894 t fill_grent
00429904 t fill_grent_mem
0042a0e0 T winbindd_getgrnam
0042a594 T winbindd_getgrgid
0042a914 T winbindd_setgrent


0042e98c T winbindd_check_cache_size
0042eb80 t get_cache
0042ec54 t centry_free
0042ec9c t centry_uint32
0042ed94 t centry_uint8
0042ee5c t centry_string
0042efa4 t centry_sid
0042f00c t wcache_server_down
0042f0d8 t fetch_cache_seqnum
0042f3b8 t store_cache_seqnum
0042f600 t refresh_sequence_number
0042f7a4 t centry_expired
0042f934 t wcache_fetch
0042fc2c t centry_expand
0042fd00 t centry_put_uint32
0042fd7c t centry_put_uint8
0042fdc4 t centry_put_string
0042fe84 t centry_put_sid
0042fec0 T centry_start
0042ff48 t centry_end
0042fff8 t wcache_save_name_to_sid
00430110 t wcache_save_sid_to_name
00430210 t wcache_save_user
0043031c t query_user_list
00430784 t enum_dom_groups
00430ab8 t enum_local_groups
00430e68 t name_to_sid
0043109c t sid_to_name
004312b0 t query_user
00431540 t lookup_usergroups
00431880 t lookup_groupmem
00431c1c t sequence_number
00431c60 t trusted_domains
00431d2c t domain_sid
00431de8 t alternate_name
00431ea0 t traverse_fn
00431ef8 T wcache_invalidate_samlogon


nscd is off, here is smb.conf for the two servers:

linps2, the samba 3.0.1 test print server:
[global]
        workgroup = TCS_MAIN_DOM
        netbios name = linps2
        netbios aliases = gutenberg
        server string = Samba Server on linps2
        interfaces = eth0 127.0.0.1/24
        bind interfaces only = yes
        security = DOMAIN
        encrypt passwords = Yes
        password server = *
        username map = /etc/samba/smbusers
        log level = 1
        log file = /var/log/samba/%m.log
        mangling method = hash2
        preferred master = No
        domain master = No
        local master = No
        dns proxy = No
        wins server = tcs_main_pdc
        kernel oplocks = No
        winbind uid = 10000-20000
        winbind gid = 10000-20000
        winbind enum users = yes
        winbind enum groups = yes
        winbind use default domain = yes
        blocking locks = No
        locking = No
        oplocks = No
        level2 oplocks = No
        guest account = nobody
        load printers = no
        printer admin = @"TCS_MAIN_DOM\Domain Admins"
        printing = cups
        printcap name = cups
[printers]
        printable = yes
        path = /var/spool/samba
[print$]
        path = /var/lib/samba/drivers
        force group = +Domain Admins
        force create mode = 0664
        force directory mode = 0775
        write list = TCS_MAIN_DOM\jarboed, TCS_MAIN_DOM\mcginnig
[TC5-EAST-TWO]
        printable = yes
        path = /var/spool/samba
[TC5-DCEXEC]
        printable = yes
        path = /var/spool/samba
[TC5-WSM]
        printable = yes
        path = /var/spool/samba
(various other print shares)

For the samba 2.2.8a test file server:

[global]

        workgroup = TCS_MAIN_DOM
        netbios name = linfs1
        server string = Samba Server on linfs1
        interfaces = eth0 127.0.0.1/24
        bind interfaces only = yes
        security = DOMAIN
        encrypt passwords = Yes
        password server = *
        username map = /etc/samba/smbusers
        log level = 3
        log file = /var/samba/log/%m.log
        mangling method = hash2
        preferred master = No
        domain master = No
        local master = Yes
        dns proxy = No
        wins server = tcs_main_pdc
        kernel oplocks = No
        winbind uid = 10000-20000
        winbind gid = 10000-20000
        winbind enum users = no
        winbind enum groups = no
        winbind use default domain = No
        blocking locks = No
        locking = No
        oplocks = No
        level2 oplocks = No
        guest account = nobody
        load printers = no
[netmail]
        browsable = yes
        path = /var/samba/netmail
        read only = no
        inherit acls = yes
        root preexec = /var/samba/scripts/initial_setup.sh '%u'
[netmail-root]
        browsable = no
        path = /var/samba/netmail
        valid users= @"Domain Admins" @"TCS_MAIN_DOM\Domain Admins"
        read only = no
        force group = TCS_MAIN_DOM\Domain Admins
        force user = root
        inherit acls = yes
[homes]
        comment = Personal Netmail Folder for %U
        path = /var/samba/netmail/%S
        valid users = TCS_MAIN_DOM\%S
        browseable = No
        read only = No
        inherit acls = yes
        root preexec = /var/samba/scripts/initial_setup.sh '%u'

Print jobs stopped being accepted when winbind shutdown.  Also, sometime after 
the connections.tdb becomes corrupt, all new connections are refused.

What additional information would be helpful?

Comment 1 Daniel Jarboe 2004-02-03 09:58:36 UTC

connections.tdb became corrupt between 9:15:00 and 9:20:00 AM today according 
to tdbbackup.

Here are the tdb messages:
bc006727.log:[2004/02/03 09:19:39, 0] tdb/tdbutil.c:tdb_log(531)
bc006727.log:  tdb(/var/lib/samba/connections.tdb): rec_free_read non-free 
magic at offset=637606297 - fixing

bc006808.log:[2004/02/03 09:35:31, 0] tdb/tdbutil.c:tdb_log(531)
bc006808.log:  tdb(/var/lib/samba/connections.tdb): rec_free_read non-free 
magic at offset=637606297 - fixing

bc007110.log:[2004/02/03 09:38:04, 0] tdb/tdbutil.c:tdb_log(531)
bc007110.log:  tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not 
on list at off=106096
bc007110.log:[2004/02/03 09:38:04, 0] tdb/tdbutil.c:tdb_log(531)
bc007110.log:  tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed 
at 106096
bc007110.log:[2004/02/03 09:38:48, 0] tdb/tdbutil.c:tdb_log(531)
bc007110.log:  tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not 
on list at off=106096
bc007110.log:[2004/02/03 09:38:48, 0] tdb/tdbutil.c:tdb_log(531)
bc007110.log:  tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed 
at 106096

bc007384.log:[2004/02/03 09:46:25, 0] tdb/tdbutil.c:tdb_log(531)
bc007384.log:  tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not 
on list at off=106096
bc007384.log:[2004/02/03 09:46:26, 0] tdb/tdbutil.c:tdb_log(531)
bc007384.log:  tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed 
at 106096

bc007306.log:[2004/02/03 09:49:54, 0] tdb/tdbutil.c:tdb_log(531)
bc007306.log:  tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not 
on list at off=106096
bc007306.log:[2004/02/03 09:49:54, 0] tdb/tdbutil.c:tdb_log(531)
bc007306.log:  tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed 
at 106096

bc007298.log:[2004/02/03 11:35:06, 0] tdb/tdbutil.c:tdb_log(531)
bc007298.log:  tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not 
on list at off=67656
bc007298.log:[2004/02/03 11:35:06, 0] tdb/tdbutil.c:tdb_log(531)
bc007298.log:  tdb(/var/lib/samba/connections.tdb): tdb_free: right free 
failed at 67656
bc007298.log:[2004/02/03 11:35:06, 0] tdb/tdbutil.c:tdb_log(531)
bc007298.log:  tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not 
on list at off=65176
bc007298.log:[2004/02/03 11:35:06, 0] tdb/tdbutil.c:tdb_log(531)
bc007298.log:  tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed 
at 65176

bc006722.log:[2004/02/03 09:47:27, 0] tdb/tdbutil.c:tdb_log(531)
bc006722.log:  tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not 
on list at off=106096
bc006722.log:[2004/02/03 09:47:27, 0] tdb/tdbutil.c:tdb_log(531)
bc006722.log:  tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed 
at 106096

bc006785.log:[2004/02/03 11:33:02, 0] tdb/tdbutil.c:tdb_log(531)
bc006785.log:  tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not 
on list at off=67656
bc006785.log:[2004/02/03 11:33:02, 0] tdb/tdbutil.c:tdb_log(531)
bc006785.log:  tdb(/var/lib/samba/connections.tdb): tdb_free: right free 
failed at 67656
bc006785.log:[2004/02/03 11:33:02, 0] tdb/tdbutil.c:tdb_log(531)
bc006785.log:  tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not 
on list at off=65176
bc006785.log:[2004/02/03 11:33:02, 0] tdb/tdbutil.c:tdb_log(531)
bc006785.log:  tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed 
at 65176
bc006785.log:[2004/02/03 11:36:50, 0] tdb/tdbutil.c:tdb_log(531)
bc006785.log:  tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not 
on list at off=67656
bc006785.log:[2004/02/03 11:36:50, 0] tdb/tdbutil.c:tdb_log(531)
bc006785.log:  tdb(/var/lib/samba/connections.tdb): tdb_free: right free 
failed at 67656
bc006785.log:[2004/02/03 11:36:50, 0] tdb/tdbutil.c:tdb_log(531)
bc006785.log:  tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not 
on list at off=65176
bc006785.log:[2004/02/03 11:36:50, 0] tdb/tdbutil.c:tdb_log(531)
bc006785.log:  tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed 
at 65176

I have log level 3 logs for that time period.  Anything specific that would be 
useful to provide?

Comment 2 Daniel Jarboe 2004-02-04 10:45:33 UTC

Created attachment 378 [details]
example of connections.tdb file that samba 3 tdbbackup reports is bad

tdbbackup aborts with:
Failed to insert into connections.tdb.bad.bak.tmp
failed to copy connections.tdb.bad

tdbbackup is failing with the copy_fn calling tdb_store with TDB_INSERT flag. 
tdb_store finds that tdb_exists_hash returns 1 for one of the records, so
tdbbackup aborts.

Comment 3 Daniel Jarboe 2004-02-05 04:01:00 UTC

Created attachment 381 [details]
Now smbstatus is reporting locking.tdb has bad magic

smbstatus reports tdb(/var/lib/samba/locking.tdb): rec_read bad magic
0xd9fee666 at offset=6968 locked file list truncated

All the log messages (10 or 11 different machines so far) with locking.tdb
messages all report bad magic 0x0 at offset 7208:

bc006778.log-[2004/02/05 04:32:36, 0] tdb/tdbutil.c:tdb_log(531)
bc006778.log:  tdb(/var/lib/samba/locking.tdb): rec_free_read bad magic 0x0 at
offset=7208

Comment 4 Jim McDonough 2004-02-05 09:58:48 UTC

Can you please post exact level of the kernel and filesystem info (on what
filesystem, with what options, do your tdb's reside, for example).  Also, the
environment in terms of lpar/vm/native...

Comment 5 Daniel Jarboe 2004-02-05 11:53:38 UTC

Certainly...

# cat /etc/SuSE-release
SuSE SLES-8 (S/390)
VERSION = 8.1
# uname -a
Linux linfs1 2.4.21-94-default #1 SMP Sat Jan 10 01:13:17 UTC 2004 s390 unknown

tdb's are in /var/lib/samba/... /var is an ext3 filesystem with 4k blocksize 
mounted with defaults (rw).  Using dasd_eckd_mod.

It's a z/VM 4.3 guest.

Comment 6 Daniel Jarboe 2004-02-07 06:59:22 UTC

After going through the code some, I commented out the following in my 
smb.conf.

#        blocking locks = No
#        locking = No

I had read that various databases, outlook pst's, word doc's, etc have 
problems unless you disable oplocks, level2 oplocks, kernel oplocks, locking, 
and blocking locks.  After going through the tdb code and looking online more, 
I've kept the following, but turned back on blocking locks and locking.

       oplocks = No
       level2 oplocks = No
       kernel oplocks = No

Could the locking and blocking locks options have been responsible for the tdb 
corruption?  The tdb files are not availble via any share, of course.

Comment 7 Daniel Jarboe 2004-02-09 04:57:39 UTC

I restarted samba with the new conf around 09:25 Saturday, and 15 hours later 
(00:15 Sunday) tdbbackup was reporting that connections.tdb was corrupt 
again.  It has happened shortly after midnight in the past, though the time 
before this it ran for several days before becoming corrupt at between 9:15 
and 09:20.

Comment 8 Daniel Jarboe 2004-02-10 10:45:31 UTC

The problem appears to be in tdb_brlock.  In both samba 2.2.8a and samba 
3.0.x, tdb_brlock in source/tdb/tdb.c uses a struct flock instead of a 
SMB_STRUCT_FLOCK, and calls fcntl with lck_type set by caller (in this case 
F_SETLKW).

For standalone tdb compiles this is not a problem, because F_SETLKW is just 
that.  But when built with samba on this platform, that F_SETLKW appears to be 
redefined to 14 (F_SETLKW64).  I'm not sure if fcntl is redefined to fcntl64 
too, but either fcntl64 is getting passed a flock structure instead of 
flock64, or fcntl is being called with an invalid cmd F_SETFLKW64 instead of 
F_SETFLK.  Since fcntl IS returning 0 and not EINVAL, it seems like it's 
calling flock64 with the 32-bit flock structure, but I could be wrong.  Ideas?

Comment 9 Daniel Jarboe 2004-02-10 13:25:47 UTC

More info...

Changing the F_SETLKW and F_SETLK to F_SETLKW64 and F_SETLK64 in tdb.c for 
standalone compile to match what samba uses results in plenty of tdbtorture 
errors.
# ./tdbtorture
rec_read bad magic 0x38 at offset=8876
rec_read bad magic 0x626e7778 at offset=9912
rec_read bad magic 0xd9fee666 at offset=744
rec_read bad magic 0xd9fee666 at offset=6488
rec_read bad magic 0x2 at offset=27600
rec_read bad magic 0x3 at offset=35556
rec_read bad magic 0xd9fee666 at offset=30132
rec_read bad magic 0xd9fee666 at offset=24564
rec_read bad magic 0xd9fee666 at offset=42348
rec_read bad magic 0x3 at offset=48340
rec_read bad magic 0x78726b74 at offset=6400
rec_read bad magic 0x0 at offset=0
rec_read bad magic 0xd9fee666 at offset=41100
rec_read bad magic 0x62656c6c at offset=11768
rec_read bad magic 0x64756476 at offset=25816
rec_read bad magic 0x75687669 at offset=23268
rec_read bad magic 0x6b7a7862 at offset=46460
rec_read bad magic 0xe1dead at offset=19292
rec_read bad magic 0x0 at offset=0
rec_read bad magic 0xd9fee666 at offset=9904
rec_read bad magic 0xd9fee666 at offset=24620
rec_read bad magic 0xd9fee666 at offset=328
rec_read bad magic 0xd9fee666 at offset=39756
rec_read bad magic 0x68626867 at offset=30440
rec_read bad magic 0x0 at offset=0
rec_read bad magic 0x0 at offset=0
rec_read bad magic 0x0 at offset=0
rec_read bad magic 0x1c at offset=34772
rec_read bad magic 0x9e9c at offset=45472
rec_read bad magic 0x0 at offset=0
rec_read bad magic 0xd9fee666 at offset=42300

etc etc

But all the errors dissapear if you add an flock64 struct to tdb.h, and use 
flock64 instead of flock in tdb_brlock in addition to the F_SETLKW64 and 
F_SETLK64 ...

struct flock64 {
        short  l_type;
        short  l_whence;
        loff_t l_start;
        loff_t l_len;
        pid_t  l_pid;
};

now tdbtorture runs without errors again:
# ./tdbtorture
OK

Comment 10 Daniel Jarboe 2004-02-11 04:36:39 UTC

It looks like the problem was that Samba had redefined F_SETLK and F_SETLKW to 
F_SETLK64 and F_SETLKW64 because SuSE SLES 8 supports 64-bit file locks, but 
was still using the plain-jane 32-bit file lock structure.  Tdbtorture didn't 
turn up anything, because it is built standalone, and didn't have F_SETLK and 
F_SETLKW redefined by samba, so it was only using 32-bit locking commands 
which matched the 32-bit file lock structure.

When I compiled tdbtorture standalone and redefined F_SETLK and F_SETLKW to 
the 64-bit locking versions to match Samba's behavior, tdbtorture returned 
plenty of errors.  When I recompiled it with those changes plus the 64-bit 
file lock structure, tdbtorture ran without errors again.

This does not appear to be an s390 thing.  I suspect SuSE 8.1 on Intel (what 
SLES 8 s/390 is based on) would also have this same problem, because of the 64-
bit file lock support that samba would detect.

I have a version of samba compiled that uses the 64-bit file lock structure 
for tdb now, which should fix our problem.  Today I'll be working on the patch 
to do it right for everyone, plus fix some other tdb bugs that turned up.

Comment 11 Daniel Jarboe 2004-02-11 06:50:56 UTC

Created attachment 394 [details]
diff -u patch, will apply both to samba 2.2.8a and samba 3.0.2

Potential tdb bugs:

Problem: Used flock struct, though F_SETLK and F_SETLKW may have been
redefined by configure to F_SETLK64 and F_SETLKW64.

Patch: Use SMB_STRUCT_FLOCK if defined, otherwise use struct flock.

Problem: tdb_lock and tdb_unlock call tdb_brlock with an offset 4-bytes too
far to the left.  While this might not cause a problem if everyone is 4-bytes
off, it really should lock FREELIST_TOP+4*(list+1)... otherwise locking the
allocation list (list==-1) places the lock before FREELIST_TOP (in the
TDB_HEADER), locking hash bucket 0 locks the allocation list, etc.

Patch: Change from FREELIST_TOP+4*list to FREELIST_TOP + (list+1)*sizeof
(tdb_off) kind of like TDB_HASH_TOP.

Problem: When tdb_free looks right to determine if the next record is FREE and
can be merged with, it checks the offset against tdb->map_size instead of
using tdb_oob (the tdb might have expanded).  This isn't a big deal; it just
might cause additional fragmentation in the freelist.

Patch: Change from (right + sizeof(*rec) <= tdb->map_size) to (!tdb_oob(tdb,
right + sizeof(*rec), 1)).

Problem: typo in tdb_free (upfate trailer failed), and TDB_LOG format string
missing conversions in tdb_lock and rec_free_read.

Patch: fix

Problem: tdb_expand: No checks to ensure tdb size does not overflow, or is
within reach of flock's l_start.

Patch: does not address this

Problem: bunch of unsigned offsets being printed as %d, and some are even cast
into int for this.  Not sure why it's this way, not touching.

Patch: does not address this

Comment 12 Daniel Jarboe 2004-02-13 05:00:43 UTC

Created attachment 401 [details]
config.log (Samba 3.0.2)

Config.log... not sure why F_SETLK and F_SETLKW get redefined to F_SETLK64 and
F_SETLKW64 (during Samba autoconfigure?).  config.h to follow...

Comment 13 Daniel Jarboe 2004-02-13 05:02:16 UTC

Created attachment 402 [details]
config.h (Samba 3.0.2)

Comment 14 Daniel Jarboe 2004-02-13 08:13:38 UTC

It might not be over... today this message showed up in the logs of a 3.0.2 
server built with the patch:

winbindd.log:[2004/02/13 00:17:01, 0] tdb/tdbutil.c:tdb_log(724)
winbindd.log:  tdb(/var/lib/samba/winbindd_cache.tdb): remove_from_freelist: 
not on list at off=23220
winbindd.log:[2004/02/13 00:17:01, 0] tdb/tdbutil.c:tdb_log(724)
winbindd.log:  tdb(/var/lib/samba/winbindd_cache.tdb): tdb_free: left free 
failed at 23220

But that's it.  Strange thing is, the samba/winbind processes restarted just 
before this (winbindd.log finished adding domains at 00:16:18), and tdbbackup 
does not report any errors for this tdb.  I looked at the tdb, and sure enough 
this offset does have free magic, but is not in the freelist.  It's reclen is 
1332 (0x0534), with an empty key and data length.  The record is all 0x42s and 
the tailer is the right size.  The record following it (that was freed and 
tried to merge with this one) has since been used again, and looks valid.

So the only real corruption that occured here is that we lost track of a freed 
tdb.  I know the winbindd_cache code is more complicated then the connection.c 
code I was primarily working with... could this possibly happen with a 
tdb_traversal starting in the middle of a tdb_store (which could tdb_expand) 
or tdb_delete?

Comment 15 Daniel Jarboe 2004-02-16 04:09:21 UTC

That didn't do it, there is something else still wrong for sure :(.

On our test fileserver (2.2.8a)
connections.tdb and sessionid.tdb still seems okay, but winbindd_cache is 
reporting problems (bad magic 0x42424242 and the like).

On our test printserver (3.0.2)
sessionid.tdb now has problems (and the logs have that "remove_from_freelist: 
not on list at off=38492" followed by "tdb_free: left free failed at 38492" 
repeatedly).

This bug is not fixed.

Comment 16 Daniel Jarboe 2004-03-01 10:12:22 UTC

Just thought I'd keep things updated.  We were able to cause a corrupt 
connections.tdb with some regularity for Samba 2.2.8a by 
connecting/disconnecting several processes with smbclient.  Haven't tried 
3.0.2a yet.

At the suggestion of Volker we tried "use mmap = no", and it appears we've 
been corruption free.  We've been up for about 10 days with only tdb_fetch 
failed messages since enabling "use mmap = no" now (no bad magic, or not in 
freelist, etc).

I stepped through the logs of one of the corrupted examples, and it looks now 
like it has nothing to do with fcntl locking.  Sorry for me initially going 
down the wrong path.  In the connections.tdb corruption, a process in the 
middle of tdb operations was not interrupted by another process doing tdb 
operations until all its locks were released.

The tdb had never expanded (was still 8192 bytes).

The samba logs reported messages like:
rec_read bad magic 0xd9fee666 at offset=4416
remove_from_freelist: not on list at off=4416

tdbtool reported:
tdb> list
hash=49
 rec: offset=696 next=0 rec_len=596 key_len=264 data_len=328 
full_hash=0x8e8162cb magic=0x26011999
hash=73
 rec: offset=4416 next=0 rec_len=3752 key_len=264 data_len=328 
full_hash=0xdb3889a1 magic=0xd9fee666
ERROR: tailer does not match record! tailer=3156 totalsize=3776
freelist:

Here are the operations that seemed to lead up to and show evidence of 
corruption for the record at offset 4416 (there's some tdb operations before, 
but they don't go near this offset).

jump in with:

p1 needed to store its second connection record, so split the record in the 
freelist at 4416 into two records of 596 and 3132.  Everything follows as 
you'd expect, with the record at 4416 being the first in its hash bucket list.

p1 then deleted that record (disconnect from share), found the key in the 
appropriate hash bucket list, took the record out of the hash bucket list and 
freed the record at 4416 (which merges it with right at 5036, len 3132 to 
create a new record of length 3752, updates the tailer at 8188 and the record 
header at 4416, and prepends the new free record to the freelist.  It unlocked 
its allocation list locks, and hash bucket.

p1 then deleted its other connection record (disconnect), freeing the record 
at offset 1936 (neither right or left are free).  This finishes without any 
apparent problems, and p1 exits.

p0 needed to store its second connection record (connect to share), and used 
the freelist record at 1936, with a next of 4416.  This record at 1936 was 
just big enough (596 + header), and the next at 4416 became the next first in 
the freelist.  Everything seems fine.

p0 then deleted that connection record (disconnect from share).  It removed it 
from the hash bucket and freed the record at 1936 (left and right were not 
free), and put the newly freed record back at the front of the freelist.

p0 then deleted its first connection record (disconnect), putting the record 
at 1316 at the front of the freelist (merged right with 1936) so the first 
freelist record big enough for 1 connection is replaced with a merged record 
big enough for two connections.  p0 exits.

p5 then stored its first connection record, splitting freelist record 1316 in 
two.  1936 was placed back at the beginning of the freelist after its tailer 
was written to 2552 (right and left were not free).  etc.  It used hash bucket 
100.

p6 then stored its first connection record, using 1936.  4416 is next in line 
again.  It used hash bucket 21.

p7 then stored its first connection record, using 4416 and splitting off the 
right at 5036 (len 3132) back to the front of the freelist, etc.  It used hash 
bucket 73.  It did a tdb_write f the record (used memcpy) to rewrite the 
record header of 4416 after the rec->magic = TDB_MAGIC;, but a little bit 
later this record apparently has free magic according to p4!

p8 then stored its first connection record, using 5036 and splitting off the 
right at 5656 (len 2512) back to the front of the freelist, etc.  It used hash 
bucket 125.

p4 then stored its second connection record, using 5656 and splitting off the 
right at 6276 (len 1892) back to the front of the freelist, etc.  It used hash 
bucket 44.

p4 then deleted that second connection record (disconnect from share), 
bringing the next freelist record back to 5656 with len 2512.

p4 then deleted its first connection record (disconnect), which is at offset 
3796.  When tdb_free looked right to 4416, it said 4416 had free magic, though 
it shouldn't be free.  It reports corruption when that record is, as expected, 
not found on the freelist.

Jerry, I just saw your note, I'll try to reproduce the failure on a stock 
3.0.2a at debug level 10 for you, thanks.

Comment 17 Daniel Jarboe 2004-03-01 10:21:19 UTC

> Jerry, I just saw your note, I'll try to reproduce the failure on a stock
> 3.0.2a at debug level 10 for you, thanks.

Oops, different bug

Comment 18 Daniel Jarboe 2004-03-03 06:10:51 UTC

Created attachment 422 [details]
tdb_dump_chain "freelist top" problem (for tdbtool, minor)

This doesn't address my corruption problem, but is a minor bug in tdb.c that
affects tdbtool and anyone else that uses tdb_dump_chain to print the freelist
(like tdb_dump_all).  tdb_dump_chain wrongly treats the list number (-1 thru
header.hash_size) as a hash value by calling TDB_HASH_TOP macro (which calls
BUCKET macro).	When BUCKET does a -1 modulus u32 tdb->header.hash_size, you
get various values depending on the hash_size but not -1 like you would need
for this to work.

diff -u against 3.0.2a sources

Comment 19 Daniel Jarboe 2004-03-03 06:14:30 UTC

Created attachment 423 [details]
tdb_dump_chain "freelist top" problem (for tdbtool, minor) 

ugh, fixed typo

Comment 20 Daniel Jarboe 2004-03-16 05:58:53 UTC

We've gone for a few weeks now without tdb corruption by setting use mmap = 
no.  So right now it's looking like not a Samba bug, but a SLES 8 s/390 bug 
with mmap.  I've openned a ticket with SuSE.

Comment 21 Gerald (Jerry) Carter (dead mail address) 2004-03-16 11:26:45 UTC

Thanks for the update Daniel.  And nice work.
I'll work on getting the freelist top patch in 
to the tree.

Comment 22 Daniel Jarboe 2004-03-17 12:40:52 UTC

> I'll work on getting the freelist top patch in to the tree.

Any thoughts about some of the other bugs addressed in the first patch (apart 
from the struct flock stuff)?

The "tdb_brlock(tdb, FREELIST_TOP+4*list" stuff (instead of list+1) causes the 
fcntl to lock 4 bytes left of intended... but I guess lock_record and friends 
are never used for the lists like tdb_lock, so there are not conflicts.

Comment 23 Lars Müller 2004-03-24 03:00:26 UTC

I've created a cross reference at SuSE's bugzilla, 36071 to this bug report. 
We'll setup a test scenario.

Comment 24 Daniel Jarboe 2004-03-30 05:28:19 UTC

> I've created a cross reference at SuSE's bugzilla, 36071 to this bug
> report.  We'll setup a test scenario.

Any news?  Is this at bugzilla.suse.de?  I tried that, got a 401, but my SuSE 
maintenance id/password failed.  Is it not meant to be accessed by others?

Comment 25 Lars Müller 2004-03-30 06:37:47 UTC

Sorry, but SuSE's bugzilla is unfortunately not public accessable. I've moved
the bug to the person responsible for S390 and requested a test environment.
That's currently not ready.

As this looks like a SuSE problem, I'll try to reassign the bug to me.

Comment 26 Lars Müller 2004-04-12 08:05:28 UTC

Daniel: Which kernel version do you use?  You can view the version of the
currently running one with uname -r. The output includes the RPM release number.

I'm testing with 2.4.21-107 as available from the SuSE Linux Maintenace Web. 
The tests are performed with cifs_bm as from cifs-load-gen.  Simulating ten
clients (-N 10) from three hosts, using client_plain.txt as from the
cifs-load-gen sources and performing 1000000 operations.

Till now the bug isn't reproduceable.  If you would like to also test with
cifs_bm, I'll provide a RPM to you.

Comment 27 Daniel Jarboe 2004-04-12 10:13:57 UTC

> Daniel: Which kernel version do you use?

I just reproduced it on SuSE's 2.4.21-107.  I couldn't immediately reproduce 
it with 10 smbclients, but after upping it to 13 was able to reproduce it in a 
few attempts.

> The tests are performed with cifs_bm as from cifs-load-gen.
> Simulating ten clients (-N 10) from three hosts, using
> client_plain.txt as from the cifs-load-gen sources and
> performing 1000000 operations.

Are these fairly uniform... like 10 clients all connect, then all disconnect?  
I noticed in the smbclient output (each backgrounded client just does a ls and 
quit) that the connections and disconnections are being mixed up very well.  
It's not a single client connecting, and disconnecting, and then the next 
one... or 10 clients connecting, and then all disconnecting.  Each connection 
results in two records added to connections.tdb (and disconnecting results in 
two deletes).  These stores/deletes are getting mixed up well, so the same 
parts of the tdb are being reused reguarly among different processes as one 
frees a record and another allocates one.  This "reuse" seems to be a 
requirement for the corruption to occur with the mmaped file.

> Till now the bug isn't reproduceable.

Any special samba compilation options, or is it the standard package available 
on the SuSE Linux Maintenace Web?  Testparm shows default of use mmap = yes?

> If you would like to also test with cifs_bm, I'll provide a RPM to you.

Please do.  Alone, cifs_bm might be inadequate to produce corruption.  Or if 
it does produce corruption here, that points to something specific to our 
installation.  Either way it should tell something.

Thanks,
~ Daniel

Comment 28 Lars Müller 2004-04-12 10:59:37 UTC

I'm using Samba as from the SuSE Linux Maintenance Web.  Adding a smbtest share.
 Checking with testparm if 'use mmap = Yes' is set.

If possible please provide a 'working' script to reproduce the bug.  I'm working
on one using smbclient to transfer, list and delete multiple little files.

For which SuSE Linux version on which cpu architecture do you want cifs-load-gen?

Comment 29 Daniel Jarboe 2004-04-12 12:00:57 UTC

I just run a very basic script.  Something like the following will usually 
yield corruption in around 3 tries.

==================================================

#!/bin/sh
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
smbclient //$HOSTNAME/tmp -U 'user%pass' < commands &
sleep 15
grep magic /var/log/samba/$HOSTNAME.log
tdbbackup /var/lib/samba/connections.tdb

=======================================================

user/pass should be nothing important since it will be visible in ps.  This 
also assumes you have machine logs broken up like: log file = /var/log/samba/%
m.log, and a tmp share that the user can connect to.

the commands file is just:

ls
quit

The transfer of files is not important, it's the tdb operations during the 
connection/disconnection (or any other time tdb records are written) that 
really matter.

When it breaks I end up with output like:

smb: \>   tdb(/var/lib/samba/connections.tdb) pid 11675: rec_read bad magic 
0xd9fee666 at offset=8756
failed to copy /var/lib/samba/connections.tdb

SuSE SLES-8 (S/390)
VERSION = 8.1
SP3 ... this is s390 arch, not s390x.

Thank you,
~ Daniel

Comment 30 Daniel Jarboe 2004-05-11 07:33:44 UTC

I don't know if you are still looking at this, Lars Müller, but ever since 
3.0.3 tdb corruption occurs even when use mmap = no is set in smb.conf.  This 
is true in 3.0.4, too.  Up until 3.0.2a, use mmap = no was an effective 
workaround, but no longer...

Grepping through machine logs reveals...
./bc006755.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406)
./bc006755.log:  BACKTRACE: 15 stack frames:
--
./bc006755.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406)
./bc006755.log:  BACKTRACE: 22 stack frames:
--
./bc006772.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406)
./bc006772.log:  BACKTRACE: 15 stack frames:
--
./bc006772.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406)
./bc006772.log:  BACKTRACE: 22 stack frames:
--
./bc006327.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406)
./bc006327.log:  BACKTRACE: 15 stack frames:
--
./bc006327.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406)
./bc006327.log:  BACKTRACE: 22 stack frames:
--
./bc006754.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406)
./bc006754.log:  BACKTRACE: 15 stack frames:
--
./bc006754.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406)
./bc006754.log:  BACKTRACE: 22 stack frames:
--
./bc007526.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406)
./bc007526.log:  BACKTRACE: 15 stack frames:
--
./bc007526.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406)
./bc007526.log:  BACKTRACE: 22 stack frames:
--
./bc007119.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406)
./bc007119.log:  BACKTRACE: 15 stack frames:
--
./bc007119.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406)
./bc007119.log:  BACKTRACE: 22 stack frames:
--
./bc007112.log-[2004/05/11 10:15:18, 0] lib/util.c:smb_panic2(1406)
./bc007112.log:  BACKTRACE: 28 stack frames:
--
./bc007112.log-[2004/05/11 10:15:18, 0] lib/util.c:smb_panic2(1406)
./bc007112.log:  BACKTRACE: 35 stack frames:
--
./bc006584.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406)
./bc006584.log:  BACKTRACE: 15 stack frames:
--
./bc006584.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406)
./bc006584.log:  BACKTRACE: 22 stack frames:
./vpn-serv1.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406)
./vpn-serv1.log:  BACKTRACE: 15 stack frames:
--
./vpn-serv1.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406)
./vpn-serv1.log:  BACKTRACE: 22 stack frames:
./bc006971.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406)
./bc006971.log:  BACKTRACE: 15 stack frames:
--
./bc006971.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406)
./bc006971.log:  BACKTRACE: 22 stack frames:
--
./bc006962.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406)
./bc006962.log:  BACKTRACE: 15 stack frames:
--
./bc006962.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406)
./bc006962.log:  BACKTRACE: 22 stack frames:
--
./bc006775.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406)
./bc006775.log:  BACKTRACE: 15 stack frames:
--
./bc006775.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406)
./bc006775.log:  BACKTRACE: 22 stack frames:

And /var/log/messages:
May 11 10:10:00 linps1 /USR/SBIN/CRON[9028]: (root) CMD 
( /usr/lib/sa/sa1      )
May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] 
smbd/tdbutil.c:smbd_tdb_log(42)
May 11 10:15:18 linps1 smbd[9065]:   tdb
(/var/lib/samba/printing/HELPDESK.tdb): rec_free_read bad magic 0xfee1dead at 
offset=32764
May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] 
smbd/tdbutil.c:smbd_tdb_log(48)
May 11 10:15:18 linps1 smbd[9065]:   tdb_log: 
TDB /var/lib/samba/printing/HELPDESK.tdb is corrupt. Removing file and 
stopping this process.
May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] 
lib/util.c:smb_panic2(1398)
May 11 10:15:18 linps1 smbd[9065]:   PANIC: corrupt tdb
May 11 10:15:18 linps1 smbd[9065]:
May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] 
lib/util.c:smb_panic2(1406)
May 11 10:15:18 linps1 smbd[9065]:   BACKTRACE: 28 stack frames:
May 11 10:15:18 linps1 smbd[9065]:    #0 [0x80560d98]
May 11 10:15:18 linps1 smbd[9065]:    #1 [0x805ac862]
May 11 10:15:18 linps1 smbd[9065]:    #2 [0x805ad31e]
May 11 10:15:18 linps1 smbd[9065]:    #3 [0x805ae810]
May 11 10:15:18 linps1 smbd[9065]:    #4 [0x805b5ee0]
May 11 10:15:18 linps1 smbd[9065]:    #5 [0x805b6442]
May 11 10:15:18 linps1 smbd[9065]:    #6 [0x805b7c7a]
May 11 10:15:18 linps1 smbd[9065]:    #7 [0x804f5626]
May 11 10:15:18 linps1 smbd[9065]:    #8 [0x804f7366]
May 11 10:15:18 linps1 smbd[9065]:    #9 [0x804f7ac0]
May 11 10:15:18 linps1 smbd[9065]:    #10 [0x804ea572]
May 11 10:15:18 linps1 smbd[9065]:    #11 [0x8051c1ae]
May 11 10:15:18 linps1 smbd[9065]:    #12 [0x8051be64]
May 11 10:15:18 linps1 smbd[9065]:    #13 [0x80515a7e]
May 11 10:15:18 linps1 smbd[9065]:    #14 [0x80515cca]
May 11 10:15:18 linps1 smbd[9065]:    #15 [0x80515ffa]
May 11 10:15:18 linps1 smbd[9065]:    #16 [0x80516258]
May 11 10:15:18 linps1 smbd[9065]:    #17 [0x805161a4]
May 11 10:15:18 linps1 smbd[9065]:    #18 [0x8045004a]
May 11 10:15:18 linps1 smbd[9065]:    #19 [0x8045026c]
May 11 10:15:18 linps1 smbd[9065]:    #20 [0x80450f18]
May 11 10:15:18 linps1 smbd[9065]:    #21 [0x8049b2f2]
May 11 10:15:18 linps1 smbd[9065]:    #22 [0x8049b3bc]
May 11 10:15:18 linps1 smbd[9065]:    #23 [0x8049b7c8]
May 11 10:15:18 linps1 smbd[9065]:    #24 [0x8049c3e6]
May 11 10:15:18 linps1 smbd[9065]:    #25 [0x8060354a]
May 11 10:15:18 linps1 smbd[9065]:    #26 [0xc02227bc]
May 11 10:15:18 linps1 smbd[9065]:    #27 [0x8043a9d8]
May 11 10:15:18 linps1 smbd[9065]:
May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] 
lib/fault.c:fault_report(36)
May 11 10:15:18 linps1 smbd[9065]:   
===============================================================
May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] 
lib/fault.c:fault_report(37)
May 11 10:15:18 linps1 smbd[9065]:   INTERNAL ERROR: Signal 6 in pid 9065 
(3.0.4-UL)
May 11 10:15:18 linps1 smbd[9065]:   Please read the appendix Bugs of the 
Samba HOWTO collection
May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] 
lib/fault.c:fault_report(39)
May 11 10:15:18 linps1 smbd[9065]:   
===============================================================
May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] 
lib/util.c:smb_panic2(1398)
May 11 10:15:18 linps1 smbd[9065]:   PANIC: internal error
May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] 
lib/util.c:smb_panic2(1406)
May 11 10:15:18 linps1 smbd[9065]:   BACKTRACE: 35 stack frames:
May 11 10:15:18 linps1 smbd[9065]:    #0 [0x80599a58]
May 11 10:15:18 linps1 smbd[9065]:    #1 [0x80585a6a]
May 11 10:15:18 linps1 smbd[9065]:    #2 [0x80585ad0]
May 11 10:15:18 linps1 smbd[9065]:    #3 [0xffffd8c0]
May 11 10:15:18 linps1 smbd[9065]:    #4 [0xc000a994]
May 11 10:15:18 linps1 smbd[9065]:    #5 [0xc0237faa]
May 11 10:15:18 linps1 smbd[9065]:    #6 [0x80599c9e]
May 11 10:15:18 linps1 smbd[9065]:    #7 [0x80560d98]
May 11 10:15:18 linps1 smbd[9065]:    #8 [0x805ac862]
May 11 10:15:18 linps1 smbd[9065]:    #9 [0x805ad31e]
May 11 10:15:18 linps1 smbd[9065]:    #10 [0x805ae810]
May 11 10:15:18 linps1 smbd[9065]:    #11 [0x805b5ee0]
May 11 10:15:18 linps1 smbd[9065]:    #12 [0x805b6442]
May 11 10:15:18 linps1 smbd[9065]:    #13 [0x805b7c7a]
May 11 10:15:18 linps1 smbd[9065]:    #14 [0x804f5626]
May 11 10:15:18 linps1 smbd[9065]:    #15 [0x804f7366]
May 11 10:15:18 linps1 smbd[9065]:    #16 [0x804f7ac0]
May 11 10:15:18 linps1 smbd[9065]:    #17 [0x804ea572]
May 11 10:15:18 linps1 smbd[9065]:    #18 [0x8051c1ae]
May 11 10:15:18 linps1 smbd[9065]:    #19 [0x8051be64]
May 11 10:15:18 linps1 smbd[9065]:    #20 [0x80515a7e]
May 11 10:15:18 linps1 smbd[9065]:    #21 [0x80515cca]
May 11 10:15:18 linps1 smbd[9065]:    #22 [0x80515ffa]
May 11 10:15:18 linps1 smbd[9065]:    #23 [0x80516258]
May 11 10:15:18 linps1 smbd[9065]:    #24 [0x805161a4]
May 11 10:15:18 linps1 smbd[9065]:    #25 [0x8045004a]
May 11 10:15:18 linps1 smbd[9065]:    #26 [0x8045026c]
May 11 10:15:18 linps1 smbd[9065]:    #27 [0x80450f18]
May 11 10:15:18 linps1 smbd[9065]:    #28 [0x8049b2f2]
May 11 10:15:18 linps1 smbd[9065]:    #29 [0x8049b3bc]
May 11 10:15:18 linps1 smbd[9065]:    #30 [0x8049b7c8]
May 11 10:15:18 linps1 smbd[9065]:    #31 [0x8049c3e6]
May 11 10:15:18 linps1 smbd[9065]:    #32 [0x8060354a]
May 11 10:15:18 linps1 smbd[9065]:    #33 [0xc02227bc]
May 11 10:15:18 linps1 smbd[9065]:    #34 [0x8043a9d8]
May 11 10:15:18 linps1 smbd[9065]:
May 11 10:16:03 linps1 smbd[9074]: [2004/05/11 10:16:03, 0] 
smbd/connection.c:register_message_flags(220)
May 11 10:16:03 linps1 smbd[9074]:   register_message_flags: tdb_fetch failed


~ Daniel

Comment 31 Jim McDonough 2004-06-04 12:33:20 UTC

Jerry has checked in a patch that re-enables the "use mmap=no" workaround.  It
had been inadvertently changed with another merge.  on the system where I
recreated the problem, it no longer occurs.

Also, Anthony has pointed out
http://www.mail-archive.com/linux-390@vm.marist.edu/msg11467.html, and I will be
engaging the IBM zLinux team next week on this.

Comment 32 Lars Müller 2004-06-16 04:55:26 UTC

The SuSE bug #36071 was closed as the real problem is the one Jim mentioned in
comment #31.

Fixed kernels for SuSE Enterprise Server (SLES) 8 should be available in around
a week.

If these updates don't fix your problem without setting 'use mmap = No' please
reopen the bug report.

Comment 33 Gerald (Jerry) Carter (dead mail address) 2005-08-24 10:19:27 UTC

sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.