On SuSE SLES8 SP3 s390 our tdb's frequently become corrupted. The biggest culprits tend to be winbindd_cache.tdb and connections.tdb. This is for all versions of Samba 2.2.8a through 3.0.1. While our connections.tdb and sessionid.tdb files have been fine for the last few days since we increased log level to 3, I looked back through the logs and noticed that one of the .tdb files that we weren't monitoring (winbindd_cache.tdb) quickly became corrupt. We couldn't easily monitor this one, because the tool we've been using (tdbbackup) always reports an error on winbindd_cache.tdb for Samba 2, even after this tdb is freshly generated. But looking back in the logs, it happened within almost a half hour of having the winbindd_cache rebuilt: Jan 30 12:24:27 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=43544 Jan 30 12:24:27 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=43544 Jan 30 12:24:27 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=43544 Jan 30 12:24:27 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=43256 Jan 30 12:24:27 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=43256 (etc) Jan 30 12:24:27 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=42532 Jan 30 12:24:27 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=42532 Jan 30 12:24:28 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): tdb_oob len 1111638618 beyond eof at 57344 Jan 30 12:24:28 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): tdb_oob len 1111638618 beyond eof at 57344 Jan 30 12:24:28 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=44968 Jan 30 12:24:28 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=44968 Jan 30 12:24:28 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=44968 Jan 30 12:24:28 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=44660 (etc) Jan 30 12:24:28 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=41356 Jan 30 12:24:28 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=41356 Jan 30 12:24:28 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0xd9fee666 at offset=41068 Jan 30 12:24:28 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0xd9fee666 at offset=41068 Jan 30 12:24:28 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=44592 Jan 30 12:24:28 linfs1 winbindd[5729]: tdb (/var/lib/samba/winbindd_cache.tdb): rec_read bad magic 0x42424242 at offset=44592 (etc) On a samba 3.0.1 print server, winbindd aborted: string_to_sid: Sid BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB does not start with 'S-'. string_to_sid: Sid BBBBBBBBBBB does not start with 'S-'. string_to_sid: Sid 648334-1534^Gftpuser^A does not start with 'S-'. string_to_sid: Sid 48334-153fitzpatp^A does not start with 'S-'. string_to_sid: Sid 24891-1907648334-154quintasv^A does not start with 'S-'. string_to_sid: Sid 20293289-429224891-1907648334-1552^Eletol^A does not start with 'S-'. string_to_sid: Sid does not start with 'S-'. string_to_sid: Sid -5-21-2020293289-429224891-1907648334-1554^Fstumpc does not start with 'S-'. string_to_sid: Sid 1-5-21-2020293289-429224891-1907648334-156g does not start with 'S-'. string_to_sid: Sid eickmac^A does not start with 'S-'. string_to_sid: Sidwsmadmin^A does not start with 'S-'. string_to_sid: Sid 8334-1588^Elambd^A does not start with 'S-'. string_to_sid: Sid -1907648334-160nelavaij^A does not start with 'S-'. string_to_sid: Sid 93289-429224891-1907648334-1604^Fmentzk^A does not start with 'S-'. string_to_sid: Sid 21-2020293289-429224891-1907648334-160stric does not start with 'S-' string_to_sid: Sid alkera^A does not start with 'S-'. string_to_sid: Sid -162cavinesd^A does not start with 'S-'. etc. etc. It must have started pointing in the middle of records instead of at the beginning. Anyway, it ended like this: string_to_sid: Sid 0293289-429224891-1907648334-1787^Gwpscomm^A does not start with 'S-'. string_to_sid: Sid 5-21-2020293289-429224891-1907648334-1796^Fsmi does not start with 'S-'. nsswitch/winbindd_cache.c:centry_uint8(175) centry corruption? needed 1 bytes, have 0 lib/util.c:smb_panic(1400) PANIC: centry_uint32 lib/util.c:smb_panic(1408) BACKTRACE: 12 stack frames: #0 [0x8042ee3a] #1 [0x8042eea4] #2 [0x8042efe8] #3 [0x804319d4] #4 [0x80429b22] #5 [0x8042a8e0] #6 [0x80426c7e] #7 [0x80427028] #8 [0x80427a58] #9 [0x8042807e] #10 [0xc024f7bc] #11 [0x804263e0] Though there was a backtrace, I doubt it will be too helpful because this is just when the corruption was noticed. But the backgrace is really: #0 [0x0042ee3a] #1 [0x0042eea4] #2 [0x0042efe8] #3 [0x004319d4] #4 [0x00429b22] #5 [0x0042a8e0] #6 [0x00426c7e] #7 [0x00427028] #8 [0x00427a58] #9 [0x0042807e] #10 [0x4024f7bc] #11 [0x004263e0] # file winbindd winbindd: ELF 32-bit MSB executable, version 1 (SYSV), dynamically linked (uses shared libs), not stripped # readelf -l winbindd Elf file type is EXEC (Executable file) Entry point 0x4263a8 There are 7 program headers, starting at offset 52 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align PHDR 0x000034 0x00400034 0x00400034 0x000e0 0x000e0 R E 0x4 INTERP 0x000114 0x00400114 0x00400114 0x0000d 0x0000d R 0x1 [Requesting program interpreter: /lib/ld.so.1] LOAD 0x000000 0x00400000 0x00400000 0x15f874 0x15f874 R E 0x1000 LOAD 0x15f878 0x00560878 0x00560878 0x06d40 0x198b4 RW 0x1000 DYNAMIC 0x165f98 0x00566f98 0x00566f98 0x00140 0x00140 RW 0x4 NOTE 0x000124 0x00400124 0x00400124 0x00020 0x00020 R 0x4 GNU_EH_FRAME 0x15f858 0x0055f858 0x0055f858 0x0001c 0x0001c R 0x4 Section to Segment mapping: Segment Sections... 00 01 .interp 02 .interp .note.ABI- tag .hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr 03 .data .eh_frame .dynamic .ctors .dtors .jcr .got .bss 04 .dynamic 05 .note.ABI-tag 06 .eh_frame_hdr # ldd winbindd | awk '{print $4 " " $3 }'|sort (0x40000000) /lib/ld.so.1 (0x4001b000) /lib/libcrypt.so.1 (0x4004e000) /lib/libattr.so.1 (0x40052000) /lib/libresolv.so.2 (0x40066000) /lib/libnsl.so.1 (0x4007d000) /lib/libdl.so.2 (0x40081000) /usr/lib/libpopt.so.0 (0x40088000) /usr/lib/libcom_err.so.1 (0x4008b000) /usr/lib/libgssapi.so.1 (0x40098000) /usr/lib/libkrb5.so.17 (0x400d6000) /usr/lib/libasn1.so.5 (0x400fd000) /usr/lib/libcrypto.so.0.9.6 (0x401de000) /usr/lib/libroken.so.9 (0x401f3000) /usr/lib/libldap.so.2 (0x40225000) /usr/lib/liblber.so.2 (0x40233000) /lib/libc.so.6 (0x40358000) /usr/lib/libdb-4.0.so (0x403fc000) /usr/lib/libsasl.so.7 (0x40411000) /usr/lib/libssl.so.0.9.6 (0x40440000) /lib/libpam.so.0 (0x40449000) /usr/lib/libgdbm.so.2 Portions of nm output around backtrace addresses: Winbind: 004263a8 T _start 004263f4 t __do_global_dtors_aux 0042645c t frame_dummy 004264bc t reload_services_file 00426588 t dump_core 004267b8 t fault_quit 004267e0 t winbindd_status 004269c8 t print_winbindd_status 004269f8 t flush_caches 00426a20 t terminate 00426a84 t termination_handler 00426abc t sigusr2_handler 00426af4 t sighup_handler 00426b2c t msg_reload_services 00426b60 t msg_shutdown 00426b88 t process_request 00426d0c t new_connection 00426e60 t remove_client 00426edc t remove_idle_client 00426ff0 T winbind_process_packet 00427054 T winbind_client_read 004271dc t client_write 00427494 t process_loop 00427b3c T main 00428094 t winbindd_fill_pwent 004282dc T winbindd_getpwnam 004294c8 T winbindd_list_users 004297b0 t gr_mem_buffer 00429894 t fill_grent 00429904 t fill_grent_mem 0042a0e0 T winbindd_getgrnam 0042a594 T winbindd_getgrgid 0042a914 T winbindd_setgrent 0042e98c T winbindd_check_cache_size 0042eb80 t get_cache 0042ec54 t centry_free 0042ec9c t centry_uint32 0042ed94 t centry_uint8 0042ee5c t centry_string 0042efa4 t centry_sid 0042f00c t wcache_server_down 0042f0d8 t fetch_cache_seqnum 0042f3b8 t store_cache_seqnum 0042f600 t refresh_sequence_number 0042f7a4 t centry_expired 0042f934 t wcache_fetch 0042fc2c t centry_expand 0042fd00 t centry_put_uint32 0042fd7c t centry_put_uint8 0042fdc4 t centry_put_string 0042fe84 t centry_put_sid 0042fec0 T centry_start 0042ff48 t centry_end 0042fff8 t wcache_save_name_to_sid 00430110 t wcache_save_sid_to_name 00430210 t wcache_save_user 0043031c t query_user_list 00430784 t enum_dom_groups 00430ab8 t enum_local_groups 00430e68 t name_to_sid 0043109c t sid_to_name 004312b0 t query_user 00431540 t lookup_usergroups 00431880 t lookup_groupmem 00431c1c t sequence_number 00431c60 t trusted_domains 00431d2c t domain_sid 00431de8 t alternate_name 00431ea0 t traverse_fn 00431ef8 T wcache_invalidate_samlogon nscd is off, here is smb.conf for the two servers: linps2, the samba 3.0.1 test print server: [global] workgroup = TCS_MAIN_DOM netbios name = linps2 netbios aliases = gutenberg server string = Samba Server on linps2 interfaces = eth0 127.0.0.1/24 bind interfaces only = yes security = DOMAIN encrypt passwords = Yes password server = * username map = /etc/samba/smbusers log level = 1 log file = /var/log/samba/%m.log mangling method = hash2 preferred master = No domain master = No local master = No dns proxy = No wins server = tcs_main_pdc kernel oplocks = No winbind uid = 10000-20000 winbind gid = 10000-20000 winbind enum users = yes winbind enum groups = yes winbind use default domain = yes blocking locks = No locking = No oplocks = No level2 oplocks = No guest account = nobody load printers = no printer admin = @"TCS_MAIN_DOM\Domain Admins" printing = cups printcap name = cups [printers] printable = yes path = /var/spool/samba [print$] path = /var/lib/samba/drivers force group = +Domain Admins force create mode = 0664 force directory mode = 0775 write list = TCS_MAIN_DOM\jarboed, TCS_MAIN_DOM\mcginnig [TC5-EAST-TWO] printable = yes path = /var/spool/samba [TC5-DCEXEC] printable = yes path = /var/spool/samba [TC5-WSM] printable = yes path = /var/spool/samba (various other print shares) For the samba 2.2.8a test file server: [global] workgroup = TCS_MAIN_DOM netbios name = linfs1 server string = Samba Server on linfs1 interfaces = eth0 127.0.0.1/24 bind interfaces only = yes security = DOMAIN encrypt passwords = Yes password server = * username map = /etc/samba/smbusers log level = 3 log file = /var/samba/log/%m.log mangling method = hash2 preferred master = No domain master = No local master = Yes dns proxy = No wins server = tcs_main_pdc kernel oplocks = No winbind uid = 10000-20000 winbind gid = 10000-20000 winbind enum users = no winbind enum groups = no winbind use default domain = No blocking locks = No locking = No oplocks = No level2 oplocks = No guest account = nobody load printers = no [netmail] browsable = yes path = /var/samba/netmail read only = no inherit acls = yes root preexec = /var/samba/scripts/initial_setup.sh '%u' [netmail-root] browsable = no path = /var/samba/netmail valid users= @"Domain Admins" @"TCS_MAIN_DOM\Domain Admins" read only = no force group = TCS_MAIN_DOM\Domain Admins force user = root inherit acls = yes [homes] comment = Personal Netmail Folder for %U path = /var/samba/netmail/%S valid users = TCS_MAIN_DOM\%S browseable = No read only = No inherit acls = yes root preexec = /var/samba/scripts/initial_setup.sh '%u' Print jobs stopped being accepted when winbind shutdown. Also, sometime after the connections.tdb becomes corrupt, all new connections are refused. What additional information would be helpful?
connections.tdb became corrupt between 9:15:00 and 9:20:00 AM today according to tdbbackup. Here are the tdb messages: bc006727.log:[2004/02/03 09:19:39, 0] tdb/tdbutil.c:tdb_log(531) bc006727.log: tdb(/var/lib/samba/connections.tdb): rec_free_read non-free magic at offset=637606297 - fixing bc006808.log:[2004/02/03 09:35:31, 0] tdb/tdbutil.c:tdb_log(531) bc006808.log: tdb(/var/lib/samba/connections.tdb): rec_free_read non-free magic at offset=637606297 - fixing bc007110.log:[2004/02/03 09:38:04, 0] tdb/tdbutil.c:tdb_log(531) bc007110.log: tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not on list at off=106096 bc007110.log:[2004/02/03 09:38:04, 0] tdb/tdbutil.c:tdb_log(531) bc007110.log: tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed at 106096 bc007110.log:[2004/02/03 09:38:48, 0] tdb/tdbutil.c:tdb_log(531) bc007110.log: tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not on list at off=106096 bc007110.log:[2004/02/03 09:38:48, 0] tdb/tdbutil.c:tdb_log(531) bc007110.log: tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed at 106096 bc007384.log:[2004/02/03 09:46:25, 0] tdb/tdbutil.c:tdb_log(531) bc007384.log: tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not on list at off=106096 bc007384.log:[2004/02/03 09:46:26, 0] tdb/tdbutil.c:tdb_log(531) bc007384.log: tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed at 106096 bc007306.log:[2004/02/03 09:49:54, 0] tdb/tdbutil.c:tdb_log(531) bc007306.log: tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not on list at off=106096 bc007306.log:[2004/02/03 09:49:54, 0] tdb/tdbutil.c:tdb_log(531) bc007306.log: tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed at 106096 bc007298.log:[2004/02/03 11:35:06, 0] tdb/tdbutil.c:tdb_log(531) bc007298.log: tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not on list at off=67656 bc007298.log:[2004/02/03 11:35:06, 0] tdb/tdbutil.c:tdb_log(531) bc007298.log: tdb(/var/lib/samba/connections.tdb): tdb_free: right free failed at 67656 bc007298.log:[2004/02/03 11:35:06, 0] tdb/tdbutil.c:tdb_log(531) bc007298.log: tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not on list at off=65176 bc007298.log:[2004/02/03 11:35:06, 0] tdb/tdbutil.c:tdb_log(531) bc007298.log: tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed at 65176 bc006722.log:[2004/02/03 09:47:27, 0] tdb/tdbutil.c:tdb_log(531) bc006722.log: tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not on list at off=106096 bc006722.log:[2004/02/03 09:47:27, 0] tdb/tdbutil.c:tdb_log(531) bc006722.log: tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed at 106096 bc006785.log:[2004/02/03 11:33:02, 0] tdb/tdbutil.c:tdb_log(531) bc006785.log: tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not on list at off=67656 bc006785.log:[2004/02/03 11:33:02, 0] tdb/tdbutil.c:tdb_log(531) bc006785.log: tdb(/var/lib/samba/connections.tdb): tdb_free: right free failed at 67656 bc006785.log:[2004/02/03 11:33:02, 0] tdb/tdbutil.c:tdb_log(531) bc006785.log: tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not on list at off=65176 bc006785.log:[2004/02/03 11:33:02, 0] tdb/tdbutil.c:tdb_log(531) bc006785.log: tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed at 65176 bc006785.log:[2004/02/03 11:36:50, 0] tdb/tdbutil.c:tdb_log(531) bc006785.log: tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not on list at off=67656 bc006785.log:[2004/02/03 11:36:50, 0] tdb/tdbutil.c:tdb_log(531) bc006785.log: tdb(/var/lib/samba/connections.tdb): tdb_free: right free failed at 67656 bc006785.log:[2004/02/03 11:36:50, 0] tdb/tdbutil.c:tdb_log(531) bc006785.log: tdb(/var/lib/samba/connections.tdb): remove_from_freelist: not on list at off=65176 bc006785.log:[2004/02/03 11:36:50, 0] tdb/tdbutil.c:tdb_log(531) bc006785.log: tdb(/var/lib/samba/connections.tdb): tdb_free: left free failed at 65176 I have log level 3 logs for that time period. Anything specific that would be useful to provide?
Created attachment 378 [details] example of connections.tdb file that samba 3 tdbbackup reports is bad tdbbackup aborts with: Failed to insert into connections.tdb.bad.bak.tmp failed to copy connections.tdb.bad tdbbackup is failing with the copy_fn calling tdb_store with TDB_INSERT flag. tdb_store finds that tdb_exists_hash returns 1 for one of the records, so tdbbackup aborts.
Created attachment 381 [details] Now smbstatus is reporting locking.tdb has bad magic smbstatus reports tdb(/var/lib/samba/locking.tdb): rec_read bad magic 0xd9fee666 at offset=6968 locked file list truncated All the log messages (10 or 11 different machines so far) with locking.tdb messages all report bad magic 0x0 at offset 7208: bc006778.log-[2004/02/05 04:32:36, 0] tdb/tdbutil.c:tdb_log(531) bc006778.log: tdb(/var/lib/samba/locking.tdb): rec_free_read bad magic 0x0 at offset=7208
Can you please post exact level of the kernel and filesystem info (on what filesystem, with what options, do your tdb's reside, for example). Also, the environment in terms of lpar/vm/native...
Certainly... # cat /etc/SuSE-release SuSE SLES-8 (S/390) VERSION = 8.1 # uname -a Linux linfs1 2.4.21-94-default #1 SMP Sat Jan 10 01:13:17 UTC 2004 s390 unknown tdb's are in /var/lib/samba/... /var is an ext3 filesystem with 4k blocksize mounted with defaults (rw). Using dasd_eckd_mod. It's a z/VM 4.3 guest.
After going through the code some, I commented out the following in my smb.conf. # blocking locks = No # locking = No I had read that various databases, outlook pst's, word doc's, etc have problems unless you disable oplocks, level2 oplocks, kernel oplocks, locking, and blocking locks. After going through the tdb code and looking online more, I've kept the following, but turned back on blocking locks and locking. oplocks = No level2 oplocks = No kernel oplocks = No Could the locking and blocking locks options have been responsible for the tdb corruption? The tdb files are not availble via any share, of course.
I restarted samba with the new conf around 09:25 Saturday, and 15 hours later (00:15 Sunday) tdbbackup was reporting that connections.tdb was corrupt again. It has happened shortly after midnight in the past, though the time before this it ran for several days before becoming corrupt at between 9:15 and 09:20.
The problem appears to be in tdb_brlock. In both samba 2.2.8a and samba 3.0.x, tdb_brlock in source/tdb/tdb.c uses a struct flock instead of a SMB_STRUCT_FLOCK, and calls fcntl with lck_type set by caller (in this case F_SETLKW). For standalone tdb compiles this is not a problem, because F_SETLKW is just that. But when built with samba on this platform, that F_SETLKW appears to be redefined to 14 (F_SETLKW64). I'm not sure if fcntl is redefined to fcntl64 too, but either fcntl64 is getting passed a flock structure instead of flock64, or fcntl is being called with an invalid cmd F_SETFLKW64 instead of F_SETFLK. Since fcntl IS returning 0 and not EINVAL, it seems like it's calling flock64 with the 32-bit flock structure, but I could be wrong. Ideas?
More info... Changing the F_SETLKW and F_SETLK to F_SETLKW64 and F_SETLK64 in tdb.c for standalone compile to match what samba uses results in plenty of tdbtorture errors. # ./tdbtorture rec_read bad magic 0x38 at offset=8876 rec_read bad magic 0x626e7778 at offset=9912 rec_read bad magic 0xd9fee666 at offset=744 rec_read bad magic 0xd9fee666 at offset=6488 rec_read bad magic 0x2 at offset=27600 rec_read bad magic 0x3 at offset=35556 rec_read bad magic 0xd9fee666 at offset=30132 rec_read bad magic 0xd9fee666 at offset=24564 rec_read bad magic 0xd9fee666 at offset=42348 rec_read bad magic 0x3 at offset=48340 rec_read bad magic 0x78726b74 at offset=6400 rec_read bad magic 0x0 at offset=0 rec_read bad magic 0xd9fee666 at offset=41100 rec_read bad magic 0x62656c6c at offset=11768 rec_read bad magic 0x64756476 at offset=25816 rec_read bad magic 0x75687669 at offset=23268 rec_read bad magic 0x6b7a7862 at offset=46460 rec_read bad magic 0xe1dead at offset=19292 rec_read bad magic 0x0 at offset=0 rec_read bad magic 0xd9fee666 at offset=9904 rec_read bad magic 0xd9fee666 at offset=24620 rec_read bad magic 0xd9fee666 at offset=328 rec_read bad magic 0xd9fee666 at offset=39756 rec_read bad magic 0x68626867 at offset=30440 rec_read bad magic 0x0 at offset=0 rec_read bad magic 0x0 at offset=0 rec_read bad magic 0x0 at offset=0 rec_read bad magic 0x1c at offset=34772 rec_read bad magic 0x9e9c at offset=45472 rec_read bad magic 0x0 at offset=0 rec_read bad magic 0xd9fee666 at offset=42300 etc etc But all the errors dissapear if you add an flock64 struct to tdb.h, and use flock64 instead of flock in tdb_brlock in addition to the F_SETLKW64 and F_SETLK64 ... struct flock64 { short l_type; short l_whence; loff_t l_start; loff_t l_len; pid_t l_pid; }; now tdbtorture runs without errors again: # ./tdbtorture OK
It looks like the problem was that Samba had redefined F_SETLK and F_SETLKW to F_SETLK64 and F_SETLKW64 because SuSE SLES 8 supports 64-bit file locks, but was still using the plain-jane 32-bit file lock structure. Tdbtorture didn't turn up anything, because it is built standalone, and didn't have F_SETLK and F_SETLKW redefined by samba, so it was only using 32-bit locking commands which matched the 32-bit file lock structure. When I compiled tdbtorture standalone and redefined F_SETLK and F_SETLKW to the 64-bit locking versions to match Samba's behavior, tdbtorture returned plenty of errors. When I recompiled it with those changes plus the 64-bit file lock structure, tdbtorture ran without errors again. This does not appear to be an s390 thing. I suspect SuSE 8.1 on Intel (what SLES 8 s/390 is based on) would also have this same problem, because of the 64- bit file lock support that samba would detect. I have a version of samba compiled that uses the 64-bit file lock structure for tdb now, which should fix our problem. Today I'll be working on the patch to do it right for everyone, plus fix some other tdb bugs that turned up.
Created attachment 394 [details] diff -u patch, will apply both to samba 2.2.8a and samba 3.0.2 Potential tdb bugs: Problem: Used flock struct, though F_SETLK and F_SETLKW may have been redefined by configure to F_SETLK64 and F_SETLKW64. Patch: Use SMB_STRUCT_FLOCK if defined, otherwise use struct flock. Problem: tdb_lock and tdb_unlock call tdb_brlock with an offset 4-bytes too far to the left. While this might not cause a problem if everyone is 4-bytes off, it really should lock FREELIST_TOP+4*(list+1)... otherwise locking the allocation list (list==-1) places the lock before FREELIST_TOP (in the TDB_HEADER), locking hash bucket 0 locks the allocation list, etc. Patch: Change from FREELIST_TOP+4*list to FREELIST_TOP + (list+1)*sizeof (tdb_off) kind of like TDB_HASH_TOP. Problem: When tdb_free looks right to determine if the next record is FREE and can be merged with, it checks the offset against tdb->map_size instead of using tdb_oob (the tdb might have expanded). This isn't a big deal; it just might cause additional fragmentation in the freelist. Patch: Change from (right + sizeof(*rec) <= tdb->map_size) to (!tdb_oob(tdb, right + sizeof(*rec), 1)). Problem: typo in tdb_free (upfate trailer failed), and TDB_LOG format string missing conversions in tdb_lock and rec_free_read. Patch: fix Problem: tdb_expand: No checks to ensure tdb size does not overflow, or is within reach of flock's l_start. Patch: does not address this Problem: bunch of unsigned offsets being printed as %d, and some are even cast into int for this. Not sure why it's this way, not touching. Patch: does not address this
Created attachment 401 [details] config.log (Samba 3.0.2) Config.log... not sure why F_SETLK and F_SETLKW get redefined to F_SETLK64 and F_SETLKW64 (during Samba autoconfigure?). config.h to follow...
Created attachment 402 [details] config.h (Samba 3.0.2)
It might not be over... today this message showed up in the logs of a 3.0.2 server built with the patch: winbindd.log:[2004/02/13 00:17:01, 0] tdb/tdbutil.c:tdb_log(724) winbindd.log: tdb(/var/lib/samba/winbindd_cache.tdb): remove_from_freelist: not on list at off=23220 winbindd.log:[2004/02/13 00:17:01, 0] tdb/tdbutil.c:tdb_log(724) winbindd.log: tdb(/var/lib/samba/winbindd_cache.tdb): tdb_free: left free failed at 23220 But that's it. Strange thing is, the samba/winbind processes restarted just before this (winbindd.log finished adding domains at 00:16:18), and tdbbackup does not report any errors for this tdb. I looked at the tdb, and sure enough this offset does have free magic, but is not in the freelist. It's reclen is 1332 (0x0534), with an empty key and data length. The record is all 0x42s and the tailer is the right size. The record following it (that was freed and tried to merge with this one) has since been used again, and looks valid. So the only real corruption that occured here is that we lost track of a freed tdb. I know the winbindd_cache code is more complicated then the connection.c code I was primarily working with... could this possibly happen with a tdb_traversal starting in the middle of a tdb_store (which could tdb_expand) or tdb_delete?
That didn't do it, there is something else still wrong for sure :(. On our test fileserver (2.2.8a) connections.tdb and sessionid.tdb still seems okay, but winbindd_cache is reporting problems (bad magic 0x42424242 and the like). On our test printserver (3.0.2) sessionid.tdb now has problems (and the logs have that "remove_from_freelist: not on list at off=38492" followed by "tdb_free: left free failed at 38492" repeatedly). This bug is not fixed.
Just thought I'd keep things updated. We were able to cause a corrupt connections.tdb with some regularity for Samba 2.2.8a by connecting/disconnecting several processes with smbclient. Haven't tried 3.0.2a yet. At the suggestion of Volker we tried "use mmap = no", and it appears we've been corruption free. We've been up for about 10 days with only tdb_fetch failed messages since enabling "use mmap = no" now (no bad magic, or not in freelist, etc). I stepped through the logs of one of the corrupted examples, and it looks now like it has nothing to do with fcntl locking. Sorry for me initially going down the wrong path. In the connections.tdb corruption, a process in the middle of tdb operations was not interrupted by another process doing tdb operations until all its locks were released. The tdb had never expanded (was still 8192 bytes). The samba logs reported messages like: rec_read bad magic 0xd9fee666 at offset=4416 remove_from_freelist: not on list at off=4416 tdbtool reported: tdb> list hash=49 rec: offset=696 next=0 rec_len=596 key_len=264 data_len=328 full_hash=0x8e8162cb magic=0x26011999 hash=73 rec: offset=4416 next=0 rec_len=3752 key_len=264 data_len=328 full_hash=0xdb3889a1 magic=0xd9fee666 ERROR: tailer does not match record! tailer=3156 totalsize=3776 freelist: Here are the operations that seemed to lead up to and show evidence of corruption for the record at offset 4416 (there's some tdb operations before, but they don't go near this offset). jump in with: p1 needed to store its second connection record, so split the record in the freelist at 4416 into two records of 596 and 3132. Everything follows as you'd expect, with the record at 4416 being the first in its hash bucket list. p1 then deleted that record (disconnect from share), found the key in the appropriate hash bucket list, took the record out of the hash bucket list and freed the record at 4416 (which merges it with right at 5036, len 3132 to create a new record of length 3752, updates the tailer at 8188 and the record header at 4416, and prepends the new free record to the freelist. It unlocked its allocation list locks, and hash bucket. p1 then deleted its other connection record (disconnect), freeing the record at offset 1936 (neither right or left are free). This finishes without any apparent problems, and p1 exits. p0 needed to store its second connection record (connect to share), and used the freelist record at 1936, with a next of 4416. This record at 1936 was just big enough (596 + header), and the next at 4416 became the next first in the freelist. Everything seems fine. p0 then deleted that connection record (disconnect from share). It removed it from the hash bucket and freed the record at 1936 (left and right were not free), and put the newly freed record back at the front of the freelist. p0 then deleted its first connection record (disconnect), putting the record at 1316 at the front of the freelist (merged right with 1936) so the first freelist record big enough for 1 connection is replaced with a merged record big enough for two connections. p0 exits. p5 then stored its first connection record, splitting freelist record 1316 in two. 1936 was placed back at the beginning of the freelist after its tailer was written to 2552 (right and left were not free). etc. It used hash bucket 100. p6 then stored its first connection record, using 1936. 4416 is next in line again. It used hash bucket 21. p7 then stored its first connection record, using 4416 and splitting off the right at 5036 (len 3132) back to the front of the freelist, etc. It used hash bucket 73. It did a tdb_write f the record (used memcpy) to rewrite the record header of 4416 after the rec->magic = TDB_MAGIC;, but a little bit later this record apparently has free magic according to p4! p8 then stored its first connection record, using 5036 and splitting off the right at 5656 (len 2512) back to the front of the freelist, etc. It used hash bucket 125. p4 then stored its second connection record, using 5656 and splitting off the right at 6276 (len 1892) back to the front of the freelist, etc. It used hash bucket 44. p4 then deleted that second connection record (disconnect from share), bringing the next freelist record back to 5656 with len 2512. p4 then deleted its first connection record (disconnect), which is at offset 3796. When tdb_free looked right to 4416, it said 4416 had free magic, though it shouldn't be free. It reports corruption when that record is, as expected, not found on the freelist. Jerry, I just saw your note, I'll try to reproduce the failure on a stock 3.0.2a at debug level 10 for you, thanks.
> Jerry, I just saw your note, I'll try to reproduce the failure on a stock > 3.0.2a at debug level 10 for you, thanks. Oops, different bug
Created attachment 422 [details] tdb_dump_chain "freelist top" problem (for tdbtool, minor) This doesn't address my corruption problem, but is a minor bug in tdb.c that affects tdbtool and anyone else that uses tdb_dump_chain to print the freelist (like tdb_dump_all). tdb_dump_chain wrongly treats the list number (-1 thru header.hash_size) as a hash value by calling TDB_HASH_TOP macro (which calls BUCKET macro). When BUCKET does a -1 modulus u32 tdb->header.hash_size, you get various values depending on the hash_size but not -1 like you would need for this to work. diff -u against 3.0.2a sources
Created attachment 423 [details] tdb_dump_chain "freelist top" problem (for tdbtool, minor) ugh, fixed typo
We've gone for a few weeks now without tdb corruption by setting use mmap = no. So right now it's looking like not a Samba bug, but a SLES 8 s/390 bug with mmap. I've openned a ticket with SuSE.
Thanks for the update Daniel. And nice work. I'll work on getting the freelist top patch in to the tree.
> I'll work on getting the freelist top patch in to the tree. Any thoughts about some of the other bugs addressed in the first patch (apart from the struct flock stuff)? The "tdb_brlock(tdb, FREELIST_TOP+4*list" stuff (instead of list+1) causes the fcntl to lock 4 bytes left of intended... but I guess lock_record and friends are never used for the lists like tdb_lock, so there are not conflicts.
I've created a cross reference at SuSE's bugzilla, 36071 to this bug report. We'll setup a test scenario.
> I've created a cross reference at SuSE's bugzilla, 36071 to this bug > report. We'll setup a test scenario. Any news? Is this at bugzilla.suse.de? I tried that, got a 401, but my SuSE maintenance id/password failed. Is it not meant to be accessed by others?
Sorry, but SuSE's bugzilla is unfortunately not public accessable. I've moved the bug to the person responsible for S390 and requested a test environment. That's currently not ready. As this looks like a SuSE problem, I'll try to reassign the bug to me.
Daniel: Which kernel version do you use? You can view the version of the currently running one with uname -r. The output includes the RPM release number. I'm testing with 2.4.21-107 as available from the SuSE Linux Maintenace Web. The tests are performed with cifs_bm as from cifs-load-gen. Simulating ten clients (-N 10) from three hosts, using client_plain.txt as from the cifs-load-gen sources and performing 1000000 operations. Till now the bug isn't reproduceable. If you would like to also test with cifs_bm, I'll provide a RPM to you.
> Daniel: Which kernel version do you use? I just reproduced it on SuSE's 2.4.21-107. I couldn't immediately reproduce it with 10 smbclients, but after upping it to 13 was able to reproduce it in a few attempts. > The tests are performed with cifs_bm as from cifs-load-gen. > Simulating ten clients (-N 10) from three hosts, using > client_plain.txt as from the cifs-load-gen sources and > performing 1000000 operations. Are these fairly uniform... like 10 clients all connect, then all disconnect? I noticed in the smbclient output (each backgrounded client just does a ls and quit) that the connections and disconnections are being mixed up very well. It's not a single client connecting, and disconnecting, and then the next one... or 10 clients connecting, and then all disconnecting. Each connection results in two records added to connections.tdb (and disconnecting results in two deletes). These stores/deletes are getting mixed up well, so the same parts of the tdb are being reused reguarly among different processes as one frees a record and another allocates one. This "reuse" seems to be a requirement for the corruption to occur with the mmaped file. > Till now the bug isn't reproduceable. Any special samba compilation options, or is it the standard package available on the SuSE Linux Maintenace Web? Testparm shows default of use mmap = yes? > If you would like to also test with cifs_bm, I'll provide a RPM to you. Please do. Alone, cifs_bm might be inadequate to produce corruption. Or if it does produce corruption here, that points to something specific to our installation. Either way it should tell something. Thanks, ~ Daniel
I'm using Samba as from the SuSE Linux Maintenance Web. Adding a smbtest share. Checking with testparm if 'use mmap = Yes' is set. If possible please provide a 'working' script to reproduce the bug. I'm working on one using smbclient to transfer, list and delete multiple little files. For which SuSE Linux version on which cpu architecture do you want cifs-load-gen?
I just run a very basic script. Something like the following will usually yield corruption in around 3 tries. ================================================== #!/bin/sh smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & smbclient //$HOSTNAME/tmp -U 'user%pass' < commands & sleep 15 grep magic /var/log/samba/$HOSTNAME.log tdbbackup /var/lib/samba/connections.tdb ======================================================= user/pass should be nothing important since it will be visible in ps. This also assumes you have machine logs broken up like: log file = /var/log/samba/% m.log, and a tmp share that the user can connect to. the commands file is just: ls quit The transfer of files is not important, it's the tdb operations during the connection/disconnection (or any other time tdb records are written) that really matter. When it breaks I end up with output like: smb: \> tdb(/var/lib/samba/connections.tdb) pid 11675: rec_read bad magic 0xd9fee666 at offset=8756 failed to copy /var/lib/samba/connections.tdb SuSE SLES-8 (S/390) VERSION = 8.1 SP3 ... this is s390 arch, not s390x. Thank you, ~ Daniel
I don't know if you are still looking at this, Lars Müller, but ever since 3.0.3 tdb corruption occurs even when use mmap = no is set in smb.conf. This is true in 3.0.4, too. Up until 3.0.2a, use mmap = no was an effective workaround, but no longer... Grepping through machine logs reveals... ./bc006755.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406) ./bc006755.log: BACKTRACE: 15 stack frames: -- ./bc006755.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406) ./bc006755.log: BACKTRACE: 22 stack frames: -- ./bc006772.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406) ./bc006772.log: BACKTRACE: 15 stack frames: -- ./bc006772.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406) ./bc006772.log: BACKTRACE: 22 stack frames: -- ./bc006327.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406) ./bc006327.log: BACKTRACE: 15 stack frames: -- ./bc006327.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406) ./bc006327.log: BACKTRACE: 22 stack frames: -- ./bc006754.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406) ./bc006754.log: BACKTRACE: 15 stack frames: -- ./bc006754.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406) ./bc006754.log: BACKTRACE: 22 stack frames: -- ./bc007526.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406) ./bc007526.log: BACKTRACE: 15 stack frames: -- ./bc007526.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406) ./bc007526.log: BACKTRACE: 22 stack frames: -- ./bc007119.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406) ./bc007119.log: BACKTRACE: 15 stack frames: -- ./bc007119.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406) ./bc007119.log: BACKTRACE: 22 stack frames: -- ./bc007112.log-[2004/05/11 10:15:18, 0] lib/util.c:smb_panic2(1406) ./bc007112.log: BACKTRACE: 28 stack frames: -- ./bc007112.log-[2004/05/11 10:15:18, 0] lib/util.c:smb_panic2(1406) ./bc007112.log: BACKTRACE: 35 stack frames: -- ./bc006584.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406) ./bc006584.log: BACKTRACE: 15 stack frames: -- ./bc006584.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406) ./bc006584.log: BACKTRACE: 22 stack frames: ./vpn-serv1.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406) ./vpn-serv1.log: BACKTRACE: 15 stack frames: -- ./vpn-serv1.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406) ./vpn-serv1.log: BACKTRACE: 22 stack frames: ./bc006971.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406) ./bc006971.log: BACKTRACE: 15 stack frames: -- ./bc006971.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406) ./bc006971.log: BACKTRACE: 22 stack frames: -- ./bc006962.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406) ./bc006962.log: BACKTRACE: 15 stack frames: -- ./bc006962.log-[2004/05/11 10:17:20, 0] lib/util.c:smb_panic2(1406) ./bc006962.log: BACKTRACE: 22 stack frames: -- ./bc006775.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406) ./bc006775.log: BACKTRACE: 15 stack frames: -- ./bc006775.log-[2004/05/11 10:17:21, 0] lib/util.c:smb_panic2(1406) ./bc006775.log: BACKTRACE: 22 stack frames: And /var/log/messages: May 11 10:10:00 linps1 /USR/SBIN/CRON[9028]: (root) CMD ( /usr/lib/sa/sa1 ) May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] smbd/tdbutil.c:smbd_tdb_log(42) May 11 10:15:18 linps1 smbd[9065]: tdb (/var/lib/samba/printing/HELPDESK.tdb): rec_free_read bad magic 0xfee1dead at offset=32764 May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] smbd/tdbutil.c:smbd_tdb_log(48) May 11 10:15:18 linps1 smbd[9065]: tdb_log: TDB /var/lib/samba/printing/HELPDESK.tdb is corrupt. Removing file and stopping this process. May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] lib/util.c:smb_panic2(1398) May 11 10:15:18 linps1 smbd[9065]: PANIC: corrupt tdb May 11 10:15:18 linps1 smbd[9065]: May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] lib/util.c:smb_panic2(1406) May 11 10:15:18 linps1 smbd[9065]: BACKTRACE: 28 stack frames: May 11 10:15:18 linps1 smbd[9065]: #0 [0x80560d98] May 11 10:15:18 linps1 smbd[9065]: #1 [0x805ac862] May 11 10:15:18 linps1 smbd[9065]: #2 [0x805ad31e] May 11 10:15:18 linps1 smbd[9065]: #3 [0x805ae810] May 11 10:15:18 linps1 smbd[9065]: #4 [0x805b5ee0] May 11 10:15:18 linps1 smbd[9065]: #5 [0x805b6442] May 11 10:15:18 linps1 smbd[9065]: #6 [0x805b7c7a] May 11 10:15:18 linps1 smbd[9065]: #7 [0x804f5626] May 11 10:15:18 linps1 smbd[9065]: #8 [0x804f7366] May 11 10:15:18 linps1 smbd[9065]: #9 [0x804f7ac0] May 11 10:15:18 linps1 smbd[9065]: #10 [0x804ea572] May 11 10:15:18 linps1 smbd[9065]: #11 [0x8051c1ae] May 11 10:15:18 linps1 smbd[9065]: #12 [0x8051be64] May 11 10:15:18 linps1 smbd[9065]: #13 [0x80515a7e] May 11 10:15:18 linps1 smbd[9065]: #14 [0x80515cca] May 11 10:15:18 linps1 smbd[9065]: #15 [0x80515ffa] May 11 10:15:18 linps1 smbd[9065]: #16 [0x80516258] May 11 10:15:18 linps1 smbd[9065]: #17 [0x805161a4] May 11 10:15:18 linps1 smbd[9065]: #18 [0x8045004a] May 11 10:15:18 linps1 smbd[9065]: #19 [0x8045026c] May 11 10:15:18 linps1 smbd[9065]: #20 [0x80450f18] May 11 10:15:18 linps1 smbd[9065]: #21 [0x8049b2f2] May 11 10:15:18 linps1 smbd[9065]: #22 [0x8049b3bc] May 11 10:15:18 linps1 smbd[9065]: #23 [0x8049b7c8] May 11 10:15:18 linps1 smbd[9065]: #24 [0x8049c3e6] May 11 10:15:18 linps1 smbd[9065]: #25 [0x8060354a] May 11 10:15:18 linps1 smbd[9065]: #26 [0xc02227bc] May 11 10:15:18 linps1 smbd[9065]: #27 [0x8043a9d8] May 11 10:15:18 linps1 smbd[9065]: May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] lib/fault.c:fault_report(36) May 11 10:15:18 linps1 smbd[9065]: =============================================================== May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] lib/fault.c:fault_report(37) May 11 10:15:18 linps1 smbd[9065]: INTERNAL ERROR: Signal 6 in pid 9065 (3.0.4-UL) May 11 10:15:18 linps1 smbd[9065]: Please read the appendix Bugs of the Samba HOWTO collection May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] lib/fault.c:fault_report(39) May 11 10:15:18 linps1 smbd[9065]: =============================================================== May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] lib/util.c:smb_panic2(1398) May 11 10:15:18 linps1 smbd[9065]: PANIC: internal error May 11 10:15:18 linps1 smbd[9065]: [2004/05/11 10:15:18, 0] lib/util.c:smb_panic2(1406) May 11 10:15:18 linps1 smbd[9065]: BACKTRACE: 35 stack frames: May 11 10:15:18 linps1 smbd[9065]: #0 [0x80599a58] May 11 10:15:18 linps1 smbd[9065]: #1 [0x80585a6a] May 11 10:15:18 linps1 smbd[9065]: #2 [0x80585ad0] May 11 10:15:18 linps1 smbd[9065]: #3 [0xffffd8c0] May 11 10:15:18 linps1 smbd[9065]: #4 [0xc000a994] May 11 10:15:18 linps1 smbd[9065]: #5 [0xc0237faa] May 11 10:15:18 linps1 smbd[9065]: #6 [0x80599c9e] May 11 10:15:18 linps1 smbd[9065]: #7 [0x80560d98] May 11 10:15:18 linps1 smbd[9065]: #8 [0x805ac862] May 11 10:15:18 linps1 smbd[9065]: #9 [0x805ad31e] May 11 10:15:18 linps1 smbd[9065]: #10 [0x805ae810] May 11 10:15:18 linps1 smbd[9065]: #11 [0x805b5ee0] May 11 10:15:18 linps1 smbd[9065]: #12 [0x805b6442] May 11 10:15:18 linps1 smbd[9065]: #13 [0x805b7c7a] May 11 10:15:18 linps1 smbd[9065]: #14 [0x804f5626] May 11 10:15:18 linps1 smbd[9065]: #15 [0x804f7366] May 11 10:15:18 linps1 smbd[9065]: #16 [0x804f7ac0] May 11 10:15:18 linps1 smbd[9065]: #17 [0x804ea572] May 11 10:15:18 linps1 smbd[9065]: #18 [0x8051c1ae] May 11 10:15:18 linps1 smbd[9065]: #19 [0x8051be64] May 11 10:15:18 linps1 smbd[9065]: #20 [0x80515a7e] May 11 10:15:18 linps1 smbd[9065]: #21 [0x80515cca] May 11 10:15:18 linps1 smbd[9065]: #22 [0x80515ffa] May 11 10:15:18 linps1 smbd[9065]: #23 [0x80516258] May 11 10:15:18 linps1 smbd[9065]: #24 [0x805161a4] May 11 10:15:18 linps1 smbd[9065]: #25 [0x8045004a] May 11 10:15:18 linps1 smbd[9065]: #26 [0x8045026c] May 11 10:15:18 linps1 smbd[9065]: #27 [0x80450f18] May 11 10:15:18 linps1 smbd[9065]: #28 [0x8049b2f2] May 11 10:15:18 linps1 smbd[9065]: #29 [0x8049b3bc] May 11 10:15:18 linps1 smbd[9065]: #30 [0x8049b7c8] May 11 10:15:18 linps1 smbd[9065]: #31 [0x8049c3e6] May 11 10:15:18 linps1 smbd[9065]: #32 [0x8060354a] May 11 10:15:18 linps1 smbd[9065]: #33 [0xc02227bc] May 11 10:15:18 linps1 smbd[9065]: #34 [0x8043a9d8] May 11 10:15:18 linps1 smbd[9065]: May 11 10:16:03 linps1 smbd[9074]: [2004/05/11 10:16:03, 0] smbd/connection.c:register_message_flags(220) May 11 10:16:03 linps1 smbd[9074]: register_message_flags: tdb_fetch failed ~ Daniel
Jerry has checked in a patch that re-enables the "use mmap=no" workaround. It had been inadvertently changed with another merge. on the system where I recreated the problem, it no longer occurs. Also, Anthony has pointed out http://www.mail-archive.com/linux-390@vm.marist.edu/msg11467.html, and I will be engaging the IBM zLinux team next week on this.
The SuSE bug #36071 was closed as the real problem is the one Jim mentioned in comment #31. Fixed kernels for SuSE Enterprise Server (SLES) 8 should be available in around a week. If these updates don't fix your problem without setting 'use mmap = No' please reopen the bug report.
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.