Running smbd (Samba-4.9.3) immediately coredumps in AIX. The stack trace reveals it happens inside tdb_write. Below are the details # /opt/freeware/sbin/smbd --interactive smbd version 4.9.3 started. Copyright Andrew Tridgell and the Samba Team 1992-2018 =============================================================== INTERNAL ERROR: Signal 10 in pid 11141180 (4.9.3) Please read the Trouble-Shooting section of the Samba HOWTO =============================================================== PANIC (pid 11141180): internal error unable to produce a stack trace on this platform dumping core in /var/log/samba/cores/smbd IOT/Abort trap(coredump) # dbx /opt/freeware/sbin/smbd core Type 'help' for help. warning: The core file is not a fullcore. Some info may not be available. [using memory image in core] reading symbolic information ...warning: Unable to access the stab file. Some info may not be available IOT/Abort trap in pthread_kill at 0xd0521f14 0xd0521f14 (pthread_kill+0xb4) 80410014 lwz r2,0x14(r1) (dbx) where pthread_kill(??, ??) at 0xd0521f14 _p_raise(??) at 0xd0521348 raise.raise(??) at 0xd011f9c0 abort() at 0xd01af584 dump_core(), line 338 in "dumpcore.c" smb_panic_s3(why = "internal error"), line 839 in "util.c" smb_panic(why = "internal error"), line 170 in "fault.c" fault_report(sig = 10), line 84 in "fault.c" sig_fault(sig = 10), line 95 in "fault.c" .() at 0xf014 tdb_write(tdb = 0x3003b618, off = 16380, buf = 0x2ff223f0, len = 4), line 223 in "io.c" tdb_ofs_write(tdb = 0x3003b618, offset = 16380, d = 0x2ff22440), line 674 in "io.c" update_tailer(tdb = 0x3003b618, offset = 696, rec = 0x2ff224e0), line 96 in "freelist.c" tdb_free(tdb = 0x3003b618, offset = 696, rec = 0x2ff224e0), line 316 in "freelist.c" tdb_expand(tdb = 0x3003b618, size = 15688), line 655 in "io.c" tdb_allocate_from_freelist(tdb = 0x3003b618, length = 108, rec = 0x2ff22650), line 577 in "freelist.c" tdb_allocate(tdb = 0x3003b618, hash = 1167830340, length = 83, rec = 0x2ff22650), line 664 in "freelist.c" tdb._tdb_storev(tdb = 0x3003b618, key = (...), dbufs = 0x2ff227dc, num_dbufs = 1, flag = 1, hash = 1167830340), line 591 in "tdb.c" tdb_storev(tdb = 0x3003b618, key = (...), dbufs = 0x2ff227dc, num_dbufs = 1, flag = 1), line 700 in "tdb.c" db_tdb_storev(rec = 0x3003b418, dbufs = 0x2ff227dc, num_dbufs = 1, flag = 1), line 298 in "dbwrap_tdb.c" dbwrap_record_storev(rec = 0x3003b418, dbufs = 0x2ff227dc, num_dbufs = 1, flags = 1), line 90 in "dbwrap.c" dbwrap_record_store(rec = 0x3003b418, data = (...), flags = 1), line 99 in "dbwrap.c" smbXsrv_version_global_init(server_id = 0x2ff22ac0), line 240 in "smbXsrv_version.c" main(0x2, 0x2ff22c74) at 0x10003700 (dbx) quit
Any update on this ?
Doing a truss on this shows the below statx("/var/locks", 0x2FF22688, 128, 011) = 0 kopen("/var/locks/smbXsrv_version_global.tdb", O_RDWR|O_CREAT|O_LARGEFILE, S_IRUSR|S_IWUSR) = 13 kfcntl(13, F_GETFD, 0x00000000) = 0 kfcntl(13, F_SETFD, 0x00000001) = 0 kfcntl(13, 13, 0x2FF22250) = 0 kfcntl(13, 12, 0x2FF22250) = 0 kfcntl(13, 13, 0x2FF222B0) = 0 klseek(13, 0, 0, 0x00000000) = 0 kftruncate(13, 0x0000000000000000) = 0 kwrite(13, " T D B f i l e\n\0\0\0".., 696) = 696 kfcntl(13, 13, 0x2FF222C0) = 0 klseek(13, 0, 0, 0x00000000) = 0 kread(13, " T D B f i l e\n\0\0\0".., 168) = 168 fstatx(13, 0x2FF22438, 128, 010) = 0 fstatx(13, 0x2FF222A0, 128, 010) = 0 kmmap(0x00000000, 696, 3, 1, 13, 0x00000000, 0x00000000) = 0xB006C000 kfcntl(13, 13, 0x2FF22250) = 0 kfcntl(13, 13, 0x2FF22250) = 0 kfcntl(13, 13, 0x2FF22250) = 0 fstatx(13, 0x2FF22658, 128, 010) = 0 kfcntl(13, 13, 0x2FF22510) = 0 kfcntl(13, 13, 0x2FF22430) = 0 fstatx(13, 0x2FF22410, 128, 010) = 0 munmap(0xB006C000, 696) = 0 kmmap(0x00000000, 696, 3, 1, 13, 0x00000000, 0x00000000) = 0xB006C000 fstatx(13, 0x2FF20338, 128, 010) = 0 kioctl(13, -2147195273, 0x2FF20300, 0x00000000) = 1 kioctl(13, -2147195273, 0x2FF20300, 0x00000000) = 0 munmap(0xB006C000, 696) = 0 kmmap(0x00000000, 16384, 3, 1, 13, 0x00000000, 0x00000000) = 0xB006C000 Received signal #10, SIGBUS [caught]
I just cleaned up the system and started fresh. Removed the secrets.tdb file as well. Now i am getting an error with secrets.tdb file. # /opt/freeware/sbin/smbd -i smbd version 4.9.3 started. Copyright Andrew Tridgell and the Samba Team 1992-2018 tdb(/var/lib/samba/private/secrets.tdb): tdb_oob len 16408 beyond eof at 696 tdb(/var/lib/samba/private/secrets.tdb): tdb_transaction_recover: failed to read recovery record Failed to open /var/lib/samba/private/secrets.tdb tdb(/var/lib/samba/private/secrets.tdb): tdb_oob len 16408 beyond eof at 696 tdb(/var/lib/samba/private/secrets.tdb): tdb_transaction_recover: failed to read recovery record Failed to open /var/lib/samba/private/secrets.tdb exit_daemon: STATUS=daemon failed to start: smbd can not open secrets.tdb, error code 13 Any ideas will be really helpful.
Created attachment 14789 [details] smbd-debug10 Attaching the output of smbd -i with debug=10
Again removing the file "/var/lib/samba/private/secrets.tdb" and doing a smbd -i -d10 results in the below error. Attempting to register passdb backend tdbsam Successfully added passdb backend 'tdbsam' Found pdb backend tdbsam pdb backend tdbsam has a valid init tdb(/var/lib/samba/private/secrets.tdb): tdb_transaction_start: nesting 1 dbwrap_lock_order_lock: check lock order 1 for /var/lib/samba/private/secrets.tdb lock order: 1:/var/lib/samba/private/secrets.tdb 2:<none> 3:<none> dbwrap_lock_order_unlock: release lock order 1 for /var/lib/samba/private/secrets.tdb tdb(/var/lib/samba/private/secrets.tdb): tdb_transaction_start: nesting 1 tdb(/var/lib/samba/private/secrets.tdb): tdb_transaction_setup_recovery: transaction data over new region boundary tdb(/var/lib/samba/private/secrets.tdb): tdb_transaction_prepare_commit: failed to setup recovery data PANIC (pid 56951290): could not start commit secrets db unable to produce a stack trace on this platform dumping core in /var/log/samba/cores/smbd IOT/Abort trap(coredump) # dbx /opt/freeware/sbin/smbd core Type 'help' for help. [using memory image in core] reading symbolic information ...warning: Unable to access the stab file. Some info may not be available IOT/Abort trap in pthread_kill at 0xd05833ec ($t1) 0xd05833ec (pthread_kill+0xac) 80410014 lwz r2,0x14(r1) (dbx) where pthread_kill(??, ??) at 0xd05833ec _p_raise(??) at 0xd05827c8 raise.raise(??) at 0xd01234a4 abort() at 0xd0189a18 dump_core(), line 338 in "dumpcore.c" smb_panic_s3(why = "could not start commit secrets db"), line 839 in "util.c" smb_panic(why = "could not start commit secrets db"), line 170 in "fault.c" get_global_sam_sid(), line 217 in "machine_sid.c" main(0x3, 0x2ff22ae0) at 0x10003694
After some debugging it seems like posix_fallocate is broken in AIX 6.1, 7.1, 7.2 File : lib/tdb/common/io.c +416 #if HAVE_POSIX_FALLOCATE +417 ret = tdb_posix_fallocate(tdb, size, addition); +418 if (ret == 0) { +419 return 0; +420 } +421 if (ret == ENOSPC) { +99 #if HAVE_POSIX_FALLOCATE +100 static int tdb_posix_fallocate(struct tdb_context *tdb, off_t offset, +101 off_t len) +102 { +103 ssize_t ret; +104 +105 if (!tdb_adjust_offset(tdb, &offset)) { +106 return -1; +107 } +108 +109 do { +110 ret = posix_fallocate(tdb->fd, offset, len); +111 } while ((ret == -1) && (errno == EINTR)); +112 The call to posix_fallocate is returning zero but file (secrets.tdb) size is not increased. Adding if !defined(_AIX) around the above mentioned code (basically disabling the code) makes the daemon run again. I see there is one more place where posix_fallocate is called. File : source3/lib/system.c +439 int sys_posix_fallocate(int fd, off_t offset, off_t len) +440 { +441 #if defined(HAVE_POSIX_FALLOCATE) +442 return posix_fallocate(fd, offset, len); +443 #elif defined(F_RESVSP64) +444 /* this handles XFS on IRIX */ +445 struct flock64 fl; I am not sure if we disable this like above, what impact this has. May be the community can shed some light here.
if you are confident that this is a AIX bug, can you please file an upstream AIX bug for that and reference here, where the result can be followed?
I am from IBM AIX Toolbox development team. This will be a internal defect and there won't be any public url to reference the defect.
we are not encountering this issue in the SerNet samba+ packages
(In reply to Ayappan from comment #8) If this can be reproduced at will then a configure test could be written to detect this behaviour and then to blacklist the posix_fallocate() call.
Thanks for the update. A simple sample program using posix_fallocate works in AIX. Need to analyze more on this.
My Samba 4.10.2 on AIX 7.2 with posix_fallocate support has no problem. ``` # file /usr/local/sbin/smbd /usr/local/sbin/smbd: executable (RISC System/6000 V3.1) or obj module not stripped # /usr/local/sbin/smbd -b |grep -i posix_fallocate HAVE_POSIX_FALLOCATE _POSIX_FALLOCATE_CAPABLE_LIBC ``` What filesystem are you using for /var/lib/samba/private/secrets.tdb? `mount |grep /var` Samba is 32-bit or 64-bit binary? `file /opt/freeware/sbin/smbd`
Interesting!! # mount |grep /var /dev/hd9var /var jfs2 Apr 23 12:10 rw,log=/dev/hd8 /dev/livedump /var/adm/ras/livedump jfs2 Apr 23 12:10 rw,log=/dev/hd8 The Samba build is 32bit. What is the AIX level you are using ? (Mine is 7200-03-00-0000) oslevel -s Can you execute below command on tdb library and paste the output here ? # dump -Tov libtdb.so | grep posix [32] 0x00000000 undef IMP DS EXTref libc.a(shr.o) posix_fallocate
(In reply to Ayappan from comment #13) ``` $ oslevel -s 7200-00-04-1717 $ dump -Tov libtdb.so |grep posix [34] 0x00000000 undef IMP DS EXTref libc.a(shr.o) posix_fallocate64 ``` I've compiled Samba on AIX 7.2 with GCC 4.8.5, CPPFLAGS="$CPPFLAGS -D_LARGE_FILES -DHAVE_BROKEN_READLINK -D_UINTPTR_T_DEFINED=1" and bug #9557 #10270 patches. If no _LARGE_FILES and _UNIPTR_T_DEFINED in CPPFLAGS, build fails.
Thanks for the info. I see "posix_fallocate64" in your case. AIX 7.2 has posix_fallocate64 under "_LARGE_FILES" condition whereas AIX 6.1 don't have that. And my build is on AIX 6.1 . Looks like the implementation could be wrong in AIX 6.1 Need to check with AIX core team.
(In reply to Ayappan from comment #15) > Looks like the implementation could be wrong in AIX 6.1 > Need to check with AIX core team. just for completeness: what is the outcome of this finally? Can you say which AIX versions and os levels are broken and which are no longer broken for the posix_fallocate64 implementation?