Bug 13751 - Samba 4.9.3 smbd coredumps in AIX
Summary: Samba 4.9.3 smbd coredumps in AIX
Status: NEW
Alias: None
Product: TDB
Classification: Unclassified
Component: libtdb (show other bugs)
Version: unspecified
Hardware: PPC AIX
: P5 normal
Target Milestone: ---
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-16 15:48 UTC by Ayappan
Modified: 2020-08-13 08:28 UTC (History)
3 users (show)

See Also:


Attachments
smbd-debug10 (28.78 KB, text/plain)
2019-01-18 15:07 UTC, Ayappan
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ayappan 2019-01-16 15:48:14 UTC
Running smbd (Samba-4.9.3) immediately coredumps in AIX.
The stack trace reveals it happens inside tdb_write.

Below are the details

# /opt/freeware/sbin/smbd --interactive
smbd version 4.9.3 started.
Copyright Andrew Tridgell and the Samba Team 1992-2018
===============================================================
INTERNAL ERROR: Signal 10 in pid 11141180 (4.9.3)
Please read the Trouble-Shooting section of the Samba HOWTO
===============================================================
PANIC (pid 11141180): internal error
unable to produce a stack trace on this platform
dumping core in /var/log/samba/cores/smbd
IOT/Abort trap(coredump)



# dbx /opt/freeware/sbin/smbd core
Type 'help' for help.
warning: The core file is not a fullcore. Some info may
not be available.
[using memory image in core]
reading symbolic information ...warning: Unable to access the stab file. Some info may not be available


IOT/Abort trap in pthread_kill at 0xd0521f14
0xd0521f14 (pthread_kill+0xb4) 80410014         lwz   r2,0x14(r1)
(dbx) where
pthread_kill(??, ??) at 0xd0521f14
_p_raise(??) at 0xd0521348
raise.raise(??) at 0xd011f9c0
abort() at 0xd01af584
dump_core(), line 338 in "dumpcore.c"
smb_panic_s3(why = "internal error"), line 839 in "util.c"
smb_panic(why = "internal error"), line 170 in "fault.c"
fault_report(sig = 10), line 84 in "fault.c"
sig_fault(sig = 10), line 95 in "fault.c"
.() at 0xf014
tdb_write(tdb = 0x3003b618, off = 16380, buf = 0x2ff223f0, len = 4), line 223 in "io.c"
tdb_ofs_write(tdb = 0x3003b618, offset = 16380, d = 0x2ff22440), line 674 in "io.c"
update_tailer(tdb = 0x3003b618, offset = 696, rec = 0x2ff224e0), line 96 in "freelist.c"
tdb_free(tdb = 0x3003b618, offset = 696, rec = 0x2ff224e0), line 316 in "freelist.c"
tdb_expand(tdb = 0x3003b618, size = 15688), line 655 in "io.c"
tdb_allocate_from_freelist(tdb = 0x3003b618, length = 108, rec = 0x2ff22650), line 577 in "freelist.c"
tdb_allocate(tdb = 0x3003b618, hash = 1167830340, length = 83, rec = 0x2ff22650), line 664 in "freelist.c"
tdb._tdb_storev(tdb = 0x3003b618, key = (...), dbufs = 0x2ff227dc, num_dbufs = 1, flag = 1, hash = 1167830340), line 591 in "tdb.c"
tdb_storev(tdb = 0x3003b618, key = (...), dbufs = 0x2ff227dc, num_dbufs = 1, flag = 1), line 700 in "tdb.c"
db_tdb_storev(rec = 0x3003b418, dbufs = 0x2ff227dc, num_dbufs = 1, flag = 1), line 298 in "dbwrap_tdb.c"
dbwrap_record_storev(rec = 0x3003b418, dbufs = 0x2ff227dc, num_dbufs = 1, flags = 1), line 90 in "dbwrap.c"
dbwrap_record_store(rec = 0x3003b418, data = (...), flags = 1), line 99 in "dbwrap.c"
smbXsrv_version_global_init(server_id = 0x2ff22ac0), line 240 in "smbXsrv_version.c"
main(0x2, 0x2ff22c74) at 0x10003700
(dbx) quit
Comment 1 Ayappan 2019-01-17 10:02:29 UTC
Any update on this ?
Comment 2 Ayappan 2019-01-17 14:28:19 UTC
Doing a truss on this shows the below

statx("/var/locks", 0x2FF22688, 128, 011)       = 0
kopen("/var/locks/smbXsrv_version_global.tdb", O_RDWR|O_CREAT|O_LARGEFILE, S_IRUSR|S_IWUSR) = 13
kfcntl(13, F_GETFD, 0x00000000)                 = 0
kfcntl(13, F_SETFD, 0x00000001)                 = 0
kfcntl(13, 13, 0x2FF22250)                      = 0
kfcntl(13, 12, 0x2FF22250)                      = 0
kfcntl(13, 13, 0x2FF222B0)                      = 0
klseek(13, 0, 0, 0x00000000)                    = 0
kftruncate(13, 0x0000000000000000)              = 0
kwrite(13, " T D B   f i l e\n\0\0\0".., 696)   = 696
kfcntl(13, 13, 0x2FF222C0)                      = 0
klseek(13, 0, 0, 0x00000000)                    = 0
kread(13, " T D B   f i l e\n\0\0\0".., 168)    = 168
fstatx(13, 0x2FF22438, 128, 010)                = 0
fstatx(13, 0x2FF222A0, 128, 010)                = 0
kmmap(0x00000000, 696, 3, 1, 13, 0x00000000, 0x00000000) = 0xB006C000
kfcntl(13, 13, 0x2FF22250)                      = 0
kfcntl(13, 13, 0x2FF22250)                      = 0
kfcntl(13, 13, 0x2FF22250)                      = 0
fstatx(13, 0x2FF22658, 128, 010)                = 0
kfcntl(13, 13, 0x2FF22510)                      = 0
kfcntl(13, 13, 0x2FF22430)                      = 0
fstatx(13, 0x2FF22410, 128, 010)                = 0
munmap(0xB006C000, 696)                         = 0
kmmap(0x00000000, 696, 3, 1, 13, 0x00000000, 0x00000000) = 0xB006C000
fstatx(13, 0x2FF20338, 128, 010)                = 0
kioctl(13, -2147195273, 0x2FF20300, 0x00000000) = 1
kioctl(13, -2147195273, 0x2FF20300, 0x00000000) = 0
munmap(0xB006C000, 696)                         = 0
kmmap(0x00000000, 16384, 3, 1, 13, 0x00000000, 0x00000000) = 0xB006C000
    Received signal #10, SIGBUS [caught]
Comment 3 Ayappan 2019-01-18 14:48:43 UTC
I just cleaned up the system and started fresh.
Removed the secrets.tdb file as well.

Now i am getting an error with secrets.tdb file.

# /opt/freeware/sbin/smbd -i
smbd version 4.9.3 started.
Copyright Andrew Tridgell and the Samba Team 1992-2018
tdb(/var/lib/samba/private/secrets.tdb): tdb_oob len 16408 beyond eof at 696
tdb(/var/lib/samba/private/secrets.tdb): tdb_transaction_recover: failed to read recovery record
Failed to open /var/lib/samba/private/secrets.tdb
tdb(/var/lib/samba/private/secrets.tdb): tdb_oob len 16408 beyond eof at 696
tdb(/var/lib/samba/private/secrets.tdb): tdb_transaction_recover: failed to read recovery record
Failed to open /var/lib/samba/private/secrets.tdb
exit_daemon: STATUS=daemon failed to start: smbd can not open secrets.tdb, error code 13


Any ideas will be really helpful.
Comment 4 Ayappan 2019-01-18 15:07:35 UTC
Created attachment 14789 [details]
smbd-debug10

Attaching the output of smbd -i with debug=10
Comment 5 Ayappan 2019-01-18 15:55:24 UTC
Again removing the file "/var/lib/samba/private/secrets.tdb" and doing a smbd -i -d10 results in the below error.

Attempting to register passdb backend tdbsam
Successfully added passdb backend 'tdbsam'
Found pdb backend tdbsam
pdb backend tdbsam has a valid init
tdb(/var/lib/samba/private/secrets.tdb): tdb_transaction_start: nesting 1
dbwrap_lock_order_lock: check lock order 1 for /var/lib/samba/private/secrets.tdb
lock order:  1:/var/lib/samba/private/secrets.tdb 2:<none> 3:<none>
dbwrap_lock_order_unlock: release lock order 1 for /var/lib/samba/private/secrets.tdb
tdb(/var/lib/samba/private/secrets.tdb): tdb_transaction_start: nesting 1
tdb(/var/lib/samba/private/secrets.tdb): tdb_transaction_setup_recovery: transaction data over new region boundary
tdb(/var/lib/samba/private/secrets.tdb): tdb_transaction_prepare_commit: failed to setup recovery data
PANIC (pid 56951290): could not start commit secrets db
unable to produce a stack trace on this platform
dumping core in /var/log/samba/cores/smbd
IOT/Abort trap(coredump)


# dbx /opt/freeware/sbin/smbd core
Type 'help' for help.
[using memory image in core]
reading symbolic information ...warning: Unable to access the stab file. Some info may not be available


IOT/Abort trap in pthread_kill at 0xd05833ec ($t1)
0xd05833ec (pthread_kill+0xac) 80410014            lwz   r2,0x14(r1)
(dbx) where
pthread_kill(??, ??) at 0xd05833ec
_p_raise(??) at 0xd05827c8
raise.raise(??) at 0xd01234a4
abort() at 0xd0189a18
dump_core(), line 338 in "dumpcore.c"
smb_panic_s3(why = "could not start commit secrets db"), line 839 in "util.c"
smb_panic(why = "could not start commit secrets db"), line 170 in "fault.c"
get_global_sam_sid(), line 217 in "machine_sid.c"
main(0x3, 0x2ff22ae0) at 0x10003694
Comment 6 Ayappan 2019-01-23 11:03:15 UTC
After some debugging it seems like posix_fallocate is broken in AIX 6.1, 7.1, 7.2
File : lib/tdb/common/io.c

  +416  #if HAVE_POSIX_FALLOCATE
  +417          ret = tdb_posix_fallocate(tdb, size, addition);
  +418          if (ret == 0) {
  +419                  return 0;
  +420          }
  +421          if (ret == ENOSPC) {


   +99  #if HAVE_POSIX_FALLOCATE
  +100  static int tdb_posix_fallocate(struct tdb_context *tdb, off_t offset,
  +101                                 off_t len)
  +102  {
  +103          ssize_t ret;
  +104
  +105          if (!tdb_adjust_offset(tdb, &offset)) {
  +106                  return -1;
  +107          }
  +108
  +109          do {
  +110                  ret = posix_fallocate(tdb->fd, offset, len);
  +111          } while ((ret == -1) && (errno == EINTR));
  +112


The call to posix_fallocate is returning zero but file (secrets.tdb) size is not increased. Adding if !defined(_AIX) around the above mentioned code (basically disabling the code) makes the daemon run again. 

I see there is one more place where posix_fallocate is called.

File : source3/lib/system.c

  +439  int sys_posix_fallocate(int fd, off_t offset, off_t len)
  +440  {
  +441  #if defined(HAVE_POSIX_FALLOCATE)
  +442          return posix_fallocate(fd, offset, len);
  +443  #elif defined(F_RESVSP64)
  +444          /* this handles XFS on IRIX */
  +445          struct flock64 fl;

I am not sure if we disable this like above, what impact this has.
May be the community can shed some light here.
Comment 7 Björn Jacke 2019-01-23 11:07:18 UTC
if you are confident that this is a AIX bug, can you please file an upstream AIX bug for that and reference here, where the result can be followed?
Comment 8 Ayappan 2019-01-23 11:15:40 UTC
I am from IBM AIX Toolbox development team. 
This will be a internal defect and there won't be any public url to reference the defect.
Comment 9 Björn Jacke 2019-02-19 11:50:52 UTC
we are not encountering this issue in the SerNet samba+ packages
Comment 10 Andrew Bartlett 2019-04-23 05:19:32 UTC
(In reply to Ayappan from comment #8)
If this can be reproduced at will then a configure test could be written to detect this behaviour and then to blacklist the posix_fallocate() call.
Comment 11 Ayappan 2019-04-23 06:43:10 UTC
Thanks for the update.

A simple sample program using posix_fallocate works in AIX. Need to analyze more on this.
Comment 12 SATOH Fumiyasu 2019-04-23 07:06:46 UTC
My Samba 4.10.2 on AIX 7.2 with posix_fallocate support has no problem.

```
# file /usr/local/sbin/smbd
/usr/local/sbin/smbd: executable (RISC System/6000 V3.1) or obj module not stripped
# /usr/local/sbin/smbd -b |grep -i posix_fallocate
   HAVE_POSIX_FALLOCATE
   _POSIX_FALLOCATE_CAPABLE_LIBC
```

What filesystem are you using for /var/lib/samba/private/secrets.tdb?
`mount |grep /var`

Samba is 32-bit or 64-bit binary?
`file /opt/freeware/sbin/smbd`
Comment 13 Ayappan 2019-04-23 07:28:54 UTC
Interesting!!

# mount |grep /var
         /dev/hd9var      /var             jfs2   Apr 23 12:10 rw,log=/dev/hd8
         /dev/livedump    /var/adm/ras/livedump jfs2   Apr 23 12:10 rw,log=/dev/hd8

The Samba build is 32bit. 

What is the AIX level you are using ? (Mine is 7200-03-00-0000)
oslevel -s

Can you execute below command on tdb library and paste the output here ?

# dump -Tov libtdb.so | grep posix
[32]    0x00000000    undef      IMP     DS EXTref   libc.a(shr.o) posix_fallocate
Comment 14 SATOH Fumiyasu 2019-04-23 07:47:59 UTC
(In reply to Ayappan from comment #13)

```
$ oslevel -s
7200-00-04-1717
$ dump -Tov libtdb.so |grep posix
[34]  0x00000000    undef      IMP     DS EXTref   libc.a(shr.o) posix_fallocate64
```

I've compiled Samba on AIX 7.2 with GCC 4.8.5, CPPFLAGS="$CPPFLAGS -D_LARGE_FILES -DHAVE_BROKEN_READLINK -D_UINTPTR_T_DEFINED=1" and bug #9557 #10270 patches. If no _LARGE_FILES and _UNIPTR_T_DEFINED in CPPFLAGS, build fails.
Comment 15 Ayappan 2019-04-23 08:54:26 UTC
Thanks for the info.

I see "posix_fallocate64" in your case. AIX 7.2 has posix_fallocate64 under "_LARGE_FILES" condition whereas AIX 6.1 don't have that. And my build is on AIX 6.1 . 

Looks like the implementation could be wrong in AIX 6.1 
Need to check with AIX core team.
Comment 16 Björn Jacke 2020-08-13 08:28:49 UTC
(In reply to Ayappan from comment #15)
> Looks like the implementation could be wrong in AIX 6.1 
> Need to check with AIX core team.

just for completeness: what is the outcome of this finally? Can you say which AIX versions and os levels are broken and which are no longer broken for the posix_fallocate64 implementation?