Bug 5707 - smbclient process sometimes hangs up
smbclient process sometimes hangs up
Status: RESOLVED FIXED
Product: Samba 3.2
Classification: Unclassified
Component: Client tools
3.2.0
x64 Solaris
: P3 normal
: ---
Assigned To: Samba Bugzilla Account
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-08-20 07:48 UTC by Igor Galić
Modified: 2008-09-21 22:24 UTC (History)
0 users

See Also:


Attachments
Possible quick workaround (525 bytes, patch)
2008-08-21 05:19 UTC, Volker Lendecke
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Igor Galić 2008-08-20 07:48:14 UTC
Normaly the job needs ~200sec ~ 3minutes but the both are running 60min and 40min until now ! 

root@uxdev02-olap:~> uname -a
SunOS uxdev02-olap 5.10 Generic_118855-36 i86pc i386 i86pc

load averages:  6.70,  6.41,  6.43;                    up 60+08:04:29                                                      15:00:34
164 processes: 156 sleeping, 4 running, 4 on cpu
CPU states:  5.7% idle, 79.8% user, 14.5% kernel,  0.0% iowait,  0.0% swap
Memory: 16G phys mem, 427M free mem, 31G swap, 28G free swap

   PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
  6892 oradwh     2   1    0   45M   29M run      0:07 70.90% sqlldr
 28639 oradwh     1   4    0 6940K 1720K cpu     35:33 59.93% smbclient
 24903 oradwh     1   5    0 6940K 1736K cpu     60:46 56.79% smbclient


Here are the core dumps:

igalic@ixopud024006:~> dbx /opt/baw/bin/smbclient /opt/install/unix/temp/core.24903
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc
Reading smbclient
core file header read successfully
Reading ld.so.1
Reading libthread.so.1
Reading libsendfile.so.1
Reading libresolv.so.2
Reading libnsl.so.1
Reading libsocket.so.1
Reading libpopt.so.0.0.0
Reading libtalloc.so
Reading libtdb.so
Reading libwbclient.so
Reading libc.so.1
Reading UTF-16LE%CP850.so
Reading CP850%UTF-16LE.so
WARNING!!
A loadobject was found with an unexpected checksum value.
See `help core mismatch' for details, and run `proc -map'
to see what checksum values were expected and found.
dbx: warning: Some symbolic information might be incorrect.
t@1 (l@1) program terminated by signal 0 (UNKNOWN SIGNAL)
0xfed71447: _ptrace+0x009b:     movl     0xfffffffd(%ebx,%ebx,4),%ebx
(dbx) where
current thread: t@1
=>[1] _ptrace(0x8046838, 0x0), at 0xfed71447
  [2] event_loop_once(0x84d7bf8, 0x80469f0, 0x8046988, 0x80d2d5e), at 0x817897d
  [3] cli_pull(0x84b6a60, 0xc002, 0x0, 0x0, 0xb4f2caf2, 0x0, 0x4000, 0x808d614, 0x80469f0, 0x8046a00, 0x1a4, 0x4), at 0x80d2dc2
  [4] do_get(0x84d7aa0, 0x84d78c0, 0x0, 0x0), at 0x808d8de
  [5] cmd_get(0x84614a8, 0x0, 0x0, 0x8409c68, 0x831d0f8, 0x831d0e0), at 0x808dc1e
  [6] process_command_string(0x84a38f0, 0x0, 0x8046ad8, 0x80947d4), at 0x8093a96
  [7] process(0x0, 0x8408c6c, 0x8046e58, 0x8095644), at 0x809483c
  [8] main(0xa, 0x8046e8c, 0x8046eb8, 0xfed9c6c0), at 0x8095753
(dbx) exit

igalic@ixopud024006:~> dbx /opt/baw/bin/smbclient /opt/install/unix/temp/core.28639
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc
Reading smbclient
core file header read successfully
Reading ld.so.1
Reading libthread.so.1
Reading libsendfile.so.1
Reading libresolv.so.2
Reading libnsl.so.1
Reading libsocket.so.1
Reading libpopt.so.0.0.0
Reading libtalloc.so
Reading libtdb.so
Reading libwbclient.so
Reading libc.so.1
Reading UTF-16LE%CP850.so
Reading CP850%UTF-16LE.so
WARNING!!
A loadobject was found with an unexpected checksum value.
See `help core mismatch' for details, and run `proc -map'
to see what checksum values were expected and found.
dbx: warning: Some symbolic information might be incorrect.
t@1 (l@1) program terminated by signal 0 (UNKNOWN SIGNAL)
0x08178980: event_loop_once+0x0070:     leal     0xfffffee4(%ebp),%eax
(dbx) where
current thread: t@1
=>[1] event_loop_once(0x84d7bf8, 0x80469f0, 0x8046988, 0x80d2d5e), at 0x8178980
  [2] cli_pull(0x84b6a60, 0x4003, 0x0, 0x0, 0xb3edf96a, 0x0, 0x4000, 0x808d614, 0x80469f0, 0x8046a00, 0x1a4, 0x4), at 0x80d2dc2
  [3] do_get(0x84d7aa0, 0x84d78c0, 0x0, 0x0), at 0x808d8de
  [4] cmd_get(0x84614a8, 0x0, 0x0, 0x8409c68, 0x831d0f8, 0x831d0e0), at 0x808dc1e
  [5] process_command_string(0x84a38f0, 0x0, 0x8046ad8, 0x80947d4), at 0x8093a96
  [6] process(0x0, 0x8408c6c, 0x8046e58, 0x8095644), at 0x809483c
  [7] main(0xa, 0x8046e8c, 0x8046eb8, 0xfed9c6c0), at 0x8095753
(dbx) exit


Currently the zones where those scripts are running are not DTrace enabled, we could do that, if the information provided does not suffice.
Comment 1 Volker Lendecke 2008-08-20 09:29:47 UTC
Hmmmm. Unlikely, but do you have *anything* that the broken jobs have in common? Can you also do a quick truss -p <pid> to see what the processes do? Probably they're sitting in a gettimeofday/select loop, but I just want to make sure.

Volker
Comment 2 Igor Galić 2008-08-20 14:20:05 UTC
Hi Volker,

I have tried to truss some of the smblient processes today, but I didn't got very far:
truss -p 2304


And that's essentially it. I'm not entirely sure if that is, again, a limitation of zones, if those processes were hung at all, or if it was something completely different.
Comment 3 Volker Lendecke 2008-08-21 05:19:00 UTC
Created attachment 3498 [details]
Possible quick workaround

Hi!

Can you try the attached patch as a quick workaround? This is not the general bug fix, but it might help. For fixing the bug I need more information about the clients that hang. Optimal would be a sniff of such a client together with a debug level 10 log.

Volker
Comment 4 Volker Lendecke 2008-08-21 05:24:09 UTC
Ok, just seen your "truss" comment. Does the backtrace always show the same? Can you recompile with -g, so that we can get a line number?

Volker
Comment 5 Igor Galić 2008-08-22 09:15:28 UTC
Recompiled and redeployed the smbclient with --enable-developer [and -g for CFLAGS and LDFLAGS]. Now I don't suppose we'll see much activity over the weekend... But I'll ask our devs to modify their scripts such as to log with loglevel 10, so we should see some results by Monday.
Comment 6 Igor Galić 2008-09-02 07:38:30 UTC
Update:

It seems that we're running into a Heisenbug.
When transfering with -d 10, we never run into a hang, but many of the files are corrupted and some of them aren't complete.

I have, right now, a hanging process, which gives me a rather poor dump:
root@uxdev02-olap:/dwh/develop/odi/log/save> ps -cafe | grep smbcl
    root 19132 10828  FSS  59 12:36:21 pts/4       0:00 grep smbcl
  oradwh 27069 27068  FSS   1 14:54:14 ?        1294:20 /opt/baw/bin/smbclient //AWT0DBSQL032/LOAD01 -I AWT0DBSQL032 -b 16384 -c get ST
root@uxdev02-olap:/dwh/develop/odi/log/save> pstack 27069/1
27069:  /opt/baw/bin/smbclient //AWT0DBSQL032/LOAD01 -I AWT0DBSQL032 -b 16384
 fffffd7fff05d924 memset () + 114

[I've drawn a coredump and will analyze it later this day on a different machine, where I have a gdb/dbx]
Comment 7 Volker Lendecke 2008-09-02 08:20:20 UTC
Yes, a heisenbug here is very likely. debug level 10 changes timings a lot, and this is very likely a timing bug. Did you have a chance to test the workaround patch I sent?

Volker
Comment 8 Volker Lendecke 2008-09-02 08:31:34 UTC
BTW, what server platform are you pulling from?

Volker
Comment 9 Igor Galić 2008-09-03 03:23:35 UTC
As stated in the innitial bug report, I'm pulling from Solaris/x86. Still:
root@uxdev02:~:> uname -a
SunOS uxdev02 5.10 Generic_118855-36 i86pc i386 i86pc
root@uxdev02:~:> isainfo -v
64-bit amd64 applications
        sse3 sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx cmov amd_sysc cx8
        tsc fpu
32-bit i386 applications
        sse3 sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx cmov amd_sysc cx8
        tsc fpu
root@uxdev02:~:> prtdiag
System Configuration: Sun Microsystems Sun Fire X4500
BIOS Configuration: American Megatrends Inc. 080010  08/04/2006
BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)

==== Processor Sockets ====================================

Version                          Location Tag
-------------------------------- --------------------------
Dual Core AMD Opteron(tm) Processor 285 H0
Dual Core AMD Opteron(tm) Processor 285 H1


Volker: I did apply the patch, and redeploy the package, so what we're looking at now, is 3.2.0 plus your patch.

Comment 10 Volker Lendecke 2008-09-03 03:30:14 UTC
Sorry, I had thought that your smbclient was running on Solaris, I did not know that the server side is also Solaris. What server are you running there? Is that also Samba, or is it the new Solaris in-kernel SMB server?

Volker
Comment 11 Igor Galić 2008-09-03 04:16:04 UTC
Must've been a misunderstanding on my side - I thought you were talking about the cores:

Here's the info on the server I'm talking to:
OS Name	Microsoft(R) Windows(R) Server 2003 Enterprise x64 Edition
Version	5.2.3790 Service Pack 2 Build 3790
Other OS Description 	Not Available
OS Manufacturer	Microsoft Corporation
System Name	AWT0DBSQL032
System Manufacturer	HP
System Model	ProLiant DL380 G5
System Type	x64-based PC
Processor	EM64T Family 6 Model 15 Stepping 6 GenuineIntel ~2667 Mhz
Processor	EM64T Family 6 Model 15 Stepping 6 GenuineIntel ~2667 Mhz
Processor	EM64T Family 6 Model 15 Stepping 11 GenuineIntel ~2667 Mhz
Processor	EM64T Family 6 Model 15 Stepping 11 GenuineIntel ~2667 Mhz
BIOS Version/Date	HP P56, 21.08.2007
SMBIOS Version	2.4
Windows Directory	C:\WINDOWS
System Directory	C:\WINDOWS\system32
Boot Device	\Device\HarddiskVolume1
Locale	Austria
Hardware Abstraction Layer	Version = "5.2.3790.3959 (srv03_sp2_rtm.070216-1710)"
User Name	Not Available
Time Zone	GMT Daylight Time
Total Physical Memory	8.189,67 MB
Available Physical Memory	183,57 MB
Total Virtual Memory	9,57 GB
Available Virtual Memory	2,01 GB
Page File Space	2,00 GB
Page File	C:\pagefile.sys
Comment 12 Volker Lendecke 2008-09-04 07:42:32 UTC
Are your binaries compiled as 64-bit binaries? If yes, can you try to compile as 32-bit? During the alpha phase of 3.2 there were problems where I got off_t/size_t wrong. If it works fine using 32 bit, then this would be a hint that more of those are lurking in the code.

Volker
Comment 13 Igor Galić 2008-09-04 08:23:17 UTC
Yes, Of course.

root@uxdev02:~:> file /opt/baw/bin/smbclient
/opt/baw/bin/smbclient: ELF 64-bit LSB executable AMD64 Version 1 [SSE2 SSE FXSR AMD_3DNow CMOV FPU], dynamically linked, not stripped

I'll try to get it deployed with 32bit as soon as possible, and report back.
Comment 14 Volker Lendecke 2008-09-04 10:16:42 UTC
FYI: I've just reproduce it on a sparc 64-bit box. Investigating.

Volker
Comment 15 Volker Lendecke 2008-09-05 05:10:02 UTC
http://git.samba.org/?p=samba.git;a=commitdiff;h=1558a5c1977b fixes the spinning smbclient for me. This happened if the network connection was broken for some reason. It does not clear up why the connection broke in the first place, but I would like to ask you to test this.

Thanks,

Volker
Comment 16 Igor Galić 2008-09-05 09:18:09 UTC
Should I be testing this compiled for 64 bits as until now, or with 32 bits?

Also, I don't have a machine where I could build x86_64 stuff right now.I hope this will change by monday.
Comment 17 Volker Lendecke 2008-09-05 09:23:46 UTC
This is non-32/64 bit related.

Volker
Comment 18 Volker Lendecke 2008-09-12 08:52:33 UTC
Any updates here? Have you been able to test the patch?

Volker
Comment 19 Igor Galić 2008-09-18 03:22:16 UTC
Unfortunately I've been sick and hence out of business for the past (two?) weeks.

I'm now back and will look into getting this into testing as soon as possible.
(Unless you have any updates, which invalidate it.)

So long,
Igor
Comment 20 Volker Lendecke 2008-09-18 03:25:32 UTC
No, no updates from my side. I just want to close this bug report with a hopefully positive ack from you.

Volker
Comment 21 Karolin Seeger 2008-09-21 22:24:35 UTC
Closing out bug report.
Please re-open if it's still an issue for you.

Thank you very much for reporting!