5707 – smbclient process sometimes hangs up

Bug 5707 - smbclient process sometimes hangs up

Summary: smbclient process sometimes hangs up

Status:	RESOLVED FIXED

Alias:	None

Product:	Samba 3.2
Classification:	Unclassified
Component:	Client tools (show other bugs)
Version:	3.2.0
Hardware:	x64 Solaris

Importance:	P3 normal
Target Milestone:	---
Assignee:	Samba Bugzilla Account
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2008-08-20 07:48 UTC by Igor Galić
Modified:	2008-09-21 22:24 UTC (History)
CC List:	0 users

See Also:

Attachments
Possible quick workaround (525 bytes, patch) 2008-08-21 05:19 UTC, Volker Lendecke	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Igor Galić 2008-08-20 07:48:14 UTC

Normaly the job needs ~200sec ~ 3minutes but the both are running 60min and 40min until now ! 

root@uxdev02-olap:~> uname -a
SunOS uxdev02-olap 5.10 Generic_118855-36 i86pc i386 i86pc

load averages:  6.70,  6.41,  6.43;                    up 60+08:04:29                                                      15:00:34
164 processes: 156 sleeping, 4 running, 4 on cpu
CPU states:  5.7% idle, 79.8% user, 14.5% kernel,  0.0% iowait,  0.0% swap
Memory: 16G phys mem, 427M free mem, 31G swap, 28G free swap

   PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
  6892 oradwh     2   1    0   45M   29M run      0:07 70.90% sqlldr
 28639 oradwh     1   4    0 6940K 1720K cpu     35:33 59.93% smbclient
 24903 oradwh     1   5    0 6940K 1736K cpu     60:46 56.79% smbclient


Here are the core dumps:

igalic@ixopud024006:~> dbx /opt/baw/bin/smbclient /opt/install/unix/temp/core.24903
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc
Reading smbclient
core file header read successfully
Reading ld.so.1
Reading libthread.so.1
Reading libsendfile.so.1
Reading libresolv.so.2
Reading libnsl.so.1
Reading libsocket.so.1
Reading libpopt.so.0.0.0
Reading libtalloc.so
Reading libtdb.so
Reading libwbclient.so
Reading libc.so.1
Reading UTF-16LE%CP850.so
Reading CP850%UTF-16LE.so
WARNING!!
A loadobject was found with an unexpected checksum value.
See `help core mismatch' for details, and run `proc -map'
to see what checksum values were expected and found.
dbx: warning: Some symbolic information might be incorrect.
t@1 (l@1) program terminated by signal 0 (UNKNOWN SIGNAL)
0xfed71447: _ptrace+0x009b:     movl     0xfffffffd(%ebx,%ebx,4),%ebx
(dbx) where
current thread: t@1
=>[1] _ptrace(0x8046838, 0x0), at 0xfed71447
  [2] event_loop_once(0x84d7bf8, 0x80469f0, 0x8046988, 0x80d2d5e), at 0x817897d
  [3] cli_pull(0x84b6a60, 0xc002, 0x0, 0x0, 0xb4f2caf2, 0x0, 0x4000, 0x808d614, 0x80469f0, 0x8046a00, 0x1a4, 0x4), at 0x80d2dc2
  [4] do_get(0x84d7aa0, 0x84d78c0, 0x0, 0x0), at 0x808d8de
  [5] cmd_get(0x84614a8, 0x0, 0x0, 0x8409c68, 0x831d0f8, 0x831d0e0), at 0x808dc1e
  [6] process_command_string(0x84a38f0, 0x0, 0x8046ad8, 0x80947d4), at 0x8093a96
  [7] process(0x0, 0x8408c6c, 0x8046e58, 0x8095644), at 0x809483c
  [8] main(0xa, 0x8046e8c, 0x8046eb8, 0xfed9c6c0), at 0x8095753
(dbx) exit

igalic@ixopud024006:~> dbx /opt/baw/bin/smbclient /opt/install/unix/temp/core.28639
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc
Reading smbclient
core file header read successfully
Reading ld.so.1
Reading libthread.so.1
Reading libsendfile.so.1
Reading libresolv.so.2
Reading libnsl.so.1
Reading libsocket.so.1
Reading libpopt.so.0.0.0
Reading libtalloc.so
Reading libtdb.so
Reading libwbclient.so
Reading libc.so.1
Reading UTF-16LE%CP850.so
Reading CP850%UTF-16LE.so
WARNING!!
A loadobject was found with an unexpected checksum value.
See `help core mismatch' for details, and run `proc -map'
to see what checksum values were expected and found.
dbx: warning: Some symbolic information might be incorrect.
t@1 (l@1) program terminated by signal 0 (UNKNOWN SIGNAL)
0x08178980: event_loop_once+0x0070:     leal     0xfffffee4(%ebp),%eax
(dbx) where
current thread: t@1
=>[1] event_loop_once(0x84d7bf8, 0x80469f0, 0x8046988, 0x80d2d5e), at 0x8178980
  [2] cli_pull(0x84b6a60, 0x4003, 0x0, 0x0, 0xb3edf96a, 0x0, 0x4000, 0x808d614, 0x80469f0, 0x8046a00, 0x1a4, 0x4), at 0x80d2dc2
  [3] do_get(0x84d7aa0, 0x84d78c0, 0x0, 0x0), at 0x808d8de
  [4] cmd_get(0x84614a8, 0x0, 0x0, 0x8409c68, 0x831d0f8, 0x831d0e0), at 0x808dc1e
  [5] process_command_string(0x84a38f0, 0x0, 0x8046ad8, 0x80947d4), at 0x8093a96
  [6] process(0x0, 0x8408c6c, 0x8046e58, 0x8095644), at 0x809483c
  [7] main(0xa, 0x8046e8c, 0x8046eb8, 0xfed9c6c0), at 0x8095753
(dbx) exit


Currently the zones where those scripts are running are not DTrace enabled, we could do that, if the information provided does not suffice.

Comment 1 Volker Lendecke 2008-08-20 09:29:47 UTC

Hmmmm. Unlikely, but do you have *anything* that the broken jobs have in common? Can you also do a quick truss -p <pid> to see what the processes do? Probably they're sitting in a gettimeofday/select loop, but I just want to make sure.

Volker

Comment 2 Igor Galić 2008-08-20 14:20:05 UTC

Hi Volker,

I have tried to truss some of the smblient processes today, but I didn't got very far:
truss -p 2304


And that's essentially it. I'm not entirely sure if that is, again, a limitation of zones, if those processes were hung at all, or if it was something completely different.

Comment 3 Volker Lendecke 2008-08-21 05:19:00 UTC

Created attachment 3498 [details]
Possible quick workaround

Hi!

Can you try the attached patch as a quick workaround? This is not the general bug fix, but it might help. For fixing the bug I need more information about the clients that hang. Optimal would be a sniff of such a client together with a debug level 10 log.

Volker

Comment 4 Volker Lendecke 2008-08-21 05:24:09 UTC

Ok, just seen your "truss" comment. Does the backtrace always show the same? Can you recompile with -g, so that we can get a line number?

Volker

Comment 5 Igor Galić 2008-08-22 09:15:28 UTC

Recompiled and redeployed the smbclient with --enable-developer [and -g for CFLAGS and LDFLAGS]. Now I don't suppose we'll see much activity over the weekend... But I'll ask our devs to modify their scripts such as to log with loglevel 10, so we should see some results by Monday.

Comment 6 Igor Galić 2008-09-02 07:38:30 UTC

Update:

It seems that we're running into a Heisenbug.
When transfering with -d 10, we never run into a hang, but many of the files are corrupted and some of them aren't complete.

I have, right now, a hanging process, which gives me a rather poor dump:
root@uxdev02-olap:/dwh/develop/odi/log/save> ps -cafe | grep smbcl
    root 19132 10828  FSS  59 12:36:21 pts/4       0:00 grep smbcl
  oradwh 27069 27068  FSS   1 14:54:14 ?        1294:20 /opt/baw/bin/smbclient //AWT0DBSQL032/LOAD01 -I AWT0DBSQL032 -b 16384 -c get ST
root@uxdev02-olap:/dwh/develop/odi/log/save> pstack 27069/1
27069:  /opt/baw/bin/smbclient //AWT0DBSQL032/LOAD01 -I AWT0DBSQL032 -b 16384
 fffffd7fff05d924 memset () + 114

[I've drawn a coredump and will analyze it later this day on a different machine, where I have a gdb/dbx]

Comment 7 Volker Lendecke 2008-09-02 08:20:20 UTC

Yes, a heisenbug here is very likely. debug level 10 changes timings a lot, and this is very likely a timing bug. Did you have a chance to test the workaround patch I sent?

Volker

Comment 8 Volker Lendecke 2008-09-02 08:31:34 UTC

BTW, what server platform are you pulling from?

Volker

Comment 9 Igor Galić 2008-09-03 03:23:35 UTC

As stated in the innitial bug report, I'm pulling from Solaris/x86. Still:
root@uxdev02:~:> uname -a
SunOS uxdev02 5.10 Generic_118855-36 i86pc i386 i86pc
root@uxdev02:~:> isainfo -v
64-bit amd64 applications
        sse3 sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx cmov amd_sysc cx8
        tsc fpu
32-bit i386 applications
        sse3 sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx cmov amd_sysc cx8
        tsc fpu
root@uxdev02:~:> prtdiag
System Configuration: Sun Microsystems Sun Fire X4500
BIOS Configuration: American Megatrends Inc. 080010  08/04/2006
BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)

==== Processor Sockets ====================================

Version                          Location Tag
-------------------------------- --------------------------
Dual Core AMD Opteron(tm) Processor 285 H0
Dual Core AMD Opteron(tm) Processor 285 H1


Volker: I did apply the patch, and redeploy the package, so what we're looking at now, is 3.2.0 plus your patch.

Comment 10 Volker Lendecke 2008-09-03 03:30:14 UTC

Sorry, I had thought that your smbclient was running on Solaris, I did not know that the server side is also Solaris. What server are you running there? Is that also Samba, or is it the new Solaris in-kernel SMB server?

Volker

Comment 11 Igor Galić 2008-09-03 04:16:04 UTC

Must've been a misunderstanding on my side - I thought you were talking about the cores:

Here's the info on the server I'm talking to:
OS Name	Microsoft(R) Windows(R) Server 2003 Enterprise x64 Edition
Version	5.2.3790 Service Pack 2 Build 3790
Other OS Description 	Not Available
OS Manufacturer	Microsoft Corporation
System Name	AWT0DBSQL032
System Manufacturer	HP
System Model	ProLiant DL380 G5
System Type	x64-based PC
Processor	EM64T Family 6 Model 15 Stepping 6 GenuineIntel ~2667 Mhz
Processor	EM64T Family 6 Model 15 Stepping 6 GenuineIntel ~2667 Mhz
Processor	EM64T Family 6 Model 15 Stepping 11 GenuineIntel ~2667 Mhz
Processor	EM64T Family 6 Model 15 Stepping 11 GenuineIntel ~2667 Mhz
BIOS Version/Date	HP P56, 21.08.2007
SMBIOS Version	2.4
Windows Directory	C:\WINDOWS
System Directory	C:\WINDOWS\system32
Boot Device	\Device\HarddiskVolume1
Locale	Austria
Hardware Abstraction Layer	Version = "5.2.3790.3959 (srv03_sp2_rtm.070216-1710)"
User Name	Not Available
Time Zone	GMT Daylight Time
Total Physical Memory	8.189,67 MB
Available Physical Memory	183,57 MB
Total Virtual Memory	9,57 GB
Available Virtual Memory	2,01 GB
Page File Space	2,00 GB
Page File	C:\pagefile.sys

Comment 12 Volker Lendecke 2008-09-04 07:42:32 UTC

Are your binaries compiled as 64-bit binaries? If yes, can you try to compile as 32-bit? During the alpha phase of 3.2 there were problems where I got off_t/size_t wrong. If it works fine using 32 bit, then this would be a hint that more of those are lurking in the code.

Volker

Comment 13 Igor Galić 2008-09-04 08:23:17 UTC

Yes, Of course.

root@uxdev02:~:> file /opt/baw/bin/smbclient
/opt/baw/bin/smbclient: ELF 64-bit LSB executable AMD64 Version 1 [SSE2 SSE FXSR AMD_3DNow CMOV FPU], dynamically linked, not stripped

I'll try to get it deployed with 32bit as soon as possible, and report back.

Comment 14 Volker Lendecke 2008-09-04 10:16:42 UTC

FYI: I've just reproduce it on a sparc 64-bit box. Investigating.

Volker

Comment 15 Volker Lendecke 2008-09-05 05:10:02 UTC

http://git.samba.org/?p=samba.git;a=commitdiff;h=1558a5c1977b fixes the spinning smbclient for me. This happened if the network connection was broken for some reason. It does not clear up why the connection broke in the first place, but I would like to ask you to test this.

Thanks,

Volker

Comment 16 Igor Galić 2008-09-05 09:18:09 UTC

Should I be testing this compiled for 64 bits as until now, or with 32 bits?

Also, I don't have a machine where I could build x86_64 stuff right now.I hope this will change by monday.

Comment 17 Volker Lendecke 2008-09-05 09:23:46 UTC

This is non-32/64 bit related.

Volker

Comment 18 Volker Lendecke 2008-09-12 08:52:33 UTC

Any updates here? Have you been able to test the patch?

Volker

Comment 19 Igor Galić 2008-09-18 03:22:16 UTC

Unfortunately I've been sick and hence out of business for the past (two?) weeks.

I'm now back and will look into getting this into testing as soon as possible.
(Unless you have any updates, which invalidate it.)

So long,
Igor

Comment 20 Volker Lendecke 2008-09-18 03:25:32 UTC

No, no updates from my side. I just want to close this bug report with a hopefully positive ack from you.

Volker

Comment 21 Karolin Seeger 2008-09-21 22:24:35 UTC

Closing out bug report.
Please re-open if it's still an issue for you.

Thank you very much for reporting!