Normaly the job needs ~200sec ~ 3minutes but the both are running 60min and 40min until now ! root@uxdev02-olap:~> uname -a SunOS uxdev02-olap 5.10 Generic_118855-36 i86pc i386 i86pc load averages: 6.70, 6.41, 6.43; up 60+08:04:29 15:00:34 164 processes: 156 sleeping, 4 running, 4 on cpu CPU states: 5.7% idle, 79.8% user, 14.5% kernel, 0.0% iowait, 0.0% swap Memory: 16G phys mem, 427M free mem, 31G swap, 28G free swap PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 6892 oradwh 2 1 0 45M 29M run 0:07 70.90% sqlldr 28639 oradwh 1 4 0 6940K 1720K cpu 35:33 59.93% smbclient 24903 oradwh 1 5 0 6940K 1736K cpu 60:46 56.79% smbclient Here are the core dumps: igalic@ixopud024006:~> dbx /opt/baw/bin/smbclient /opt/install/unix/temp/core.24903 For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc Reading smbclient core file header read successfully Reading ld.so.1 Reading libthread.so.1 Reading libsendfile.so.1 Reading libresolv.so.2 Reading libnsl.so.1 Reading libsocket.so.1 Reading libpopt.so.0.0.0 Reading libtalloc.so Reading libtdb.so Reading libwbclient.so Reading libc.so.1 Reading UTF-16LE%CP850.so Reading CP850%UTF-16LE.so WARNING!! A loadobject was found with an unexpected checksum value. See `help core mismatch' for details, and run `proc -map' to see what checksum values were expected and found. dbx: warning: Some symbolic information might be incorrect. t@1 (l@1) program terminated by signal 0 (UNKNOWN SIGNAL) 0xfed71447: _ptrace+0x009b: movl 0xfffffffd(%ebx,%ebx,4),%ebx (dbx) where current thread: t@1 =>[1] _ptrace(0x8046838, 0x0), at 0xfed71447 [2] event_loop_once(0x84d7bf8, 0x80469f0, 0x8046988, 0x80d2d5e), at 0x817897d [3] cli_pull(0x84b6a60, 0xc002, 0x0, 0x0, 0xb4f2caf2, 0x0, 0x4000, 0x808d614, 0x80469f0, 0x8046a00, 0x1a4, 0x4), at 0x80d2dc2 [4] do_get(0x84d7aa0, 0x84d78c0, 0x0, 0x0), at 0x808d8de [5] cmd_get(0x84614a8, 0x0, 0x0, 0x8409c68, 0x831d0f8, 0x831d0e0), at 0x808dc1e [6] process_command_string(0x84a38f0, 0x0, 0x8046ad8, 0x80947d4), at 0x8093a96 [7] process(0x0, 0x8408c6c, 0x8046e58, 0x8095644), at 0x809483c [8] main(0xa, 0x8046e8c, 0x8046eb8, 0xfed9c6c0), at 0x8095753 (dbx) exit igalic@ixopud024006:~> dbx /opt/baw/bin/smbclient /opt/install/unix/temp/core.28639 For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc Reading smbclient core file header read successfully Reading ld.so.1 Reading libthread.so.1 Reading libsendfile.so.1 Reading libresolv.so.2 Reading libnsl.so.1 Reading libsocket.so.1 Reading libpopt.so.0.0.0 Reading libtalloc.so Reading libtdb.so Reading libwbclient.so Reading libc.so.1 Reading UTF-16LE%CP850.so Reading CP850%UTF-16LE.so WARNING!! A loadobject was found with an unexpected checksum value. See `help core mismatch' for details, and run `proc -map' to see what checksum values were expected and found. dbx: warning: Some symbolic information might be incorrect. t@1 (l@1) program terminated by signal 0 (UNKNOWN SIGNAL) 0x08178980: event_loop_once+0x0070: leal 0xfffffee4(%ebp),%eax (dbx) where current thread: t@1 =>[1] event_loop_once(0x84d7bf8, 0x80469f0, 0x8046988, 0x80d2d5e), at 0x8178980 [2] cli_pull(0x84b6a60, 0x4003, 0x0, 0x0, 0xb3edf96a, 0x0, 0x4000, 0x808d614, 0x80469f0, 0x8046a00, 0x1a4, 0x4), at 0x80d2dc2 [3] do_get(0x84d7aa0, 0x84d78c0, 0x0, 0x0), at 0x808d8de [4] cmd_get(0x84614a8, 0x0, 0x0, 0x8409c68, 0x831d0f8, 0x831d0e0), at 0x808dc1e [5] process_command_string(0x84a38f0, 0x0, 0x8046ad8, 0x80947d4), at 0x8093a96 [6] process(0x0, 0x8408c6c, 0x8046e58, 0x8095644), at 0x809483c [7] main(0xa, 0x8046e8c, 0x8046eb8, 0xfed9c6c0), at 0x8095753 (dbx) exit Currently the zones where those scripts are running are not DTrace enabled, we could do that, if the information provided does not suffice.
Hmmmm. Unlikely, but do you have *anything* that the broken jobs have in common? Can you also do a quick truss -p <pid> to see what the processes do? Probably they're sitting in a gettimeofday/select loop, but I just want to make sure. Volker
Hi Volker, I have tried to truss some of the smblient processes today, but I didn't got very far: truss -p 2304 And that's essentially it. I'm not entirely sure if that is, again, a limitation of zones, if those processes were hung at all, or if it was something completely different.
Created attachment 3498 [details] Possible quick workaround Hi! Can you try the attached patch as a quick workaround? This is not the general bug fix, but it might help. For fixing the bug I need more information about the clients that hang. Optimal would be a sniff of such a client together with a debug level 10 log. Volker
Ok, just seen your "truss" comment. Does the backtrace always show the same? Can you recompile with -g, so that we can get a line number? Volker
Recompiled and redeployed the smbclient with --enable-developer [and -g for CFLAGS and LDFLAGS]. Now I don't suppose we'll see much activity over the weekend... But I'll ask our devs to modify their scripts such as to log with loglevel 10, so we should see some results by Monday.
Update: It seems that we're running into a Heisenbug. When transfering with -d 10, we never run into a hang, but many of the files are corrupted and some of them aren't complete. I have, right now, a hanging process, which gives me a rather poor dump: root@uxdev02-olap:/dwh/develop/odi/log/save> ps -cafe | grep smbcl root 19132 10828 FSS 59 12:36:21 pts/4 0:00 grep smbcl oradwh 27069 27068 FSS 1 14:54:14 ? 1294:20 /opt/baw/bin/smbclient //AWT0DBSQL032/LOAD01 -I AWT0DBSQL032 -b 16384 -c get ST root@uxdev02-olap:/dwh/develop/odi/log/save> pstack 27069/1 27069: /opt/baw/bin/smbclient //AWT0DBSQL032/LOAD01 -I AWT0DBSQL032 -b 16384 fffffd7fff05d924 memset () + 114 [I've drawn a coredump and will analyze it later this day on a different machine, where I have a gdb/dbx]
Yes, a heisenbug here is very likely. debug level 10 changes timings a lot, and this is very likely a timing bug. Did you have a chance to test the workaround patch I sent? Volker
BTW, what server platform are you pulling from? Volker
As stated in the innitial bug report, I'm pulling from Solaris/x86. Still: root@uxdev02:~:> uname -a SunOS uxdev02 5.10 Generic_118855-36 i86pc i386 i86pc root@uxdev02:~:> isainfo -v 64-bit amd64 applications sse3 sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications sse3 sse2 sse fxsr amd_3dnowx amd_3dnow amd_mmx mmx cmov amd_sysc cx8 tsc fpu root@uxdev02:~:> prtdiag System Configuration: Sun Microsystems Sun Fire X4500 BIOS Configuration: American Megatrends Inc. 080010 08/04/2006 BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style) ==== Processor Sockets ==================================== Version Location Tag -------------------------------- -------------------------- Dual Core AMD Opteron(tm) Processor 285 H0 Dual Core AMD Opteron(tm) Processor 285 H1 Volker: I did apply the patch, and redeploy the package, so what we're looking at now, is 3.2.0 plus your patch.
Sorry, I had thought that your smbclient was running on Solaris, I did not know that the server side is also Solaris. What server are you running there? Is that also Samba, or is it the new Solaris in-kernel SMB server? Volker
Must've been a misunderstanding on my side - I thought you were talking about the cores: Here's the info on the server I'm talking to: OS Name Microsoft(R) Windows(R) Server 2003 Enterprise x64 Edition Version 5.2.3790 Service Pack 2 Build 3790 Other OS Description Not Available OS Manufacturer Microsoft Corporation System Name AWT0DBSQL032 System Manufacturer HP System Model ProLiant DL380 G5 System Type x64-based PC Processor EM64T Family 6 Model 15 Stepping 6 GenuineIntel ~2667 Mhz Processor EM64T Family 6 Model 15 Stepping 6 GenuineIntel ~2667 Mhz Processor EM64T Family 6 Model 15 Stepping 11 GenuineIntel ~2667 Mhz Processor EM64T Family 6 Model 15 Stepping 11 GenuineIntel ~2667 Mhz BIOS Version/Date HP P56, 21.08.2007 SMBIOS Version 2.4 Windows Directory C:\WINDOWS System Directory C:\WINDOWS\system32 Boot Device \Device\HarddiskVolume1 Locale Austria Hardware Abstraction Layer Version = "5.2.3790.3959 (srv03_sp2_rtm.070216-1710)" User Name Not Available Time Zone GMT Daylight Time Total Physical Memory 8.189,67 MB Available Physical Memory 183,57 MB Total Virtual Memory 9,57 GB Available Virtual Memory 2,01 GB Page File Space 2,00 GB Page File C:\pagefile.sys
Are your binaries compiled as 64-bit binaries? If yes, can you try to compile as 32-bit? During the alpha phase of 3.2 there were problems where I got off_t/size_t wrong. If it works fine using 32 bit, then this would be a hint that more of those are lurking in the code. Volker
Yes, Of course. root@uxdev02:~:> file /opt/baw/bin/smbclient /opt/baw/bin/smbclient: ELF 64-bit LSB executable AMD64 Version 1 [SSE2 SSE FXSR AMD_3DNow CMOV FPU], dynamically linked, not stripped I'll try to get it deployed with 32bit as soon as possible, and report back.
FYI: I've just reproduce it on a sparc 64-bit box. Investigating. Volker
http://git.samba.org/?p=samba.git;a=commitdiff;h=1558a5c1977b fixes the spinning smbclient for me. This happened if the network connection was broken for some reason. It does not clear up why the connection broke in the first place, but I would like to ask you to test this. Thanks, Volker
Should I be testing this compiled for 64 bits as until now, or with 32 bits? Also, I don't have a machine where I could build x86_64 stuff right now.I hope this will change by monday.
This is non-32/64 bit related. Volker
Any updates here? Have you been able to test the patch? Volker
Unfortunately I've been sick and hence out of business for the past (two?) weeks. I'm now back and will look into getting this into testing as soon as possible. (Unless you have any updates, which invalidate it.) So long, Igor
No, no updates from my side. I just want to close this bug report with a hopefully positive ack from you. Volker
Closing out bug report. Please re-open if it's still an issue for you. Thank you very much for reporting!