Some months back, I submitted the following Samba "bug" report:
However, the reported problem turned out to be due to an innocuous e1000 driver bug for which a patch/driver update was eventually released. For more information, see:
It turned out that the e1000 WINDOWS driver had an identical bug which was observed in network traces by the driver developers. This is not surprising as a significant portion of the driver codebase is shared with Linux. What is surprising, however, is that the problem, as originally reported, did NOT manifest itself when running a Windows XP "server" on the same hardware!
This would seem to imply that the Windows/CIFS stack is more resilient than Linux/Samba.
I appreciate that this is a very sweeping statement but it is probably reasonable to conclude that, given Windows has a more widespread deployment, particularly in the SME space (where troublesome networks are more likely to arise), its stack has probably been "fortified" at a more rapid pace than Linux/Samba over the years.
What I'm trying to achieve with this bug report is to place the spotlight on this gray area and perhaps move from the general to the specific.
As an example, I have recently been using the "reset on vc" option which solved a nasty problem I was having with interference from fluorescent lights (although it did not solve the particular problem reported above). I am wondering if other similar "fortification" measures are feasible, possibly even in the kernel itself.
Here are some relevant excerpts from my conversations with one of the driver developers, which I reproduce here with his permission:
Q: The Windows driver does NOT exhibit this bug on the same hardware...it
is surely a good reference point for this and other nasty bugs showing up
in the Linux driver.
A: Interestingly we found that the same problem does exist in the Windows
driver, though it seems that a higher layer retry allows the session to
recover & continue rather than hang. I monitored the link and observed the
errored checksum using a windows XP server to be sure.
Q: Are you saying that the Windows TCP/IP stack is somehow aware of this potential historical checksum problem in NIC drivers and may retry sending potentially troublesome packets with different payloads.
A: I don’t think its likely to be an awareness of the checksum problem in the Windows stack, more likely that the Windows TCP/IP stack is configured with different parameters, either statically by configuration or hard-coded implementation, or by the SMB service when the socket is opened or used. I’ll see if I kept the Windows traces that showed that the issue is still there with Windows, as the answer may be infer-able from that. It would be explained, for instance, if we see that the Windows SMB/TCP packets for this test include the (optional) TCP timestamp field, as that would probably change when a TCP resend occurs, thus changing the required checksum. But I don’t know, and though I did notice the invalid checksums coming out of Windows as the Ghost server, I only noticed that the session paused briefly, and seemed to quickly recover without any obvious effect to the session.
Thanks for the long report, but I don't think Samba can do anything about this. We sit on the kernel socket layer, there's no way to trigger a retry from user space.
Thank you for your reply. I cannot argue with what you say, and indeed the error shows up in the samba log as a sudden client disconnect, but it still does not solve the problem.
This "bug report" is not so much out of idle curiosity. It is driven by practical experience on the ground with Samba e.g. the dreaded XP "Delayed Write Failed" events occassionaly rear their ugly head, although they arise with Windows servers as well, but not as frequently. You start to get a gut feeling about these things....
Sometimes "fortification" measures in the lower layers are driven by problems in higer layers and, as I already said, this is a gray area. It goes without saying that networks need to be within spec but, in real life, you get glitches and these will tend to arise more frequently in SME networks where Windows dominates. I mentioned the "reset on zero vc" configuration option already and this definitely had a positive impact on Linux/Samba resilience.
Can you perhaps point me to some areas that I might research further. I'm sure you must have had previous discussions with the kernel people on this subject.
I have already tried sysctl TCP settings but with no joy.
As I said: Samba can't do anything here. A driver bug is a driver bug.
I accept what you say wholeheartedly Volker...and perhaps this is not the right forum...but this innocuous little e1000 checksum bug has been in both the Linux and Windows drivers for years yet Windows works and Linux/Samba does not. If it was just this particular bug, I wouldn't care but I have noticed a greater occurrence of "Delayed Write Failed" errors in Samba installations compared to Windows so my whole point about "resilience" may have some merit...and this driver issue is simply a repeatable test case more than anything else.
It may not be Samba's fault but it is an issue for Samba users...we not live in a vacuum.
I will try raising this in a more appropriate forum and if, in the meantime, you can give me any pointers on previous Kernel/Samba discussions on this topic, it would be appreciated.
Sorry, I have no clue who is the right person to fix your particular driver bug. If you have an "Enterprise Linux", you might go to SuSE, RH or Canonical. If you have built the kernel on your own, then the linux-kernel ML is the right forum. You might also want to look at the MAINTAINERS file in the Linux kernel, maybe someone feels responsible for your drive. If you downloaded your driver from Hardware vendor website, then your Hardware vendor is the right person to ask. If you don't have any of those options, you might want to contact a local Linux service company to help you across the road.
You see, there is tons of options you have, I just have no idea about your environment.
I think the kernel maintainers is probably the best option here.
But just to correct you on one point...
The driver bug has been *fixed* and that is no longer the issue here. However the somewhat convoluted process of identifying and fixing this bug has prompted concerns about stack resilience...concerns that should not be simply brushed under the carpet (regardless of where the solution lies)
If I manage to make progress on this issue, I will report my findings on a more appropriate forum at a later time.
Thanks for taking the time to discuss this,