I work at a university. Each day hundreds if not thousands of students use our computer labs and have a Samba drive mapped for their own personal use. They are given a 20 Megabyte quota. What has been happening is a student with their disk quota near full capacity will open a MS-Word document and add a lot of text and/or images and then go to save their document filling their disk quota in the process. What happens next is very very ugly. The save to the Samba share fails with Windows XP reporting a "delayed write failed" at the lower right of the desktop, the original document on the Samba share is now corrupt and the document is also corrupted in Word in memory. Its actually quite the site to behold, all the text on the screen in Word just transforms to garbage right before your eyes. The production set up we use is Samba running on Solaris/Sparc. The student directories are served to the Samba server from a NetApp NFS server and then re- exported as Samba shares. However, I've been chipping away at this issue through much experimentation. Its not an NFS problem, or quota problem, or Solaris specific problem. I can replicate it on x86 Red Hat Enterprise Linux 4 without quotas, just using a nearly full local disk partition. The same scenario saving to Microsoft servers always results in Word just popping up a friendly disk full error, not corrupting the document on disk or in memory, and allowing you to save elsewhere, or delete some stuff to make room and then save again. Its been tested saving to MS Server 2003, Win 2K, and XP Pro. To reproduce: Fill an entire disk partition that is being served out by Samba to just under capacity. About 1 Megabyte free is perfect. On your Windows XP or Win 2K system accessing that Samba share, if there isn't one there already put a smallish MS Word document onto it. The test document I've been using is 29 kilobytes. Open the document in MS Word 2002 or 2003. Now we have to make the document to large to be saved. I do this by by browsing my hard drive to a several hundred kilobyte jpg file, right click and select copy, then paste into Word 2 or 3 or however many times is necessary so the document will be to large to save. Attempt to save the document. Get a delayed write failure error. Watch all the text on the screen in Word corrupt to garbage before your eyes. Try to open the orginal document from the Samba share, its completely corrupt as well. Not only have you lost your new edits, you've lost the entire original document as well, very nasty!
Can you reproduce this problem if you set the smb.conf parameter : strict allocate = true in the [global] or share-specific part of the smb.conf ? If you can I'd like to see an ethereal capture trace of the problem. I'll try and reproduce this once I get back to the USA. Jeremy.
(In reply to comment #1) Wow! Hi Jeremy, I didn't expect to hear from you so soon if ever, seeing as how you're employed by Novell now. They are lucky to have you. You may find this amusing: We used to be big Novell users around here for many years. Then, about 4 years ago I demoed Samba for my boss's boss. He fell in love with it immediately. The rest is history as they say. Over the course of the next 20 odd months the entire campus was migrated off Novell and onto Samba. :) Back to the issue at hand, yes I do have "strict allocate = yes" enabled in the global section. I came up with that one on my own in the course of trying to solve this. It didn't. I could do some packet captures. For it to be of any use I guess I'd have to do two, one to a Samba server and one to a Windows server and you could try and see whats different. Honestly that would be somewhat of a hassle though. If you are going to be back from Germany relatively soon I'd probably just assume wait. I do think you'll be able to replicate this without much trouble. When are you expecting to come back? If you are going to be over there a while I'll see about gathering up some traces to send your way. Thankyou, Tom Schaefer
Tom, what would really help is to see a trace from a Windows client to a Windows server returning the disk full error. I need to see exactly when Windows returns this so we can match the call. Can you get me tcpdump or ethereal full traces of this between Windows -> Windows ? A Windows -> Samba would also help but isn't as important. With nfs we're returning disk full on close as that's when we find out (when the client code in the kernel flushes the write onto the NFS server). This was a bug fixed for Intel in Roseville about 5 years ago so I don't want to just ignore the error on close here, I'd rather find out where Windows detects the disk full problem and ensure the same. As a test (although this will damage performance), try setting : strict sync = yes sync always = yes in the [global] section of the smb.conf file. This will force a flush onto disk on every write. If a write returns full it will force it to be detected on the write call, not the close call. Performace will suck though but I'll be interested to see if it fixes the problem. Jeremy.
Jeremy, I enabled strict sync = yes and sync always = yes and it did not solve the problem. In fact I had temporarily enabled those parameters myself trying to come up with a solution before I ever even filed this bug report. I'll capture some traffic with windump (Windows port of tcpdump) and get it to you as soon as I can. Probably tomorrow. Also, I need to tell you that this problem isn't completely black and white. Its not reproduceable 100% of the time. Well it is and it isn't. Let me explain: In some cases attempting to save the Word document to a Samba server that would result in running out of disk space is handled flawlessly by the Windows/Word client, a friendly error is popped up telling you the disk is full and that is that. No "delayed write failed" errors. No document corruption. That generally seems to be the case if the document you are attempting to save is going to put you vastly over your quota. In some cases it all goes bezerk as outlined in the original bug report above. Those cases seem to be easily caused by following my guidelines above, about 1 Meg free on the disk to begin with and then add a little more than 1 Meg of new material to your document and attempt to save it. So Jeremy, you may have to take a few cracks at it before being able to reproduce the problem but then once you do come up with a bug producing evil magic formula of free disk space, original document size, and additions to the document before resaving you'll be able to replicate the bug every single attempt without fail. In other Words, if I can copy document1.doc from my C: drive onto a Samba share, then open it from the Samba share, paste a 600kb jpeg file into it twice and then attempt to save it and see all the text turn to garbage then I could do it times in a row by just repeating this exact same procedure 10 times in a row. As far as the document corruption goes, that doesn't really seem to be a great mystery. When saving a document from Word it seems to go through the routine of save to a temp file ~wrl0001.tmp or something like that, then deleting the original document and then renaming ~wrl0001.tmp to the name of the original document. What I've observed is that even when these saves out to the Samba share fail Word will still delete the original document and rename ~wrl0001.tmp (the file it was writing when it ran out of space) to the name of the original document and then I guess Word reads it back into memory from disk at that point causing the corruption of the document in memory. Now I'm going to share with you a feeling I've had as to what's maybe going on here, please discard without hesitation of this sounds ludicrous which to you it very well might... I've been getting the sense this might be related to the whole business of sparse files. strict allocate = yes makes total sense to me as a potential solution. Is it at all possible that even with that parameter enabled Samba is still creating sparse files? I ask because #1) what we have discovered is that in the end the Word document displays a size that shouldn't be possible given the quota or disk limitation in place. Say I start with 1 Meg disk free, and try to save what would result in a 2 Megabyte Word document and it gets corrupted as I've described - the resulting corrupt file actually does list as a 2 Megabyte file and acts that way too - I can't move the corrupt 2 Megabyte Word file from the Samba share to my c: drive and back onto the Samba share, I just get an error that there's not enough disk space. #2) It just kind of seems like it would fit - like thats how Windows would check if there's space available - try to grow the file and see if it succeeds #3) I do think strict allocate = yes helped. As I was saying its not a black and white situation and it seems with strict allocate enabled I have to work harder at coming up with a scenario to trigger the problem. Thankyou, Tom Schaefer
Jeremy, I made some Ethereal traces yesterday and had to e-mail them to you directly because they where to large to attach on Bugzilla. Hopefully you got them. Anyhow, I stared at them some more last and I think I've probably determined where the problem is. I've analyzed other traces besides the ones I've sent you and what I'm seeing holds true for all of them. If you where to look at the server2003 trace I sent you would see in packets 89 through 216 you'll see a big series of requests and responses where the Windows client is telling the server to write 1 byte in a file to which it keeps incrementing the offset by about 32K each time. This goes on successfully in packets 89 through 178. At packet 179 the offset is up to 994815 which apparently makes the file to large to remain under quota so in packets 179 through 216 the Windows client keeps requesting these 1 byte writes and upping the offset by about 32K and keeps getting told over and over with each of those requests STATUS_DISK_FULL ... 89 0.321558 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 11305 90 0.321954 134.124.18.221 134.124.42.203 TCP netbios-ssn > 2724 [ACK] Seq=4401 Ack=4788 Win=64284 Len=0 91 0.322118 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 1 byte 92 0.322206 134.124.42.203 134.124.18.221 SMB Trans2 Request, QUERY_FILE_INFO, FID: 0x0007, Query File Standard Info 93 0.323763 134.124.18.221 134.124.42.203 SMB Trans2 Response, QUERY_FILE_INFO 94 0.324293 134.124.42.203 134.124.18.221 SMB Trans2 Request, SET_FILE_INFO, FID: 0x0007 95 0.324876 134.124.18.221 134.124.42.203 SMB Trans2 Response, SET_FILE_INFO 96 0.324918 134.124.18.221 134.124.42.203 SMB NT Trans Response, NT NOTIFY 97 0.324974 134.124.42.203 134.124.18.221 TCP 2724 > netbios-ssn [ACK] Seq=4960 Ack=4680 Win=65256 [TCP CHECKSUM INCORRECT] Len=0 98 0.324986 134.124.18.221 134.124.42.203 SMB NT Trans Response, NT NOTIFY 99 0.325132 134.124.42.203 134.124.18.221 SMB Trans2 Request, SET_FILE_INFO, FID: 0x0007 100 0.325353 134.124.42.203 134.124.18.221 SMB NT Trans Request, NT NOTIFY, FID: 0xc00f 101 0.325420 134.124.18.221 134.124.42.203 SMB Trans2 Response, SET_FILE_INFO 102 0.325539 134.124.18.221 134.124.42.203 SMB NT Trans Response, NT NOTIFY 103 0.325573 134.124.42.203 134.124.18.221 TCP 2724 > netbios-ssn [ACK] Seq=5136 Ack=4896 Win=65040 [TCP CHECKSUM INCORRECT] Len=0 104 0.326267 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 44543 105 0.326505 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 1 byte ... 179 0.399282 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 994815 180 0.399615 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 181 0.400564 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1027583 182 0.401010 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 183 0.401489 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1060351 184 0.402523 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 185 0.402946 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1093119 186 0.403338 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 187 0.403757 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1125887 188 0.404074 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 189 0.404508 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1158655 190 0.404786 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 191 0.405216 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1191423 192 0.405559 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 193 0.405972 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1224191 194 0.406280 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 195 0.406683 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1256959 196 0.407601 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 197 0.408056 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1289727 198 0.408650 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 199 0.409067 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1322495 200 0.410005 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 201 0.410358 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1355263 202 0.411128 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 203 0.411531 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1388031 204 0.412383 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 205 0.412786 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1420799 206 0.413405 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 207 0.413840 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1453567 208 0.415027 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 209 0.415447 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1486335 210 0.415853 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 211 0.416263 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1519103 212 0.417208 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 213 0.417560 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 1543057 214 0.418340 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL 215 0.418721 134.124.42.203 134.124.18.221 SMB Write AndX Request, FID: 0x0007, 1 byte at offset 970239 216 0.419224 134.124.18.221 134.124.42.203 SMB Write AndX Response, FID: 0x0007, 0 bytes, Error: STATUS_DISK_FULL Now we jump over to the same situation saving to the Samba server and Samba never returns a STATUS_DISK_FULL even though the series of 1 byte writes and incrementing offsets creates a file on the Samba server, it must be a sparse file, that is to large to fit if all the space where actually allocated to the file. This is with strict allocate, strict sync, and sync always all enabled. The following is from the Samba Ethereal trace I sent you, there will be some clusters of Windows probing around a bit with these 1 byte writes and then actually writing some big chunks out to the file and then doing some more probing. The first probe is at packet 58, a 1 byte write at offset 11305, throughout the capture there are many of these 1 byte writes with offsets well over 1 Megabyte yet there was only about 1 Meg free on the disk partition. It culminates at packet 2509 writing 1 byte with an offset of 1575423. Unlike the trace of the Windows Server 2003 system none of these 1 byte writes ever fail with STATUS_DISK_FULL they always return STATUS_SUCCESS. I end up with a corrupt Word document on the disk that if do ls -l on in Linux or look at its properties in Windows Explorer is supposedly 1575424 bytes. In my understanding of the way the world operates that has to be a sparse file because there was only about 1 Meg free on the disk to begin with and as I said yesterday I can't move that file from the Samba share to my C: drive and then back to the Samba share, I'll get an insufficent space error. 58 0.282850 134.124.42.203 134.124.48.126 SMB Write AndX Request, FID: 0x31d7, 1 byte at offset 11305 59 0.319457 134.124.48.126 134.124.42.203 TCP netbios-ssn > 2517 [ACK] Seq=3640 Ack=2938 Win=10220 Len=0 60 0.336431 134.124.48.126 134.124.42.203 SMB Write AndX Response, FID: 0x31d7, 1 byte ... 2509 8.567376 134.124.42.203 134.124.48.126 SMB Write AndX Request, FID: 0x31db, 1 byte at offset 1575423 2510 8.599248 134.124.48.126 134.124.42.203 TCP netbios-ssn > 2517 [ACK] Seq=38848 Ack=1781719 Win=10440 Len=0 2511 8.664959 134.124.48.126 134.124.42.203 SMB Write AndX Response, FID: 0x31db, 1 byte There you go Jeremy, I'll be eagerly waiting to hear from you. Thanks again, Tom Schaefer
Created attachment 1231 [details] Proposed patch Ok, please test the attached patch. You will need to set "strict allocate = yes" in order to execute the new code and it will run slower than before, as each write beyond EOF will cause smbd to zero-fill the previously sparse area. It should catch the situation you describe in your (excellent) bug report though. Ie. All the 1 byte writes should force a DISK_FULL error return. If applies (with a little fuzz factor) to the 3.0.14a source code. Cheers, Jeremy.
Hi Jeremy, I finally got to test the patch over the weekend and today. The whole issue of Word corrupting documents seems to be eliminated. So very awesome!! Thankyou, this will work. However, and this is pretty much just an FYI, the patch doesn’t seem to be a "perfect" solution. Windows still gets a "Delayed Write Failed" error from the OS when saving the document as well as the normal error from Word whereas attempting to save to large of a document to an actual Windows server only results in the one error message from Word, the the Windows client OS never complains. In other words the user saving to large of a document has to click OK on two errors instead of one. I don’t care if you don’t. Or, if you are interested in pursuing what that’s about I’ll go down that road with you. Whatever you want. I’ve actually been looking at it already with Ethereal. What I’ve noticed now is that as Windows/Word is probing around on the server for available space it will write 1 byte to the new temporary file and then if that succeeds it immediately does a “Trans2 Request, Query File Info” on it. The Trans2 Response, Query File Info Includes the information End of File and and Allocation Size. On a Windows server if the 1 byte was written at offset 44543 the Trans2 Response, Query File Info will then show End of File as 44544 and the allocation size will always be just a little greater than EOF say like about 46000. Then the next 1 byte probe will be made at a location a little bit higher than the allocation size so say about 47000. And so on. And the Windows/Word client kind of methodically stair steps its way up in small increments to where STATUS_DISK_FULL is reached and thus the Windows/Word client has a pretty accurate knowledge of just exactly how much space is available on the Windows server. Contrast this with the situation where it’s a Samba server, what I’ve found there is that Windows/Word will just do a couple 1 byte probes at low locations like 44543 and then when it does its Trans2 Query and gets the Reponse the Samba server tells it that EOF is at 44544 and the Allocation Size is 1048576. So then the next Windows 1 byte probe will be at an offset a little higher than 1048576 say 1060000. If that succeeds then the next Trans2 File Query/Response will show Samba reporting that exactly 2 Megabytes is the Allocation Size and 3 Megabytes allocated as soon as something is written over 2 Megabytes and so forth. In other Words the Samba server always responds with the allocation size in rounded up whole Megabytes only and so the Windows client never can really get a fine grained picture of just how much space is available like it can talking to a Windows server. When talking to the Samba server I think the Windows client can only get as accurate a picture of how much space is available rounded up to the next whole Megabyte. If there is actually 1.2 Megabytes available Windows comes away from its one byte probes thinking there is exactly 2 Megabytes available. Now as far as the Delayed Write Fails go. What I see happening when talking to a Windows server the Windows/Word client will determine that there should be say at least 1.6 Megabytes available on the server and then it will actually start writing the contents of the Word document into the file on the server and halt itself just short of 1.6 Megabytes. Now when saving out to the Samba server that say has 1.6 Megabytes available the Windows/Word client thinks there is exactly 2 Megabytes available and starts writing the Word document out to the Samba share intending to halt itself just short of 2 Megabytes. But beyond 1.6 Megabytes of writing Samba starts responding with STATUS_DISK_FULL and I believe that is the source of the Delayed Write Failed errors. Windows is getting an error writing out to a file position that it had previously determined via the 1 byte probes "should be" available for writing. I’m guestimating the whole 1 byte probes thing is the Windows client OS figuring out how much space is available, then it lets the Word application start writing out its file and just short of where the server disk would fill the Windows client OS tells the Word application itself that there is no more space so that Word itself doesn’t have to deal with talking directly to a network disk server. I got to wondering if there was less than 1 Megabyte available on the Samba server say like 900k and I tried to save a very small like 30k document onto the share would I be told the disk full. So I tried it and sure enough thats the case. I can't save any Word documents to a Samba share unless there is at least 1 Meg free even if its a just a little tiny document that should fit onto the share no problem. I'm told by Word that there isn't disk space available and do not get any Delayed Write Failed errors. I think if Samba could be made to round down on the whole Megabyte allocation sizes it reports it would probably entirely solve the Delayed Write Failed errors. Or better yet if Samba could be made to report those allocation sizes in much smaller than whole 1 Megabyte increments that would be even better. Anyhow, again Jeremy the real crisis has been solved. If you’ve got time and/or an interest in solving the now cosmetic issue of these Delayed Write Failed errors I'm certainly willing to try and help. If nothing else though, if you've got at least a comment or two about my above analysis I'd love to hear it. Once again, the important problem, corrupt Word documents is solved. Thanks again, Tom Schaefer
We already have the capability to change this. When a client asks for the "allocation size" we round up to the nearest : "allocation roundup size" parameter, which is a per-share parameter set to be 0x100000 bytes (1mb). If you can tell me what the roundup size is that Windows servers use when allocating space then you can test the theory by setting "allocation roundup size = <windows value>" and this should behave exactly the same as Windows. From your mail it looks like the allocation size that Windows uses might be somewhere between 1k - 8k. We don't actuall allocate this on the disk (as there is no way to do this using the POSIX API's), we just round up to the given value. The reason we use 1mb is a cheat :-). Someone discovered that Windows clients use the allocation size as a disk cache tuning parameter, so if we set it very large then they cache much better to a Samba server than to a Windows one :-). But this large allocation roundup can cause problems (specifically with visual studio) so we made it a tunable parameter. If you can work out the right value for it this should fix your problem. In the meantime I'm going to close this one out as the data corruption bug is fixed - if you discover the correct allocation size just add it to this bug report for future knowledge. Thanks, Jeremy.
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.