A quick check using the smbtorture tests shows a failure in conducting test smb2.rw.rw1 on a distributed-disperse volume The exact command run as part of the samba-integration tests is /bin/smbtorture --fullname --option=torture:progress=no --option=torture:sharedelay=100000 --option=torture:writetimeupdatedelay=500000 --format=subunit --target=samba3 --user=test1%x //192.168.123.10/gluster-vol-disperse smb2.rw.rw1 The test involves writing a random string to a file and immediately reading it. The contents are then checked to confirm that the data returned for the read call matches the data written for write calls. There are two errors seen. 1) Fewer bytes read in the read call. 2) Mismatch in the buffer between read and write. I believe both errors are manifestation of the same problem. More details below. 1) The problem is only seen with a disperse volume. I have also tested against a replicated volume successfully. 2) The problem couldn't be reproduced using vfs_gluster_fuse module using the same setup. ie. using samba exported fuse shares. 3) With slight modifications to the testing script, I determined that the read returns the write from the call which immediately preceeded this call. ie. The content returned on read is what was written 2 writes ago and not the last write. 4) On retrying the read, the correct data is returned. 5) I couldn't reproduce this issue using a test program written with gfapi. versions: samba-20200826.095928.fbe5853-1.el7.1.x86_64 - recent upstream build glusterfs-20200905.4e81a5e-0.0.el7.x86_64 - recent upstream gluster build
Created attachment 16217 [details] Modifications to smb2.rw.rw1 Modifciations to smbtorture test smb2.rw.rw1. This includes 1) Use a single char pattern. ie. Generates buffer consisting of one character which changes every time a new buffer is generated. So the first attempt sets up buffer with repeating char 'a'. The second call sets up the buffer to char 'b', etc. This was used to confirm that the problem read call returns the content written before the immediate write. It appears that the data written isn't either in cache or flushed to disk before the read call. 2) We retry until we get the right buffer and keep track of number of retries. This confirmed that the correct output was returned immediately after the failed read call.
Created attachment 16218 [details] test program using gfapi Test program using gfapi to attempt to reproduce the problem. I was unsuccessful in reproducing the issue with this test program.
Hmmm. Doesn't this mean there's a bug in the gluster libraries ? It isn't as if Samba caches any of the data, everything is just going through the glfs_XXXX() API calls.
I couldn't recreate with a test program using gfapi calls thus taking samba out of the equation. There is a chance that the bug is within the vfs plugin. We are still investigating the cause of this issue and gd thought that we could record this here while we investigate. Sachin Prabhu
No it's a good idea to record here. Remember, the vfs plugin uses threaded calls to read/write data. Is your gfapi test program threaded ? My money would be on a threading issue in gfapi.
No. No threading on the test program. I'll run a few checks on the tcpdumps between samba and gluster to confirm. Thanks Sachin Prabhu
Can you reproduce the issue with aio read size = 0 aio write size = 0 which means we don't use threading? Just for another data point.
Created attachment 16226 [details] pcap capturing the problem This is a pcap file taken on the samba server to which the client is connected. The modified smb2.rw.rw1 test(using char patterns - patch provided earlier) was run against the server. This clearly shows the problem. for pattern 'd' - This worked as expected. 1187 SMB WRITE REQUEST 17447 bytes pattern - 'd' 1193 SMB WRITE RESPONSE sucessfull 1194 SMB READ REQUEST Len 82423 1207 SMB READ STATUS_PENDING 1219 GLFS WRITE CALL pattern 'd' 1222 GLFS WRITE CALL pattern 'd' 1226 GLFS WRITE CALL pattern 'd' 1231 GLFS WRITE REPLY Success 1232 GLFS WRITE REPLY Success 1233 GLFS WRITE REPLY Success 1260 GLFS READ CALL 1261 GLFS READ CALL 1262 GLFS READ REPLY pattern 'd' 1264 GLFS READ REPLY pattern 'd' 1267 SMB READ RESPONSE pattern 'd' In this case, we accept the write from the client, and then we accept the read call but leave it on status pending. We then write onto gluster. Then subsequently, we make a read call on gluster which is successful. The data from the read is then sent back to the client over smb. for pattern 'e' - Problem write-read 1271 SMB WRITE REQUEST 83692 bytes pattern - 'e' 1273 SMB WRITE RESPONSE success 1277 SMB READ REQUEST Len 83693 1282 GLFS READ CALL <--- NO GLFS Writes performed before we call this READ call 1283 GLFS READ CALL 1288 GLFS READ REPLY pattern '' 1289 GLFS READ REPLY pattern 'd' <-- incorrect pattern returned from GLFS 1302 SMB READ RESPONSE STATUS_PENDING 1305 SMB READ RESPONSE pattern 'd' <-- incorrect pattern returned 1312 SMB CLOSE REQUEST <-- file closed 1318 GLFS WRITE CALL pattern 'e' 1322 GLFS WRITE CALL pattern '' 1324 GLFS WRITE CALL pattern 'e' 1333 GLFS WRITE REPLY(1324) 1334 GLFS WRITE REPLY(1322) 1344 GLFS WRITE REPLY(1318) This is the problem case. 1) We see a smb write call which is successful and then subsequently a read call which is set to status pending. 2) We can see that we make a read call to gluster first. This returns the data from the previous write ie. pattern 'd'. 3) This is sent to the smb client. This fails the test and the file is closed. 4) We then see the pending writes made by the smb client being flushed to the gluster fs.
Hello Volker, A quick check with aio read size = 0 aio write size = 0 Did not fix the problem. I have checked the tcpdump and we see the same pattern as described in c#8 with the READ on the backing device being called before the WRITE is. I suspect we have see these issues earlier in customer cases which unfortunately we never had a good reproducer for. I will continue investigating on Monday.
OK, it looks like you're going to need some kind of synchronization object (mutex) inside the vfs_glusterfs.c code on a per-fd basis to ensure that the ordering is maintained w.r.t to the SMB file accesses. Great analysis BTW !
Just FYI. Once the SMB2_WRITE returns success it means we have gotten a return from the glfs_pwrite() call made in the VFS module. We will only return: SMB WRITE RESPONSE success once vfs_gluster_pwrite_done()/vfs_gluster_pwrite_recv() has completed. That means we've given the data to the glfs_XXXX() calls. So I may be wrong about needing a mutex inside the VFS as in this case we've completed the glfs_XXX() call. You might want to turn on as much debug as you can get inside the glfs_XXX() code to try and track the path through this case.
In discussion with Sachin, we spotted one potential additional problem: the smbtorture test uses two tree connects. So it is using two gluster clients - i.e. two instances of the libgfapi. We can modify the test program to use two instances of libgfapi. So the problem might still be in the libgfapi or even the underlying implementation. Can it be reproduced with two fuse mounts?...
I was able to reproduce this with a test program using gfapi and by passing samba completely. A new issue has been created with the gluster project at https://github.com/gluster/glusterfs/issues/1501
Great ! Thanks Sachin. Do you want to leave this open until the bug has been fixed in gluster in order for Samba users to track it ?
Hello Jeremy, I would like to keep this open for some time. Atleast until we have an acknowledgement of the problem by the gluster team. We can also use it to track any changes(if recommended) to vfs_glusterfs by the gluster team. Thanks Sachin Prabhu
We have determined the cause of the problem to be the write-behind translator in gluster. https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Developer-guide/write-behind/ In summary, the translator returns a success for a write call before it has been flushed to the backing device. It continues writing in the background. To disable this translator, run the following gluster command. # gluster volume set <volname> write-behind off This fixed the problem both for the test program as well as for the smbtorture tests using samba. So closing this issue for now. We are considering the performance hits for disabling the translator for our installations and will re-open the bz to update notes in case we decide to make any changes based on the results.
Oh, that "write-behind translator" is horrid. Is it off or on by default ?
It is on by default. We are discussing internally on how best to fix this. There are new translators in development to replace this and other performance translators in gluster.
Oh, that's really bad :-(. Re-opening this bug. Can you add a patch to the vfs_glusterfs man page that covers this problem and explains how to turn this off. Samba users really need to be aware of this as it's a data corruption bug. Alternatively, is there an API vfs_glusterfs can call to turn this off ?
We are discussing how best to solve this internally. I will update the issue as soon as we have a solution.
Update: A merge request has been made to modify the man page for vfs_glusterfs https://gitlab.com/samba-team/samba/-/merge_requests/1608 At the moment there is no programatic way to disable the translator in the vfs_glusterfs plugin. We are working with GlusterFS upstream to add this feature. The patch has been proposed and is going through the system. https://github.com/gluster/glusterfs/pull/1640 This however would only be available in the newer versions of GlusterFS. A second option was to determine programmatically if the write-behind translator could be detected and if so, either warn the user or do not allow the mount. To do this, at the moment, the best method available is to run glfs_get_volfile() and parse the volume file obtained for strings which would indicate the presence of the write-behind translator. We are looking for other options and do not intend to move forward with this option yet. Therefore, we are adding a note to the man page pointing to a possible data corruption issue in the man page.
"Determining it programatically will require parsing through a 100s of lines of text" Surely this is just a strstr() call on the output of glfs_get_volfile() into a large buffer and can be done at connection time. Sachin - what string should this be looking for ?
Created attachment 16319 [details] volfile with write-behind translator enabled volfile for volume vol-replicate obtained by calling glfs_get_volfile()
Created attachment 16320 [details] volfile with write-behind translator disabled for volume vol-replicated. The volfile was obtained with glfs_get_volfile() call.
I have uploaded two volfiles with the write behind translator enabled/disabled. The volfile is a graph representation of the translators in the order in which they are to be called. In the files above, the volume is 'vol-replicate' with subvolume 'vol-replicate-md-cache'. 'vol-replicate-md-cache' has subvolume 'vol-replicate-nl-cache'. 'vol-replicate-nl-cache' has subvolume 'vol-replicate-quick-read'. 'vol-replicate-quick-read' has subvolume 'vol-replicate-open-behind'. 'vol-replicate-open-behind' has subvolume 'vol-replicate-write-behind'. This above is the representation for the write behind translator. It continues with the subvolumes eventually ending with the individual bricks where the file or fragment is actually written. A simple way to determine the presence of the write-behind translator is to look for the string "volume vol-replicate-write-behind" ie. "volume <volume name>-write-behind. However the proper way for this is to build the tree and to ensure that the write-behind translator is called when writing to the volume. My hesitation comes from the fact that apart from the text parsing and the possibilities of bugs associated with it, this information is internal to GlusterFS and was only exposed through a feature request from the Ganesha team(rh bz 1090363) which needed it to determine the UUID of the device. I don't think that there are any guaranties that the representation will not change in the future (I am not 100% sure of this and will have to check this). In that case, the code to check the presence of the translator may end up breaking and we may have incompatible versions which will need more hacks to work around.
This bug was referenced in samba master: 08f8f665d409ee7b93840c25a8142f2ce8bacfa1 2a49ccbcf5e3ff0f6833bcb7f04b800125f1783f
3 patches that are going into master incoming once they've all landed !
This bug was referenced in samba master: 7d846cd178d653600c71ee4bd6a491a9e48a56da
Created attachment 16322 [details] Patch with cherry-pick information for 4.13, 4.12 and 4.11 This patchset contains the 3 commits Jeremy mentioned cherry-picked from master. It applies to v4-13-test, v4-12-test and v4-11-test, but please double check if it's correct for all branches. Thanks!
Comment on attachment 16322 [details] Patch with cherry-pick information for 4.13, 4.12 and 4.11 lgtm
Comment on attachment 16322 [details] Patch with cherry-pick information for 4.13, 4.12 and 4.11 LGTM, thanks for backporting!
(In reply to Guenther Deschner from comment #32) Pushed to autobuild-v4-{13,12,11}-test.
This bug was referenced in samba v4-13-test: 8079e2a9116a70726cf99c7e7ad6b4ed0f925fbe 3d5be93eea886e31d1eaf087e9bc21bfae336126 2599b6bd3ef0b21590487c95de09be2f82c6d38b
This bug was referenced in samba v4-11-test: f214862ef7a3a6715ff19129a27c827330e03eca 28db03fbe0eb3fa84038089ba32777b3fa35bff7 f5135703e5f773a57369820cf667cf046741f322
This bug was referenced in samba v4-13-stable (Release samba-4.13.2): 8079e2a9116a70726cf99c7e7ad6b4ed0f925fbe 3d5be93eea886e31d1eaf087e9bc21bfae336126 2599b6bd3ef0b21590487c95de09be2f82c6d38b
This bug was referenced in samba v4-12-test: 0004099938068db5c81628a6a2b077807218dd86 329c95136ff82a398856fef174d4b3c899c8fcc8 5d78ec76c86634faabda04d356790f66c82dd21c
Closing out bug report. Thanks!
This bug was referenced in samba v4-11-stable (Release samba-4.11.16): f214862ef7a3a6715ff19129a27c827330e03eca 28db03fbe0eb3fa84038089ba32777b3fa35bff7 f5135703e5f773a57369820cf667cf046741f322
This bug was referenced in samba master: a51cda69ec6a017ad04b5690a3ae67a5478deee9
Reopening to track followup patch
Created attachment 16324 [details] followup patch for 4.13, 4.12 and 4.11
This bug was referenced in samba v4-12-stable (Release samba-4.12.10): 0004099938068db5c81628a6a2b077807218dd86 329c95136ff82a398856fef174d4b3c899c8fcc8 5d78ec76c86634faabda04d356790f66c82dd21c
(In reply to Guenther Deschner from comment #42) Pushed to autobuild-v4-{12,13}-test.
This bug was referenced in samba v4-12-test: 9215dc9dc69c76082d251b94b2d79c9129a732a3
This bug was referenced in samba v4-13-test: 4337a6378db35d6204886d4e3ad6add5c727c7cd
This bug was referenced in samba master: be03ce7d8bb213633eedcfc3299b8d9865a3c67f
I screwed this one up.. Errata patch incoming. Sorry :-(.
This bug was referenced in samba master: 457b49c67803dd95abc8502c2a410fac273f6fba
Created attachment 16333 [details] Errata patch for 4.13.next, 4.12.next, 4.11.next
Comment on attachment 16333 [details] Errata patch for 4.13.next, 4.12.next, 4.11.next LGTM, thanks!
Karolin can you add the errata patch to the 4.11.next, 4.12.next, 4.13.next branches. Thanks !
(In reply to Jeremy Allison from comment #52) Pushed to autobuild-v4-{13,12,11}-test. Do we need another 4.11 bugfix release here?
This bug was referenced in samba v4-13-test: 76f07c13cd665f00603718723f1ef716e1b3df5d
(In reply to Karolin Seeger from comment #53) Yes, I'm afraid.
This bug was referenced in samba v4-11-test: 49710332b5937bb7490db87b803faeaf8ca5190b
This bug was referenced in samba v4-12-test: a6782e760460d598709481b435fee209dae60de3
I'm sorry Karolin, but I think we do. I feel dreadful as this was my mistake :-(.
Created attachment 16345 [details] follow up patch for 4.11 Hi Karolin, I would like to include attached follow up patch(one from Guenther and another an update to vfs_glusterfs man page) also for v4-11. With those patches it should be complete. For v4.12 and v4.13, only change to update vfs_glusterfs man page is missing. I will be adding it next.
Created attachment 16346 [details] follow up patch for 4.12 and 4.13 Here is the man page update change to be included in v4.12 and v4.13.
This bug was referenced in samba master: 369c1d539837b70e94fe9d533d44860c8a9380a1
Created attachment 16354 [details] follow up patch for 4.11
Created attachment 16355 [details] follow up patch for 4.12 and 4.13
Comment on attachment 16354 [details] follow up patch for 4.11 LGTM
Comment on attachment 16355 [details] follow up patch for 4.12 and 4.13 LGTM
Karolin, please add the followup patches to the relevant branches. I'm confident these are the last and final patches for this bug :-)
(In reply to Guenther Deschner from comment #66) Pushed to autobuild-v4-{13,12,11}-test.
This bug was referenced in samba v4-13-test: 587fa331f62f6bd36fdb8688c8d0734d02f07ee8 585c49f21f7db686f479ce02b2ae647a313f1184
This bug was referenced in samba v4-11-test: d6fb44cba256ce98feea3dc968653ca22aa715bd eb525a3e0704595550a130ae865304a90f22b0f8 f1b1dc12abaecbdef68f752d9a424180b0e6890d
This bug was referenced in samba v4-12-test: 9bcd19c42aee884f79f19128bbf0293ad0da1fb6 ecdddde3c5387ec3749d9758a7191b9ff9bc91d8
This bug was referenced in samba v4-11-stable (Release samba-4.11.17): 49710332b5937bb7490db87b803faeaf8ca5190b d6fb44cba256ce98feea3dc968653ca22aa715bd eb525a3e0704595550a130ae865304a90f22b0f8 f1b1dc12abaecbdef68f752d9a424180b0e6890d
Closing out bug report - RIP
This bug was referenced in samba v4-13-stable (Release samba-4.13.3): 4337a6378db35d6204886d4e3ad6add5c727c7cd 76f07c13cd665f00603718723f1ef716e1b3df5d 587fa331f62f6bd36fdb8688c8d0734d02f07ee8 585c49f21f7db686f479ce02b2ae647a313f1184
This bug was referenced in samba v4-12-stable (Release samba-4.12.11): 9215dc9dc69c76082d251b94b2d79c9129a732a3 a6782e760460d598709481b435fee209dae60de3 9bcd19c42aee884f79f19128bbf0293ad0da1fb6 ecdddde3c5387ec3749d9758a7191b9ff9bc91d8