Bug 14486 - vfs_glusterfs: smbtorture test smb2.rw.rw1 failure on distributed disperse volume
Summary: vfs_glusterfs: smbtorture test smb2.rw.rw1 failure on distributed disperse vo...
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.13.0rc3
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Guenther Deschner
QA Contact: Samba QA Contact
Depends on:
Reported: 2020-09-10 14:37 UTC by Sachin Prabhu
Modified: 2020-11-24 11:17 UTC (History)
6 users (show)

See Also:

Modifications to smb2.rw.rw1 (1.82 KB, application/x-compressed-tar)
2020-09-10 14:45 UTC, Sachin Prabhu
no flags Details
test program using gfapi (3.04 KB, text/x-csrc)
2020-09-10 14:47 UTC, Sachin Prabhu
no flags Details
pcap capturing the problem (1.69 MB, application/vnd.tcpdump.pcap)
2020-09-11 18:09 UTC, Sachin Prabhu
no flags Details
volfile with write-behind translator enabled (6.16 KB, text/plain)
2020-11-01 15:03 UTC, Sachin Prabhu
no flags Details
volfile with write-behind translator disabled (6.04 KB, text/plain)
2020-11-01 15:04 UTC, Sachin Prabhu
no flags Details
Patch with cherry-pick information for 4.13, 4.12 and 4.11 (6.72 KB, patch)
2020-11-03 07:25 UTC, Karolin Seeger
jra: review+
gd: review+
kseeger: review? (abartlet)
anoopcs: review+
followup patch for 4.13, 4.12 and 4.11 (2.96 KB, patch)
2020-11-05 08:26 UTC, Guenther Deschner
jra: review+
gd: ci-passed+
Errata patch for 4.13.next, 4.12.next, 4.11.next (2.42 KB, patch)
2020-11-12 18:27 UTC, Jeremy Allison
gd: review+
follow up patch for 4.11 (4.49 KB, patch)
2020-11-20 05:38 UTC, Anoop C S
anoopcs: review? (gd)
follow up patch for 4.12 and 4.13 (1.39 KB, patch)
2020-11-20 05:41 UTC, Anoop C S
anoopcs: review? (gd)

Note You need to log in before you can comment on or make changes to this bug.
Description Sachin Prabhu 2020-09-10 14:37:09 UTC
A quick check using the smbtorture tests shows a failure in conducting test smb2.rw.rw1 on a distributed-disperse volume

The exact command run as part of the samba-integration tests is 
/bin/smbtorture --fullname --option=torture:progress=no --option=torture:sharedelay=100000 --option=torture:writetimeupdatedelay=500000 --format=subunit --target=samba3 --user=test1%x // smb2.rw.rw1

The test involves writing a random string to a file and immediately reading it. The contents are then checked to confirm that the data returned for the read call matches the data written for write calls. There are two errors seen.
1) Fewer bytes read in the read call.
2) Mismatch in the buffer between read and write.
I believe both errors are manifestation of the same problem. More details below.

1) The problem is only seen with a disperse volume. I have also tested against a replicated volume successfully.
2) The problem couldn't be reproduced using vfs_gluster_fuse module using the same setup. ie. using samba exported fuse shares.
3) With slight modifications to the testing script, I determined that the read returns the write from the call which immediately preceeded this call. ie. The content returned on read is what was written 2 writes ago and not the last write.
4) On retrying the read, the correct data is returned.
5) I couldn't reproduce this issue using a test program written with gfapi.

samba-20200826.095928.fbe5853-1.el7.1.x86_64 - recent upstream build
glusterfs-20200905.4e81a5e-0.0.el7.x86_64 - recent upstream gluster build
Comment 1 Sachin Prabhu 2020-09-10 14:45:46 UTC
Created attachment 16217 [details]
Modifications to smb2.rw.rw1

Modifciations to smbtorture test smb2.rw.rw1.

This includes
1) Use a single char pattern. ie. Generates buffer consisting of one character which changes every time a new buffer is generated. So the first attempt sets up buffer with repeating char 'a'. The second call sets up the buffer to char 'b', etc. This was used to confirm that the problem read call returns the content written before the immediate write. It appears that the data written isn't either in cache or flushed to disk before the read call.

2) We retry until we get the right buffer and keep track of number of retries. This confirmed that the correct output was returned immediately after the failed read call.
Comment 2 Sachin Prabhu 2020-09-10 14:47:49 UTC
Created attachment 16218 [details]
test program using gfapi

Test program using gfapi to attempt to reproduce the problem.

I was unsuccessful in reproducing the issue with this test program.
Comment 3 Jeremy Allison 2020-09-10 15:55:47 UTC
Hmmm. Doesn't this mean there's a bug in the gluster libraries ?

It isn't as if Samba caches any of the data, everything is just going through the glfs_XXXX() API calls.
Comment 4 Sachin Prabhu 2020-09-10 15:59:29 UTC
I couldn't recreate with a test program using gfapi calls thus taking samba out of the equation. There is a chance that the bug is within the vfs plugin.

We are still investigating the cause of this issue and gd thought that we could record this here while we investigate.

Sachin Prabhu
Comment 5 Jeremy Allison 2020-09-10 16:08:44 UTC
No it's a good idea to record here. Remember, the vfs plugin uses threaded calls to read/write data. Is your gfapi test program threaded ?

My money would be on a threading issue in gfapi.
Comment 6 Sachin Prabhu 2020-09-10 16:10:53 UTC
No. No threading on the test program. I'll run a few checks on the tcpdumps between samba and gluster to confirm.

Sachin Prabhu
Comment 7 Volker Lendecke 2020-09-11 08:13:57 UTC
Can you reproduce the issue with

aio read size = 0
aio write size = 0

which means we don't use threading? Just for another data point.
Comment 8 Sachin Prabhu 2020-09-11 18:09:13 UTC
Created attachment 16226 [details]
pcap capturing the problem

This is a pcap file taken on the samba server to which the client is connected. The modified smb2.rw.rw1 test(using char patterns - patch provided earlier) was run against the server. This clearly shows the problem.

for pattern 'd' - This worked as expected.

1187 SMB WRITE REQUEST 17447 bytes pattern - 'd'
1193 SMB WRITE RESPONSE sucessfull
1194 SMB READ REQUEST Len 82423
1219 GLFS WRITE CALL pattern 'd'
1222 GLFS WRITE CALL pattern 'd'
1226 GLFS WRITE CALL pattern 'd'
1262 GLFS READ REPLY pattern 'd'
1264 GLFS READ REPLY pattern 'd'
1267 SMB READ RESPONSE pattern 'd'

In this case, we accept the write from the client, and then we accept the read call but leave it on status pending. 
We then write onto gluster. Then subsequently, we make a read call on gluster which is successful.
The data from the read is then sent back to the client over smb.

for pattern 'e' - Problem write-read

1271 SMB WRITE REQUEST 83692 bytes pattern - 'e'
1277 SMB READ REQUEST Len 83693
1282 GLFS READ CALL   <--- NO GLFS Writes performed before we call this READ call
1288 GLFS READ REPLY pattern ''
1289 GLFS READ REPLY pattern 'd' <-- incorrect pattern returned from GLFS
1305 SMB READ RESPONSE pattern 'd' <-- incorrect pattern returned
1312 SMB CLOSE REQUEST <-- file closed
1318 GLFS WRITE CALL pattern 'e'
1322 GLFS WRITE CALL pattern ''
1324 GLFS WRITE CALL pattern  'e'

This is the problem case.
1) We see a smb write call which is successful and then subsequently a read call which is set to status pending.
2) We can see that we make a read call to gluster first. This returns the data from the previous write ie. pattern 'd'.
3) This is sent to the smb client. This fails the test and the file is closed.
4) We then see the pending writes made by the smb client being flushed to the gluster fs.
Comment 9 Sachin Prabhu 2020-09-11 18:28:17 UTC
Hello Volker,

A quick check with 

aio read size = 0
aio write size = 0

Did not fix the problem.

I have checked the tcpdump and we see the same pattern as described in c#8 with the READ on the backing device being called before the WRITE is.

I suspect we have see these issues earlier in customer cases which unfortunately we never had a good reproducer for. I will continue investigating on Monday.
Comment 10 Jeremy Allison 2020-09-11 18:28:57 UTC
OK, it looks like you're going to need some kind of synchronization object (mutex) inside the vfs_glusterfs.c code on a per-fd basis to ensure that the ordering is maintained w.r.t to the SMB file accesses.

Great analysis BTW !
Comment 11 Jeremy Allison 2020-09-11 18:33:19 UTC
Just FYI. Once the SMB2_WRITE returns success it means we have gotten a return from the glfs_pwrite() call made in the VFS module. We will only return:


once vfs_gluster_pwrite_done()/vfs_gluster_pwrite_recv() has completed. That means we've given the data to the glfs_XXXX() calls.

So I may be wrong about needing a mutex inside the VFS as in this case we've completed the glfs_XXX() call.

You might want to turn on as much debug as you can get inside the glfs_XXX() code to try and track the path through this case.
Comment 12 Michael Adam 2020-09-15 10:18:41 UTC
In discussion with Sachin, we spotted one potential additional problem: the smbtorture test uses two tree connects. So it is using two gluster clients - i.e. two instances of the libgfapi. 

We can modify the test program to use two instances of libgfapi.

So the problem might still be in the libgfapi or even the underlying implementation. Can it be reproduced with two fuse mounts?...
Comment 13 Sachin Prabhu 2020-09-21 16:43:50 UTC
I was able to reproduce this with a test program using gfapi and by passing samba completely.

A new issue has been created with the gluster project at
Comment 14 Jeremy Allison 2020-09-21 16:48:42 UTC
Great ! Thanks Sachin. Do you want to leave this open until the bug has been fixed in gluster in order for Samba users to track it ?
Comment 15 Sachin Prabhu 2020-09-21 16:52:10 UTC
Hello Jeremy,

I would like to keep this open for some time. Atleast until we have an acknowledgement of the problem by the gluster team. We can also use it to track any changes(if recommended) to vfs_glusterfs by the gluster team.

Sachin Prabhu
Comment 16 Sachin Prabhu 2020-09-22 11:17:44 UTC
We have determined the cause of the problem to be the write-behind translator in gluster.

In summary, the translator returns a success for a write call before it has been flushed to the backing device. It continues writing in the background.

To disable this translator, run the following gluster command.

# gluster volume set <volname> write-behind off

This fixed the problem both for the test program as well as for the smbtorture tests using samba.

So closing this issue for now. We are considering the performance hits for disabling the translator for our installations and will re-open the bz to update notes in case we decide to make any changes based on the results.
Comment 17 Jeremy Allison 2020-09-22 16:45:23 UTC
Oh, that "write-behind translator" is horrid. Is it off or on by default ?
Comment 18 Sachin Prabhu 2020-09-22 17:00:48 UTC
It is on by default. We are discussing internally on how best to fix this.

There are new translators in development to replace this and other performance translators in gluster.
Comment 19 Jeremy Allison 2020-09-22 17:10:18 UTC
Oh, that's really bad :-(. Re-opening this bug. Can you add a patch to the vfs_glusterfs man page that covers this problem and explains how to turn this off. Samba users really need to be aware of this as it's a data corruption bug.

Alternatively, is there an API vfs_glusterfs can call to turn this off ?
Comment 20 Sachin Prabhu 2020-09-29 09:30:57 UTC
We are discussing how best to solve this internally. I will update the issue as soon as we have a solution.
Comment 21 Sachin Prabhu 2020-10-16 09:42:27 UTC

A merge request has been made to modify the man page for vfs_glusterfs

At the moment there is no programatic way to disable the translator in the vfs_glusterfs plugin. We are working with GlusterFS upstream to add this feature. The patch has been proposed and is going through the system.
This however would only be available in the newer versions of GlusterFS.

A second option was to determine programmatically if the write-behind translator could be detected and if so, either warn the user or do not allow the mount. To do this, at the moment, the best method available is to run glfs_get_volfile() and parse the volume file obtained for strings which would indicate the presence of the write-behind translator. We are looking for other options and do not intend to move forward with this option yet.

Therefore, we are adding a note to the man page pointing to a possible data corruption issue in the man page.
Comment 22 Jeremy Allison 2020-10-30 19:14:23 UTC
"Determining it programatically will require parsing through a 100s of lines of text"

Surely this is just a strstr() call on the output of glfs_get_volfile() into a large buffer and can be done at connection time.

Sachin - what string should this be looking for ?
Comment 23 Sachin Prabhu 2020-11-01 15:03:32 UTC
Created attachment 16319 [details]
volfile with write-behind translator enabled

volfile for volume vol-replicate obtained by calling  glfs_get_volfile()
Comment 24 Sachin Prabhu 2020-11-01 15:04:29 UTC
Created attachment 16320 [details]
volfile with write-behind translator disabled

for volume vol-replicated. The volfile was obtained with glfs_get_volfile() call.
Comment 25 Sachin Prabhu 2020-11-01 15:28:58 UTC
I have uploaded two volfiles with the write behind translator enabled/disabled.

The volfile is a graph representation of the translators in the order in which they are to be called.

In the files above, the volume is 'vol-replicate' with subvolume 'vol-replicate-md-cache'. 
'vol-replicate-md-cache' has subvolume 'vol-replicate-nl-cache'.
'vol-replicate-nl-cache' has subvolume 'vol-replicate-quick-read'.
'vol-replicate-quick-read' has subvolume 'vol-replicate-open-behind'.
'vol-replicate-open-behind' has subvolume 'vol-replicate-write-behind'.
This above is the representation for the write behind translator.
It continues with the subvolumes eventually ending with the individual bricks where the file or fragment is actually written.

A simple way to determine the presence of the write-behind translator is to look for the string
"volume vol-replicate-write-behind"
ie. "volume <volume name>-write-behind.
However the proper way for this is to build the tree and to ensure that the write-behind translator is called when writing to the volume.

My hesitation comes from the fact that apart from the text parsing and the possibilities of bugs associated with it, this information is internal to GlusterFS and was only exposed through a feature request from the Ganesha team(rh bz 1090363) which needed it to determine the UUID of the device. I don't think that there are any guaranties that the representation will not change in the future (I am not 100% sure of this and will have to check this). In that case, the code to check the presence of the translator may end up breaking and we may have incompatible versions which will need more hacks to work around.
Comment 26 Samba QA Contact 2020-11-02 21:41:10 UTC
This bug was referenced in samba master:

Comment 27 Jeremy Allison 2020-11-03 01:09:00 UTC
3 patches that are going into master incoming once they've all landed !
Comment 28 Jeremy Allison 2020-11-03 01:09:00 UTC
3 patches that are going into master incoming once they've all landed !
Comment 29 Samba QA Contact 2020-11-03 01:58:13 UTC
This bug was referenced in samba master:

Comment 30 Karolin Seeger 2020-11-03 07:25:16 UTC
Created attachment 16322 [details]
Patch with cherry-pick information for 4.13, 4.12 and 4.11

This patchset contains the 3 commits Jeremy mentioned cherry-picked from master. It applies to v4-13-test, v4-12-test and v4-11-test, but please double check if it's correct for all branches. Thanks!
Comment 31 Anoop C S 2020-11-03 07:40:21 UTC
Comment on attachment 16322 [details]
Patch with cherry-pick information for 4.13, 4.12 and 4.11

Comment 32 Guenther Deschner 2020-11-03 08:00:55 UTC
Comment on attachment 16322 [details]
Patch with cherry-pick information for 4.13, 4.12 and 4.11

LGTM, thanks for backporting!
Comment 33 Karolin Seeger 2020-11-03 08:03:19 UTC
(In reply to Guenther Deschner from comment #32)
Pushed to autobuild-v4-{13,12,11}-test.
Comment 34 Samba QA Contact 2020-11-03 10:17:05 UTC
This bug was referenced in samba v4-13-test:

Comment 35 Samba QA Contact 2020-11-03 11:31:13 UTC
This bug was referenced in samba v4-11-test:

Comment 36 Samba QA Contact 2020-11-03 12:35:41 UTC
This bug was referenced in samba v4-13-stable (Release samba-4.13.2):

Comment 37 Samba QA Contact 2020-11-03 13:03:04 UTC
This bug was referenced in samba v4-12-test:

Comment 38 Karolin Seeger 2020-11-04 08:04:26 UTC
Closing out bug report.

Comment 39 Samba QA Contact 2020-11-04 10:34:24 UTC
This bug was referenced in samba v4-11-stable (Release samba-4.11.16):

Comment 40 Samba QA Contact 2020-11-04 22:54:07 UTC
This bug was referenced in samba master:

Comment 41 Guenther Deschner 2020-11-05 08:24:51 UTC
Reopening to track followup patch
Comment 42 Guenther Deschner 2020-11-05 08:26:37 UTC
Created attachment 16324 [details]
followup patch for 4.13, 4.12 and 4.11
Comment 43 Samba QA Contact 2020-11-05 10:01:40 UTC
This bug was referenced in samba v4-12-stable (Release samba-4.12.10):

Comment 44 Karolin Seeger 2020-11-05 10:55:46 UTC
(In reply to Guenther Deschner from comment #42)
Pushed to autobuild-v4-{12,13}-test.
Comment 45 Samba QA Contact 2020-11-05 12:28:06 UTC
This bug was referenced in samba v4-12-test:

Comment 46 Samba QA Contact 2020-11-05 13:55:05 UTC
This bug was referenced in samba v4-13-test:

Comment 47 Samba QA Contact 2020-11-09 13:31:04 UTC
This bug was referenced in samba master:

Comment 48 Jeremy Allison 2020-11-10 20:39:05 UTC
I screwed this one up.. Errata patch incoming. Sorry :-(.
Comment 49 Samba QA Contact 2020-11-11 15:03:05 UTC
This bug was referenced in samba master:

Comment 50 Jeremy Allison 2020-11-12 18:27:55 UTC
Created attachment 16333 [details]
Errata patch for 4.13.next, 4.12.next, 4.11.next
Comment 51 Guenther Deschner 2020-11-12 21:42:30 UTC
Comment on attachment 16333 [details]
Errata patch for 4.13.next, 4.12.next, 4.11.next

LGTM, thanks!
Comment 52 Jeremy Allison 2020-11-12 22:26:08 UTC
Karolin can you add the errata patch to the 4.11.next, 4.12.next, 4.13.next branches. Thanks !
Comment 53 Karolin Seeger 2020-11-19 09:30:37 UTC
(In reply to Jeremy Allison from comment #52)
Pushed to autobuild-v4-{13,12,11}-test.
Do we need another 4.11 bugfix release here?
Comment 54 Samba QA Contact 2020-11-19 11:26:02 UTC
This bug was referenced in samba v4-13-test:

Comment 55 Guenther Deschner 2020-11-19 11:39:01 UTC
(In reply to Karolin Seeger from comment #53)
Yes, I'm afraid.
Comment 56 Samba QA Contact 2020-11-19 12:39:05 UTC
This bug was referenced in samba v4-11-test:

Comment 57 Samba QA Contact 2020-11-19 14:10:56 UTC
This bug was referenced in samba v4-12-test:

Comment 58 Jeremy Allison 2020-11-19 16:51:32 UTC
I'm sorry Karolin, but I think we do. I feel dreadful as this was my mistake :-(.
Comment 59 Anoop C S 2020-11-20 05:38:01 UTC
Created attachment 16345 [details]
follow up patch for 4.11

Hi Karolin,

I would like to include attached follow up patch(one from Guenther and another an update to vfs_glusterfs man page) also for v4-11. With those patches it should be complete.

For v4.12 and v4.13, only change to update vfs_glusterfs man page is missing. I will be adding it next.
Comment 60 Anoop C S 2020-11-20 05:41:24 UTC
Created attachment 16346 [details]
follow up patch for 4.12 and 4.13

Here is the man page update change to be included in v4.12 and v4.13.