Bug 14486 - vfs_gluster: smbtorture test smb2.rw.rw1 failure on distributed disperse volume
Summary: vfs_gluster: smbtorture test smb2.rw.rw1 failure on distributed disperse volume
Status: NEW
Alias: None
Product: Samba 4.1 and newer
Classification: Unclassified
Component: File services (show other bugs)
Version: 4.13.0rc3
Hardware: All All
: P5 normal (vote)
Target Milestone: ---
Assignee: Samba QA Contact
QA Contact: Samba QA Contact
Depends on:
Reported: 2020-09-10 14:37 UTC by Sachin Prabhu
Modified: 2020-09-15 10:18 UTC (History)
3 users (show)

See Also:

Modifications to smb2.rw.rw1 (1.82 KB, application/x-compressed-tar)
2020-09-10 14:45 UTC, Sachin Prabhu
no flags Details
test program using gfapi (3.04 KB, text/x-csrc)
2020-09-10 14:47 UTC, Sachin Prabhu
no flags Details
pcap capturing the problem (1.69 MB, application/vnd.tcpdump.pcap)
2020-09-11 18:09 UTC, Sachin Prabhu
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sachin Prabhu 2020-09-10 14:37:09 UTC
A quick check using the smbtorture tests shows a failure in conducting test smb2.rw.rw1 on a distributed-disperse volume

The exact command run as part of the samba-integration tests is 
/bin/smbtorture --fullname --option=torture:progress=no --option=torture:sharedelay=100000 --option=torture:writetimeupdatedelay=500000 --format=subunit --target=samba3 --user=test1%x // smb2.rw.rw1

The test involves writing a random string to a file and immediately reading it. The contents are then checked to confirm that the data returned for the read call matches the data written for write calls. There are two errors seen.
1) Fewer bytes read in the read call.
2) Mismatch in the buffer between read and write.
I believe both errors are manifestation of the same problem. More details below.

1) The problem is only seen with a disperse volume. I have also tested against a replicated volume successfully.
2) The problem couldn't be reproduced using vfs_gluster_fuse module using the same setup. ie. using samba exported fuse shares.
3) With slight modifications to the testing script, I determined that the read returns the write from the call which immediately preceeded this call. ie. The content returned on read is what was written 2 writes ago and not the last write.
4) On retrying the read, the correct data is returned.
5) I couldn't reproduce this issue using a test program written with gfapi.

samba-20200826.095928.fbe5853-1.el7.1.x86_64 - recent upstream build
glusterfs-20200905.4e81a5e-0.0.el7.x86_64 - recent upstream gluster build
Comment 1 Sachin Prabhu 2020-09-10 14:45:46 UTC
Created attachment 16217 [details]
Modifications to smb2.rw.rw1

Modifciations to smbtorture test smb2.rw.rw1.

This includes
1) Use a single char pattern. ie. Generates buffer consisting of one character which changes every time a new buffer is generated. So the first attempt sets up buffer with repeating char 'a'. The second call sets up the buffer to char 'b', etc. This was used to confirm that the problem read call returns the content written before the immediate write. It appears that the data written isn't either in cache or flushed to disk before the read call.

2) We retry until we get the right buffer and keep track of number of retries. This confirmed that the correct output was returned immediately after the failed read call.
Comment 2 Sachin Prabhu 2020-09-10 14:47:49 UTC
Created attachment 16218 [details]
test program using gfapi

Test program using gfapi to attempt to reproduce the problem.

I was unsuccessful in reproducing the issue with this test program.
Comment 3 Jeremy Allison 2020-09-10 15:55:47 UTC
Hmmm. Doesn't this mean there's a bug in the gluster libraries ?

It isn't as if Samba caches any of the data, everything is just going through the glfs_XXXX() API calls.
Comment 4 Sachin Prabhu 2020-09-10 15:59:29 UTC
I couldn't recreate with a test program using gfapi calls thus taking samba out of the equation. There is a chance that the bug is within the vfs plugin.

We are still investigating the cause of this issue and gd thought that we could record this here while we investigate.

Sachin Prabhu
Comment 5 Jeremy Allison 2020-09-10 16:08:44 UTC
No it's a good idea to record here. Remember, the vfs plugin uses threaded calls to read/write data. Is your gfapi test program threaded ?

My money would be on a threading issue in gfapi.
Comment 6 Sachin Prabhu 2020-09-10 16:10:53 UTC
No. No threading on the test program. I'll run a few checks on the tcpdumps between samba and gluster to confirm.

Sachin Prabhu
Comment 7 Volker Lendecke 2020-09-11 08:13:57 UTC
Can you reproduce the issue with

aio read size = 0
aio write size = 0

which means we don't use threading? Just for another data point.
Comment 8 Sachin Prabhu 2020-09-11 18:09:13 UTC
Created attachment 16226 [details]
pcap capturing the problem

This is a pcap file taken on the samba server to which the client is connected. The modified smb2.rw.rw1 test(using char patterns - patch provided earlier) was run against the server. This clearly shows the problem.

for pattern 'd' - This worked as expected.

1187 SMB WRITE REQUEST 17447 bytes pattern - 'd'
1193 SMB WRITE RESPONSE sucessfull
1194 SMB READ REQUEST Len 82423
1219 GLFS WRITE CALL pattern 'd'
1222 GLFS WRITE CALL pattern 'd'
1226 GLFS WRITE CALL pattern 'd'
1262 GLFS READ REPLY pattern 'd'
1264 GLFS READ REPLY pattern 'd'
1267 SMB READ RESPONSE pattern 'd'

In this case, we accept the write from the client, and then we accept the read call but leave it on status pending. 
We then write onto gluster. Then subsequently, we make a read call on gluster which is successful.
The data from the read is then sent back to the client over smb.

for pattern 'e' - Problem write-read

1271 SMB WRITE REQUEST 83692 bytes pattern - 'e'
1277 SMB READ REQUEST Len 83693
1282 GLFS READ CALL   <--- NO GLFS Writes performed before we call this READ call
1288 GLFS READ REPLY pattern ''
1289 GLFS READ REPLY pattern 'd' <-- incorrect pattern returned from GLFS
1305 SMB READ RESPONSE pattern 'd' <-- incorrect pattern returned
1312 SMB CLOSE REQUEST <-- file closed
1318 GLFS WRITE CALL pattern 'e'
1322 GLFS WRITE CALL pattern ''
1324 GLFS WRITE CALL pattern  'e'

This is the problem case.
1) We see a smb write call which is successful and then subsequently a read call which is set to status pending.
2) We can see that we make a read call to gluster first. This returns the data from the previous write ie. pattern 'd'.
3) This is sent to the smb client. This fails the test and the file is closed.
4) We then see the pending writes made by the smb client being flushed to the gluster fs.
Comment 9 Sachin Prabhu 2020-09-11 18:28:17 UTC
Hello Volker,

A quick check with 

aio read size = 0
aio write size = 0

Did not fix the problem.

I have checked the tcpdump and we see the same pattern as described in c#8 with the READ on the backing device being called before the WRITE is.

I suspect we have see these issues earlier in customer cases which unfortunately we never had a good reproducer for. I will continue investigating on Monday.
Comment 10 Jeremy Allison 2020-09-11 18:28:57 UTC
OK, it looks like you're going to need some kind of synchronization object (mutex) inside the vfs_glusterfs.c code on a per-fd basis to ensure that the ordering is maintained w.r.t to the SMB file accesses.

Great analysis BTW !
Comment 11 Jeremy Allison 2020-09-11 18:33:19 UTC
Just FYI. Once the SMB2_WRITE returns success it means we have gotten a return from the glfs_pwrite() call made in the VFS module. We will only return:


once vfs_gluster_pwrite_done()/vfs_gluster_pwrite_recv() has completed. That means we've given the data to the glfs_XXXX() calls.

So I may be wrong about needing a mutex inside the VFS as in this case we've completed the glfs_XXX() call.

You might want to turn on as much debug as you can get inside the glfs_XXX() code to try and track the path through this case.
Comment 12 Michael Adam 2020-09-15 10:18:41 UTC
In discussion with Sachin, we spotted one potential additional problem: the smbtorture test uses two tree connects. So it is using two gluster clients - i.e. two instances of the libgfapi. 

We can modify the test program to use two instances of libgfapi.

So the problem might still be in the libgfapi or even the underlying implementation. Can it be reproduced with two fuse mounts?...