A quick check using the smbtorture tests shows a failure in conducting test smb2.rw.rw1 on a distributed-disperse volume
The exact command run as part of the samba-integration tests is
/bin/smbtorture --fullname --option=torture:progress=no --option=torture:sharedelay=100000 --option=torture:writetimeupdatedelay=500000 --format=subunit --target=samba3 --user=test1%x //192.168.123.10/gluster-vol-disperse smb2.rw.rw1
The test involves writing a random string to a file and immediately reading it. The contents are then checked to confirm that the data returned for the read call matches the data written for write calls. There are two errors seen.
1) Fewer bytes read in the read call.
2) Mismatch in the buffer between read and write.
I believe both errors are manifestation of the same problem. More details below.
1) The problem is only seen with a disperse volume. I have also tested against a replicated volume successfully.
2) The problem couldn't be reproduced using vfs_gluster_fuse module using the same setup. ie. using samba exported fuse shares.
3) With slight modifications to the testing script, I determined that the read returns the write from the call which immediately preceeded this call. ie. The content returned on read is what was written 2 writes ago and not the last write.
4) On retrying the read, the correct data is returned.
5) I couldn't reproduce this issue using a test program written with gfapi.
samba-20200826.095928.fbe5853-1.el7.1.x86_64 - recent upstream build
glusterfs-20200905.4e81a5e-0.0.el7.x86_64 - recent upstream gluster build
Created attachment 16217 [details]
Modifications to smb2.rw.rw1
Modifciations to smbtorture test smb2.rw.rw1.
1) Use a single char pattern. ie. Generates buffer consisting of one character which changes every time a new buffer is generated. So the first attempt sets up buffer with repeating char 'a'. The second call sets up the buffer to char 'b', etc. This was used to confirm that the problem read call returns the content written before the immediate write. It appears that the data written isn't either in cache or flushed to disk before the read call.
2) We retry until we get the right buffer and keep track of number of retries. This confirmed that the correct output was returned immediately after the failed read call.
Created attachment 16218 [details]
test program using gfapi
Test program using gfapi to attempt to reproduce the problem.
I was unsuccessful in reproducing the issue with this test program.
Hmmm. Doesn't this mean there's a bug in the gluster libraries ?
It isn't as if Samba caches any of the data, everything is just going through the glfs_XXXX() API calls.
I couldn't recreate with a test program using gfapi calls thus taking samba out of the equation. There is a chance that the bug is within the vfs plugin.
We are still investigating the cause of this issue and gd thought that we could record this here while we investigate.
No it's a good idea to record here. Remember, the vfs plugin uses threaded calls to read/write data. Is your gfapi test program threaded ?
My money would be on a threading issue in gfapi.
No. No threading on the test program. I'll run a few checks on the tcpdumps between samba and gluster to confirm.
Can you reproduce the issue with
aio read size = 0
aio write size = 0
which means we don't use threading? Just for another data point.
Created attachment 16226 [details]
pcap capturing the problem
This is a pcap file taken on the samba server to which the client is connected. The modified smb2.rw.rw1 test(using char patterns - patch provided earlier) was run against the server. This clearly shows the problem.
for pattern 'd' - This worked as expected.
1187 SMB WRITE REQUEST 17447 bytes pattern - 'd'
1193 SMB WRITE RESPONSE sucessfull
1194 SMB READ REQUEST Len 82423
1207 SMB READ STATUS_PENDING
1219 GLFS WRITE CALL pattern 'd'
1222 GLFS WRITE CALL pattern 'd'
1226 GLFS WRITE CALL pattern 'd'
1231 GLFS WRITE REPLY Success
1232 GLFS WRITE REPLY Success
1233 GLFS WRITE REPLY Success
1260 GLFS READ CALL
1261 GLFS READ CALL
1262 GLFS READ REPLY pattern 'd'
1264 GLFS READ REPLY pattern 'd'
1267 SMB READ RESPONSE pattern 'd'
In this case, we accept the write from the client, and then we accept the read call but leave it on status pending.
We then write onto gluster. Then subsequently, we make a read call on gluster which is successful.
The data from the read is then sent back to the client over smb.
for pattern 'e' - Problem write-read
1271 SMB WRITE REQUEST 83692 bytes pattern - 'e'
1273 SMB WRITE RESPONSE success
1277 SMB READ REQUEST Len 83693
1282 GLFS READ CALL <--- NO GLFS Writes performed before we call this READ call
1283 GLFS READ CALL
1288 GLFS READ REPLY pattern ''
1289 GLFS READ REPLY pattern 'd' <-- incorrect pattern returned from GLFS
1302 SMB READ RESPONSE STATUS_PENDING
1305 SMB READ RESPONSE pattern 'd' <-- incorrect pattern returned
1312 SMB CLOSE REQUEST <-- file closed
1318 GLFS WRITE CALL pattern 'e'
1322 GLFS WRITE CALL pattern ''
1324 GLFS WRITE CALL pattern 'e'
1333 GLFS WRITE REPLY(1324)
1334 GLFS WRITE REPLY(1322)
1344 GLFS WRITE REPLY(1318)
This is the problem case.
1) We see a smb write call which is successful and then subsequently a read call which is set to status pending.
2) We can see that we make a read call to gluster first. This returns the data from the previous write ie. pattern 'd'.
3) This is sent to the smb client. This fails the test and the file is closed.
4) We then see the pending writes made by the smb client being flushed to the gluster fs.
A quick check with
aio read size = 0
aio write size = 0
Did not fix the problem.
I have checked the tcpdump and we see the same pattern as described in c#8 with the READ on the backing device being called before the WRITE is.
I suspect we have see these issues earlier in customer cases which unfortunately we never had a good reproducer for. I will continue investigating on Monday.
OK, it looks like you're going to need some kind of synchronization object (mutex) inside the vfs_glusterfs.c code on a per-fd basis to ensure that the ordering is maintained w.r.t to the SMB file accesses.
Great analysis BTW !
Just FYI. Once the SMB2_WRITE returns success it means we have gotten a return from the glfs_pwrite() call made in the VFS module. We will only return:
SMB WRITE RESPONSE success
once vfs_gluster_pwrite_done()/vfs_gluster_pwrite_recv() has completed. That means we've given the data to the glfs_XXXX() calls.
So I may be wrong about needing a mutex inside the VFS as in this case we've completed the glfs_XXX() call.
You might want to turn on as much debug as you can get inside the glfs_XXX() code to try and track the path through this case.
In discussion with Sachin, we spotted one potential additional problem: the smbtorture test uses two tree connects. So it is using two gluster clients - i.e. two instances of the libgfapi.
We can modify the test program to use two instances of libgfapi.
So the problem might still be in the libgfapi or even the underlying implementation. Can it be reproduced with two fuse mounts?...