There are two different scenarios where we encountered the EBADF/EIO errors: - Write test, where there are multiple threads writing to a file with "_temp" suffix and after the write completes a rename happens that removes the "_temp" suffix. Each process/thread writes to a unique file, there are no overlaps. Read test, where multiple threads attempt to choose a random file out of a certain set and read it – so here it can happen that multiple threads access the same file at the same time. The error happens most often when the set of files is small; for example, repeated reads of 5 different 20M files by 128 threads triggered this for us. - Read test, where multiple threads attempt to choose a random file out of a certain set and read it – so here it can happen that multiple threads access the same file at the same time. The error happens most often when the set of files is small; for example, repeated reads of 5 different 20M files by 128 threads triggered this for us. We believe the root cause for both of them is the same: there seems to be a race condition in smb2_compound_op: --- after_close: num_rqst++; if (cfile) { cifsFileInfo_put(cfile); // sends SMB2_CLOSE to the server cfile = NULL; --- This code is triggered by smb2_query_path_info operation that happens during revalidate_dentry. In smb2_query_path_info, get_readable_path is called to load the cfile, increasing the reference counter. If in the meantime, this reference becomes the very last, this call to cifsFileInfo_put(cfile) will trigger a SMB2_CLOSE request sent to the server just before sending this compound request – and so then the compound request fails either with EBADF/EIO depending on the timing at the server, because the handle is already closed. In the first scenario, the race seems to be happening between smb2_query_path_info triggered by the rename operation, and between “cleanup” of asynchronous writes – while fsync(fd) likely waits for the asynchronous writes to complete, releasing the writeback structures can happen after the close(fd) call. So the EBADF/EIO errors will pop up if the timing is such that: 1) There are still outstanding references after close(fd) in the writeback structures 2) smb2_query_path_info successfully fetches the cfile, increasing the refcounter by 1 3) All writeback structures release the same cfile, reducing refcounter to 1 4) smb2_compound_op is called with that cfile In the second scenario, the race seems to be similar – here open triggers the smb2_query_path_info operation, and if all other threads in the meantime decrease the refcounter to 1 similarly to the first scenario, again SMB2_CLOSE will be sent to the server just before issuing the compound request. This case is harder to reproduce across different machines and the timing is more tricky, so we attach a pseudocode for that case. Proposed patch: It seems to us that it should be sufficient and harmless to just remove the two lines: cifsFileInfo_put(cfile); cfile = NULL; The cleanup happens anyways just below the call to compound_send_recv under the finished: label. Unless something in compound_send_recv can interfere with this cleanup, it should be safe to remove them. In our testing we didn't encounter any issues after removing them. Observed stack traces by manual insertion of dump_stack to _cifsFileInfo_put/smb2_compound_op: rite test, from _cifsFileInfo_put after hitting last reference: [Tue Apr 19 05:38:30 2022] Call Trace: [Tue Apr 19 05:38:30 2022] dump_stack_lvl+0x33/0x42 [Tue Apr 19 05:38:30 2022] _cifsFileInfo_put+0x124/0x3b1 [cifs] [Tue Apr 19 05:38:30 2022] ? smb2_plain_req_init+0x43/0x45 [cifs] [Tue Apr 19 05:38:30 2022] ? SMB2_query_info_init+0x5c/0xf5 [cifs] [Tue Apr 19 05:38:30 2022] cifsFileInfo_put+0x14/0x15 [cifs] [Tue Apr 19 05:38:30 2022] smb2_compound_op+0xd78/0xfc0 [cifs] [Tue Apr 19 05:38:30 2022] smb2_query_path_info+0x14e/0x238 [cifs] [Tue Apr 19 05:38:30 2022] cifs_get_inode_info+0x1f5/0x6fb [cifs] [Tue Apr 19 05:38:30 2022] cifs_revalidate_dentry_attr+0x249/0x2e7 [cifs] [Tue Apr 19 05:38:30 2022] cifs_revalidate_dentry+0x1a/0x2b [cifs] [Tue Apr 19 05:38:30 2022] cifs_d_revalidate+0x6b/0x15c [cifs] [Tue Apr 19 05:38:30 2022] lookup_dcache+0x3b/0x60 [Tue Apr 19 05:38:30 2022] __lookup_hash+0x1f/0xa0 [Tue Apr 19 05:38:30 2022] ? down_write+0xe/0x40 [Tue Apr 19 05:38:30 2022] do_renameat2+0x279/0x510 [Tue Apr 19 05:38:30 2022] ? strncpy_from_user+0x41/0x1a0 [Tue Apr 19 05:38:30 2022] __x64_sys_rename+0x3c/0x50 [Tue Apr 19 05:38:30 2022] do_syscall_64+0x3a/0x80 [Tue Apr 19 05:38:30 2022] entry_SYSCALL_64_after_hwframe+0x44/0xae Read test, from smb2_compound_op after hitting rc = -9: Call Trace: [Wed Apr 20 02:31:10 2022] dump_stack_lvl+0x33/0x42 [Wed Apr 20 02:31:10 2022] smb2_compound_op+0x902/0x103b [cifs] [Wed Apr 20 02:31:10 2022] ? build_path_from_dentry_optional_prefix+0xa0/0x298 [cifs] [Wed Apr 20 02:31:10 2022] ? cifsFileInfo_get+0x29/0x2f [cifs] [Wed Apr 20 02:31:10 2022] ? ___ratelimit+0x6f/0xd0 [Wed Apr 20 02:31:10 2022] smb2_query_path_info+0x14c/0x248 [cifs] [Wed Apr 20 02:31:10 2022] cifs_get_inode_info+0x1f5/0x6fb [cifs] [Wed Apr 20 02:31:10 2022] cifs_revalidate_dentry_attr+0x249/0x2e7 [cifs] [Wed Apr 20 02:31:10 2022] cifs_revalidate_dentry+0x1a/0x2b [cifs] [Wed Apr 20 02:31:10 2022] cifs_d_revalidate+0x6b/0x15c [cifs] [Wed Apr 20 02:31:10 2022] lookup_fast+0xcd/0x150 [Wed Apr 20 02:31:10 2022] path_openat+0x114/0x1050 [Wed Apr 20 02:31:10 2022] ? cifsFileInfo_put_final+0xc0/0xc9 [cifs] [Wed Apr 20 02:31:10 2022] ? _cifsFileInfo_put+0x391/0x39b [cifs] [Wed Apr 20 02:31:10 2022] do_filp_open+0xb4/0x120 [Wed Apr 20 02:31:10 2022] ? __check_object_size+0x15f/0x170 [Wed Apr 20 02:31:10 2022] do_sys_openat2+0x242/0x300 [Wed Apr 20 02:31:10 2022] do_sys_open+0x4b/0x80 [Wed Apr 20 02:31:10 2022] do_syscall_64+0x3a/0x80 [Wed Apr 20 02:31:10 2022] entry_SYSCALL_64_after_hwframe+0x44/0xae Write-case small repro: #include <iostream> #include <thread> #include <vector> #include <string> #include <unistd.h> #include <fcntl.h> std::string path1 = "/mnt/mountpoint/dir"; const int FILES_PER_WORKER = 1000; const int BUF_SIZE = 1048576 * 20; const int BUF_NUMBER_OF_WRITES = 1; const int WORKERS = 4; int main() { std::vector<std::thread> ts; for (int workerNum = 0; workerNum < WORKERS; workerNum++) { ts.emplace_back([workerNum](){ std::unique_ptr<char[]> buf(new char[BUF_SIZE]); for (int i = 0; i < BUF_SIZE; i++) buf[i] = 'A'; for (int i = 0; i < FILES_PER_WORKER; i++) { std::string finalPath = path1 + std::to_string(FILES_PER_WORKER * workerNum + i); std::string tempPath = finalPath + "_temp"; int fd = open(tempPath.c_str(), O_RDWR | O_CREAT | O_TRUNC); if (fd < 0) { std::cerr << "Open failed!" << std::endl; return; } for (int j = 0; j < BUF_NUMBER_OF_WRITES; j++) { int res = write(fd, buf.get(), BUF_SIZE); if (res != BUF_SIZE) { std::cerr << "Write error!" << std::endl; return; } } int res = fsync(fd); if (res != 0) { std::cerr << "Fsync error!" << std::endl; return; } res = close(fd); if (res != 0) { std::cerr << "Close error!" << std::endl; return; } res = rename(tempPath.c_str(), finalPath.c_str()); if (res != 0) { std::cerr << "Rename error! (" << errno << ")" << std::endl; return; } } }); } for(auto& t : ts) t.join(); return 0; } For the read test, the following happens on each thread instead (on the files created by the write test). for (int j = 0; j < TOTAL_READS; j++) { usleep(random time); string path = basePath + to_string(random number from 0 to 4 inclusive); int fd = open(path, O_RDONLY); read(fd, buf, 20M); close(fd); }
Good analysis, this looks like a genuine race condition. Reviewed-by me for the suggested patch.
Created attachment 17284 [details] created patch based on Ondrej's suggestion
Running regression tests on the patch now
Patch is in 5.19-rc1 and later, and marked for stable so will be backported