15051 – EBADF/EIO errors in rename/open caused by race condition in smb2_compound_op

Bug 15051 - EBADF/EIO errors in rename/open caused by race condition in smb2_compound_op

Summary: EBADF/EIO errors in rename/open caused by race condition in smb2_compound_op

Status:	RESOLVED FIXED

Alias:	None

Product:	CifsVFS
Classification:	Unclassified
Component:	kernel fs (show other bugs)
Version:	5.x
Hardware:	x64 Linux

Importance:	P5 normal
Target Milestone:	---
Assignee:	Steve French
QA Contact:	cifs QA contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2022-04-22 15:05 UTC by Ondrej Hubsch
Modified:	2022-07-12 19:41 UTC (History)
CC List:	0 users

See Also:

Attachments
created patch based on Ondrej's suggestion (2.86 KB, patch) 2022-05-12 15:50 UTC, Steve French	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ondrej Hubsch 2022-04-22 15:05:29 UTC

There are two different scenarios where we encountered the EBADF/EIO errors: 

- Write test, where there are multiple threads writing to a file with "_temp" suffix and after the write completes a rename happens that removes the "_temp" suffix. Each process/thread writes to a unique file, there are no overlaps.
Read test, where multiple threads attempt to choose a random file out of a certain set and read it – so here it can happen that multiple threads access the same file at the same time. The error happens most often when the set of files is small; for example, repeated reads of 5 different 20M files by 128 threads triggered this for us.
- Read test, where multiple threads attempt to choose a random file out of a certain set and read it – so here it can happen that multiple threads access the same file at the same time. The error happens most often when the set of files is small; for example, repeated reads of 5 different 20M files by 128 threads triggered this for us.

We believe the root cause for both of them is the same: there seems to be a race condition in smb2_compound_op:

---
after_close:
	num_rqst++;

	if (cfile) {
		cifsFileInfo_put(cfile); // sends SMB2_CLOSE to the server
		cfile = NULL;
---

This code is triggered by smb2_query_path_info operation that happens during revalidate_dentry. In smb2_query_path_info, get_readable_path is called to load the cfile, increasing the reference counter. If in the meantime, this reference becomes the very last, this call to cifsFileInfo_put(cfile) will trigger a SMB2_CLOSE request sent to the server just before sending this compound request – and so then the compound request fails either with EBADF/EIO depending on the timing at the server, because the handle is already closed.


In the first scenario, the race seems to be happening between smb2_query_path_info triggered by the rename operation, and between “cleanup” of asynchronous writes – while fsync(fd) likely waits for the asynchronous writes to complete, releasing the writeback structures can happen after the close(fd) call. So the EBADF/EIO errors will pop up if the timing is such that:
1) There are still outstanding references after close(fd) in the writeback structures
2) smb2_query_path_info successfully fetches the cfile, increasing the refcounter by 1
3) All writeback structures release the same cfile, reducing refcounter to 1
4) smb2_compound_op is called with that cfile

In the second scenario, the race seems to be similar – here open triggers the smb2_query_path_info operation, and if all other threads in the meantime decrease the refcounter to 1 similarly to the first scenario, again SMB2_CLOSE will be sent to the server just before issuing the compound request. This case is harder to reproduce across different machines and the timing is more tricky, so we attach a pseudocode for that case.

Proposed patch:

It seems to us that it should be sufficient and harmless to just remove the two lines:

cifsFileInfo_put(cfile);
cfile = NULL;

The cleanup happens anyways just below the call to compound_send_recv under the finished: label. Unless something in compound_send_recv can interfere with this cleanup, it should be safe to remove them. In our testing we didn't encounter any issues after removing them.

Observed stack traces by manual insertion of dump_stack to _cifsFileInfo_put/smb2_compound_op:
rite test, from _cifsFileInfo_put after hitting last reference:

[Tue Apr 19 05:38:30 2022] Call Trace:
[Tue Apr 19 05:38:30 2022]  dump_stack_lvl+0x33/0x42
[Tue Apr 19 05:38:30 2022]  _cifsFileInfo_put+0x124/0x3b1 [cifs]
[Tue Apr 19 05:38:30 2022]  ? smb2_plain_req_init+0x43/0x45 [cifs]
[Tue Apr 19 05:38:30 2022]  ? SMB2_query_info_init+0x5c/0xf5 [cifs]
[Tue Apr 19 05:38:30 2022]  cifsFileInfo_put+0x14/0x15 [cifs]
[Tue Apr 19 05:38:30 2022]  smb2_compound_op+0xd78/0xfc0 [cifs]
[Tue Apr 19 05:38:30 2022]  smb2_query_path_info+0x14e/0x238 [cifs]
[Tue Apr 19 05:38:30 2022]  cifs_get_inode_info+0x1f5/0x6fb [cifs]
[Tue Apr 19 05:38:30 2022]  cifs_revalidate_dentry_attr+0x249/0x2e7 [cifs]
[Tue Apr 19 05:38:30 2022]  cifs_revalidate_dentry+0x1a/0x2b [cifs]
[Tue Apr 19 05:38:30 2022]  cifs_d_revalidate+0x6b/0x15c [cifs]
[Tue Apr 19 05:38:30 2022]  lookup_dcache+0x3b/0x60
[Tue Apr 19 05:38:30 2022]  __lookup_hash+0x1f/0xa0
[Tue Apr 19 05:38:30 2022]  ? down_write+0xe/0x40
[Tue Apr 19 05:38:30 2022]  do_renameat2+0x279/0x510
[Tue Apr 19 05:38:30 2022]  ? strncpy_from_user+0x41/0x1a0
[Tue Apr 19 05:38:30 2022]  __x64_sys_rename+0x3c/0x50
[Tue Apr 19 05:38:30 2022]  do_syscall_64+0x3a/0x80
[Tue Apr 19 05:38:30 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae


Read test, from smb2_compound_op after hitting rc = -9:

Call Trace:
[Wed Apr 20 02:31:10 2022]  dump_stack_lvl+0x33/0x42
[Wed Apr 20 02:31:10 2022]  smb2_compound_op+0x902/0x103b [cifs]
[Wed Apr 20 02:31:10 2022]  ? build_path_from_dentry_optional_prefix+0xa0/0x298 [cifs]
[Wed Apr 20 02:31:10 2022]  ? cifsFileInfo_get+0x29/0x2f [cifs]
[Wed Apr 20 02:31:10 2022]  ? ___ratelimit+0x6f/0xd0
[Wed Apr 20 02:31:10 2022]  smb2_query_path_info+0x14c/0x248 [cifs]
[Wed Apr 20 02:31:10 2022]  cifs_get_inode_info+0x1f5/0x6fb [cifs]
[Wed Apr 20 02:31:10 2022]  cifs_revalidate_dentry_attr+0x249/0x2e7 [cifs]
[Wed Apr 20 02:31:10 2022]  cifs_revalidate_dentry+0x1a/0x2b [cifs]
[Wed Apr 20 02:31:10 2022]  cifs_d_revalidate+0x6b/0x15c [cifs]
[Wed Apr 20 02:31:10 2022]  lookup_fast+0xcd/0x150
[Wed Apr 20 02:31:10 2022]  path_openat+0x114/0x1050
[Wed Apr 20 02:31:10 2022]  ? cifsFileInfo_put_final+0xc0/0xc9 [cifs]
[Wed Apr 20 02:31:10 2022]  ? _cifsFileInfo_put+0x391/0x39b [cifs]
[Wed Apr 20 02:31:10 2022]  do_filp_open+0xb4/0x120
[Wed Apr 20 02:31:10 2022]  ? __check_object_size+0x15f/0x170
[Wed Apr 20 02:31:10 2022]  do_sys_openat2+0x242/0x300
[Wed Apr 20 02:31:10 2022]  do_sys_open+0x4b/0x80
[Wed Apr 20 02:31:10 2022]  do_syscall_64+0x3a/0x80
[Wed Apr 20 02:31:10 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae


Write-case small repro:
#include <iostream>
#include <thread>
#include <vector>
#include <string>
#include <unistd.h>
#include <fcntl.h>

std::string path1 = "/mnt/mountpoint/dir";
const int FILES_PER_WORKER = 1000;
const int BUF_SIZE = 1048576 * 20;
const int BUF_NUMBER_OF_WRITES = 1;
const int WORKERS = 4;

int main() {
	std::vector<std::thread> ts;
	for (int workerNum = 0; workerNum < WORKERS; workerNum++) {
		ts.emplace_back([workerNum](){
			std::unique_ptr<char[]> buf(new char[BUF_SIZE]);
			for (int i = 0; i < BUF_SIZE; i++)
				buf[i] = 'A';
			for (int i = 0; i < FILES_PER_WORKER; i++) {
				std::string finalPath = path1 + std::to_string(FILES_PER_WORKER * workerNum + i);
				std::string tempPath = finalPath + "_temp";
				int fd = open(tempPath.c_str(), O_RDWR | O_CREAT | O_TRUNC);
				if (fd < 0) {
					std::cerr << "Open failed!" << std::endl;
					return;
				}

				for (int j = 0; j < BUF_NUMBER_OF_WRITES; j++) {
					int res = write(fd, buf.get(), BUF_SIZE);
					if (res != BUF_SIZE) {
						std::cerr << "Write error!" << std::endl;
						return;
					}
				}
				int res = fsync(fd);
				if (res != 0) {
					std::cerr << "Fsync error!" << std::endl;
					return;
				}
				res = close(fd);
				if (res != 0) {
					std::cerr << "Close error!" << std::endl;
					return;
				}
				res = rename(tempPath.c_str(), finalPath.c_str());
				if (res != 0) {
					std::cerr << "Rename error! (" << errno << ")" << std::endl;
					return;
				}
			}

		});
	}
	for(auto& t : ts)
		t.join();

	return 0;
}



For the read test, the following happens on each thread instead (on the files created by the write test).

for (int j = 0; j < TOTAL_READS; j++)
{
    usleep(random time);
    string path = basePath + to_string(random number from 0 to 4 inclusive);
    int fd = open(path, O_RDONLY);
    read(fd, buf, 20M);
    close(fd);
}

Comment 1 Ronnie Sahlberg 2022-05-12 05:10:57 UTC

Good analysis, this looks like a genuine race condition.
Reviewed-by me for the suggested patch.

Comment 2 Steve French 2022-05-12 15:50:35 UTC

Created attachment 17284 [details]
created patch based on Ondrej's suggestion

Comment 3 Steve French 2022-05-12 15:50:49 UTC

Running regression tests on the patch now

Comment 4 Steve French 2022-07-12 19:41:16 UTC

Patch is in 5.19-rc1 and later, and marked for stable so will be backported