Created attachment 18273 [details] Output of |modinfo cifs| I am running Debian GNU/Linux within VirtualBox. I was testing the behavior of thunderbird mail client when I/O error occurs. My simulated I/O errors are introduced by unplugging the network cable to a remote CIFS server where mail messages are stored. I used this setup very successfully several years ago to make the error handling of Thunderbird more robust. But this time, I noticed TB got stuck and could not be even killed by |kill| and am reporting the issue. Version of software involved. Thunderbird is compiled from comm-central tree locally with my own patches. VirtualBox: 7.0.14 r16109 Debian GNU/Linux: uname -a Linux ip030 6.6.15-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.6.15-2 (2024-02-04) x86_64 GNU/Linux CIFS client: 3.4.2 ? Not quite sure how I can check this. I found the following modules: locate smb | grep '\.ko' ... /usr/lib/modules/6.6.15-amd64/kernel/fs/smb/client/cifs.ko.xz /usr/lib/modules/6.6.15-amd64/kernel/fs/smb/common/cifs_arc4.ko.xz /usr/lib/modules/6.6.15-amd64/kernel/fs/smb/common/cifs_md4.ko.xz /usr/lib/modules/6.6.15-amd64/kernel/fs/smb/server/ksmbd.ko.xz Looking inside the kernel module by |strings -| after uncompressing a copy of it, I found the following from fs/smb/server/ksbmbd.ko: description=Linux kernel CIFS/SMB SERVER version=3.4.2 srcversion=1AFC26B1B221D0178BFD3EF vermagic=6.6.15-amd64 SMP preempt mod_unload modversions From fs/smb/client/cifs.ko: ... srcversion=642FAF28CDF9168BBABA0CA Output of |modinfo cifs| is lengthy and is in an attachment, but probably this is the relevant line (?). version: 2.45 The remote CIFS server is inside a TP-LINK router.: Its OS/firmware version is up-to-date. firmware version 1.3.8 Build 20230830 rel.19767(4A50) released in August, 2023. hardware version Archer AX20 v1.0 I am not sure what the version of CIFS in the firmware image is, though. How CIFS was mounted: I used the following command. I shortened the timeout value by setting echo_interval=3 so that I can test the error handling of TB quickly.. mount.cifs "//192.168.0.202/g" /mnt/L-temp -o username=admin,noperm,vers=1.0,sec=ntlmv2,echo_interval=3 How the problem happened: While thunderbird was downloading e-mails from the server and storing messages to the remote CIFS server (TP-LINK router's usb storage), I unplugged the network cable to the CIFS server, and then see Thunderbird displaying an error message (after TCP timeout caused an I/O error) , and then plugged the cable again to continue. This worked in the first test. But during the second test, thunderbird crashed. From the stack trace, I think libc's FILE structure was messed up and glibc complained and crashed. I filed an issue at mozilla bugzilla.: https://bugzilla.mozilla.org/show_bug.cgi?id=1887532 (C-C TB crashes when an I/O error is simulated.) I quit the gdb session which was attached to the crashing thunderbird (thunderbird offered to attach gdb to the running image from its own error handler within 5 minutes). I checked the stacktrace, and checked the trashed(?) FILE structure and quit GDB. (That is in the Mozilla's bugzilla.) I assumed that the thunderbird process also quit then. But maybe I was wrong here. I did not check it. On restarting thunderbird, I noticed that it did not show the starting window. And I noticed there is a dying thunderbird process. This was the reason for non-window. (thunderbird command tries to reuse a running thunderbird process if the user invokes the command twice. It is a wrapper to spawn a few processes if I am not mistaken.) I tried to kill the zombie process with |kill -KILL|, |kill -HUP|, etc. But to no avail. Eventually, I had to reboot the system since even restarting X windows did not eliminate this stuck process. But then I realized the problem may be related CIFS. While shutting down and rebooting, I notice that systemd-shtudown logged the thunderbird process which did not respond and disappear after its own timeout period. The log contains where the process got stuck by showing system stack. Below is the excerpt from the logged message from the journal log using the command after I successfully rebooted linux: journalctl | egrep -i40 "(cifs|syscall|TASK>|<IRQ>)" I don't know if the IRQ-related message which I noticed during booting process has anything to do with CIFS issue. I think it is more of a problem of VirtualBox and watchdog timer, but just in case, I included it here. Debian is VERY CONSERVATIVE and sticks to known stable software modules even though the upstream has come up with newer versions. I am not sure how current Debian's CIFS module(s) is although I am using Debian's testing repository since some updated version of tools that I need to develop patches for thunderbird mail client are only available in the testing repository. If you would like me to try a more recent module that is supposed to fix the problem, I will do that. In the coming days, I will repeat the I/O error simulation using this unplug/plug step. Thus thus not encountering this stuck process is a priority for me. It may take a while for me to get the latest module compiled and rebuild the kernel since I have not built the linux kernel for the last several years. I used to do it to obtain latest features and updated SCSI/SATA drivers many years ago, but I now have been paying more attention to application side issues since SCSI/SATA driver became stable enough for me [and SCSI is gone for home users practically for the last dozen years.]
Created attachment 18274 [details] Output of journalctl | egrep -i40 "(cifs|syscall|TASK>|<IRQ>)" I am afraid I selected a wrong component. Anyway, this is the log during rebooting the kernel to remove the stuck process. It is obtained by journalctl | egrep -i40 "(cifs|syscall|TASK>|<IRQ>)" As I mentioned in my previous comment, "I don't know if the IRQ-related message which I noticed during booting process has anything to do with CIFS issue. I think it is more of a problem of VirtualBox and watchdog timer, but just in case, I included it here."
I changed the component to the other.
(In reply to ishikawa from comment #1) > I am afraid I selected a wrong component. I think actually the wrong product. "Samba 4.1 and newer" is for the userspace Samba software. It looks like you want "CifsVFS".
Created attachment 18279 [details] ps output list that includes two dead processes. The output from |ps axg| command. The hung occurred about 5-6 hours ago. In the list I see. > 70 ? D 5:48 [kcompactd0] <----- This got stuck > 879933 pts/5 D+ 0:32 ./thunderbird-bin -p <--- This got stuck
Created attachment 18280 [details] dmesg output that shows the stacks two non-responding processes. I was testing the behavior of thunderbird-bin after figuring that it had use after free problem and tried to locate the issue using an ASAN build. I unplugged the network cable and caused thunderbird to show an error dialog and plugged the cable again. I repeated it prior to seeing a stuck thunderbird-bin. Then I noticed a kernel process also got stuck.
I uploaded a couple of files. ps output list that includes two dead processes. dmesg output that shows the stacks two non-responding processes. Today, as I repeated the test of I/O error handling by thunderbird mail client, I observed a stuck thunderbird (dead, but could not be removed from the process list) again. But this time, I also noticed by looking into the dmesg output that a kernel process, dcompactd0, got stuck too. It did not respond for a long duration of time. The stack from both processes may reveal something. I notice thunderbird seems to get stuck in an llseek call path. I suspect within this call path, it may have used bogus stdio FILE* structure. I have noticed a use after free problem, and I was trying to see where by an ASAN build (not locally created, but created in mozilla's compilation firm.) The crashes of thunderbird I observed were all related to a call to seek so far. But if there is a memory reference after free, anything goes. And given that fseek only uses an FD and nothing else in the system call interface at the lowest level, if file number descriptor is a valid one, then the kernel should not crash a kernel thread. Something is wrong. Another observation. In the xosview (I monitor the CPU usage all the time during build and testing), I see two CPUs busy 100% with WIO state. The processes are obviously very low-priority because once a build happens, they disappear, but as soon as other higher processes disappear, the two CPU cores get occupied fully and they are in WIO state. Presumably they are related to the two dead processes? I hope this helps to figure out the root cause of the wedged process.
Created attachment 18281 [details] Another stuck thunderbird-bin and kcompactd0 I saw another stuck processed, a user process, thunderbird-bin, and a kernel process, kcompactd0. Once this user process got stuck and cannot be killed and remains in the process list I cannot umount/mount the directory. root@ip030:/NREF-COMM-CENTRAL/work-dir# umount /mnt/L-temp umount: /mnt/L-temp: target is busy. This makes the testing a bit awkward since there was a time when I cannot even list the file entries in the mounted directly. This time around, I could, though.
I have another date point. Thunderbird created from the development tree had an issue in handling an I/O error. I realized TB was calling the I/O stdio code with bogus fileno such as -1 or for a file number which had already been closed previously after the error handling. I have fixed it locally for now. I have no idea if this had anything to do with the CIFS related issue. Right now, I am testing the I/O operation, I mean the I/O error handling using NFS because C-C TB often gets stuck in kernel CIFS code and I could not debug the error handling issue very well. Once I figure that every thing is more or less OK on the TB side, I will retry testing using CIFS server. TIA
(In reply to Douglas Bagnall from comment #3) Thank you for fixing the product category. I thought I had sent a thank you note, but I don't see it here. Thank you again.