Bug 15614 - An application (thunderbird) was wedged in CIFS-related calling path in linux kernel.
Summary: An application (thunderbird) was wedged in CIFS-related calling path in linux...
Status: NEW
Alias: None
Product: CifsVFS
Classification: Unclassified
Component: kernel fs (show other bugs)
Version: 3.x
Hardware: x64 Linux
: P5 critical
Target Milestone: ---
Assignee: Steve French
QA Contact: cifs QA contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-24 23:51 UTC by ishikawa
Modified: 2024-04-07 20:36 UTC (History)
2 users (show)

See Also:


Attachments
Output of |modinfo cifs| (3.01 KB, text/plain)
2024-03-24 23:51 UTC, ishikawa
no flags Details
Output of journalctl | egrep -i40 "(cifs|syscall|TASK>|<IRQ>)" (23.23 KB, text/plain)
2024-03-24 23:54 UTC, ishikawa
no flags Details
ps output list that includes two dead processes. (19.78 KB, text/plain)
2024-04-02 07:48 UTC, ishikawa
no flags Details
dmesg output that shows the stacks two non-responding processes. (74.72 KB, text/plain)
2024-04-02 07:52 UTC, ishikawa
no flags Details
Another stuck thunderbird-bin and kcompactd0 (23.41 KB, text/plain)
2024-04-03 11:44 UTC, ishikawa
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description ishikawa 2024-03-24 23:51:16 UTC
Created attachment 18273 [details]
Output of |modinfo cifs|

I am running Debian GNU/Linux within VirtualBox.

I was testing the behavior of thunderbird mail client when I/O error
occurs.
My simulated I/O errors are introduced by unplugging the network cable to
a remote CIFS server where mail messages are stored.
I used this setup very successfully several years ago to make the
error handling of Thunderbird more robust.


But this time, I noticed TB got stuck and could not be even killed by
|kill| and am reporting the issue.

Version of software involved.
Thunderbird is compiled from comm-central tree locally with my own patches.

VirtualBox: 
       7.0.14 r16109

Debian GNU/Linux:
       uname -a
       Linux ip030 6.6.15-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.6.15-2 (2024-02-04) x86_64 GNU/Linux

CIFS client: 3.4.2 ?
     Not quite sure how I can check this.
     I found the following modules:
     locate smb | grep '\.ko'
     ...
     /usr/lib/modules/6.6.15-amd64/kernel/fs/smb/client/cifs.ko.xz
     /usr/lib/modules/6.6.15-amd64/kernel/fs/smb/common/cifs_arc4.ko.xz
     /usr/lib/modules/6.6.15-amd64/kernel/fs/smb/common/cifs_md4.ko.xz
     /usr/lib/modules/6.6.15-amd64/kernel/fs/smb/server/ksmbd.ko.xz

     Looking inside the kernel module by |strings -| after
     uncompressing a copy of it, I found the following from
     fs/smb/server/ksbmbd.ko:

     description=Linux kernel CIFS/SMB SERVER
     version=3.4.2
     srcversion=1AFC26B1B221D0178BFD3EF
     vermagic=6.6.15-amd64 SMP preempt mod_unload modversions 

     From fs/smb/client/cifs.ko:
     ...
     srcversion=642FAF28CDF9168BBABA0CA

     Output of |modinfo cifs| is lengthy and is in 
     an attachment, but probably this is the relevant line (?).
     version:        2.45

The remote CIFS server is inside a TP-LINK router.:
       Its OS/firmware version is up-to-date.
       firmware version 1.3.8 Build 20230830 rel.19767(4A50)
           released in August, 2023.
       hardware version Archer AX20 v1.0

       I am not sure what the version of CIFS in the firmware image
       is, though.

How CIFS was mounted: 
I used the following command. I shortened the timeout value by setting
echo_interval=3 so that I can test the error handling of TB quickly..

mount.cifs "//192.168.0.202/g" /mnt/L-temp -o  username=admin,noperm,vers=1.0,sec=ntlmv2,echo_interval=3


How the problem happened:

While thunderbird was downloading e-mails from the server and storing
messages to the remote CIFS server (TP-LINK router's usb storage),
I unplugged the network cable to the CIFS server, and
then see Thunderbird displaying an error message (after TCP timeout
caused an I/O error) , and then plugged the cable again to continue.
This worked in the first test.
But during the second test, thunderbird crashed.
From the stack trace, I think libc's FILE structure was messed up and
glibc complained and crashed.
I filed an issue at mozilla bugzilla.: 
https://bugzilla.mozilla.org/show_bug.cgi?id=1887532
(C-C TB crashes when an I/O error is simulated.)

I quit the gdb session which was attached to the crashing thunderbird
(thunderbird offered to attach gdb to the running image from its own error handler
within 5 minutes).

I checked the stacktrace, and checked the trashed(?) FILE structure and
quit GDB. (That is in the Mozilla's bugzilla.)

I assumed that the thunderbird process also quit then.
But maybe I was wrong here. I did not check it.

On restarting thunderbird, I noticed that it did not show the starting
window.
And I noticed there is a dying thunderbird process.
This was the reason for non-window. (thunderbird command tries to
reuse a running thunderbird process if the user invokes the command
twice. It is a wrapper to spawn a few processes if I am not mistaken.)

I tried to kill the zombie process with |kill -KILL|, |kill -HUP|,
etc. But to no avail.
Eventually, I had to reboot the system since even restarting X windows
did not eliminate this stuck process.

But then I realized the problem may be related CIFS.
While shutting down and rebooting, I notice that systemd-shtudown
logged the thunderbird process which did not respond and disappear
after its own timeout period.  The log contains where the process got
stuck by showing system stack.

Below is the excerpt from the  logged message from the journal log using the  command after I
successfully rebooted linux:

 journalctl | egrep -i40 "(cifs|syscall|TASK>|<IRQ>)"

I don't know if the IRQ-related message which I noticed during booting
process has anything to do with CIFS issue.  I think it is more of a
problem of VirtualBox and watchdog timer, but just in case, I included
it here.

Debian is VERY CONSERVATIVE and sticks to known stable
software modules even though the upstream has come up with newer versions.
I am not sure how current Debian's CIFS module(s) is although
I am using Debian's testing repository since some updated
version of tools that I need to develop patches for thunderbird mail
client are only available in the testing repository.

If you would like me to try a more recent module that is supposed to
fix the problem, I will do that.
In the coming days, I will repeat the I/O error simulation using this
unplug/plug step. Thus thus not encountering this stuck process is a
priority for me.
It may take a while for me to get the latest module compiled and
rebuild the kernel since I have not built the linux kernel for the
last several years. I used to do it to obtain latest features and
updated SCSI/SATA drivers many years ago, but I now have been 
paying more attention to application side issues since SCSI/SATA driver
became stable enough for me [and SCSI is gone for home users
practically for the last dozen years.]
Comment 1 ishikawa 2024-03-24 23:54:50 UTC
Created attachment 18274 [details]
Output of  journalctl | egrep -i40 "(cifs|syscall|TASK>|<IRQ>)"

I am afraid I selected a wrong component.

Anyway, this is the log during rebooting the kernel to remove the stuck process.
It is obtained by

 journalctl | egrep -i40 "(cifs|syscall|TASK>|<IRQ>)"

As I mentioned in my previous comment, "I don't know if the IRQ-related message which I noticed during booting
process has anything to do with CIFS issue.  I think it is more of a
problem of VirtualBox and watchdog timer, but just in case, I included
it here."
Comment 2 ishikawa 2024-03-24 23:55:32 UTC
I changed the component to the other.
Comment 3 Douglas Bagnall 2024-03-27 22:41:26 UTC
(In reply to ishikawa from comment #1)

> I am afraid I selected a wrong component.

I think actually the wrong product. "Samba 4.1 and newer" is for the userspace Samba software. It looks like you want "CifsVFS".
Comment 4 ishikawa 2024-04-02 07:48:23 UTC
Created attachment 18279 [details]
ps output list that includes two dead processes.

The output from |ps axg| command.
The hung occurred about 5-6 hours ago.
In the list I see.

>     70 ?        D      5:48 [kcompactd0]       <----- This got stuck

> 879933 pts/5    D+     0:32 ./thunderbird-bin -p  <--- This got stuck
Comment 5 ishikawa 2024-04-02 07:52:37 UTC
Created attachment 18280 [details]
dmesg output that shows the stacks two non-responding processes.

I was testing the behavior of thunderbird-bin after figuring that
it had use after free problem and tried to locate the issue using an ASAN build.
I unplugged the network cable and caused thunderbird to show an error dialog and
plugged the cable again.
I repeated it prior to seeing a stuck thunderbird-bin.
Then I noticed a kernel process also got stuck.
Comment 6 ishikawa 2024-04-02 08:05:26 UTC
I uploaded a couple of files.
ps output list that includes two dead processes. 
dmesg output that shows the stacks two non-responding processes. 

Today, as I repeated the test of I/O error handling by thunderbird mail client,
I observed a stuck thunderbird (dead, but could not be removed from
the process list) again.
But this time, I also noticed by looking into the dmesg output that a kernel process, dcompactd0, got stuck too. It did not respond for a long duration of time.
The stack from both processes may reveal something.

I notice thunderbird seems to get stuck in an llseek call path.
I suspect within this call path, it may have used bogus stdio FILE* structure.
I have noticed a use after free problem, and I was trying to see where by an ASAN build (not locally created, but created in mozilla's compilation firm.)
The crashes of thunderbird I observed were all related to a call to seek so far.
But if there is a memory reference after free, anything goes.
And given that fseek only uses an FD and nothing else in the system call interface at the lowest level, if file number descriptor is a valid one, then the kernel should not crash a kernel thread. Something is wrong.

Another observation.
In the xosview (I monitor the CPU usage all the time during build and testing), I see two CPUs busy 100% with WIO state.
The processes are obviously very low-priority because once a build happens, they disappear, but as soon as other higher processes disappear, the two CPU cores 
get occupied fully and they are in WIO state.
Presumably they are related to the two dead processes?

I hope this helps to figure out the root cause of the wedged process.
Comment 7 ishikawa 2024-04-03 11:44:14 UTC
Created attachment 18281 [details]
Another stuck thunderbird-bin and kcompactd0

I saw another stuck processed, a user process, thunderbird-bin, and
a kernel process, kcompactd0.

Once this user process got stuck and cannot be killed and remains in the process list
I cannot umount/mount the directory.

root@ip030:/NREF-COMM-CENTRAL/work-dir# umount /mnt/L-temp
umount: /mnt/L-temp: target is busy.

This makes the testing a bit awkward since there was a time when I cannot even list the file entries in the mounted directly. 
This time around, I could, though.
Comment 8 ishikawa 2024-04-07 20:34:23 UTC
I have another date point.

Thunderbird created from the development tree
had an issue in handling an I/O error.
I realized TB was calling the  I/O stdio code with bogus fileno such as -1 or for a file number which had already been closed previously after the error handling.

I have fixed it locally for now.

I have no idea if this had anything to do with the CIFS related issue.
Right now, I am testing the I/O operation, I mean the I/O error handling using NFS because C-C TB often gets stuck in kernel CIFS code and I could not debug the error handling issue very well.

Once I figure that every thing is more or less OK on the TB side, 
I will retry testing using CIFS server.

TIA
Comment 9 ishikawa 2024-04-07 20:36:09 UTC
(In reply to Douglas Bagnall from comment #3)

Thank you for fixing the product category.

I thought I had sent a thank you note, but I don't see it here.

Thank you again.