xDroid's Blog

假装高冷的Geek

Matlab造成的死机?!

最近遇到一个非常奇怪的事情,本来好好的写作业,需要跑一个matlab代码,结果跑了一小会儿直接黑屏了。我一开始以为是内存爆了,就把测试矩阵的大小改小了一点,结果还是崩,而且崩的时候是鼠标键盘统统没反应,只能按电源键。失败几次之后决定从日志里找找原因。

让我们康康 journalctl --since=today 里都有啥:

Feb 28 18:37:30 xDroid-Arch26X MATLAB[4591]: Failed to load module "canberra-gtk-module"
Feb 28 18:37:50 xDroid-Arch26X kernel: pcieport 0000:03:01.0: can't change power state from D3cold to D0 (config space inacce>
Feb 28 18:37:51 xDroid-Arch26X kernel: xhci_hcd 0000:02:00.0: xHCI host controller not responding, assume dead
Feb 28 18:37:51 xDroid-Arch26X kernel: xhci_hcd 0000:02:00.0: HC died; cleaning up
Feb 28 18:37:51 xDroid-Arch26X kernel: usb 1-5: USB disconnect, device number 2
Feb 28 18:37:52 xDroid-Arch26X /usr/lib/gdm-x-session[1305]: (II) event16 - DELL Alienware 310K: device removed

下面是一些设备(键鼠)被断开,然后是

Feb 28 18:37:53 xDroid-Arch26X kernel: iwlwifi 0000:06:00.0: Error sending CMD_DTS_MEASUREMENT_TRIGGER_WIDE: time out after 2>
Feb 28 18:37:53 xDroid-Arch26X kernel: iwlwifi 0000:06:00.0: Current CMD queue read_ptr 91 write_ptr 92
Feb 28 18:37:53 xDroid-Arch26X kernel: ------------[ cut here ]------------
Feb 28 18:37:53 xDroid-Arch26X kernel: Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff)
Feb 28 18:37:53 xDroid-Arch26X kernel: WARNING: CPU: 5 PID: 5784 at drivers/net/wireless/intel/iwlwifi/pcie/trans.c:2084 __iw>
Fe

然后无线网卡看起来也跟着炸了。

emm我们还是先从出事的03:01.0看起来吧。lspci -vt显示的设备如下:

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
+-01.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-01.1-[01]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
+-01.3-[02-06]--+-00.0 Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller
| +-00.1 Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller
| \-00.2-[03-06]--+-00.0-[04]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
| +-01.0-[05]--
| \-04.0-[06]----00.0 Intel Corporation Wireless 8265 / 8275
+-03.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
+-03.1-[07]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
| \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]

好家伙,我啥也没看懂——稍微翻了点资料大概明白了方括号里是实际连接的设备,从01开始编号,那么03的话应该是连着04-06的……主控?上面也没标出来啊(捂脸

anyway用lspci -v -s 03:01.0看看吧:

03:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset PCIe Port (rev 01) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 36, IOMMU group 0
Bus: primary=03, secondary=05, subordinate=05, sec-latency=0

好像就还真是一个主控……那也没啥信息啊。

上网搜了一圈,发现有一些非常离谱的方案(好像是定时重启usb主控)……感觉也不是很靠谱

想了想是不是matlab自己的锅,就卸载重装了一次(其实还更新到了2021b),但是然并卵。最后误打误撞发现是dropbox的锅,matlab在写入图表epsc的文件的时候和dropbox的同步机制撞车了[允悲],也算是大开眼界。