Fix for Broken Boot After Failed amdgpu-dkim Install
I recently upgraded to the 7900 XTX GPU which was a totally issue-free experience. Then today, I tried to install AMD ROCm so I could try out AMD’s TensorFlow fork that works with AMD GPUs.
I ran into a lot of issues with this that resulted in my computer not being able to boot for a while. I eventually figured it out, but it was quite a struggle.
It started after I downloaded and ran amdgpu-install
- AMD’s tool for installing drivers and other software for use with their hardware.
I ran a variety of different commands with that - sudo amdgpu-install --usecase=rocm
, sudo amdgpu-install --uninstall
, sudo amdgpu-install --usecase=graphics,rocm
through different stages of debugging stuff.
The install itself failed because the kernel version needed by the amdgpu-dkim
component of ROCm (5.x) was different than my Kernel version (6.3) so the module build failed. amdgpu-dkim
is a kernel module for amdgpu, and I didn’t and still don’t really understand how or if it differs from the amdgpu
kernel module that comes built-in to the kernel.
Symptoms⌗
At some point, I rebooted my computer. When I tried to reboot, the boot hung at the dmesg
output which is displayed before my desktop environment pops up. I looked into a few red herring errors in the logs that turned out to have nothing to do with the failure to boot.
I eventually figured out a way to get the boot to work: Pressing the “e” key on the grub menu option and adding nomodeset
to the list of boot args.
Although it did boot and most things worked alright, it was clear that nothing was GPU accelerated. Only one of my three monitors worked, Xorg had high CPU usage since it was clearly not accelerating anything with the GPU, and glxgears
was running with software rasterizer.
The Cause⌗
At some point, I ran lsmod | grep gpu
to try to figure out if maybe there was some weird alternate kernel module running that was dropped by amdgpu-install
which was conflicting with the kernel’s built-in one.
However, what I saw was that there were no kernel modules at all for amdgpu
.
The Fix⌗
After a good bit of googling, I found a blog post written in Japanese which talks about this exact situation.
It turns out that when the
amdgpu-dkim
kernel module build fails, a file/etc/modprobe.d/blacklist-amdgpu.conf
will get created. This results in theamdgpu
kernel module getting forced to not load during boot and results in the boot failing (unless thenomodeset
boot param is set).
After deleting that file, the computer booted normally.
I’ve given up on getting ROCm working for now, but might give it another go in the future. I have a hope that it will maybe work without installing the kernel module which caused all of these issues, but we’ll see!