Fix NVIDIA DKMS Build Failure After Ubuntu Kernel Upgrade on a Proxmox VM Using GPU Passthrough

Fix NVIDIA DKMS Build Failure After Ubuntu Kernel Upgrade on a Proxmox VM Using GPU Passthrough

I ran into a kernel upgrade issue on an Ubuntu 24.04 VM running in Proxmox with NVIDIA GPU passthrough enabled. During apt upgrade, the system attempted to install a newer HWE kernel, but the NVIDIA DKMS module failed to build for that kernel.

Because this VM relies on GPU passthrough, I wanted to be careful not to break the existing working NVIDIA configuration.

This article documents the issue, the checks I performed, and the conservative fix I used to return the VM to a clean working state.

Who This Applies To

This guide is for an Ubuntu VM where a newer kernel failed to configure because NVIDIA DKMS could not build for that kernel, while an older kernel and NVIDIA driver are still working.

This is especially relevant for Proxmox VMs using NVIDIA GPU passthrough, where removing NVIDIA packages could break a working GPU setup.

Environment

  • Ubuntu 24.04 VM
  • Proxmox host
  • NVIDIA GPU passthrough enabled
  • Working kernel: 6.14.0-29-generic
  • Failed attempted kernel upgrade: 6.17.0-23-generic
  • NVIDIA driver/DKMS version: 575.57.08

Problem

At the end of apt upgrade, I saw errors similar to the following:

Autoinstall of module nvidia/575.57.08 for kernel 6.17.0-23-generic (x86_64)
Building module(s)...............(bad exit status: 2)

Error! Bad return status for module build on kernel: 6.17.0-23-generic (x86_64)
Consult /var/lib/dkms/nvidia/575.57.08/build/make.log for more information.

Autoinstall on 6.17.0-23-generic failed for module(s) nvidia(10).

run-parts: /etc/kernel/postinst.d/dkms exited with return code 1
dpkg: error processing package linux-image-6.17.0-23-generic (--configure):
 installed linux-image-6.17.0-23-generic package post-installation script subprocess returned error exit status 1

The key issue was that DKMS could not build the NVIDIA module for the newly installed 6.17.0-23-generic kernel.

Because the NVIDIA module build failed, the kernel package did not finish configuring successfully.

Important Note: Do Not Immediately Purge NVIDIA

In a normal VM with no GPU passthrough, removing the NVIDIA packages might be a reasonable fix.

In my case, this VM uses NVIDIA GPU passthrough, so purging NVIDIA would risk breaking the working GPU setup.

The goal was not to remove NVIDIA.

The goal was to remove the failed 6.17 kernel attempt while preserving the working 6.14 kernel and its installed NVIDIA DKMS module.

Initial Checks

First, I confirmed the currently running kernel:

uname -r

Result:

6.14.0-29-generic

Then I checked the package state:

sudo dpkg --audit

The result showed that some 6.17 kernel-related packages were unpacked or half-configured.

I also checked DKMS status:

dkms status

The important line was:

nvidia/575.57.08, 6.14.0-29-generic, x86_64: installed

That confirmed the NVIDIA module was installed for the working 6.14 kernel.

I also verified that the 6.14 kernel files were still present:

ls /boot | grep '6.14.0-29'

Expected output included:

config-6.14.0-29-generic
initrd.img-6.14.0-29-generic
System.map-6.14.0-29-generic
vmlinuz-6.14.0-29-generic

At this point, the existing working installation was still intact. The failed part was the attempted upgrade to the 6.17 kernel.

Take a Snapshot First

Before making changes, I created a Proxmox snapshot of the VM.

This is especially important when working on a VM with GPU passthrough, kernel packages, and NVIDIA DKMS modules.

Remove the Stale Crash File

The upgrade output also mentioned that a crash report already existed under /var/crash.

I removed that first:

sudo rm -f /var/crash/nvidia-dkms-575.0.crash

This was cleanup only. It did not resolve the DKMS build failure, but it removed a stale crash report that was adding noise to the upgrade output.

Purge the Failed 6.17 Kernel Image and Headers

Next, I removed the failed 6.17 kernel image and headers:

Note: Replace the kernel version in these commands with the failed kernel version shown on your own system. Do not blindly copy 6.17.0-23-generic unless that is the failed kernel version on your machine.

sudo apt purge linux-image-6.17.0-23-generic linux-headers-6.17.0-23-generic

During removal, I saw a warning like:

rmdir: failed to remove '/lib/modules/6.17.0-23-generic': Directory not empty

This was not fatal in my case. It meant that some files were still left in the module directory.

I confirmed the previous command completed successfully:

echo $?

Result:

0

Then I checked the package state again:

sudo dpkg --audit

At this stage, dpkg --audit returned nothing, which meant the package database was no longer in a broken state.

Check for Remaining 6.17 Packages

Even after removing the failed image and headers, there were still some remaining 6.17 support packages installed.

I checked with:

dpkg -l | grep '6.17.0-23'

The remaining packages included items like:

linux-hwe-6.17-headers-6.17.0-23
linux-hwe-6.17-tools-6.17.0-23
linux-modules-6.17.0-23-generic
linux-modules-extra-6.17.0-23-generic
linux-tools-6.17.0-23-generic

I removed those as well:

sudo apt purge \
  linux-hwe-6.17-headers-6.17.0-23 \
  linux-hwe-6.17-tools-6.17.0-23 \
  linux-modules-6.17.0-23-generic \
  linux-modules-extra-6.17.0-23-generic \
  linux-tools-6.17.0-23-generic

Again, I saw a warning about /lib/modules/6.17.0-23-generic not being empty.

I checked again:

echo $?
dpkg -l | grep '6.17.0-23'
sudo dpkg --audit

The command returned 0, no 6.17.0-23 packages remained, and dpkg --audit returned nothing.

Remove the Leftover 6.17 Module Directory

Since no 6.17.0-23 packages remained installed, it was safe to remove the leftover module directory manually.

NOTE: Only remove /lib/modules/6.17.0-23-generic manually after confirming that no packages for that kernel remain installed. Removing a module directory for an actively installed or booted kernel can break that kernel.

sudo rm -rf /lib/modules/6.17.0-23-generic

Do not do this while packages for that kernel are still installed. Only remove it manually after confirming that dpkg -l | grep '6.17.0-23' returns nothing.

Final Cleanup

Next, I ran:

sudo apt autoremove --purge
sudo dpkg --configure -a
sudo apt -f install
sudo dpkg --audit

The important result was:

sudo dpkg --audit

returned nothing.

That meant the package system was clean again.

Hold the HWE Kernel Meta Packages

To prevent Ubuntu from immediately trying to install the same problematic HWE kernel path again, I placed the HWE meta packages on hold:

sudo apt-mark hold linux-generic-hwe-24.04 linux-headers-generic-hwe-24.04

I verified the hold with:

apt-mark showhold

Expected output:

linux-generic-hwe-24.04
linux-headers-generic-hwe-24.04

This is not necessarily a permanent solution. It is a safety measure to prevent the same failed kernel/DKMS combination from being attempted again before I am ready to plan a controlled kernel and NVIDIA driver update.

Rebuild Initramfs and GRUB

I rebuilt the initramfs for the known-good kernel:

sudo update-initramfs -u -k 6.14.0-29-generic

Then I updated GRUB:

sudo update-grub

Verify Before Reboot

Before rebooting, I checked:

uname -r
dkms status | grep '6.14.0-29'
sudo dpkg --audit
dpkg -l | grep '6.17.0-23'

Expected results:

6.14.0-29-generic
nvidia/575.57.08, 6.14.0-29-generic, x86_64: installed

sudo dpkg --audit should return nothing.

dpkg -l | grep '6.17.0-23' should also return nothing.

Reboot

After confirming the system looked clean, I rebooted:

sudo reboot

After rebooting, I confirmed:

uname -r
nvidia-smi
dkms status
sudo dpkg --audit

The system booted back into:

6.14.0-29-generic

nvidia-smi worked, DKMS still showed the NVIDIA module installed for 6.14, and dpkg --audit returned nothing.

Final Result

The VM was restored to a clean and working state.

Final state:

Kernel: 6.14.0-29-generic
NVIDIA DKMS: installed for 6.14
nvidia-smi: working
dpkg state: clean
SSH: working
6.17 packages: removed
HWE meta packages: held

Takeaways

The existing working Ubuntu installation was not broken. The failed part was the attempted configuration of the newer 6.17.0-23-generic kernel.

For a Proxmox VM using NVIDIA GPU passthrough, it is important to avoid blindly purging NVIDIA packages unless that is truly intended. In this case, the safer fix was to preserve the working NVIDIA driver and kernel, remove the failed kernel upgrade attempt, clean the package state, and temporarily hold the HWE meta packages.

Going forward, I would treat kernel and NVIDIA driver updates on this VM as planned maintenance. Before allowing future kernel or NVIDIA changes, I would take a Proxmox snapshot and carefully review what apt plans to install or remove.