There was an issue with Nvidia that has been lingering for years:

nvidia-container-interface cdi

As is documented in many places, all you had to do to reproduce the issue is docker run -d --rm --gpus=all nvidia/cuda bash -c "while [ true ]; do nvidia-smi -L; sleep 10; done" and then run sudo systemctl daemon-reload.
If you go and look in the container logs (docker logs <container_id>), you’ll see that the GPU is no longer detected.

Enter the new specification CDI (Container Device Interface): https://github.com/cncf-tags/container-device-interface.

The workaround would be to:

  • Use nvidia as the default runtime.
  • Swith the nvidia runtime to cdi instead of the oci hooks.
  • Generate the cdi spec. This can be done with a systemd unit that runs on boot (just in case you change the GPU).

EKS introduced this in version 1.32. See https://github.com/awslabs/amazon-eks-ami/pull/2173 and https://github.com/awslabs/amazon-eks-ami/releases/tag/v20250317.

ECS will have it working soon as per their comment: https://github.com/aws/amazon-ecs-ami/pull/541#issuecomment-3357380221.

Yesterday (2025/10/21), nvidia-container-toolkit v1.18.0 was released (https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.18.0). It takes care of CDI in a seamless way.
So if you never had to deal with these issues, you won’t even have to do anything, nor need to know that it’s using CDI.