This document provides a complete guide to setting up NVIDIA GPU support in k3s running on Windows WSL2.
Concepts and Terminology
Understanding the key concepts, technologies, and files involved is crucial for successful GPU integration with k3s on WSL2.
Core Technologies
WSL2 (Windows Subsystem for Linux 2)
A compatibility layer that allows running a Linux kernel directly on Windows. WSL2 provides GPU passthrough capabilities, allowing Linux containers to access Windows NVIDIA GPUs through a virtualized interface.
k3s
A lightweight, certified Kubernetes distribution designed for production workloads. k3s packages Kubernetes components into a single binary and includes containerd as the default container runtime.
containerd
A high-level container runtime that manages the complete container lifecycle. k3s uses containerd to run containers and manage images. containerd can be configured with different runtime engines (like runc or nvidia-container-runtime).
Kubernetes Device Plugins
Extensions that allow Kubernetes to advertise and schedule specialized hardware resources (like GPUs). Device plugins run as DaemonSets and register resources with the kubelet.
NVIDIA Technologies
NVIDIA Container Toolkit (formerly nvidia-docker2)
A collection of tools and libraries that enable GPU support in containers. Includes:
- nvidia-container-runtime: OCI-compliant runtime that injects GPU support
- nvidia-container-cli: Low-level utility for configuring containers
- nvidia-ctk: High-level configuration tool
- libnvidia-container: Core library providing GPU container support
NVIDIA k8s Device Plugin
A Kubernetes device plugin that discovers NVIDIA GPUs and makes them available for pod scheduling. Runs as a DaemonSet and communicates with kubelet via Unix socket.
CUDA (Compute Unified Device Architecture)
NVIDIA’s parallel computing platform and API that allows applications to use GPUs for general-purpose processing.
Container Runtime Concepts
OCI Runtime Specification
Open Container Initiative standard defining how to run containers. Implementations include:
- runc: Default OCI runtime (CPU-only containers)
- nvidia-container-runtime: NVIDIA-enhanced OCI runtime (GPU-enabled containers)
Container Runtime Interface (CRI)
Kubernetes plugin interface that enables kubelet to use different container runtimes. containerd implements CRI.
Container Device Interface (CDI)
Modern standard for exposing devices (like GPUs) to containers. CDI uses declarative YAML specifications instead of runtime hooks. Essential for WSL2 GPU support.
Key Configuration Files
/var/lib/rancher/k3s/agent/etc/containerd/config.toml
The active containerd configuration file used by k3s. Generated automatically from the template on k3s startup.
/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
The template file that defines containerd configuration. User modifications should go here, not in the generated config file.
/etc/cdi/nvidia.yaml
CDI specification file that describes available NVIDIA GPUs and required libraries/devices. Generated by nvidia-ctk cdi generate.
/var/lib/kubelet/device-plugins/
Directory where device plugins register their Unix sockets to communicate with kubelet.
/dev/nvidia*
Device files representing NVIDIA GPUs:
/dev/nvidia0,/dev/nvidia1, etc. - Individual GPU devices/dev/nvidiactl- NVIDIA control device/dev/nvidia-uvm- Unified Virtual Memory device (if available)
Kubernetes Resources
DaemonSet
Kubernetes workload that ensures a pod runs on all (or selected) nodes. The NVIDIA device plugin runs as a DaemonSet.
Tolerations
Allow pods to be scheduled on nodes with matching taints. Required for device plugin to run on control-plane nodes.
Node Selectors/Affinity
Mechanisms to constrain pod scheduling to specific nodes. Often used with GPU workloads to ensure scheduling on GPU-enabled nodes.
Resource Requests/Limits
Kubernetes mechanism for requesting and limiting compute resources. GPU resources are requested as nvidia.com/gpu: 1.
WSL2-Specific Concepts
GPU Passthrough
WSL2 feature that allows Linux containers to access Windows GPU hardware through virtualized drivers.
Driver Store
Windows mechanism for managing device drivers. WSL2 accesses NVIDIA drivers through the Windows driver store paths.
WDDM (Windows Display Driver Model)
Windows graphics driver architecture. NVIDIA WSL drivers use WDDM for GPU access.
Configuration Parameters
enable_cdi
containerd setting that enables Container Device Interface support. Critical for WSL2 GPU functionality.
default_runtime_name
containerd setting that specifies which runtime to use by default. Set to “nvidia” for automatic GPU support.
DEVICE_DISCOVERY_STRATEGY
Device plugin environment variable controlling how GPUs are discovered. Options include:
auto: Automatic detection (may fail in WSL2)nvml: Use NVIDIA Management Library (recommended for WSL2)
NVIDIA_VISIBLE_DEVICES
Environment variable controlling which GPUs are visible to containers. Set to “all” to expose all available GPUs.
Common Terms
GRPC (Google Remote Procedure Call)
Protocol used by device plugins to communicate with kubelet. Device plugins expose GRPC servers on Unix sockets.
SystemdCgroup
Linux control group management using systemd. Required for proper resource isolation in containers.
Privileged Containers
Containers with elevated permissions. Device plugins typically need privileged access to manage hardware.
Flannel
Default CNI (Container Network Interface) plugin used by k3s for pod networking.
Problem Summary
The NVIDIA k8s device plugin fails to detect GPU devices in k3s on WSL2, showing errors like:
Incompatible strategy detected autoNo devices found. Waiting indefinitely.0/1 nodes are available: 1 Insufficient nvidia.com/gpu
Root Cause
The issue stems from k3s using its own containerd configuration that doesn’t include proper NVIDIA Container Toolkit integration for WSL2 environments. WSL2 requires Container Device Interface (CDI) support for GPU access, which needs to be explicitly configured in k3s’s containerd runtime.
Critical Missing Components in Basic Setups
Many initial attempts fail because they miss these essential WSL2-specific requirements:
- CDI (Container Device Interface) Support - The most critical missing piece
- Proper k3s containerd template configuration - Standard Docker GPU approaches don’t work
- WSL2-specific device plugin environment variables - Different discovery strategy needed
- Correct tolerations for single-node setups - Device plugin scheduling requirements
- Using template files instead of direct config modification - k3s regenerates config files
Prerequisites
Windows Host
- Windows 10/11 with WSL2 enabled
- NVIDIA GPU with latest drivers installed on Windows
- WSL2 with GPU support enabled
WSL2 Ubuntu Environment
- Ubuntu 22.04+ in WSL2
- NVIDIA drivers in WSL2 (usually auto-installed)
- Docker/containerd runtime support
Step-by-Step Setup Guide
1. Verify NVIDIA GPU Access in WSL2
# Check if nvidia-smi works
nvidia-smi
# You should see your GPU listed with driver version2. Install NVIDIA Container Toolkit
# Add NVIDIA package repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Update and install
sudo apt update
sudo apt install -y nvidia-container-toolkit nvidia-container-runtime3. Generate NVIDIA CDI Specification
# Create CDI directory
sudo mkdir -p /etc/cdi
# Generate CDI spec for WSL2 (this auto-detects WSL mode)
# This is the MOST CRITICAL missing piece from basic setups
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# Verify CDI spec was created
ls -la /etc/cdi/nvidia.yamlWhy CDI is Critical for WSL2: Standard Docker/containerd GPU integration uses OCI hooks, but WSL2’s virtualized GPU environment requires the Container Device Interface (CDI) for proper device mapping. Without CDI, the device plugin cannot detect GPUs even when nvidia-smi works.
4. Install and Configure k3s
# Install k3s (if not already installed)
curl -sfL https://get.k3s.io | sh -
# Stop k3s for configuration
sudo systemctl stop k3s5. Create k3s Containerd Template with NVIDIA Support
Critical: k3s regenerates its containerd config from the template file on startup. Direct modifications to
/var/lib/rancher/k3s/agent/etc/containerd/config.tomlwill be overwritten. Always use the.tmplfile.
Create the containerd configuration template with the essential missing components:
sudo tee /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl > /dev/null << 'EOF'
# K3s containerd config template with NVIDIA GPU support
version = 3
root = "/var/lib/rancher/k3s/agent/containerd"
state = "/run/k3s/containerd"
[grpc]
address = "/run/k3s/containerd/containerd.sock"
[plugins.'io.containerd.internal.v1.opt']
path = "/var/lib/rancher/k3s/agent/containerd"
[plugins.'io.containerd.grpc.v1.cri']
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
[plugins.'io.containerd.cri.v1.runtime']
enable_selinux = false
enable_unprivileged_ports = true
enable_unprivileged_icmp = true
device_ownership_from_security_context = false
enable_cdi = true # CRITICAL: Enables CDI support for WSL2 GPU access
[plugins.'io.containerd.cri.v1.images']
snapshotter = "overlayfs"
disable_snapshot_annotations = true
[plugins.'io.containerd.cri.v1.images'.pinned_images]
sandbox = "rancher/mirrored-pause:3.6"
[plugins.'io.containerd.cri.v1.runtime'.cni]
bin_dir = "/var/lib/rancher/k3s/data/cni"
conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"
[plugins.'io.containerd.cri.v1.runtime'.containerd]
default_runtime_name = "nvidia" # CRITICAL: Makes nvidia runtime default for all pods
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.runhcs-wcow-process]
runtime_type = "io.containerd.runhcs.v1"
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime" # CRITICAL: Path to nvidia runtime
SystemdCgroup = true
[plugins.'io.containerd.cri.v1.images'.registry]
config_path = "/var/lib/rancher/k3s/agent/etc/containerd/certs.d"
EOF6. Start k3s and Deploy NVIDIA Device Plugin
# Start k3s service
sudo systemctl start k3s
# Wait for k3s to be ready
sleep 30
sudo kubectl get nodes
# Deploy NVIDIA device plugin with WSL2-compatible configuration
# Key differences from standard device plugin: WSL2-specific env vars and tolerations
sudo tee /tmp/nvidia-device-plugin.yaml > /dev/null << 'EOF'
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: CriticalAddonsOnly
operator: Exists
- key: node-role.kubernetes.io/master
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
- operator: Exists # CRITICAL: Catch-all toleration for single-node k3s setups
priorityClassName: system-node-critical
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility" # WSL2-specific capability requirements
- name: NVIDIA_VISIBLE_DEVICES
value: "all" # Expose all GPUs to device plugin
- name: DEVICE_DISCOVERY_STRATEGY
value: "nvml" # CRITICAL: Use NVML for WSL2 (auto detection fails)
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: dev
mountPath: /dev
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev
hostPath:
path: /dev
EOF
# Apply the device plugin
sudo kubectl apply -f /tmp/nvidia-device-plugin.yaml7. Verification
# Check if device plugin is running
sudo kubectl get pods -n kube-system | grep nvidia
# Check if GPU resources are visible
sudo kubectl describe node | grep nvidia.com/gpu
# Create a test GPU job
sudo tee /tmp/gpu-test.yaml > /dev/null << 'EOF'
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-test
spec:
template:
spec:
containers:
- name: pytorch-gpu
image: pytorch/pytorch:latest
command: ["python", "-c", "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU count: {torch.cuda.device_count()}')"]
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
restartPolicy: Never
backoffLimit: 4
EOF
# Run the test
sudo kubectl apply -f /tmp/gpu-test.yaml
# Check results (wait a few minutes for the job to complete)
sudo kubectl logs job/gpu-test
# Expected output:
# CUDA available: True
# GPU count: 1Common Initial Mistakes and Why They Fail
1. Missing CDI Support (Most Critical)
What’s typically attempted: Using basic NVIDIA Container Toolkit setup without generating CDI specifications.
# This alone is NOT sufficient for WSL2
nvidia-ctk runtime configure --runtime=containerdWhy it fails: WSL2’s virtualized GPU environment doesn’t work with standard OCI hooks. CDI provides the device mapping layer that WSL2 requires.
Correct approach:
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# Plus enable_cdi = true in containerd config2. Incorrect Containerd Configuration
What’s typically attempted: Trying to configure system containerd or using Docker’s containerd config.
# This configures the wrong containerd instance
sudo nvidia-ctk runtime configure --runtime=containerd
systemctl restart containerdWhy it fails: k3s uses its own embedded containerd with its own config template that regenerates on startup.
Correct approach: Configure k3s’s containerd template at /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
3. Wrong Device Discovery Strategy
What’s typically attempted: Using default device plugin configuration with auto discovery.
env:
- name: DEVICE_DISCOVERY_STRATEGY
value: "auto" # Fails in WSL2Why it fails: WSL2’s GPU virtualization layer breaks automatic GPU discovery mechanisms.
Correct approach: Force NVML discovery strategy:
env:
- name: DEVICE_DISCOVERY_STRATEGY
value: "nvml" # Required for WSL24. Insufficient Tolerations
What’s typically attempted: Using only GPU-specific tolerations.
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoScheduleWhy it fails: Single-node k3s setups have control-plane taints that prevent device plugin scheduling.
Correct approach: Include catch-all toleration:
tolerations:
- operator: Exists # Allows scheduling on any node5. Not Setting NVIDIA as Default Runtime
What’s typically attempted: Leaving runc as the default runtime and expecting GPU detection.
Why it fails: Device plugins need the nvidia runtime to be active to detect GPU resources properly.
Correct approach:
[plugins.'io.containerd.cri.v1.runtime'.containerd]
default_runtime_name = "nvidia"Key Configuration Points
1. CDI Support
- Critical:
enable_cdi = truein containerd config - CDI specification generated with
nvidia-ctk cdi generate - WSL2 requires CDI for proper GPU device mapping
2. NVIDIA Runtime as Default
- Set
default_runtime_name = "nvidia"in containerd - Ensures all pods use NVIDIA runtime by default
- Required for device plugin to detect GPUs
3. Device Plugin Tolerations
- Must include
operator: Existstoleration - Required for control-plane/master node scheduling
- Essential for single-node k3s setups
4. WSL2-Specific Environment Variables
DEVICE_DISCOVERY_STRATEGY: "nvml"NVIDIA_VISIBLE_DEVICES: "all"NVIDIA_DRIVER_CAPABILITIES: "compute,utility"
Troubleshooting
Device Plugin Shows “No devices found”
- Verify CDI specification exists:
ls -la /etc/cdi/nvidia.yaml - Check containerd config includes CDI:
grep enable_cdi /var/lib/rancher/k3s/agent/etc/containerd/config.toml - Ensure nvidia runtime is available:
nvidia-container-runtime --version
Network Plugin Not Ready
- This usually indicates containerd config is incomplete
- Verify the full template was applied correctly
- Check k3s logs:
journalctl -u k3s --no-pager -n 20
GPU Test Pod Stays Pending
- Check node GPU capacity:
kubectl describe node | grep nvidia.com/gpu - Verify device plugin is running:
kubectl get pods -n kube-system | grep nvidia - Check device plugin logs:
kubectl logs -l name=nvidia-device-plugin-ds -n kube-system
Files Modified Summary
/etc/cdi/nvidia.yaml- NVIDIA CDI specification for WSL2/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl- k3s containerd template with NVIDIA support- Device plugin DaemonSet - Modified with proper tolerations and WSL2 environment variables
Important Notes
- Always use the template file (
config.toml.tmpl) not the generated config directly - CDI is required for WSL2 GPU support - standard OCI hooks don’t work reliably
- Restart k3s after making containerd template changes
- Single-node setups need special tolerations for control-plane scheduling
This configuration enables full GPU support in k3s on WSL2, allowing containers to access NVIDIA GPUs for CUDA workloads, machine learning, and other GPU-accelerated tasks.