指定GPU与CUDA版本

指定程序可见GPU设备

假设现在该服务器有两张显卡，卡0为3090，卡1为1080Ti。如果希望程序仅在1080Ti上运行，可以通过以下命令临时配置用户环境变量来指定显卡：

export CUDA_VISIBLE_DEVICES=1

或者在Python的开头添加

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

1. 关于CUDA Toolkit（系统）与CUDA Runtime（通过Conda或Pip安装）

系统安装的CUDA为CUDA Compile Toolkit，可用于编译CUDA代码；而Conda安装的CUDA为CUDA Runtime，不能用于编译，仅提供运行环境。

使用系统CUDA编译CUDA代码时，需要保证Pytorch、CUDA Runtime（Pytorch安装命令中指定的CUDA版本）与CUDA Compile Toolkit（当前终端）版本一致。

项目	备注
Deformable DETR（包含其各种改进）
Flash Attention	对GPU硬件也有要求（3090、4090、A100和H100及其以上）
MMCV

对于其他项目，使用Conda安装的Pytorch对应的CUDA Runtime版本不高于显卡驱动支持的最高版本即可正常运行代码，无需关注系统CUDA Toolkit版本。

CUDA最低版本

RTX 3090显卡最低CUDA版本为11.1（支持的Pytorch版本为1.8.0~1.10.1），RTX 4090显卡最低CUDA版本为11.8（支持的Pytorch版本为＞2.0.0）。

对应的Pytorch版本可查询Pytorch官网

2. 查询当前终端使用的CUDA Toolkit版本

nvcc --version

3. 切换CUDA Toolkit版本

在~/Tools路径下有switch_cuda_version.sh脚本供大家切换CUDA Toolkit版本，使用以下命令执行该脚本，遵循提示操作即可。

source ~/Tools/switch_cuda_version.sh

4. 常见CUDA相关错误

CUDA_ERROR_OUT_OF_MEMORY

显存不足，调小Batch Size或调小图片尺寸。

CUDA error: device-side assert triggered

错误代码

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [91,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [92,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [93,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [94,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [95,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

若为目标检测相关任务，则应首先考虑模型输出的目标类别数是否为数据集类别数+1。