当前位置：首页 > RockyLinux > 正文

RockyLinux分布式训练环境搭建（手把手教你配置多机深度学习训练集群）

主机测评网
RockyLinux
2025-12-07
853

在当今人工智能快速发展的时代，RockyLinux分布式训练已成为提升模型训练效率的关键技术。本文将面向初学者，详细讲解如何在 Rocky Linux 操作系统上搭建一套完整的深度学习环境搭建流程，并实现多节点间的分布式训练。即使你是 Linux 新手，也能轻松跟做！

RockyLinux分布式训练环境搭建（手把手教你配置多机深度学习训练集群） RockyLinux分布式训练深度学习环境搭建 RockyLinux多机训练 PyTorch分布式训练第1张

一、准备工作

你需要至少两台安装了 Rocky Linux 8 或 9 的服务器（物理机或虚拟机均可），并确保以下条件：

所有节点之间网络互通（建议使用内网）
每台机器拥有 GPU（如 NVIDIA 显卡）
已配置好静态 IP 地址
拥有 sudo 权限

二、基础环境配置（所有节点）

1. 更新系统并安装必要工具

sudo dnf update -ysudo dnf install -y epel-releasesudo dnf install -y git wget htop vim net-tools openssh-server

2. 安装 NVIDIA 驱动与 CUDA

首先添加 NVIDIA 官方仓库：

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.reposudo dnf module install -y nvidia-driver:latest-dkmssudo dnf install -y cuda-toolkit-12-3

重启后验证驱动是否加载：

nvidia-smi

3. 安装 Python 与 Conda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.shbash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3source ~/.bashrcconda init bash

三、配置 SSH 免密登录（主节点到所有从节点）

假设你的主节点 IP 为 192.168.1.10，从节点为 192.168.1.11 和 192.168.1.12。

在主节点生成 SSH 密钥：

ssh-keygen -t rsa -b 4096 -N "" -f ~/.ssh/id_rsa

将公钥复制到所有从节点：

ssh-copy-id user@192.168.1.11ssh-copy-id user@192.168.1.12

测试免密登录：

ssh user@192.168.1.11 'hostname'

四、安装 PyTorch 与分布式依赖

在所有节点创建统一的 Conda 环境：

conda create -n dist_train python=3.10 -yconda activate dist_trainpip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121pip install opencv-python numpy pandas

五、编写分布式训练脚本

创建一个简单的 train.py 文件，使用 PyTorch 的 torch.distributed 模块：

import torchimport torch.distributed as distimport osdef setup():    dist.init_process_group("nccl")    rank = dist.get_rank()    print(f"Rank {rank} initialized.")    return rankdef cleanup():    dist.destroy_process_group()if __name__ == "__main__":    rank = setup()    # 示例：每个 GPU 创建一个张量并进行 all-reduce    tensor = torch.ones(2, 2).cuda() * rank    print(f"Rank {rank} before all_reduce: {tensor}")    dist.all_reduce(tensor, op=dist.ReduceOp.SUM)    print(f"Rank {rank} after all_reduce: {tensor}")    cleanup()

六、启动分布式训练

在主节点上创建主机文件 hostfile.txt：

192.168.1.10 slots=1192.168.1.11 slots=1192.168.1.12 slots=1

使用 torchrun 启动训练（假设每台机器1个GPU）：

conda activate dist_traintorchrun \  --nnodes=3 \  --nproc_per_node=1 \  --node_rank=0 \  --master_addr="192.168.1.10" \  --master_port=29500 \  train.py

注意：在其他节点上需设置对应的 --node_rank=1 和 --node_rank=2。更推荐使用 torch.distributed.run 结合 hostfile 自动调度（高级用法可参考 PyTorch 官方文档）。