当前位置：首页 > Rust > 正文

Rust语言从零实现决策树（手把手教你用Rust构建机器学习决策树模型）

主机测评网
Rust
2025-12-06
653

在当今的软件开发世界中，Rust语言因其内存安全、高性能和并发能力而备受关注。同时，决策树作为最基础且直观的机器学习算法之一，非常适合初学者理解模型如何做“决策”。本文将带你从零开始，使用Rust语言实现一个简单的决策树分类器，即使你是编程小白，也能轻松上手！

什么是决策树？

决策树是一种树形结构模型，它通过一系列“是/否”问题对数据进行分割，最终到达叶节点并给出预测结果。例如，判断一个人是否会购买某商品，可能基于年龄、收入、是否已婚等特征。

Rust语言从零实现决策树（手把手教你用Rust构建机器学习决策树模型） Rust语言决策树实现 Rust机器学习教程决策树算法Rust Rust编程入门第1张

为什么用Rust实现决策树？

Rust不仅安全高效，还拥有日益完善的生态系统。通过用Rust实现决策树算法，你可以深入理解算法原理，同时掌握Rust的基本语法、结构体、枚举和递归等核心概念。这也是学习Rust编程入门的绝佳项目！

项目准备

首先，确保你已安装Rust。打开终端，运行：

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

然后创建新项目：

$ cargo new rust_decision_treecd rust_decision_tree

定义数据结构

我们先定义几个关键结构：

Sample：表示一条训练数据，包含特征向量和标签。
Split：表示一次特征分割（如“年龄 > 30”）。
Node：决策树的节点，可以是内部节点或叶节点。

在 src/main.rs 中添加以下代码：

#[derive(Debug, Clone)]pub struct Sample {    pub features: Vec,    pub label: String,}#[derive(Debug, Clone)]pub struct Split {    pub feature_index: usize,    pub threshold: f64,}#[derive(Debug)]pub enum Node {    Leaf { prediction: String },    Internal {        split: Split,        left: Box<Node>,        right: Box<Node>,    },}

计算信息熵（Entropy）

决策树通常使用信息熵来衡量数据的“混乱度”。熵越低，数据越纯净。我们实现一个函数来计算一组样本的熵：

use std::collections::HashMap;fn entropy(samples: &[Sample]) -> f64 {    if samples.is_empty() {        return 0.0;    }    let total = samples.len() as f64;    let mut label_counts = HashMap::new();    for sample in samples {        *label_counts.entry(&sample.label).or_insert(0) += 1;    }    let mut entropy = 0.0;    for &count in label_counts.values() {        let p = count as f64 / total;        entropy -= p * p.log2();    }    entropy}

寻找最佳分割点

我们需要遍历所有特征和可能的阈值，找到使信息增益最大的分割：

fn find_best_split(samples: &[Sample]) -> Option<(Split, f64)> {    if samples.is_empty() {        return None;    }    let n_features = samples[0].features.len();    let mut best_gain = 0.0;    let mut best_split = None;    for feature_index in 0..n_features {        // 获取该特征的所有唯一值并排序        let mut thresholds: Vec<f64> = samples.iter()            .map(|s| s.features[feature_index])            .collect();        thresholds.sort_by(|a, b| a.partial_cmp(b).unwrap());        thresholds.dedup();        // 尝试每个可能的阈值（取相邻值的中点）        for i in 0..thresholds.len() - 1 {            let threshold = (thresholds[i] + thresholds[i + 1]) / 2.0;            let split = Split { feature_index, threshold };            let (left, right) = split_samples(samples, &split);            let gain = information_gain(samples, &left, &right);            if gain > best_gain {                best_gain = gain;                best_split = Some((split, gain));            }        }    }    best_split}fn split_samples(samples: &[Sample], split: &Split) -> (Vec<Sample>, Vec<Sample>) {    let mut left = Vec::new();    let mut right = Vec::new();    for sample in samples {        if sample.features[split.feature_index] <= split.threshold {            left.push(sample.clone());        } else {            right.push(sample.clone());        }    }    (left, right)}fn information_gain(parent: &[Sample], left: &[Sample], right: &[Sample]) -> f64 {    let p_entropy = entropy(parent);    let left_weight = left.len() as f64 / parent.len() as f64;    let right_weight = right.len() as f64 / parent.len() as f64;    let weighted_entropy = left_weight * entropy(left) + right_weight * entropy(right);    p_entropy - weighted_entropy}

递归构建决策树

现在我们可以递归地构建整棵树了：

fn build_tree(samples: &[Sample]) -> Node {    // 停止条件：如果所有样本属于同一类，返回叶节点    let labels: std::collections::HashSet<_> = samples.iter().map(|s| &s.label).collect();    if labels.len() == 1 {        return Node::Leaf {            prediction: samples[0].label.clone(),        };    }    // 如果无法再分割，返回多数类    if let Some((best_split, _)) = find_best_split(samples) {        let (left_samples, right_samples) = split_samples(samples, &best_split);        let left_child = build_tree(&left_samples);        let right_child = build_tree(&right_samples);        Node::Internal {            split: best_split,            left: Box::new(left_child),            right: Box::new(right_child),        }    } else {        // 返回出现次数最多的标签        let mut label_counts = HashMap::new();        for sample in samples {            *label_counts.entry(sample.label.clone()).or_insert(0) += 1;        }        let prediction = label_counts            .into_iter()            .max_by_key(|&(_, count)| count)            .map(|(label, _)| label)            .unwrap();        Node::Leaf { prediction }    }}

预测新样本

最后，我们实现预测函数：

fn predict(node: &Node, features: &[f64]) -> String {    match node {        Node::Leaf { prediction } => prediction.clone(),        Node::Internal { split, left, right } => {            if features[split.feature_index] <= split.threshold {                predict(left, features)            } else {                predict(right, features)            }        }    }}

完整示例与测试

在 main 函数中，我们用一个简单数据集测试模型：

fn main() {    // 示例数据：[年龄, 收入], 标签："买" 或 "不买"    let training_data = vec![        Sample { features: vec![25.0, 3000.0], label: "不买".to_string() },        Sample { features: vec![35.0, 8000.0], label: "买".to_string() },        Sample { features: vec![45.0, 7000.0], label: "买".to_string() },        Sample { features: vec![20.0, 2000.0], label: "不买".to_string() },    ];    let tree = build_tree(&training_data);    println!("构建的决策树: {:?}", tree);    // 预测新样本    let new_sample = vec![40.0, 6000.0];    let result = predict(&tree, &new_sample);    println!("预测结果: {}", result); // 应输出 "买"}