当前位置：首页 > C > 正文

C语言后缀树实现（从零构建高效字符串匹配结构）

主机测评网
C
2025-12-09
756

在计算机科学中，后缀树（Suffix Tree）是一种用于高效处理字符串操作的强大数据结构。它广泛应用于生物信息学、全文搜索、数据压缩等领域。本文将手把手教你用C语言后缀树实现一个基础但功能完整的后缀树，并深入浅出地解释其原理，即使你是编程小白也能轻松理解。

什么是后缀树？

后缀树是一种压缩的字典树（Trie），它将一个字符串的所有后缀都存储在一棵树中。例如，字符串 "banana$"（末尾加 $ 表示结束符）的所有后缀包括：

banana$
anana$
nana$
ana$
na$
a$
$

后缀树将这些后缀以共享前缀的方式组织成一棵树，从而节省空间并加速查询。

C语言后缀树实现（从零构建高效字符串匹配结构） C语言后缀树实现后缀树算法 C语言字符串处理高效字符串匹配第1张

为什么使用C语言实现后缀树？

C语言提供了对内存和指针的精细控制，非常适合实现底层数据结构。通过C语言字符串处理能力，我们可以高效构建和操作后缀树，为后续的高效字符串匹配打下基础。

基础结构定义

首先，我们需要定义后缀树的节点结构。每个节点包含子节点指针数组、起始和结束索引（用于表示边上的子串）、以及是否为叶节点的标记。

#include <stdio.h>#include <stdlib.h>#include <string.h>#define MAX_CHAR 256  // ASCII 字符集大小typedef struct SuffixTreeNode {    int start;                     // 边起始位置    int *end;                      // 边结束位置（使用指针以便所有叶节点共享）    struct SuffixTreeNode *children[MAX_CHAR];    struct SuffixTreeNode *suffixLink;  // 后缀链接    int index;                     // 叶节点存储后缀起始索引} Node;typedef struct SuffixTree {    char *text;    int textLen;    Node *root;    Node *activeNode;    int activeEdge;    int activeLength;    int remainingSuffixCount;    int *leafEnd;} SuffixTree;

Ukkonen 算法简介

构建后缀树最著名的算法是 Ukkonen 算法，它能在 O(n) 时间内构建后缀树（n 为字符串长度）。虽然完整实现较为复杂，但我们将采用简化版本，便于理解。

创建新节点

下面是一个创建新节点的辅助函数：

Node* createNode(int start, int *end) {    Node *node = (Node*)malloc(sizeof(Node));    node->start = start;    node->end = end;    node->suffixLink = NULL;    node->index = -1;    for (int i = 0; i < MAX_CHAR; i++) {        node->children[i] = NULL;    }    return node;}

构建后缀树（简化版）

为了教学目的，我们使用一个更直观但非线性时间的方法：逐个插入后缀。虽然效率不如 Ukkonen，但逻辑清晰，适合初学者。

void insertSuffix(Node *root, char *text, int suffixStart, int textLen) {    Node *current = root;    int i = suffixStart;    while (i < textLen) {        char ch = text[i];        if (current->children[ch] == NULL) {            // 创建新边            int *end = (int*)malloc(sizeof(int));            *end = textLen - 1;            current->children[ch] = createNode(i, end);            current->children[ch]->index = suffixStart;            break;        } else {            // 沿着现有边走            Node *child = current->children[ch];            int edgeLen = *(child->end) - child->start + 1;            int j;            for (j = 0; j < edgeLen && i + j < textLen; j++) {                if (text[child->start + j] != text[i + j]) break;            }            if (j == edgeLen) {                // 完全匹配，继续向下                i += edgeLen;                current = child;            } else {                // 需要分裂节点                int *oldEnd = child->end;                child->end = (int*)malloc(sizeof(int));                *(child->end) = child->start + j - 1;                // 创建新内部节点                Node *split = createNode(child->start, child->end);                split->children[text[child->start + j]] = child;                child->start += j;                // 创建新叶节点                int *newEnd = (int*)malloc(sizeof(int));                *newEnd = textLen - 1;                split->children[text[i + j]] = createNode(i + j, newEnd);                split->children[text[i + j]]->index = suffixStart;                // 更新父节点指针                current->children[ch] = split;                break;            }        }    }}SuffixTree* buildSuffixTree(char *text) {    int len = strlen(text);    SuffixTree *tree = (SuffixTree*)malloc(sizeof(SuffixTree));    tree->text = text;    tree->textLen = len;    tree->root = createNode(-1, NULL);    for (int i = 0; i < len; i++) {        insertSuffix(tree->root, text, i, len);    }    return tree;}

使用示例

下面是一个完整的使用示例，展示如何构建后缀树并打印所有后缀：

void printSuffixes(Node *node, char *text) {    if (node->index != -1) {        printf("Suffix: %s\n", text + node->index);        return;    }    for (int i = 0; i < MAX_CHAR; i++) {        if (node->children[i]) {            printSuffixes(node->children[i], text);        }    }}int main() {    char text[] = "banana$";    SuffixTree *tree = buildSuffixTree(text);    printf("All suffixes in the suffix tree:\n");    printSuffixes(tree->root, text);    return 0;}