← Tilbage til strukturer

Suffix Tree

Træ

En komprimeret trie der indeholder alle suffixes af en string, brugt til effektiv string matching og pattern søgning.

Kompleksitet

OperationTidskompleksitet
AccessO(m) for pattern length m
SearchO(m) for pattern matching
InsertionO(n) for construction
DeletionN/A (static after construction)

Beskrivelse

Suffix Tree er en af de mest kraftfulde datastrukturer til string processing og pattern matching. Den gemmer alle suffixes af en given string i en komprimeret trie struktur. For en string af længde n vil der være n suffixes, men ved at bruge path compression kan hele strukturen bygges i O(n) tid og plads med Ukkonen's algorithm. Hver path fra roden til et leaf repræsenterer en suffix. Suffix Trees kan løse mange string problems i linear eller næsten linear tid: pattern matching i O(m) for pattern af længde m, finding longest repeated substring i O(n), longest common substring mellem to strings i O(n+m), og mange flere. Strukturen er grundlaget for mange bioinformatics algoritmer hvor DNA sekvens matching er kritisk. Selvom Suffix Trees er utroligt kraftfulde, bruges Suffix Arrays ofte i prakti k fordi de bruger mindre memory. Suffix Trees finder også anvendelse i text editors, DNA sequencing, data compression, og plagiarism detection.

Kode Eksempel

// JavaScript Suffix Tree Implementation (Simplified)
class SuffixTreeNode {
  constructor() {
    this.children = {};
    this.start = -1;
    this.end = null;
    this.suffixLink = null;
    this.suffixIndex = -1;
  }

  getEdgeLength(position) {
    return (this.end || position) - this.start;
  }
}

class SuffixTree {
  constructor(text) {
    this.text = text + '$'; // Add terminator
    this.root = new SuffixTreeNode();
    this.root.suffixLink = this.root;
    this.buildNaive();
  }

  // Naive O(n²) construction for simplicity
  buildNaive() {
    const n = this.text.length;
    
    for (let i = 0; i < n; i++) {
      this.insertSuffix(i);
    }
  }

  insertSuffix(suffixStart) {
    let node = this.root;
    
    for (let i = suffixStart; i < this.text.length; i++) {
      const char = this.text[i];
      
      if (!node.children[char]) {
        const newNode = new SuffixTreeNode();
        newNode.start = i;
        newNode.end = this.text.length - 1;
        newNode.suffixIndex = suffixStart;
        node.children[char] = newNode;
        return;
      }
      
      node = node.children[char];
    }
  }

  // Search for pattern
  search(pattern) {
    let node = this.root;
    let i = 0;
    
    while (i < pattern.length) {
      const char = pattern[i];
      
      if (!node.children[char]) {
        return false;
      }
      
      node = node.children[char];
      const edgeLabel = this.getEdgeLabel(node);
      
      for (let j = 0; j < edgeLabel.length && i < pattern.length; j++, i++) {
        if (edgeLabel[j] !== pattern[i]) {
          return false;
        }
      }
    }
    
    return true;
  }

  getEdgeLabel(node) {
    if (node.start === -1) return '';
    return this.text.substring(node.start, (node.end || 0) + 1);
  }

  // Find all occurrences of pattern
  findOccurrences(pattern) {
    const occurrences = [];
    let node = this.root;
    let i = 0;
    
    // Navigate to pattern end
    while (i < pattern.length && node) {
      const char = pattern[i];
      
      if (!node.children[char]) {
        return occurrences;
      }
      
      node = node.children[char];
      const edgeLabel = this.getEdgeLabel(node);
      
      for (let j = 0; j < edgeLabel.length && i < pattern.length; j++, i++) {
        if (edgeLabel[j] !== pattern[i]) {
          return occurrences;
        }
      }
    }
    
    // Collect all suffix indices from this subtree
    this.collectSuffixIndices(node, occurrences);
    return occurrences;
  }

  collectSuffixIndices(node, result) {
    if (node.suffixIndex !== -1) {
      result.push(node.suffixIndex);
    }
    
    for (const child of Object.values(node.children)) {
      this.collectSuffixIndices(child, result);
    }
  }

  // Find longest repeated substring
  findLongestRepeatedSubstring() {
    let maxLength = 0;
    let maxDepth = 0;
    let result = '';
    
    const dfs = (node, depth, label) => {
      if (Object.keys(node.children).length > 0) {
        if (depth > maxDepth) {
          maxDepth = depth;
          result = label;
        }
        
        for (const [char, child] of Object.entries(node.children)) {
          const edgeLabel = this.getEdgeLabel(child);
          dfs(child, depth + edgeLabel.length, label + edgeLabel);
        }
      }
    };
    
    dfs(this.root, 0, '');
    return result.replace('$', '');
  }
}

// Brug
const text = "banana";
const tree = new SuffixTree(text);

console.log('Search "ana":', tree.search('ana')); // true
console.log('Search "app":', tree.search('app')); // false

console.log('Occurrences of "ana":', tree.findOccurrences('ana')); // [1, 3]
console.log('Occurrences of "na":', tree.findOccurrences('na')); // [2, 4]

const text2 = "abcabcabc";
const tree2 = new SuffixTree(text2);
console.log('Longest repeated substring:', tree2.findLongestRepeatedSubstring()); // "abcabc"

Anvendelsesområder

  • Pattern matching i store tekster
  • DNA/RNA sequence analysis i bioinformatics
  • Plagiarism detection
  • Data compression algorithms
  • Text editors med advanced search

Fordele

  • O(m) pattern matching for pattern af længde m
  • Kan finde alle occurrences samtidig
  • Løser mange string problems i linear tid
  • Powerful til bioinformatics
  • Understøtter advanced queries (longest repeat, etc)

Ulemper

  • Meget memory intensiv O(n) men med høj konstant
  • Kompleks at implementere korrekt
  • Ukkonen's algorithm er svær at forstå
  • Bygning tager tid selvom O(n)
  • Ofte erstattes med Suffix Arrays i praksis

Eksempler fra den virkelige verden

  • DNA sequence matching (BLAST algorithms)
  • Plagiarism detection software (Turnitin)
  • String search i text editors
  • Data deduplication i storage systems
  • Genome assembly i bioinformatics

Relaterede strukturer

TrieSuffix ArrayAho-Corasick AutomatonKMP Algorithm