Suffix Tree
TræEn komprimeret trie der indeholder alle suffixes af en string, brugt til effektiv string matching og pattern søgning.
Kompleksitet
| Operation | Tidskompleksitet |
|---|---|
| Access | O(m) for pattern length m |
| Search | O(m) for pattern matching |
| Insertion | O(n) for construction |
| Deletion | N/A (static after construction) |
Beskrivelse
Suffix Tree er en af de mest kraftfulde datastrukturer til string processing og pattern matching. Den gemmer alle suffixes af en given string i en komprimeret trie struktur. For en string af længde n vil der være n suffixes, men ved at bruge path compression kan hele strukturen bygges i O(n) tid og plads med Ukkonen's algorithm. Hver path fra roden til et leaf repræsenterer en suffix. Suffix Trees kan løse mange string problems i linear eller næsten linear tid: pattern matching i O(m) for pattern af længde m, finding longest repeated substring i O(n), longest common substring mellem to strings i O(n+m), og mange flere. Strukturen er grundlaget for mange bioinformatics algoritmer hvor DNA sekvens matching er kritisk. Selvom Suffix Trees er utroligt kraftfulde, bruges Suffix Arrays ofte i prakti k fordi de bruger mindre memory. Suffix Trees finder også anvendelse i text editors, DNA sequencing, data compression, og plagiarism detection.
Kode Eksempel
// JavaScript Suffix Tree Implementation (Simplified)
class SuffixTreeNode {
constructor() {
this.children = {};
this.start = -1;
this.end = null;
this.suffixLink = null;
this.suffixIndex = -1;
}
getEdgeLength(position) {
return (this.end || position) - this.start;
}
}
class SuffixTree {
constructor(text) {
this.text = text + '$'; // Add terminator
this.root = new SuffixTreeNode();
this.root.suffixLink = this.root;
this.buildNaive();
}
// Naive O(n²) construction for simplicity
buildNaive() {
const n = this.text.length;
for (let i = 0; i < n; i++) {
this.insertSuffix(i);
}
}
insertSuffix(suffixStart) {
let node = this.root;
for (let i = suffixStart; i < this.text.length; i++) {
const char = this.text[i];
if (!node.children[char]) {
const newNode = new SuffixTreeNode();
newNode.start = i;
newNode.end = this.text.length - 1;
newNode.suffixIndex = suffixStart;
node.children[char] = newNode;
return;
}
node = node.children[char];
}
}
// Search for pattern
search(pattern) {
let node = this.root;
let i = 0;
while (i < pattern.length) {
const char = pattern[i];
if (!node.children[char]) {
return false;
}
node = node.children[char];
const edgeLabel = this.getEdgeLabel(node);
for (let j = 0; j < edgeLabel.length && i < pattern.length; j++, i++) {
if (edgeLabel[j] !== pattern[i]) {
return false;
}
}
}
return true;
}
getEdgeLabel(node) {
if (node.start === -1) return '';
return this.text.substring(node.start, (node.end || 0) + 1);
}
// Find all occurrences of pattern
findOccurrences(pattern) {
const occurrences = [];
let node = this.root;
let i = 0;
// Navigate to pattern end
while (i < pattern.length && node) {
const char = pattern[i];
if (!node.children[char]) {
return occurrences;
}
node = node.children[char];
const edgeLabel = this.getEdgeLabel(node);
for (let j = 0; j < edgeLabel.length && i < pattern.length; j++, i++) {
if (edgeLabel[j] !== pattern[i]) {
return occurrences;
}
}
}
// Collect all suffix indices from this subtree
this.collectSuffixIndices(node, occurrences);
return occurrences;
}
collectSuffixIndices(node, result) {
if (node.suffixIndex !== -1) {
result.push(node.suffixIndex);
}
for (const child of Object.values(node.children)) {
this.collectSuffixIndices(child, result);
}
}
// Find longest repeated substring
findLongestRepeatedSubstring() {
let maxLength = 0;
let maxDepth = 0;
let result = '';
const dfs = (node, depth, label) => {
if (Object.keys(node.children).length > 0) {
if (depth > maxDepth) {
maxDepth = depth;
result = label;
}
for (const [char, child] of Object.entries(node.children)) {
const edgeLabel = this.getEdgeLabel(child);
dfs(child, depth + edgeLabel.length, label + edgeLabel);
}
}
};
dfs(this.root, 0, '');
return result.replace('$', '');
}
}
// Brug
const text = "banana";
const tree = new SuffixTree(text);
console.log('Search "ana":', tree.search('ana')); // true
console.log('Search "app":', tree.search('app')); // false
console.log('Occurrences of "ana":', tree.findOccurrences('ana')); // [1, 3]
console.log('Occurrences of "na":', tree.findOccurrences('na')); // [2, 4]
const text2 = "abcabcabc";
const tree2 = new SuffixTree(text2);
console.log('Longest repeated substring:', tree2.findLongestRepeatedSubstring()); // "abcabc"Anvendelsesområder
- •Pattern matching i store tekster
- •DNA/RNA sequence analysis i bioinformatics
- •Plagiarism detection
- •Data compression algorithms
- •Text editors med advanced search
Fordele
- ✓O(m) pattern matching for pattern af længde m
- ✓Kan finde alle occurrences samtidig
- ✓Løser mange string problems i linear tid
- ✓Powerful til bioinformatics
- ✓Understøtter advanced queries (longest repeat, etc)
Ulemper
- ✗Meget memory intensiv O(n) men med høj konstant
- ✗Kompleks at implementere korrekt
- ✗Ukkonen's algorithm er svær at forstå
- ✗Bygning tager tid selvom O(n)
- ✗Ofte erstattes med Suffix Arrays i praksis
Eksempler fra den virkelige verden
- •DNA sequence matching (BLAST algorithms)
- •Plagiarism detection software (Turnitin)
- •String search i text editors
- •Data deduplication i storage systems
- •Genome assembly i bioinformatics