Ukkonen's Algorithm

Ukkonen’s algorithm is used for constructing a suffix tree for a given string in linear time. Traditional algorithms for building suffix trees usually require quadratic time, but Ukkonen optimized this to O(n) time, where n is the length of the string. The algorithm builds the tree incrementally, adding one character at a time from the input string. This leads to significant computational savings.

Let’s implement Ukkonen’s algorithm for constructing a suffix tree in Java, C++, and Python. Note that this code is simplified and intended for educational purposes.


Java

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import java.util.HashMap;

class Node {
    HashMap<Character, Edge> edges = new HashMap<>();
}

class Edge {
    int start, end;
    Node nextNode;
}

public class UkkonenAlgorithm {
    public static void main(String[] args) {
        String str = "abcab$";
        Node root = new Node();
        Node activeNode = root;

        for (int i = 0; i < str.length(); i++) {
            char ch = str.charAt(i);
            Edge edge = new Edge();
            edge.start = i;
            edge.end = i + 1;
            edge.nextNode = new Node();
            activeNode.edges.put(ch, edge);
            activeNode = edge.nextNode;
        }
    }
}

C++

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include <iostream>
#include <unordered_map>
using namespace std;

struct Node {
    unordered_map<char, struct Edge> edges;
};

struct Edge {
    int start, end;
    Node* nextNode;
};

int main() {
    string str = "abcab$";
    Node* root = new Node;
    Node* activeNode = root;

    for (int i = 0; i < str.size(); i++) {
        char ch = str[i];
        Edge edge;
        edge.start = i;
        edge.end = i + 1;
        edge.nextNode = new Node;
        activeNode->edges[ch] = edge;
        activeNode = edge.nextNode;
    }
    return 0;
}

Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
class Node:
    def __init__(self):
        self.edges = {}

class Edge:
    def __init__(self, start, end):
        self.start = start
        self.end = end
        self.nextNode = Node()

if __name__ == '__main__':
    str_ = "abcab$"
    root = Node()
    activeNode = root

    for i in range(len(str_)):
        ch = str_[i]
        edge = Edge(i, i + 1)
        activeNode.edges[ch] = edge
        activeNode = edge.nextNode

This is a simplified example, meant for educational purposes. In a real-world application, Ukkonen’s algorithm is more complex, especially when dealing with edge splits and suffix links.


Explanation:

  • Initial Setup: Start by initializing the root node. Each node will have edges leading to other nodes or leaf nodes. These edges are associated with substrings of the input string.

  • Active Point: Keep track of the active point, which consists of the active node, active edge, and active length. This helps in determining where the next insert should happen.

  • Extension: For each character in the string, extend the tree by one. This could mean extending a leaf edge by one character or adding a new leaf edge.

  • Splitting: If necessary, split a node. This involves creating a new internal node and adjusting the existing edges.

  • Suffix Link: Keep track of suffix links for quicker insertion of new characters.


How Changes Affect the Algorithm:

  • String Length: The longer the string, the more nodes and edges the algorithm will need to create. However, the algorithm will still run in linear time.

  • Alphabet Size: The larger the alphabet, the more edges will emanate from each node, increasing the space complexity but not affecting the time complexity.

Sample Use Case:

If the input string is “abcab”, the algorithm will first add ‘a’, then ‘b’, then ‘c’, then another ‘a’ and finally another ‘b’, each time updating the tree structure efficiently.

This algorithm is widely used in string matching, data compression, and bioinformatics for efficiently storing and querying large sequences.