Ukkonen’s Suffix Tree Construction

April 30, 2023

Ukkonen’s algorithm is used for constructing a suffix tree from a given input string. A suffix tree is a data structure that allows for efficient searching of substrings within a given string. Suffix trees have many applications, including but not limited to: string searching, pattern matching, and bioinformatics.

Brief History and Development

Ukkonen’s algorithm was developed by Esko Ukkonen in 1995. Prior to Ukkonen’s algorithm, suffix trees were constructed using quadratic time algorithms, making them inefficient for large datasets. Ukkonen’s algorithm has a linear time complexity of O(n), making it more efficient and practical for real-world applications.

Key Concepts and Principles

Suffix Tree

A suffix tree is a tree-like data structure that stores all the suffixes of a given string. The suffix tree for a string S is defined as follows:

  1. The root node represents the empty string.
  2. Each non-root node represents a non-empty substring of S.
  3. Each edge represents a single character in S.
  4. The concatenation of the characters along any path from the root to a leaf node represents a suffix of S.

End Position

The end position of a substring in a suffix tree is defined as the index of the last character in the substring. For example, if the substring “abc” appears in a string at index 5, then the end position of “abc” is 7 (the index of “c” in the string).

Active Point

The active point is a tuple (active node, active edge, active length) that represents the current state of the algorithm while constructing the suffix tree. Initially, the active point is defined as (root, null, 0), where “root” is the root node of the suffix tree, “null” represents no edge, and “0” represents an empty string.

Extension Rule

The extension rule is the key principle used by Ukkonen’s algorithm for constructing the suffix tree. The extension rule states that if the active edge ends at a leaf node, and the next character in the input string is not already in the tree, then a new leaf node is added to the tree with a new edge representing the next character in the input string.

Pseudocode and Implementation Details

def ukkonen_suffix_tree(string):
    # initialize suffix tree with root node
    tree = {0:{}}
    # initialize active point
    active_node = 0
    active_edge = None
    active_length = 0
    remainder = 0
    # iterate through each character in the string
    for i, c in enumerate(string):
        # update active point and remainder
        last_node = None
        while remainder >= 0:
            if active_edge is None:
                active_edge = c
            if active_edge not in tree[active_node]:
                # create new leaf node and add to tree
                tree[active_node][active_edge] = len(tree)
                tree[len(tree)] = {}
                # update suffix link from previous non-root node to current node
                if last_node is not None:
                    tree[last_node]["suffix_link"] = active_node
                    last_node = None
            next_node = tree[active_node][active_edge]
            next_char = string[next_node]
            if active_length < len(next_char):
                # move to next node and update active point
                break_edge = next_char[active_length]
                tree[next_node][break_edge] = len(tree)
                tree[len(tree)] = {c:len(string)-1}
                # update suffix link from previous non-root node to new node
                if last_node is not None:
                    tree[last_node]["suffix_link"] = next_node
                last_node = next_node
                # update active point and remainder
                active_node = tree[active_node]["suffix_link"] if active_node > 0 else 0
                active_length -= 1
                active_edge = string[i - remainder + 1]
            elif c == next_char[active_length]:
                # update active point and remainder
                if last_node is not None:
                    tree[last_node]["suffix_link"] = active_node
                last_node = None
                active_length += 1
                break
            else:
                # create new internal node and add to tree
                split_node = tree[active_node][active_edge]
                split_char = next_char[active_length]
                tree[split_node][split_char] = len(tree)
                tree[len(tree)] = {next_char[active_length:]:next_node, c:len(string)-1}
                # update suffix link from previous non-root node to new node
                if last_node is not None:
                    tree[last_node]["suffix_link"] = split_node
                last_node = split_node
                # update active point and remainder
                active_node = tree[split_node]["suffix_link"] if active_node > 0 else 0
                active_length -= 1
                active_edge = string[i - remainder + 1]
            remainder -= 1
        remainder += 1
    # return suffix tree
    return tree

The above pseudocode implements Ukkonen’s algorithm for constructing a suffix tree from a given string. The input string is iterated through character by character, and the active point and remainder are updated according to the extension rule. The suffix tree is stored in a dictionary format, where each key corresponds to a node in the tree, and the value is another dictionary containing the edges of the node.

Examples and Use Cases

Ukkonen’s algorithm can be used to efficiently search for substrings within a given string. For example, suppose we have the string “banana” and we want to search for all occurrences of the substring “ana”. We can construct the suffix tree for “banana” using Ukkonen’s algorithm, and then search for the substring “ana” by following the path in the tree corresponding to “ana”. This can be done in linear time, making it much more efficient than brute-force string searching algorithms.

Another use case for suffix trees is in bioinformatics, where they can be used for genome assembly and sequence alignment.

Advantages and Disadvantages

The main advantage of Ukkonen’s algorithm is its linear time complexity, which makes it much more efficient for constructing suffix trees than previous quadratic time algorithms. Additionally, Ukkonen’s algorithm is easy to implement and can handle large datasets.

One disadvantage of Ukkonen’s algorithm is its complexity, which can make it difficult to understand and debug. Additionally, the algorithm requires a lot of memory for constructing large suffix trees, which can be a limitation on resource-constrained systems.

Several variations of Ukkonen’s algorithm have been proposed, including the McCreight algorithm and the Weiner algorithm. These algorithms make use of different data structures or heuristics to improve the efficiency of constructing suffix trees. Additionally, there are several algorithms for constructing compressed suffix trees, which are a more space-efficient version of suffix trees.