Z Algorithm

April 30, 2023

The Z Algorithm is a string matching algorithm used to find all occurrences of a pattern in a given text. It was invented by German computer scientist Gustav-Friedrich Hartmann in 1975 and is named after its main data structure, the Z-array.

Purpose and Usage

The Z Algorithm is primarily used for pattern matching in strings. It is a linear time algorithm, which means it can find all occurrences of a pattern in linear time complexity O(n+m), where n is the length of the text and m is the length of the pattern. This makes it a more efficient algorithm than other commonly used string matching algorithms such as the Naive algorithm, which has a time complexity of O(n*m).

Brief History and Development

The Z Algorithm was first introduced by Gustav-Friedrich Hartmann in a paper titled “On string matching” in 1975. It was later refined and popularized by Martín Escardó and Reinhard Wilhelm in 1992, who presented a more efficient implementation of the algorithm.

Key Concepts and Principles

The Z Algorithm works by constructing a Z-array, which is an array of integers that stores the length of the longest substring starting from each position in the text that matches the pattern. The first element of the Z-array is always zero, and the rest of the elements are calculated using the following algorithm:

z[0] = 0
for i from 1 to n-1:
  if i > r:
    l = r = i
    while r < n and s[r-l] = s[r]:
      r++
    z[i] = r-l
    r--
  else:
    k = i-l
    if z[k] < r-i+1:
      z[i] = z[k]
    else:
      l = i
      while r < n and s[r-l] = s[r]:
        r++
      z[i] = r-l
      r--

Here, s is the concatenated string of the pattern and the text, n is the length of s, l and r are variables that keep track of the boundaries of the current Z-box, and z is the Z-array. The algorithm scans each character in the string once and uses the information from the previous scans to avoid unnecessary comparisons. The Z-array can be used to find all occurrences of the pattern in the text by searching for all indices i such that z[i] = m, where m is the length of the pattern.

Pseudocode and Implementation Details

The pseudocode for the Z Algorithm is given above. The implementation details depend on the programming language being used. Here is an implementation of the Z Algorithm in Python:

def z_algorithm(s):
    n = len(s)
    z = [0] * n
    l, r = 0, 0
    for i in range(1, n):
        if i > r:
            l = r = i
            while r < n and s[r-l] == s[r]:
                r += 1
            z[i] = r-l
            r -= 1
        else:
            k = i-l
            if z[k] < r-i+1:
                z[i] = z[k]
            else:
                l = i
                while r < n and s[r-l] == s[r]:
                    r += 1
                z[i] = r-l
                r -= 1
    return z

This implementation takes a string s as input and returns its Z-array.

Examples and Use Cases

Here are some examples and use cases of the Z Algorithm:

Example 1: Finding all occurrences of a pattern in a text

Suppose we want to find all occurrences of the pattern ab in the text abababbaba. We concatenate the pattern and the text with a special character $, giving us the string ab$abababbaba. We then apply the Z Algorithm to this string to get the Z-array [0,0,1,0,3,0,1,0,0,1,0]. The indices where the Z-array is equal to the length of the pattern, which is 2, are 4 and 10. Therefore, the pattern occurs at indices 2, 6, and 9 in the text.

Example 2: Finding the longest repeated substring

Suppose we want to find the longest repeated substring in the string ababababa. We concatenate the string with itself with a special character $, giving us the string ababababa$ababababa. We then apply the Z Algorithm to this string to get the Z-array [0,0,1,0,3,0,1,0,7,0,1,0,3,0,1,0]. The highest value in the Z-array is 7, which corresponds to the longest repeated substring ababa.

Use Case: DNA sequencing

The Z Algorithm can be used in DNA sequencing to find matches between a DNA sequence and a reference genome. It is particularly useful for finding short matches between the sequence and the genome, which can be difficult to detect using other algorithms.

Advantages and Disadvantages

The advantages of the Z Algorithm are:

  • It has a time complexity of O(n+m), which is faster than other commonly used string matching algorithms such as the Naive algorithm.
  • It can be used to find all occurrences of a pattern in a text.

The disadvantages of the Z Algorithm are:

  • It requires additional memory to store the Z-array.
  • It may not be suitable for very long patterns or texts, as the Z-array can become very large.

There are several variations of the Z Algorithm, including the Z+-algorithm, which is a modification of the Z Algorithm that allows for approximate string matching. Another variation is the Z-boxes algorithm, which is a simplified version of the Z Algorithm that only uses a subset of the Z-array. Other related algorithms include the Knuth-Morris-Pratt algorithm and the Boyer-Moore algorithm, which are also used for string matching.