# Z Algorithm

April 30, 2023

The Z Algorithm is a string matching algorithm used to find all occurrences of a pattern in a given text. It was invented by German computer scientist Gustav-Friedrich Hartmann in 1975 and is named after its main data structure, the Z-array.

## Purpose and Usage

The Z Algorithm is primarily used for pattern matching in strings. It is a linear time algorithm, which means it can find all occurrences of a pattern in linear time complexity O(n+m), where n is the length of the text and m is the length of the pattern. This makes it a more efficient algorithm than other commonly used string matching algorithms such as the Naive algorithm, which has a time complexity of O(n*m).

## Brief History and Development

The Z Algorithm was first introduced by Gustav-Friedrich Hartmann in a paper titled “On string matching” in 1975. It was later refined and popularized by Martín Escardó and Reinhard Wilhelm in 1992, who presented a more efficient implementation of the algorithm.

## Key Concepts and Principles

The Z Algorithm works by constructing a Z-array, which is an array of integers that stores the length of the longest substring starting from each position in the text that matches the pattern. The first element of the Z-array is always zero, and the rest of the elements are calculated using the following algorithm:

```
z[0] = 0
for i from 1 to n-1:
if i > r:
l = r = i
while r < n and s[r-l] = s[r]:
r++
z[i] = r-l
r--
else:
k = i-l
if z[k] < r-i+1:
z[i] = z[k]
else:
l = i
while r < n and s[r-l] = s[r]:
r++
z[i] = r-l
r--
```

Here, `s`

is the concatenated string of the pattern and the text, `n`

is the length of `s`

, `l`

and `r`

are variables that keep track of the boundaries of the current Z-box, and `z`

is the Z-array. The algorithm scans each character in the string once and uses the information from the previous scans to avoid unnecessary comparisons. The Z-array can be used to find all occurrences of the pattern in the text by searching for all indices `i`

such that `z[i] = m`

, where `m`

is the length of the pattern.

## Pseudocode and Implementation Details

The pseudocode for the Z Algorithm is given above. The implementation details depend on the programming language being used. Here is an implementation of the Z Algorithm in Python:

```
def z_algorithm(s):
n = len(s)
z = [0] * n
l, r = 0, 0
for i in range(1, n):
if i > r:
l = r = i
while r < n and s[r-l] == s[r]:
r += 1
z[i] = r-l
r -= 1
else:
k = i-l
if z[k] < r-i+1:
z[i] = z[k]
else:
l = i
while r < n and s[r-l] == s[r]:
r += 1
z[i] = r-l
r -= 1
return z
```

This implementation takes a string `s`

as input and returns its Z-array.

## Examples and Use Cases

Here are some examples and use cases of the Z Algorithm:

### Example 1: Finding all occurrences of a pattern in a text

Suppose we want to find all occurrences of the pattern `ab`

in the text `abababbaba`

. We concatenate the pattern and the text with a special character `$`

, giving us the string `ab$abababbaba`

. We then apply the Z Algorithm to this string to get the Z-array `[0,0,1,0,3,0,1,0,0,1,0]`

. The indices where the Z-array is equal to the length of the pattern, which is 2, are 4 and 10. Therefore, the pattern occurs at indices 2, 6, and 9 in the text.

### Example 2: Finding the longest repeated substring

Suppose we want to find the longest repeated substring in the string `ababababa`

. We concatenate the string with itself with a special character `$`

, giving us the string `ababababa$ababababa`

. We then apply the Z Algorithm to this string to get the Z-array `[0,0,1,0,3,0,1,0,7,0,1,0,3,0,1,0]`

. The highest value in the Z-array is 7, which corresponds to the longest repeated substring `ababa`

.

### Use Case: DNA sequencing

The Z Algorithm can be used in DNA sequencing to find matches between a DNA sequence and a reference genome. It is particularly useful for finding short matches between the sequence and the genome, which can be difficult to detect using other algorithms.

## Advantages and Disadvantages

The advantages of the Z Algorithm are:

- It has a time complexity of O(n+m), which is faster than other commonly used string matching algorithms such as the Naive algorithm.
- It can be used to find all occurrences of a pattern in a text.

The disadvantages of the Z Algorithm are:

- It requires additional memory to store the Z-array.
- It may not be suitable for very long patterns or texts, as the Z-array can become very large.

## Related Algorithms or Variations

There are several variations of the Z Algorithm, including the Z+-algorithm, which is a modification of the Z Algorithm that allows for approximate string matching. Another variation is the Z-boxes algorithm, which is a simplified version of the Z Algorithm that only uses a subset of the Z-array. Other related algorithms include the Knuth-Morris-Pratt algorithm and the Boyer-Moore algorithm, which are also used for string matching.