Rabin-Karp String Search

May 20, 2023

The Rabin-Karp String Search algorithm is a string searching algorithm that finds all occurrences of a pattern string within a larger text by performing a hash function on both the pattern and the text. It is named after its inventors, Michael O. Rabin and Richard M. Karp, who developed the algorithm in 1987. The Rabin-Karp algorithm is known for its efficient running time, which is O(n+m) in the best and average cases, and O(nm) in the worst case scenario.

Purpose and Usage

The Rabin-Karp algorithm is used to search for a substring (pattern) in a larger string (text). The algorithm compares the hash value of the pattern with the hash values of all possible substrings in the text, in order to determine if there is a match. The search is performed in a rolling manner, where the hash values of the substrings are updated by subtracting the hash value of the first character in the previous substring and adding the hash value of the next character in the text. If the hash value of the pattern and the hash value of a substring match, then the algorithm performs a character-by-character comparison to confirm the match. The Rabin-Karp algorithm is particularly useful in cases where the pattern string is longer than the text string, or when multiple patterns need to be searched for in a given text.

History and Development

The Rabin-Karp algorithm was first introduced in 1987 by Michael O. Rabin and Richard M. Karp, who were both working at Harvard University at the time. The algorithm was based on previous work done by Rabin in the field of randomized algorithms. The Rabin-Karp algorithm was one of the first string searching algorithms to make use of hashing techniques, and has since inspired the development of other string matching algorithms such as the Boyer-Moore algorithm and the Knuth-Morris-Pratt algorithm.

Key Concepts and Principles

Hash Function

The Rabin-Karp algorithm relies on a hash function to compute the hash values of the pattern and the text. A hash function is a mathematical function that converts an input (in this case, a string) into a fixed-size output (in this case, an integer). The hash function used in the Rabin-Karp algorithm is designed to minimize collisions (i.e. the occurrence of different inputs producing the same output) and to be fast to compute.

Rolling Hash

The Rabin-Karp algorithm uses a rolling hash to compute the hash values of the substrings in the text. A rolling hash is a hash function that updates the hash value of a string by subtracting the hash value of the first character in the previous substring and adding the hash value of the next character in the text. This allows the algorithm to compute the hash value of all possible substrings in the text with a single pass through the text.

Pseudocode and Implementation Details

The following is pseudocode for the Rabin-Karp algorithm:

function RabinKarp(string text, string pattern)
    hash_pattern = hash(pattern)
    hash_text = hash(text[0:|pattern|])
    for i from 0 to |text|-|pattern| do
        if hash_pattern == hash_text then
            if text[i:i+|pattern|-1] == pattern then
                return i
        hash_text = update_hash(hash_text, text[i], text[i+|pattern|])
    return -1

The hash function computes the hash value of a string, and the update_hash function updates the hash value of a string given the previous hash value, the first character in the previous substring, and the next character in the text. The |pattern| notation represents the length of the pattern string.

Examples and Use Cases

Suppose we have a text string ABABABABA and a pattern string ABA. The Rabin-Karp algorithm would proceed as follows:

  1. Compute the hash value of the pattern string, which is hash(ABA) = 1.
  2. Compute the hash value of the first substring in the text, which is hash(ABA) = 1.
  3. Compare the hash value of the pattern and the substring. Since they match, perform a character-by-character comparison to confirm the match. The match is confirmed, and the algorithm returns the index of the first occurrence of the pattern in the text.
  4. Update the hash value of the next substring in the text using the rolling hash technique.
  5. Repeat steps 3 and 4 for all substrings in the text.

The Rabin-Karp algorithm is particularly useful in cases where the pattern string is longer than the text string, or when multiple patterns need to be searched for in a given text. For example, the Rabin-Karp algorithm can be used to search for plagiarism in a large corpus of text, by comparing the hash values of all possible substrings in the corpus with the hash values of a known plagiarized text.

Advantages and Disadvantages

The Rabin-Karp algorithm has several advantages and disadvantages:

Advantages

  1. The Rabin-Karp algorithm is easy to implement and understand.
  2. The algorithm has a relatively low space complexity, as it only requires the storage of the hash value of the pattern and the text.
  3. The Rabin-Karp algorithm is particularly useful in cases where the pattern string is longer than the text string, or when multiple patterns need to be searched for in a given text.

Disadvantages

  1. The Rabin-Karp algorithm has a worst-case time complexity of O(nm), where n is the length of the text and m is the length of the pattern. This occurs when all possible substrings in the text need to be compared to the pattern.
  2. The Rabin-Karp algorithm is vulnerable to hash collisions, which can decrease the accuracy of the search results.
  3. The hash function used in the algorithm can be time-consuming to compute for very long strings.

There are several related algorithms and variations of the Rabin-Karp algorithm:

Rolling Hash

The rolling hash technique used in the Rabin-Karp algorithm is also used in other string searching algorithms such as the Boyer-Moore algorithm and the Knuth-Morris-Pratt algorithm.

Monte Carlo Algorithm

The Monte Carlo algorithm is a variation of the Rabin-Karp algorithm that only performs a character-by-character comparison if the hash values of the pattern and a substring match. This reduces the computational complexity of the algorithm, but can lead to false positives due to hash collisions.

Las Vegas Algorithm

The Las Vegas algorithm is a variation of the Rabin-Karp algorithm that repeats the search process with different hash functions until a match is found. This ensures that the search is accurate, but can be computationally expensive.