Published on : Apr 19, 2026

What is KMP Algorithm for Pattern Searching?

Understanding KMP algorithm and LPS array clearly

6 Minutes Read

Why the KMP Algorithm Changed Pattern Searching Forever

You are building a search feature for a text editor. A user types “Ctrl+F” and searches for the word “AAAB” in a document containing a million A characters followed by a B. The naive search checks “AAAB” against every position in the text, doing 4 comparisons per position for nearly one million positions. That is close to 4 million comparisons for one search.

The KMP algorithm does it in one pass: approximately one million comparisons total, regardless of how repetitive the pattern is.

That improvement does not sound dramatic until you apply it to DNA sequence matching. Human DNA is about 3 billion base pairs long. Pattern matching in bioinformatics tools searches for gene sequences of hundreds or thousands of characters across this entire string. Naive search at O(n × m) would be impossibly slow. KMP at O(n + m) makes it practical.

The same logic applies to text editors (Find and Replace), search engines indexing web pages, plagiarism detection tools comparing documents, and network intrusion detection systems scanning packets for attack signatures.

KMP (Knuth-Morris-Pratt) was published in 1977 by Donald Knuth, James Morris, and Vaughan Pratt. It was the first linear-time string matching algorithm. Understanding it teaches you one of the most elegant ideas in computer science: how to reuse previously matched information instead of throwing it away.

What is the KMP Algorithm?

The KMP Algorithm (Knuth-Morris-Pratt) is a string pattern searching algorithm that finds all occurrences of a pattern string (called pat) within a text string (called txt) in O(n + m) time, where n is the length of the text and m is the length of the pattern.

Break that apart:

Pattern searching: Given a text like "aabaacaadaabaaba" and a pattern like "aaba", find every starting index where pat occurs in txt
O(n + m) time: KMP preprocesses the pattern in O(m) time, then scans the text in O(n) time. The naive approach takes O(n × m) in the worst case

No backtracking on text: The key insight is that KMP never moves the text pointer backward. Each character in the text is compared at most once

Reuses match information: When a mismatch occurs, KMP uses a precomputed table (the LPS array) to skip redundant comparisons rather than restarting from scratch

How KMP Works Under the Hood

KMP has two phases:

Phase 1: PREPROCESSING (O(m))
  Build the LPS array from the pattern.
  LPS = Longest Proper Prefix which is also a Suffix.
  This tells KMP: "If a mismatch occurs at position j in the pattern,
  how far back in the pattern do we need to go?"

Phase 2: SEARCHING (O(n))
  Scan the text left to right (text pointer NEVER goes backward).
  When characters match, advance both text pointer (i) and pattern pointer (j).
  When mismatch occurs at pattern position j:
    If j > 0: jump j to lps[j-1] (use precomputed table, do NOT restart)
    If j == 0: advance text pointer i by 1
  When j reaches m (full match found): record position, jump j to lps[j-1]

Total: O(m) preprocessing + O(n) search = O(n + m)

Why KMP Never Backtracks on the Text

Naive: mismatch at position i in text → restart text from i-j+1
       Text pointer moves BACKWARD. Characters re-examined.

KMP:   mismatch at position i in text → text pointer stays at i
       Pattern pointer jumps to lps[j-1]
       Text pointer moves FORWARD only

This is the guarantee that gives O(n) search time.
Each character in the text is compared at most twice: O(n) total.

The Naive Approach (Why KMP Exists)

Before understanding KMP, understand what it replaces.

Naive Pattern Search

txt = "AAAAAAB"
pat = "AAAB"

Position 0: A A A A A A B
            A A A B          ← mismatch at index 3 (text B vs pattern B? No: A vs B)
            Compare 4 chars. FAIL. Restart from position 1.

Position 1:   A A A A A B
              A A A B        ← compare 4 chars. FAIL. Restart from position 2.

Position 2:     A A A A B
                A A A B      ← compare 4 chars. FAIL. Restart from position 3.

Position 3:       A A A B
                  A A A B    ← MATCH at position 3!

Total comparisons: 4 + 4 + 4 + 4 = 16
For n=1,000,000 A's and pattern of 999 A's + B:
  ~1,000 comparisons × ~1,000,000 positions = 1 BILLION comparisons

KMP on the Same Input

txt = "AAAAAAB"
pat = "AAAB"
lps = [0, 1, 2, 0]

Start: i=0, j=0
  A==A → match: i=1, j=1
  A==A → match: i=2, j=2
  A==A → match: i=3, j=3
  A≠B  → mismatch at j=3. j = lps[2] = 2 (don't restart!)
  A==A → match: i=4, j=3
  A≠B  → mismatch at j=3. j = lps[2] = 2
  A==A → match: i=5, j=3
  A≠B  → mismatch. j = lps[2] = 2
  B≠A  → mismatch at j=2. j = lps[1] = 1
  B≠A  → mismatch at j=1. j = lps[0] = 0
  B≠A  → mismatch at j=0. i++

Total comparisons: ~13 (not 16, and scales to ~n+m for large inputs)

The LPS Array: The Key to KMP

The LPS (Longest Proper Prefix which is also a Suffix) array is the precomputed table that makes KMP efficient.

For each position i in the pattern, lps[i] = length of the longest proper prefix of pat[0..i] that is also a suffix of pat[0..i].

A proper prefix does not include the full string. “ab” is a proper prefix of “abc”. “abc” is not a proper prefix of “abc”.

Pattern: "aabaaac"

Position 0: "a"       → No proper prefix/suffix     → lps[0] = 0
Position 1: "aa"      → "a" is both prefix & suffix  → lps[1] = 1
Position 2: "aab"     → No match                     → lps[2] = 0
Position 3: "aaba"    → "a" is both prefix & suffix  → lps[3] = 1
Position 4: "aabaa"   → "aa" is both prefix & suffix → lps[4] = 2
Position 5: "aabaaa"  → "aa" is both prefix & suffix → lps[5] = 2
Position 6: "aabaaac" → mismatch, reset              → lps[6] = 0

Final lps[] = [0, 1, 0, 1, 2, 2, 0]

More LPS Examples

Pattern "AAAA":       lps = [0, 1, 2, 3]
Pattern "ABCDE":      lps = [0, 0, 0, 0, 0]  (no repetition)
Pattern "AABAACAABAA":lps = [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
Pattern "ABCDABD":    lps = [0, 0, 0, 0, 1, 2, 0]

What lps[i] Tells KMP

When mismatch occurs at pattern index j:
  If j > 0: set j = lps[j-1]   ← jump back, NOT to 0
  If j == 0: advance text pointer

WHY this works:
  If we matched pat[0..j-1] before the mismatch,
  and lps[j-1] = k, then:
  pat[0..k-1] == pat[j-k..j-1]   (by definition of LPS)
  These k characters already matched in the text.
  So we do not need to recheck them.
  Resume matching from pat[k] instead of pat[0].

Building the LPS Array (Step by Step)

Algorithm: computeLPS(pattern)
  lps[0] = 0     (always: single char has no proper prefix)
  len = 0        (length of current matching prefix-suffix)
  i   = 1        (start from second character)

  WHILE i < m:
    CASE 1: pat[i] == pat[len]
      len++
      lps[i] = len
      i++

    CASE 2: pat[i] != pat[len] AND len == 0
      lps[i] = 0
      i++

    CASE 3: pat[i] != pat[len] AND len > 0
      len = lps[len-1]   ← try shorter prefix (DO NOT increment i)

Trace: Build LPS for “AABAACAABAA”

pat = A A B A A C A A B A A
idx = 0 1 2 3 4 5 6 7 8 9 10

i=1: A==A(len=0) → len=1, lps[1]=1, i=2
i=2: B≠A(len=1) → len=lps[0]=0
     B≠A(len=0) → lps[2]=0, i=3
i=3: A==A(len=0) → len=1, lps[3]=1, i=4
i=4: A==A(len=1) → len=2, lps[4]=2, i=5
i=5: C≠B(len=2) → len=lps[1]=1
     C≠A(len=1) → len=lps[0]=0
     C≠A(len=0) → lps[5]=0, i=6
i=6: A==A(len=0) → len=1, lps[6]=1, i=7
i=7: A==A(len=1) → len=2, lps[7]=2, i=8
i=8: B==B(len=2) → len=3, lps[8]=3, i=9
i=9: A==A(len=3) → len=4, lps[9]=4, i=10
i=10:A==A(len=4) → len=5, lps[10]=5, i=11

Final lps[] = [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]

KMP Search Algorithm (Step by Step)

Algorithm: KMPSearch(txt, pat)
  Build lps[] for pat
  i = 0   (text pointer)
  j = 0   (pattern pointer)

  WHILE i < n:
    IF txt[i] == pat[j]:   ← characters match
      i++, j++

    IF j == m:             ← full match found!
      RECORD position (i - j)
      j = lps[j-1]         ← look for more matches

    ELSE IF txt[i] != pat[j]:   ← mismatch
      IF j > 0:
        j = lps[j-1]       ← jump pattern pointer (text stays)
      ELSE:
        i++                ← advance text pointer only

Full KMP Trace

txt = "AABAACAABAA"
pat = "AABA"
lps = [0, 1, 0, 1]

i=0,j=0: A==A → i=1,j=1
i=1,j=1: A==A → i=2,j=2
i=2,j=2: B==B → i=3,j=3
i=3,j=3: A==A → i=4,j=4
j==m=4: MATCH at position i-j = 0! j = lps[3] = 1

i=4,j=1: A==A → i=5,j=2
i=5,j=2: C≠B → j=lps[1]=1 (text stays at 5)
i=5,j=1: C≠A → j=lps[0]=0
i=5,j=0: C≠A → i=6

i=6,j=0: A==A → i=7,j=1
i=7,j=1: A==A → i=8,j=2
i=8,j=2: B==B → i=9,j=3
i=9,j=3: A==A → i=10,j=4
j==m=4: MATCH at position i-j = 6! j = lps[3] = 1

i=10,j=1: A==A → i=11,j=2
i=11: loop ends (i == n)

Matches found at indices: [0, 6]

KMP Implementation in All Four Languages

def compute_lps(pat):
    """Build LPS array for pattern. O(m) time and space."""
    m   = len(pat)
    lps = [0] * m
    length = 0   # length of previous longest prefix-suffix
    i = 1

    while i < m:
        if pat[i] == pat[length]:
            length += 1
            lps[i]  = length
            i += 1
        elif length != 0:
            length = lps[length - 1]   # Try shorter prefix (don't advance i)
        else:
            lps[i] = 0
            i += 1

    return lps

def kmp_search(txt, pat):
    """
    KMP pattern search.
    Returns all starting indices where pat occurs in txt.
    Time: O(n + m). Space: O(m) for LPS array.
    """
    n, m   = len(txt), len(pat)
    lps    = compute_lps(pat)
    result = []
    i = j  = 0   # i = text pointer, j = pattern pointer

    while i < n:
        if txt[i] == pat[j]:
            i += 1
            j += 1

        if j == m:              # Full match found
            result.append(i - j)
            j = lps[j - 1]     # Look for next match

        elif i < n and txt[i] != pat[j]:
            if j != 0:
                j = lps[j - 1]  # Jump pattern pointer (text pointer stays)
            else:
                i += 1          # Advance text pointer

    return result

# Test cases
print(kmp_search("aabaacaadaabaaba", "aaba"))   # [0, 9, 12]
print(kmp_search("abcab", "ab"))                # [0, 3]
print(kmp_search("AAAAAAAAB", "AAAAB"))         # [5]

Complexity Analysis

Time Complexity

Phase 1: Build LPS array
  The outer loop runs m times (once per character of pattern).
  The inner backtrack step (len = lps[len-1]) can run at most m times total.
  Total LPS construction: O(m)

Phase 2: KMP Search
  Text pointer i only moves FORWARD (never backward).
  i goes from 0 to n: O(n) increments max.
  Pattern pointer j can decrease via lps but total decreases
  cannot exceed total increases, which is bounded by n.
  Total search: O(n)

TOTAL: O(m) + O(n) = O(n + m)

Compare to Naive: O(n × m) worst case
For n=1,000,000 and m=1,000:
  Naive:  1,000,000,000 comparisons (1 billion)
  KMP:    1,001,000 comparisons
  Speedup: ~1000x

Space Complexity

LPS array: O(m)           ← only the pattern length
No extra space for text   ← text is scanned in-place

Total Space: O(m)

KMP vs Naive Pattern Search

Feature	Naive Search	KMP Algorithm
Time (best case)	O(n)	O(n + m)
Time (worst case)	O(n × m)	O(n + m)
Space	O(1)	O(m) for LPS array
Text pointer	Can move backward	Moves forward only
Preprocessing	None	Build LPS array O(m)
Handles repetitive patterns	Very poorly	Excellently
Implementation complexity	Simple	Moderate
Best for	Short patterns or small text	Long patterns with repetition

Worst Case Comparison

txt = "AAAAAAAAAAAB"  (n A's followed by B)
pat = "AAAAB"         (m-1 A's followed by B)

Naive:
  At every position, matches m-1 characters then fails on B.
  n positions × m comparisons = O(n × m)
  1,000,000 positions × 1,000 pattern length = 1 billion ops

KMP:
  Text pointer never goes back.
  Total comparisons ≤ 2n = O(n)
  1,000,000 text length → ~2,000,000 ops max

Real difference: 500x faster for this worst case.

Real-World Applications

Text Editor Find and Replace: Every “Find” operation in VS Code, Notepad++, and Word scans potentially large documents for a pattern. KMP ensures the search runs in O(n + m) even for pathological inputs like searching “aaa…aaab” in a file of a million a characters.
DNA Sequence Matching: Bioinformatics tools like BLAST search for gene sequences inside 3-billion-character genomes. The pattern (a gene sequence) may be hundreds of characters long with repetitive base pairs (e.g., “AAATAAATAAATAAAC”). KMP handles these repetitive patterns that would destroy naive search performance.
Plagiarism Detection: Plagiarism checkers compare submitted documents against millions of stored texts. They search for verbatim sentence matches (patterns) inside large corpora (text). KMP’s linear time is essential for making this feasible at scale.
Network Intrusion Detection: Security tools like Snort scan every network packet for known attack signatures. A signature might be a 50-100 character byte sequence. KMP scans each packet in O(packet_size + signature_size) rather than the much slower naive O(packet_size × signature_size).

# Practical example: Find all occurrences of a DNA motif
def find_gene_motif(genome, motif):
    """
    Search for a genetic motif in a genome sequence.
    KMP handles the highly repetitive nature of DNA efficiently.

    genome: full DNA string (e.g., 3 billion chars for human genome)
    motif:  target gene sequence (e.g., "ATCGATCGAT")
    Returns: list of starting positions of motif in genome
    """
    return kmp_search(genome, motif)

# Simulated example
genome = "ATCGATCGATCGGCATCGATCG"
motif  = "ATCGATCG"
positions = find_gene_motif(genome, motif)
print(f"Motif found at positions:{positions}")
# Motif found at positions: [0, 4, 14]

Advantages and Disadvantages

Advantages

Linear time O(n + m) guaranteed: Unlike naive search, KMP performs in O(n + m) even for the worst possible inputs (highly repetitive patterns in repetitive text). This guarantee is critical for applications that must handle adversarial input without performance degradation
Text pointer never moves backward: KMP scans the text in a single left-to-right pass. This makes it ideal for streaming inputs like network packets or file streams where you cannot seek backward in the data
Efficient for repetitive patterns: The LPS table encodes all the information about repeated substrings within the pattern. The more repetitive the pattern, the more work KMP saves compared to naive search by leveraging these repetitions
O(m) space only: KMP only needs to store the LPS array of length m (pattern size). No extra space proportional to the text is needed, making it memory-efficient for searching large texts

Disadvantages

More complex to implement: KMP requires understanding and correctly building the LPS array, which involves a non-trivial loop with three distinct cases and index management. The naive approach is trivially simple to write and understand. Bugs in the LPS construction are common and subtle
Preprocessing overhead for short patterns: For very short patterns (length 1-5) in short texts, the O(m) preprocessing cost of building the LPS array is wasted effort. The naive algorithm with its O(1) space and simpler logic performs comparably and is easier to reason about
Not always the fastest in practice: For typical inputs with uniformly distributed characters, naive search rejects mismatches on the first character most of the time, giving nearly O(n) performance. In these cases, KMP’s preprocessing overhead makes it marginally slower than naive despite better asymptotic complexity. Boyer-Moore and Horspool algorithms are often faster for large alphabets

Conclusion

KMP is the algorithm that makes linear-time pattern searching possible even in the worst case. Here is the full recap:

KMP finds all occurrences of pattern pat (length m) in text txt (length n) in O(n + m) time and O(m) space.
The naive approach takes O(n × m) in the worst case because it backtracks the text pointer and re-examines characters. KMP avoids this entirely by never moving the text pointer backward.
The key innovation is the LPS array (Longest Proper Prefix which is also Suffix). For each position in the pattern, lps[i] = length of the longest proper prefix of pat[0..i] that is also a suffix. This tells KMP exactly where to resume after a mismatch.
Building LPS uses three cases: characters match (extend the prefix-suffix), mismatch with len > 0 (fall back via lps), mismatch with len = 0 (set 0, advance).
KMP search: advance both pointers on match, jump pattern pointer via lps on mismatch (text pointer stays), record match when j reaches m then jump j via lps.
Real-world uses: text editor search, DNA sequence matching, plagiarism detection, network intrusion detection, spam filtering.

FAQ

FREQUENTLY ASKED QUESTIONS

LPS stands for Longest Proper Prefix which is also a Suffix. For a given position i in the pattern, lps[i] stores the length of the longest substring that is simultaneously a proper prefix of the pattern from position 0 to i, and also a suffix of the same substring. “Proper prefix” means it does not include the entire string. The LPS array tells KMP how far to jump the pattern pointer back when a mismatch occurs without re-examining already-matched text characters.

The naive approach restarts the text pointer backward on every mismatch, potentially re-examining the same text characters many times. KMP’s text pointer only moves forward. After a mismatch, the pattern pointer jumps using the LPS table while the text pointer stays in place. Since the text pointer makes at most n forward moves and the pattern pointer can decrease at most as many times as it increases (bounded by n), total comparisons are O(n). Adding O(m) for LPS construction gives O(n + m).

Yes. When KMP finds a complete match (pattern pointer reaches m), it records the match position and then sets the pattern pointer to lps[m-1] instead of resetting to 0. This allows it to immediately continue searching for the next potential match, correctly handling overlapping matches. The result is a list of all starting positions where the pattern occurs in the text.

Both achieve better worst-case performance than naive search. KMP preprocesses the pattern and uses the LPS table to decide how far to slide the pattern after a mismatch, always scanning left-to-right. Boyer-Moore also preprocesses the pattern but scans the pattern right-to-left and can skip more characters at once for large alphabets, making it faster in practice for English text. KMP is simpler to implement and guarantees O(n + m). Boyer-Moore is harder to implement but often faster on real-world inputs.

For random text with a large alphabet (like English), the naive approach typically rejects a mismatch on the very first character comparison at each position. This means it rarely does more than one comparison per text position, giving near-O(n) performance. KMP’s advantage only becomes dramatic for repetitive patterns in repetitive text. For typical inputs, the naive approach performs well, which is why it is commonly used for short patterns in simple applications.

Yes. KMP correctly handles overlapping occurrences. For example, searching for “AA” in “AAAA” finds matches at positions 0, 1, and 2. When a match is found at position 0, the pattern pointer jumps to lps[1] = 1 (not 0), allowing the next comparison to start from the overlapping position correctly. Naive search handles this too but must be carefully coded to not skip positions after a match.

KMP provides the optimal worst-case O(n + m) time for exact string matching. However, Rabin-Karp uses hashing and is better when searching for multiple patterns simultaneously. Boyer-Moore is typically faster in practice for single pattern search with large alphabets. Aho-Corasick efficiently handles multiple patterns at once. KMP is an excellent general-purpose algorithm and the most important one to learn first because its LPS concept underlies many advanced string algorithms.

Text editors use string search when you press Ctrl+F. For simple cases (short pattern, small file), naive search is fast enough. For large files or repetitive content (like source code with repeated variable names), KMP or similar algorithms ensure the search completes quickly regardless of input. Some editors use Boyer-Moore for its better average-case performance on English text, but the core concept (preprocess the pattern to avoid redundant comparisons) is the same as KMP.