What is KMP Algorithm for Pattern Searching?
Understanding KMP algorithm and LPS array clearly
.png&w=3840&q=75)
Understanding KMP algorithm and LPS array clearly
.png&w=3840&q=75)
.png)
You are building a search feature for a text editor. A user types “Ctrl+F” and searches for the word “AAAB” in a document containing a million A characters followed by a B. The naive search checks “AAAB” against every position in the text, doing 4 comparisons per position for nearly one million positions. That is close to 4 million comparisons for one search.
The KMP algorithm does it in one pass: approximately one million comparisons total, regardless of how repetitive the pattern is.
That improvement does not sound dramatic until you apply it to DNA sequence matching. Human DNA is about 3 billion base pairs long. Pattern matching in bioinformatics tools searches for gene sequences of hundreds or thousands of characters across this entire string. Naive search at O(n × m) would be impossibly slow. KMP at O(n + m) makes it practical.
The same logic applies to text editors (Find and Replace), search engines indexing web pages, plagiarism detection tools comparing documents, and network intrusion detection systems scanning packets for attack signatures.
KMP (Knuth-Morris-Pratt) was published in 1977 by Donald Knuth, James Morris, and Vaughan Pratt. It was the first linear-time string matching algorithm. Understanding it teaches you one of the most elegant ideas in computer science: how to reuse previously matched information instead of throwing it away.
The KMP Algorithm (Knuth-Morris-Pratt) is a string pattern searching algorithm that finds all occurrences of a pattern string (called pat) within a text string (called txt) in O(n + m) time, where n is the length of the text and m is the length of the pattern.
.png)
Break that apart:
Pattern searching: Given a text like "aabaacaadaabaaba" and a pattern like "aaba", find every starting index where pat occurs in txt
O(n + m) time: KMP preprocesses the pattern in O(m) time, then scans the text in O(n) time. The naive approach takes O(n × m) in the worst case
No backtracking on text: The key insight is that KMP never moves the text pointer backward. Each character in the text is compared at most once
Reuses match information: When a mismatch occurs, KMP uses a precomputed table (the LPS array) to skip redundant comparisons rather than restarting from scratch
KMP has two phases:
Phase 1: PREPROCESSING (O(m))
Build the LPS array from the pattern.
LPS = Longest Proper Prefix which is also a Suffix.
This tells KMP: "If a mismatch occurs at position j in the pattern,
how far back in the pattern do we need to go?"
Phase 2: SEARCHING (O(n))
Scan the text left to right (text pointer NEVER goes backward).
When characters match, advance both text pointer (i) and pattern pointer (j).
When mismatch occurs at pattern position j:
If j > 0: jump j to lps[j-1] (use precomputed table, do NOT restart)
If j == 0: advance text pointer i by 1
When j reaches m (full match found): record position, jump j to lps[j-1]
Total: O(m) preprocessing + O(n) search = O(n + m)Naive: mismatch at position i in text → restart text from i-j+1
Text pointer moves BACKWARD. Characters re-examined.
KMP: mismatch at position i in text → text pointer stays at i
Pattern pointer jumps to lps[j-1]
Text pointer moves FORWARD only
This is the guarantee that gives O(n) search time.
Each character in the text is compared at most twice: O(n) total.Before understanding KMP, understand what it replaces.
txt = "AAAAAAB"
pat = "AAAB"
Position 0: A A A A A A B
A A A B ← mismatch at index 3 (text B vs pattern B? No: A vs B)
Compare 4 chars. FAIL. Restart from position 1.
Position 1: A A A A A B
A A A B ← compare 4 chars. FAIL. Restart from position 2.
Position 2: A A A A B
A A A B ← compare 4 chars. FAIL. Restart from position 3.
Position 3: A A A B
A A A B ← MATCH at position 3!
Total comparisons: 4 + 4 + 4 + 4 = 16
For n=1,000,000 A's and pattern of 999 A's + B:
~1,000 comparisons × ~1,000,000 positions = 1 BILLION comparisonstxt = "AAAAAAB"
pat = "AAAB"
lps = [0, 1, 2, 0]
Start: i=0, j=0
A==A → match: i=1, j=1
A==A → match: i=2, j=2
A==A → match: i=3, j=3
A≠B → mismatch at j=3. j = lps[2] = 2 (don't restart!)
A==A → match: i=4, j=3
A≠B → mismatch at j=3. j = lps[2] = 2
A==A → match: i=5, j=3
A≠B → mismatch. j = lps[2] = 2
B≠A → mismatch at j=2. j = lps[1] = 1
B≠A → mismatch at j=1. j = lps[0] = 0
B≠A → mismatch at j=0. i++
Total comparisons: ~13 (not 16, and scales to ~n+m for large inputs)The LPS (Longest Proper Prefix which is also a Suffix) array is the precomputed table that makes KMP efficient.
.png)
For each position i in the pattern, lps[i] = length of the longest proper prefix of pat[0..i] that is also a suffix of pat[0..i].
A proper prefix does not include the full string. “ab” is a proper prefix of “abc”. “abc” is not a proper prefix of “abc”.
Pattern: "aabaaac"
Position 0: "a" → No proper prefix/suffix → lps[0] = 0
Position 1: "aa" → "a" is both prefix & suffix → lps[1] = 1
Position 2: "aab" → No match → lps[2] = 0
Position 3: "aaba" → "a" is both prefix & suffix → lps[3] = 1
Position 4: "aabaa" → "aa" is both prefix & suffix → lps[4] = 2
Position 5: "aabaaa" → "aa" is both prefix & suffix → lps[5] = 2
Position 6: "aabaaac" → mismatch, reset → lps[6] = 0
Final lps[] = [0, 1, 0, 1, 2, 2, 0]Pattern "AAAA": lps = [0, 1, 2, 3]
Pattern "ABCDE": lps = [0, 0, 0, 0, 0] (no repetition)
Pattern "AABAACAABAA":lps = [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
Pattern "ABCDABD": lps = [0, 0, 0, 0, 1, 2, 0]When mismatch occurs at pattern index j:
If j > 0: set j = lps[j-1] ← jump back, NOT to 0
If j == 0: advance text pointer
WHY this works:
If we matched pat[0..j-1] before the mismatch,
and lps[j-1] = k, then:
pat[0..k-1] == pat[j-k..j-1] (by definition of LPS)
These k characters already matched in the text.
So we do not need to recheck them.
Resume matching from pat[k] instead of pat[0].Algorithm: computeLPS(pattern)
lps[0] = 0 (always: single char has no proper prefix)
len = 0 (length of current matching prefix-suffix)
i = 1 (start from second character)
WHILE i < m:
CASE 1: pat[i] == pat[len]
len++
lps[i] = len
i++
CASE 2: pat[i] != pat[len] AND len == 0
lps[i] = 0
i++
CASE 3: pat[i] != pat[len] AND len > 0
len = lps[len-1] ← try shorter prefix (DO NOT increment i)pat = A A B A A C A A B A A
idx = 0 1 2 3 4 5 6 7 8 9 10
i=1: A==A(len=0) → len=1, lps[1]=1, i=2
i=2: B≠A(len=1) → len=lps[0]=0
B≠A(len=0) → lps[2]=0, i=3
i=3: A==A(len=0) → len=1, lps[3]=1, i=4
i=4: A==A(len=1) → len=2, lps[4]=2, i=5
i=5: C≠B(len=2) → len=lps[1]=1
C≠A(len=1) → len=lps[0]=0
C≠A(len=0) → lps[5]=0, i=6
i=6: A==A(len=0) → len=1, lps[6]=1, i=7
i=7: A==A(len=1) → len=2, lps[7]=2, i=8
i=8: B==B(len=2) → len=3, lps[8]=3, i=9
i=9: A==A(len=3) → len=4, lps[9]=4, i=10
i=10:A==A(len=4) → len=5, lps[10]=5, i=11
Final lps[] = [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]Algorithm: KMPSearch(txt, pat)
Build lps[] for pat
i = 0 (text pointer)
j = 0 (pattern pointer)
WHILE i < n:
IF txt[i] == pat[j]: ← characters match
i++, j++
IF j == m: ← full match found!
RECORD position (i - j)
j = lps[j-1] ← look for more matches
ELSE IF txt[i] != pat[j]: ← mismatch
IF j > 0:
j = lps[j-1] ← jump pattern pointer (text stays)
ELSE:
i++ ← advance text pointer onlytxt = "AABAACAABAA"
pat = "AABA"
lps = [0, 1, 0, 1]
i=0,j=0: A==A → i=1,j=1
i=1,j=1: A==A → i=2,j=2
i=2,j=2: B==B → i=3,j=3
i=3,j=3: A==A → i=4,j=4
j==m=4: MATCH at position i-j = 0! j = lps[3] = 1
i=4,j=1: A==A → i=5,j=2
i=5,j=2: C≠B → j=lps[1]=1 (text stays at 5)
i=5,j=1: C≠A → j=lps[0]=0
i=5,j=0: C≠A → i=6
i=6,j=0: A==A → i=7,j=1
i=7,j=1: A==A → i=8,j=2
i=8,j=2: B==B → i=9,j=3
i=9,j=3: A==A → i=10,j=4
j==m=4: MATCH at position i-j = 6! j = lps[3] = 1
i=10,j=1: A==A → i=11,j=2
i=11: loop ends (i == n)
Matches found at indices: [0, 6]def compute_lps(pat): """Build LPS array for pattern. O(m) time and space.""" m = len(pat) lps = [0] * m length = 0 # length of previous longest prefix-suffix i = 1 while i < m: if pat[i] == pat[length]: length += 1 lps[i] = length i += 1 elif length != 0: length = lps[length - 1] # Try shorter prefix (don't advance i) else: lps[i] = 0 i += 1 return lps def kmp_search(txt, pat): """ KMP pattern search. Returns all starting indices where pat occurs in txt. Time: O(n + m). Space: O(m) for LPS array. """ n, m = len(txt), len(pat) lps = compute_lps(pat) result = [] i = j = 0 # i = text pointer, j = pattern pointer while i < n: if txt[i] == pat[j]: i += 1 j += 1 if j == m: # Full match found result.append(i - j) j = lps[j - 1] # Look for next match elif i < n and txt[i] != pat[j]: if j != 0: j = lps[j - 1] # Jump pattern pointer (text pointer stays) else: i += 1 # Advance text pointer return result # Test cases print(kmp_search("aabaacaadaabaaba", "aaba")) # [0, 9, 12] print(kmp_search("abcab", "ab")) # [0, 3] print(kmp_search("AAAAAAAAB", "AAAAB")) # [5]
Phase 1: Build LPS array
The outer loop runs m times (once per character of pattern).
The inner backtrack step (len = lps[len-1]) can run at most m times total.
Total LPS construction: O(m)
Phase 2: KMP Search
Text pointer i only moves FORWARD (never backward).
i goes from 0 to n: O(n) increments max.
Pattern pointer j can decrease via lps but total decreases
cannot exceed total increases, which is bounded by n.
Total search: O(n)
TOTAL: O(m) + O(n) = O(n + m)
Compare to Naive: O(n × m) worst case
For n=1,000,000 and m=1,000:
Naive: 1,000,000,000 comparisons (1 billion)
KMP: 1,001,000 comparisons
Speedup: ~1000xLPS array: O(m) ← only the pattern length
No extra space for text ← text is scanned in-place
Total Space: O(m)Feature | Naive Search | KMP Algorithm |
|---|---|---|
Time (best case) | O(n) | O(n + m) |
Time (worst case) | O(n × m) | O(n + m) |
Space | O(1) | O(m) for LPS array |
Text pointer | Can move backward | Moves forward only |
Preprocessing | None | Build LPS array O(m) |
Handles repetitive patterns | Very poorly | Excellently |
Implementation complexity | Simple | Moderate |
Best for | Short patterns or small text | Long patterns with repetition |
txt = "AAAAAAAAAAAB" (n A's followed by B)
pat = "AAAAB" (m-1 A's followed by B)
Naive:
At every position, matches m-1 characters then fails on B.
n positions × m comparisons = O(n × m)
1,000,000 positions × 1,000 pattern length = 1 billion ops
KMP:
Text pointer never goes back.
Total comparisons ≤ 2n = O(n)
1,000,000 text length → ~2,000,000 ops max
Real difference: 500x faster for this worst case.Text Editor Find and Replace: Every “Find” operation in VS Code, Notepad++, and Word scans potentially large documents for a pattern. KMP ensures the search runs in O(n + m) even for pathological inputs like searching “aaa…aaab” in a file of a million a characters.
DNA Sequence Matching: Bioinformatics tools like BLAST search for gene sequences inside 3-billion-character genomes. The pattern (a gene sequence) may be hundreds of characters long with repetitive base pairs (e.g., “AAATAAATAAATAAAC”). KMP handles these repetitive patterns that would destroy naive search performance.
Plagiarism Detection: Plagiarism checkers compare submitted documents against millions of stored texts. They search for verbatim sentence matches (patterns) inside large corpora (text). KMP’s linear time is essential for making this feasible at scale.
Network Intrusion Detection: Security tools like Snort scan every network packet for known attack signatures. A signature might be a 50-100 character byte sequence. KMP scans each packet in O(packet_size + signature_size) rather than the much slower naive O(packet_size × signature_size).
# Practical example: Find all occurrences of a DNA motif
def find_gene_motif(genome, motif):
"""
Search for a genetic motif in a genome sequence.
KMP handles the highly repetitive nature of DNA efficiently.
genome: full DNA string (e.g., 3 billion chars for human genome)
motif: target gene sequence (e.g., "ATCGATCGAT")
Returns: list of starting positions of motif in genome
"""
return kmp_search(genome, motif)
# Simulated example
genome = "ATCGATCGATCGGCATCGATCG"
motif = "ATCGATCG"
positions = find_gene_motif(genome, motif)
print(f"Motif found at positions:{positions}")
# Motif found at positions: [0, 4, 14]Linear time O(n + m) guaranteed: Unlike naive search, KMP performs in O(n + m) even for the worst possible inputs (highly repetitive patterns in repetitive text). This guarantee is critical for applications that must handle adversarial input without performance degradation
Text pointer never moves backward: KMP scans the text in a single left-to-right pass. This makes it ideal for streaming inputs like network packets or file streams where you cannot seek backward in the data
Efficient for repetitive patterns: The LPS table encodes all the information about repeated substrings within the pattern. The more repetitive the pattern, the more work KMP saves compared to naive search by leveraging these repetitions
O(m) space only: KMP only needs to store the LPS array of length m (pattern size). No extra space proportional to the text is needed, making it memory-efficient for searching large texts
More complex to implement: KMP requires understanding and correctly building the LPS array, which involves a non-trivial loop with three distinct cases and index management. The naive approach is trivially simple to write and understand. Bugs in the LPS construction are common and subtle
Preprocessing overhead for short patterns: For very short patterns (length 1-5) in short texts, the O(m) preprocessing cost of building the LPS array is wasted effort. The naive algorithm with its O(1) space and simpler logic performs comparably and is easier to reason about
Not always the fastest in practice: For typical inputs with uniformly distributed characters, naive search rejects mismatches on the first character most of the time, giving nearly O(n) performance. In these cases, KMP’s preprocessing overhead makes it marginally slower than naive despite better asymptotic complexity. Boyer-Moore and Horspool algorithms are often faster for large alphabets
KMP is the algorithm that makes linear-time pattern searching possible even in the worst case. Here is the full recap:
KMP finds all occurrences of pattern pat (length m) in text txt (length n) in O(n + m) time and O(m) space.
The naive approach takes O(n × m) in the worst case because it backtracks the text pointer and re-examines characters. KMP avoids this entirely by never moving the text pointer backward.
The key innovation is the LPS array (Longest Proper Prefix which is also Suffix). For each position in the pattern, lps[i] = length of the longest proper prefix of pat[0..i] that is also a suffix. This tells KMP exactly where to resume after a mismatch.
Building LPS uses three cases: characters match (extend the prefix-suffix), mismatch with len > 0 (fall back via lps), mismatch with len = 0 (set 0, advance).
KMP search: advance both pointers on match, jump pattern pointer via lps on mismatch (text pointer stays), record match when j reaches m then jump j via lps.
Real-world uses: text editor search, DNA sequence matching, plagiarism detection, network intrusion detection, spam filtering.
FAQ