Regular expression lookbehind behavior tests

Following are test results that help to elucidate the basic semantics used by regular expression implementations that support capturing groups within variable-length lookbehind.

Lookbehind results in different regex implementations
Regex Target Result
.NET Java PCRE ActionScript JGsoft
(?<=(.+))$ "abc" $1: "abc" Unsupported Unsupported Unsupported $1: "abc"
(?<=(.+?))$ "abc" $1: "c" Unsupported Unsupported Unsupported $1: "c"
(?<=(.{1,3}))$ "abc" $1: "abc" $1: "c" Unsupported Unsupported $1: "abc"
(?<=(.{1,3}?))$ "abc" $1: "c" $1: "c" Unsupported Unsupported $1: "c"
(?<=(.{1,3})(.{1,3}?))$ "abc"
  • $1: "ab"
  • $2: "c"
  • $1: "b"
  • $2: "c"
Unsupported Unsupported
  • $1: "ab"
  • $2: "c"
(?<=(.{1,3}?)(.{1,3}))$ "abc"
  • $1: "a"
  • $2: "bc"
  • $1: "b"
  • $2: "c"
Unsupported Unsupported
  • $1: "a"
  • $2: "bc"
(?<=(.)|(..)|(...))$ "abc"
  • $1: "c"
  • $2 and $3 don't participate
  • $1: "c"
  • $2 and $3 don't participate
  • $1: "c"
  • $2 and $3 don't participate
  • $1: "c"
  • $2 and $3 don't participate
  • $1: "c"
  • $2 and $3 don't participate
(?<=(...)|(..)|(.))$ "abc"
  • $1: "abc"
  • $2 and $3 don't participate
  • $1 and $2 don't participate
  • $3: "c"
  • $1: "abc"
  • $2 and $3 don't participate
  • $1: "abc"
  • $2 and $3 don't participate
  • $1: "abc"
  • $2 and $3 don't participate

Other regex flavors:

.NET's lookbehind is the flagship and ideal. It supports all regex syntax, and provides intuitive greedy and nongreedy behavior. It does this while maintaining the same efficiency as lookahead. Java's lookbehind behavior is different, although this is only observable when capturing groups are used within lookbehind. Java applies lookbehind by going through the target string from right to left, while going through the regex from left to right. .NET goes through both the target string and the regex from right to left, except that the order of alternation is unchanged.

Consider the regex (?<=(.{1,3}))$. In Java, backreference 1 when this is tested against "abc" is "c" (unlike .NET, where it's "abc").

.NET applies the regex in reverse, applying it against all text available to the left of the current matching position. It starts after "c". .{1,3} matches three characters, because the quantifier is greedy. The regex engine then concludes the match attempt has succeeded.

Java starts with going back one character, i.e., starting between "b" and "c", applying the regex from left to right. .{1,3} can match one character, so the lookbehind succeeds.

Next, consider the regex (?<=(...)|(..)|(.))$. When tested against "abc" in Java, backreference 3 is "c" (unlike .NET, where backreference 1 is "abc").

Here, the story is the same as before. .NET goes in reverse through the regex and through the target string, but still processes alternation from left to right. Going backwards starting after "c", ... finds a match. Alternation is eager so the lookbehind succeeds immediately.

Java goes back one character. ... fails to match because "c" is the only character available. .. also fails. The lone dot then matches "c".

If you had (?<=(abc)|(de)|(f)) and the string "abc" in Java, then it would start before the "c". First abc fails to match "c", then de fails to match "c", and then f fails to match "c". Java then looks back one more character. abc fails to match "bc", de fails to match "bc", and f fails to match "bc". Going back one more character, abc matches "abc", and the lookbehind succeeds.

The reason Java doesn't support infinite repetition is that it would have to test the lookbehind regex at every character in the target string before the current matching position, which would be inefficient with long target strings. .NET only applies the lookbehind regex once, which is no more expensive than a lookahead.

— Adapted from email correspendence between Jan Goyvaerts and Steven Levithan, while working on Regular Expressions Cookbook, Second Edition

Miscellaneous notes: