Regular expression lookbehind behavior tests

Following are test results that help to elucidate the basic semantics used by regular expression implementations that support capturing groups within variable-length lookbehind.

Lookbehind results in different regex implementations
Regex	Target	Result
Regex	Target	.NET	Java	PCRE	ActionScript	JGsoft
`(?<=(.+))$`	`"abc"`	`$1`: `"abc"`	Unsupported	Unsupported	Unsupported	`$1`: `"abc"`
`(?<=(.+?))$`	`"abc"`	`$1`: `"c"`	Unsupported	Unsupported	Unsupported	`$1`: `"c"`
`(?<=(.{1,3}))$`	`"abc"`	`$1`: `"abc"`	`$1`: `"c"`	Unsupported	Unsupported	`$1`: `"abc"`
`(?<=(.{1,3}?))$`	`"abc"`	`$1`: `"c"`	`$1`: `"c"`	Unsupported	Unsupported	`$1`: `"c"`
`(?<=(.{1,3})(.{1,3}?))$`	`"abc"`	`$1`: `"ab"` `$2`: `"c"`	`$1`: `"b"` `$2`: `"c"`	Unsupported	Unsupported	`$1`: `"ab"` `$2`: `"c"`
`(?<=(.{1,3}?)(.{1,3}))$`	`"abc"`	`$1`: `"a"` `$2`: `"bc"`	`$1`: `"b"` `$2`: `"c"`	Unsupported	Unsupported	`$1`: `"a"` `$2`: `"bc"`
`(?<=(.)\|(..)\|(...))$`	`"abc"`	`$1`: `"c"` `$2` and `$3` don't participate	`$1`: `"c"` `$2` and `$3` don't participate	`$1`: `"c"` `$2` and `$3` don't participate	`$1`: `"c"` `$2` and `$3` don't participate	`$1`: `"c"` `$2` and `$3` don't participate
`(?<=(...)\|(..)\|(.))$`	`"abc"`	`$1`: `"abc"` `$2` and `$3` don't participate	`$1` and `$2` don't participate `$3`: `"c"`	`$1`: `"abc"` `$2` and `$3` don't participate	`$1`: `"abc"` `$2` and `$3` don't participate	`$1`: `"abc"` `$2` and `$3` don't participate

Other regex flavors:

Oniguruma (used by default in Ruby 1.9) supports (?<=a|bc) and (?<=a(?:b|c)|a), but not (?<=ab?) or (?<=a(?:b|cd)). It also doesn't support capturing groups in lookbehind.
Perl, Python, and Boost.Regex support fixed-length lookbehind only: (?<=abc) or (?<=a|b).
ECMAScript 5.1, Ruby 1.8, Tcl, XML Schema/XPath, Go/RE2, and POSIX ERE/BRE do not support lookbehind at all.

.NET's lookbehind is the flagship and ideal. It supports all regex syntax, and provides intuitive greedy and nongreedy behavior. It does this while maintaining the same efficiency as lookahead. Java's lookbehind behavior is different, although this is only observable when capturing groups are used within lookbehind. Java applies lookbehind by going through the target string from right to left, while going through the regex from left to right. .NET goes through both the target string and the regex from right to left, except that the order of alternation is unchanged.

Consider the regex (?<=(.{1,3}))$. In Java, backreference 1 when this is tested against "abc" is "c" (unlike .NET, where it's "abc").

.NET applies the regex in reverse, applying it against all text available to the left of the current matching position. It starts after "c". .{1,3} matches three characters, because the quantifier is greedy. The regex engine then concludes the match attempt has succeeded.

Java starts with going back one character, i.e., starting between "b" and "c", applying the regex from left to right. .{1,3} can match one character, so the lookbehind succeeds.

Next, consider the regex (?<=(...)|(..)|(.))$. When tested against "abc" in Java, backreference 3 is "c" (unlike .NET, where backreference 1 is "abc").

Here, the story is the same as before. .NET goes in reverse through the regex and through the target string, but still processes alternation from left to right. Going backwards starting after "c", ... finds a match. Alternation is eager so the lookbehind succeeds immediately.

Java goes back one character. ... fails to match because "c" is the only character available. .. also fails. The lone dot then matches "c".

If you had (?<=(abc)|(de)|(f)) and the string "abc" in Java, then it would start before the "c". First abc fails to match "c", then de fails to match "c", and then f fails to match "c". Java then looks back one more character. abc fails to match "bc", de fails to match "bc", and f fails to match "bc". Going back one more character, abc matches "abc", and the lookbehind succeeds.

The reason Java doesn't support infinite repetition is that it would have to test the lookbehind regex at every character in the target string before the current matching position, which would be inefficient with long target strings. .NET only applies the lookbehind regex once, which is no more expensive than a lookahead.

— Adapted from email correspendence between Jan Goyvaerts and Steven Levithan, while working on Regular Expressions Cookbook, Second Edition

Miscellaneous notes:

Lookbehind is tentatively scheduled to be added in ECMAScript 6. Details of how it should work are currently under discussion on the es-discuss mailing list.
The JGsoft implementation is accessible via Just Great Software applications only (RegexBuddy, etc.). The JGsoft implementation is able to emulate the limitations of lookbehind syntax in other regex flavors (e.g., differences in support of fixed-, finite-, and infinite-length repetition), but does not currently emulate Java's lookbehind behavior.
To do: Test Matthew Barnett's alternative Python regex module. Its documentation says it supports captures in variable-length lookbehind.
To do: Test ICU Regular Expressions. Its documentation says it supports finite-length lookbehind, without any further details.
To do: Add more tests (e.g., nested lookbehind).