If JavaScript or regular expressions are your thing, I encourage you to have a look around on my JavaScript and regular expression centric blog.

What is it?

XRegExp is a JavaScript library that provides an augmented, cross-browser implementation of regular expressions, including support for additional modifiers and syntax. Several convenience methods and a powerful recursive-construct parser are also included.

Browser and standards compatibility

XRegExp has been tested with Firefox 2 – 3 beta 5, Internet Explorer 5.5 – 8 beta 1, Safari 3.1, and Opera 9.27.

XRegExp is compliant with the regular expression flavor specified in ECMA-262 Edition 3 (ES3), and is designed to be compatible with regular expression extensions proposed for ECMAScript 4 (ES4). In fact, XRegExp makes some of ES4's big regex features available in today's browsers.

Features

API

Constructor XRegExp(pattern, [modifiers]) : Global

Accepts a pattern and modifiers; returns a new, extended RegExp object. Differs from a native regular expression in that additional syntax and modifiers are supported and browser inconsistencies are ameliorated.

Parameters:
  • pattern : String
    The regular expression pattern.
  • modifiers : String [optional]
    The regular expression modifiers (aka flags); may include non-native modifiers.
Returns:
  • RegExp
    An extended regular expression object.

Example code

var regex = new XRegExp("(?<month> [0-9]+ ) [-/.\\s]  # month\n" +
                        "(?<day>   [0-9]+ ) [-/.\\s]  # day  \n" +
                        "(?<year>  [0-9]+ )           # year   ", "x");

var input = "04/20/2008";
var output = input.replace(regex, "${year}-${month}-${day}");
// -> "2008-04-20"

Method cache(pattern, [modifiers]) : XRegExp

Accepts a pattern and modifiers; returns an extended RegExp object. If the regex has previously been cached, returns the cached object, otherwise the new object is cached.

Parameters:
  • pattern : String
    The regular expression pattern.
  • modifiers : String [optional]
    The regular expression modifiers (aka flags); may include non-native modifiers.
Returns:
  • RegExp
    The extended regular expression object.

Example code

var regex1 = XRegExp.cache("\\b ex", "gix");
var regex2 = XRegExp.cache("\\b ex", "gix");
// regex1 and regex2 now refer to the same RegExp object
// This has the benefit that it is only compiled once

Method escape(string) : XRegExp

Accepts a string; returns the string with regex metacharacters escaped. The returned string can safely be used within a regex to match the literal characters specified. The escaped characters are [, ], {, }, (, ), *, +, ?, ., \, ^, $, |, ,, -, #, and whitespace (see free-spacing and comments for the list of whitespace characters).

Parameters:
  • string : String
    The string to escape.
Returns:
  • String
    The escaped string.

Example code

var str = XRegExp.escape("([\\d])");
var regex = new XRegExp("^" + str + "$");
var matched = regex.test("([\\d])"); // -> true

Method matchRecursive(string, left, right, [modifiers], [options]) : XRegExp

Accepts a string to search, left and right format delimiters as regex pattern strings, optional regex modifiers, and optional extended options. Returns an array of matches, allowing nested instances of the left and right delimiters. Use the g modifier to return all matches, otherwise only the first is returned.

Parameters:
  • string : String
    The string to search.
  • left : String
    The left delimiter as a regex pattern.
  • right : String
    The right delimiter as a regex pattern.
  • modifiers : String [optional]
    The regular expression modifiers (aka flags); may include non-native modifiers.
  • options : Object [optional]
    • valueNames : Array [optional]
      Changes the return format from an array of matches to a two-dimensional array of identified string parts with extended position data. Expected to be either undefined or an array with four values to be used for identifying elements in the returned array. The four element types are the text between matches and at the beginning and end of the string, left delimiter, matched text, and right delimiter. If any of the four values are set to null, all instances of that element type are omitted from the returned array. See the example code below for more information.
    • escapeChar : String [optional]
      A single-character string to be used as an escape character. Instances of the left and right delimiters escaped with this character are ignored both inside and outside of non-escaped delimiters.
      Warning: The escapeChar option is considered experimental and might be changed or removed in future versions of XRegExp.
Returns:
  • Array
    The return format is determined by the valueNames option.
    • If valueNames is undefined:
      An array containing the text within each outermost delimiter pair. E.g., ["one", "two", "three"]. If there are no matches, an empty array is returned.
    • If valueNames is an array with four values:
      A two-dimensional array of identified string parts with extended position data. E.g., [["text","...",0,3], ["left","(",3,4], ["match","a(b())c",4,11], ["right",")",11,12]]. Empty "text" segments (the text before, after, and between matches) are omitted from the results.

Example code

var input = "(t((e))s)t()(ing)";
var output = XRegExp.matchRecursive(input, "\\(", "\\)");
// -> ["t((e))s"]

// Global match
output = XRegExp.matchRecursive(input, "\\(", "\\)", "g");
// -> ["t((e))s", "", "ing"]

// Unbalanced delimiter on the left or right
output = XRegExp.matchRecursive("<<t>est", "<", ">", "g");
output = XRegExp.matchRecursive("<t>>est", "<", ">", "g");
// **both lines throw an error**

// Ignoring escaped delimiters
input = "t\\{e\\\\{s{t\\{i}ng}";
output = XRegExp.matchRecursive(input, "{", "}", "g", {escapeChar: "\\"});
// -> ["s{t\\{i}ng"]

// Extended information mode with valueNames
input = "HTML: <div id='x'>A <div>nested <div /></div> element.</div>";
// The left delimiter is designed to skip self-closed <div /> elements
output = XRegExp.matchRecursive(input, "<div\\b(?:[^>](?!/>))*>", "</div>", "i", {valueNames: ["text", "left", "match", "right"]});
/* ->
[
  ["text", "HTML: ", 0, 6],
  ["left", "<div id='x'>", 6, 18],
  ["match", "A <div>nested <div /></div> element.", 18, 54],
  ["right", "</div>", 54, 60]
]
*/

// Omitting unneeded parts with null valueNames
input = "...{1}..{function(a,b){return a+b;}}";
output = XRegExp.matchRecursive(input, "{", "}", "g", {valueNames: ["literal", null, "value", null]});
/* ->
[
  ["literal", "...", 0, 3],
  ["value", "1", 4, 5],
  ["literal", "..", 6, 8],
  ["value", "function(a,b){return a+b;}", 9, 35]
]
*/

/* The matchRecursive function specifically supports the y modifier (sticky mode). This mode
requires the first match to appear at the beginning of the string, with each subsequent match
immediately following the last. Outside of the matchRecursive function, the y modifier cannot
be used unless your browser supports it natively. */
input = "<1><2><3>4<5>";
output = XRegExp.matchRecursive(input, "<", ">", "gy");
// -> ["1", "2", "3"]

Method addFlags(modifiers) : RegExp.prototype

Returns a new RegExp object generated by recompiling the regex with the additional modifiers (aka flags), which may include non-native modifiers. The original regex object is not altered. See modifiers for additional details.

Parameters:
  • modifiers : String
    The regular expression modifiers (aka flags) to apply; may include non-native modifiers.
Returns:
  • RegExp
    The extended regular expression object.

Example code

var regex = new RegExp("\\b ex", "g");
regex = regex.addFlags("ix");
// regex has had three modifiers applied: gix

...
function parse (input, regex) {
	var output = "", match;
	// regex must be global for the while loop to work correctly
	if (!regex.global)
		regex = regex.addFlags("g");
	while (match = regex.exec(input)) {

		...

		// Avoid an infinite loop with zero-length matches
		if (match.index == regex.lastIndex)
			regex.lastIndex++;
	}
	return output;
}

Method apply(context, args) : RegExp.prototype

Returns the result of calling RegExp.prototype.exec on the first item in the args array. This is intended to allow working generically with both functions and regular expression objects.

Parameters:
  • context : Object
    The context is ignored. It is accepted for congruity with Function.prototype.apply.
  • args : Array
    The first value in the args array is passed to the RegExp.prototype.exec method.
Returns:
  • A result Array or null.

Example code

// Returns true if every element in the array satisfies the provided testing function
Array.prototype.every = function (fn, context) {
	for (var i = 0; i < this.length; i++) {
		if (!fn.apply(context, [this[i], i, this]))
			return false;
	}
	return true;
};

var output = ["a", "ba"].every(/^a/);
// -> false
var output = ["a", "ab"].every(/^a/);
// -> true

Method call(context, string) : RegExp.prototype

Returns the result of calling RegExp.prototype.exec on the provided string. This is intended to allow working generically with both functions and regular expression objects.

Parameters:
  • context : Object
    The context is ignored. It is accepted for congruity with Function.prototype.call.
  • string : String
    The value is passed to the RegExp.prototype.exec method.
Returns:
  • A result Array or null.

Example code

// Returns an array with the elements of an existng array for which the provided filtering function returns true
Array.prototype.filter = function (fn, context) {
	var results = [];
	for (var i = 0; i < this.length; i++) {
		if (fn.call(context, this[i], i, this))
			results.push(this[i]);
	}
	return results;
};

var output = ["a", "ba", "ab", "b"].filter(/^a/);
// -> ["a", "ab"]

Named capture

There are several different syntaxes in the wild for named capture. Although Python was the first to implement the feature, most libraries have adopted .NET's alternative syntax. The following table is based on my understanding of the regex libraries in question. XRegExp's syntax is included at the top. Capture names can use the characters A–Z, a–z, 0–9, _, and $ only.

Library Capture Backref in regex Backref in replacement Stored at Backref numbering
XRegExp
  • (?<name>…)
  • \k<name>
${name}
  • result.name
  • arguments[0].name (within replace callbacks)
Sequential
.NET
  • (?<name>…)
  • (?'name'…)
  • \k<name>
  • \k'name'
${name} Matcher.Groups('name') Unnamed first, then named
Python
  • (?P<name>…)
  • (?P=name)
\g<name> result.group('name') Sequential
Perl 5.10
  • (?<name>…)
  • (?'name'…)
  • (?P<name>…)
  • \k<name>
  • \k'name'
  • \k{name}
  • \g{name}
  • (?P=name)
$+{name} $+{name} Sequential
PHP preg* functions (PCRE 7) Perl styles $regs['name'] $result['name'] Sequential
Oniguruma .NET styles N/A Unnamed groups default to non-capturing when mixed with named groups
JGsoft .NET and Python styles N/A .NET and Python styles, depending on capture syntax
JRegex
  • ({name}…)
  • {\name}
${name} Matcher.group('name') Unnamed only

Example code

var repeatedWords = new XRegExp("\\b (?<word>[a-z]+) \\s+ \\k<word> \\b", "gix");
var input = "The the test data.";

// Check if data contains repeated words
var hasRepeatedWords = repeatedWords.test(input);
// -> true

// Use the regex to remove repeated words
var output = input.replace(repeatedWords, "${word}");
// -> "The test data."

var url = "http://yahoo.com/path/to/file?q=1";
var parser = new XRegExp("^                            # start of string\n" +
                         "(?<protocol> [^:/?]+ ) ://   # protocol       \n" +
                         "(?<host>     [^/?]+  )       # domain name/IP \n" +
                         "(?<path>     [^?]*   ) \\??  # optional path  \n" +
                         "(?<query>    .*      )       # optional query   ", "x");

var parts = parser.exec(url);
/* ->
parts.protocol: "http"
parts.host:     "yahoo.com"
parts.path:     "/path/to/file"
parts.query:    "q=1"
*/

// Named backreferences are available in replacement functions as properties of the first argument
url = url.replace(parser, function (match) {
	return match.replace(match.host, "microsoft.com");
});
// -> "http://microsoft.com/path/to/file?q=1"

Compatibility with deprecated features

XRegExp's named capture functionality does not support the lastMatch property of the global RegExp object or the RegExp.prototype.compile method, since those features were deprecated in JavaScript 1.5. Previous versions of this library included a k modifier that engaged named capture using a non-standard syntax. This has been removed in favor or making named capture always available using the standard syntax.

Modifiers (aka flags)

Modifiers can be combined and arranged in any order. Unlike with native modifiers, the non-native modifiers do not show up as properties on RegExp objects.

Dot matches all (s)

ECMAScript 4 proposals indicate that the C1/Unicode NEL "next line" control character (\u0085) will be recognized as an additional newline character in that standard.

Usually, dot does not match newlines. However, a mode in which dot matches newlines can be as useful as one where dot doesn't. The s modifier allows the mode to be selected on a per-regex basis. Escaped dots and dots within character classes (e.g. [.a-z]) are always equivalent to literal dots. The newline characters are listed below:

Free-spacing and comments (x)

It might be better to think of whitespace and comments as do-nothing (rather than ignore-me) metacharacters. This distinction is important with something like \12 3, which with the x modifier is taken as \12 followed by 3, and not \123. However, quantifiers following whitespace or comments apply to the preceeding token, so x + is equivalent to x+.

This modifier has two, complementary effects. First, it causes most whitespace to be ignored, so you can free-format the expression for readability. Second, it allows comments with a leading #. Specifically, it turns most whitespace into an "ignore me" metacharacter, and # into an "ignore me, and everything else up to the next newline" metacharacter. They aren't taken as metacharacters within character classes (which means that classes are not free-format, even with x), and as with other metacharacters, you can escape whitespace and # that you want to be taken literally. Of course, you can always use \s to match whitespace.

ECMA-262 Edition 3 uses an interpretation of whitespace based on Unicode's Basic Multilingual Plane, from version 2.1 or later of the Unicode standard. Following are the characters that should be matched by \s according to ECMA-262 Edition 3 and Unicode 4.0:

JavaScript's \s is similar but not equivalent to \p{Z} from regex libraries that support Unicode properties, including ECMAScript 4 (as proposed). The difference is that \s includes characters \u0009\u000d, which are not assigned the Separator property in the Unicode character database.

Note that not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. According to ECMA-262 Edition 3, \s, \S, ., ^, and $ use Unicode-based interpretations of whitespace and newline, while \d, \D, \w, \W, \b, and \B use ASCII-only interpretations of digit, word character, and word boundary (e.g. /a\b/.test("naïve") returns true). Actual browser implementations often differ on these points. For example, Firefox 2 and lower considers \d and \D to be Unicode-aware, while Firefox 3 fixes this bug — making \d equivalent to [0-9] as with most other browsers.

To test which characters or positions are matched by the tokens mentioned above in your browser, see JavaScript Regex and Unicode Tests.

Comment patterns

Comment patterns use the syntax (?#comment). They are an alternative to the line-comments allowed in free-spacing and comments mode.

Example code

var regex = new XRegExp("(?#month)\\d\\d?[-/. ](?#day)\\d\\d?[-/. ](?#year)\\d{4}");
var isDate = regex.test("04/20/2008");
// -> true

Browser inconsistency fixes

RegExp.prototype.exec

The XRegExp library automatically fixes both of these cross-browser compatibility issues.

String.prototype.match

When String.prototype.match is called with a regular expression that doesn't use the /g modifier, it returns the same result as would RegExp.prototype.exec. Hence, this method also benefits from the exec cross-browser compatibility fixes noted above.

String.prototype.split

There are several cross-browser inconsistencies when using a regular expression as the delimiter with the native String.prototype.split method. Divergences from the ECMA-262 Edition 3 standard are listed below, based on results from Internet Explorer 5.5–7, Firefox 2.0.0.14, Safari 3.0.3 beta, and Opera 9.23.

XRegExp overrides the native String.prototype.split method with a uniform cross-browser implementation that attempts to precisely follow the relevant specification (ECMA-262 Edition 3, §15.5.4.14, pp.103–104).

Character class syntax

Unlike other regex flavors, ECMA-262 Edition 3 does not treat a leading, unescaped ] within a character class as a literal character. Instead, [] is an empty set that will never match (similar to (?!) or \B\b), and [^] matches any single character (similar to [\S\s] or [\0-\uFFFF]). However, Internet Explorer and older versions of Safari use the more traditional behavior instead.

XRegExp automatically enforces the ECMA-262 Edition 3 standard behavior cross-browser.

Downloads

Download XRegExp 0.5.2 (minified and gzipped: 2.4 KB) — released 2008-05-14.

XRegExp 0.5 is not fully backward compatible with previous releases. See the release notes and changelog for details.

Previous releases

Tests

More tests will be added.

Feedback

Feedback is very welcome on the related blog post (preferred) or by email.

References and sources