If JavaScript or regular expressions are your thing, I encourage you to have a look around on my JavaScript and regular expression centric blog.
XRegExp is a JavaScript library that provides an augmented, cross-browser implementation of regular expressions, including support for additional modifiers and syntax. Several convenience methods and a powerful recursive-construct parser are also included.
XRegExp has been tested with Firefox 2 – 3 beta 5, Internet Explorer 5.5 – 8 beta 1, Safari 3.1, and Opera 9.27.
XRegExp is compliant with the regular expression flavor specified in ECMA-262 Edition 3 (ES3), and is designed to be compatible with regular expression extensions proposed for ECMAScript 4 (ES4). In fact, XRegExp makes some of ES4's big regex features available in today's browsers.
(?#…). (Included in ES4)s (singleline), to make dot match all characters including newlines.x (extended), for free-spacing and comments. (Included in ES4)call and apply methods, which make generically working with functions and regexes easier.XRegExp(pattern, [modifiers]) : GlobalAccepts a pattern and modifiers; returns a new, extended RegExp object. Differs from a native regular expression in that additional syntax and modifiers are supported and browser inconsistencies are ameliorated.
| Parameters: |
|
|---|---|
| Returns: |
|
var regex = new XRegExp("(?<month> [0-9]+ ) [-/.\\s] # month\n" +
"(?<day> [0-9]+ ) [-/.\\s] # day \n" +
"(?<year> [0-9]+ ) # year ", "x");
var input = "04/20/2008";
var output = input.replace(regex, "${year}-${month}-${day}");
// -> "2008-04-20"
cache(pattern, [modifiers]) : XRegExpAccepts a pattern and modifiers; returns an extended RegExp object. If the regex has previously been cached, returns the cached object, otherwise the new object is cached.
| Parameters: |
|
|---|---|
| Returns: |
|
var regex1 = XRegExp.cache("\\b ex", "gix");
var regex2 = XRegExp.cache("\\b ex", "gix");
// regex1 and regex2 now refer to the same RegExp object
// This has the benefit that it is only compiled once
escape(string) : XRegExpAccepts a string; returns the string with regex metacharacters escaped. The returned string can safely be used within a regex to match the literal characters specified. The escaped characters are [, ], {, }, (, ), *, +, ?, ., \, ^, $, |, ,, -, #, and whitespace (see free-spacing and comments for the list of whitespace characters).
| Parameters: |
|
|---|---|
| Returns: |
|
var str = XRegExp.escape("([\\d])");
var regex = new XRegExp("^" + str + "$");
var matched = regex.test("([\\d])"); // -> true
matchRecursive(string, left, right, [modifiers], [options]) : XRegExpAccepts a string to search, left and right format delimiters as regex pattern strings, optional regex modifiers, and optional extended options. Returns an array of matches, allowing nested instances of the left and right delimiters. Use the g modifier to return all matches, otherwise only the first is returned.
| Parameters: |
|
|---|---|
| Returns: |
|
var input = "(t((e))s)t()(ing)";
var output = XRegExp.matchRecursive(input, "\\(", "\\)");
// -> ["t((e))s"]
// Global match
output = XRegExp.matchRecursive(input, "\\(", "\\)", "g");
// -> ["t((e))s", "", "ing"]
// Unbalanced delimiter on the left or right
output = XRegExp.matchRecursive("<<t>est", "<", ">", "g");
output = XRegExp.matchRecursive("<t>>est", "<", ">", "g");
// **both lines throw an error**
// Ignoring escaped delimiters
input = "t\\{e\\\\{s{t\\{i}ng}";
output = XRegExp.matchRecursive(input, "{", "}", "g", {escapeChar: "\\"});
// -> ["s{t\\{i}ng"]
// Extended information mode with valueNames
input = "HTML: <div id='x'>A <div>nested <div /></div> element.</div>";
// The left delimiter is designed to skip self-closed <div /> elements
output = XRegExp.matchRecursive(input, "<div\\b(?:[^>](?!/>))*>", "</div>", "i", {valueNames: ["text", "left", "match", "right"]});
/* ->
[
["text", "HTML: ", 0, 6],
["left", "<div id='x'>", 6, 18],
["match", "A <div>nested <div /></div> element.", 18, 54],
["right", "</div>", 54, 60]
]
*/
// Omitting unneeded parts with null valueNames
input = "...{1}..{function(a,b){return a+b;}}";
output = XRegExp.matchRecursive(input, "{", "}", "g", {valueNames: ["literal", null, "value", null]});
/* ->
[
["literal", "...", 0, 3],
["value", "1", 4, 5],
["literal", "..", 6, 8],
["value", "function(a,b){return a+b;}", 9, 35]
]
*/
/* The matchRecursive function specifically supports the y modifier (sticky mode). This mode
requires the first match to appear at the beginning of the string, with each subsequent match
immediately following the last. Outside of the matchRecursive function, the y modifier cannot
be used unless your browser supports it natively. */
input = "<1><2><3>4<5>";
output = XRegExp.matchRecursive(input, "<", ">", "gy");
// -> ["1", "2", "3"]
addFlags(modifiers) : RegExp.prototypeReturns a new RegExp object generated by recompiling the regex with the additional modifiers (aka flags), which may include non-native modifiers. The original regex object is not altered. See modifiers for additional details.
| Parameters: |
|
|---|---|
| Returns: |
|
var regex = new RegExp("\\b ex", "g");
regex = regex.addFlags("ix");
// regex has had three modifiers applied: gix
...
function parse (input, regex) {
var output = "", match;
// regex must be global for the while loop to work correctly
if (!regex.global)
regex = regex.addFlags("g");
while (match = regex.exec(input)) {
...
// Avoid an infinite loop with zero-length matches
if (match.index == regex.lastIndex)
regex.lastIndex++;
}
return output;
}
apply(context, args) : RegExp.prototypeReturns the result of calling RegExp.prototype.exec on the first item in the args array. This is intended to allow working generically with both functions and regular expression objects.
| Parameters: |
|
|---|---|
| Returns: |
|
// Returns true if every element in the array satisfies the provided testing function
Array.prototype.every = function (fn, context) {
for (var i = 0; i < this.length; i++) {
if (!fn.apply(context, [this[i], i, this]))
return false;
}
return true;
};
var output = ["a", "ba"].every(/^a/);
// -> false
var output = ["a", "ab"].every(/^a/);
// -> true
call(context, string) : RegExp.prototypeReturns the result of calling RegExp.prototype.exec on the provided string. This is intended to allow working generically with both functions and regular expression objects.
| Parameters: |
|
|---|---|
| Returns: |
|
// Returns an array with the elements of an existng array for which the provided filtering function returns true
Array.prototype.filter = function (fn, context) {
var results = [];
for (var i = 0; i < this.length; i++) {
if (fn.call(context, this[i], i, this))
results.push(this[i]);
}
return results;
};
var output = ["a", "ba", "ab", "b"].filter(/^a/);
// -> ["a", "ab"]
There are several different syntaxes in the wild for named capture. Although Python was the first to implement the feature, most libraries have adopted .NET's alternative syntax. The following table is based on my understanding of the regex libraries in question. XRegExp's syntax is included at the top. Capture names can use the characters A–Z, a–z, 0–9, _, and $ only.
| Library | Capture | Backref in regex | Backref in replacement | Stored at | Backref numbering |
|---|---|---|---|---|---|
| XRegExp |
|
|
${name} |
|
Sequential |
| .NET |
|
|
${name} |
Matcher.Groups('name') |
Unnamed first, then named |
| Python |
|
|
\g<name> |
result.group('name') |
Sequential |
| Perl 5.10 |
|
|
$+{name} |
$+{name} |
Sequential |
| PHP preg* functions (PCRE 7) | Perl styles | $regs['name'] |
$result['name'] |
Sequential | |
| Oniguruma | .NET styles | N/A | Unnamed groups default to non-capturing when mixed with named groups | ||
| JGsoft | .NET and Python styles | N/A | .NET and Python styles, depending on capture syntax | ||
| JRegex |
|
|
${name} |
Matcher.group('name') |
Unnamed only |
var repeatedWords = new XRegExp("\\b (?<word>[a-z]+) \\s+ \\k<word> \\b", "gix");
var input = "The the test data.";
// Check if data contains repeated words
var hasRepeatedWords = repeatedWords.test(input);
// -> true
// Use the regex to remove repeated words
var output = input.replace(repeatedWords, "${word}");
// -> "The test data."
var url = "http://yahoo.com/path/to/file?q=1";
var parser = new XRegExp("^ # start of string\n" +
"(?<protocol> [^:/?]+ ) :// # protocol \n" +
"(?<host> [^/?]+ ) # domain name/IP \n" +
"(?<path> [^?]* ) \\?? # optional path \n" +
"(?<query> .* ) # optional query ", "x");
var parts = parser.exec(url);
/* ->
parts.protocol: "http"
parts.host: "yahoo.com"
parts.path: "/path/to/file"
parts.query: "q=1"
*/
// Named backreferences are available in replacement functions as properties of the first argument
url = url.replace(parser, function (match) {
return match.replace(match.host, "microsoft.com");
});
// -> "http://microsoft.com/path/to/file?q=1"
XRegExp's named capture functionality does not support the lastMatch property of the global RegExp object or the RegExp.prototype.compile method, since those features were deprecated in JavaScript 1.5. Previous versions of this library included a k modifier that engaged named capture using a non-standard syntax. This has been removed in favor or making named capture always available using the standard syntax.
s — Dot matches all (singleline)x — Free-spacing and comments (extended)g — All matches (global)i — Case insensitive (ignoreCase)m — ^ and $ match at line breaks (multiline)Modifiers can be combined and arranged in any order. Unlike with native modifiers, the non-native modifiers do not show up as properties on RegExp objects.
s)ECMAScript 4 proposals indicate that the C1/Unicode NEL "next line" control character (\u0085) will be recognized as an additional newline character in that standard.
Usually, dot does not match newlines. However, a mode in which dot matches newlines can be as useful as one where dot doesn't. The s modifier allows the mode to be selected on a per-regex basis. Escaped dots and dots within character classes (e.g. [.a-z]) are always equivalent to literal dots. The newline characters are listed below:
\u000a — Line feed — \n\u000d — Carriage return — \r\u2028 — Line separator\u2029 — Paragraph separator[\S\s] or [\0-\uFFFF].s modifier is illegal in native ECMAScript 3 regular expressions.s modifier is not proposed for inclusion in ECMAScript 4.x)It might be better to think of whitespace and comments as do-nothing (rather than ignore-me) metacharacters. This distinction is important with something like \12 3, which with the x modifier is taken as \12 followed by 3, and not \123. However, quantifiers following whitespace or comments apply to the preceeding token, so x + is equivalent to x+.
This modifier has two, complementary effects. First, it causes most whitespace to be ignored, so you can free-format the expression for readability. Second, it allows comments with a leading #. Specifically, it turns most whitespace into an "ignore me" metacharacter, and # into an "ignore me, and everything else up to the next newline" metacharacter. They aren't taken as metacharacters within character classes (which means that classes are not free-format, even with x), and as with other metacharacters, you can escape whitespace and # that you want to be taken literally. Of course, you can always use \s to match whitespace.
ECMA-262 Edition 3 uses an interpretation of whitespace based on Unicode's Basic Multilingual Plane, from version 2.1 or later of the Unicode standard. Following are the characters that should be matched by \s according to ECMA-262 Edition 3 and Unicode 4.0:
JavaScript's \s is similar but not equivalent to \p{Z} from regex libraries that support Unicode properties, including ECMAScript 4 (as proposed). The difference is that \s includes characters \u0009–\u000d, which are not assigned the Separator property in the Unicode character database.
Note that not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. According to ECMA-262 Edition 3, \s, \S, ., ^, and $ use Unicode-based interpretations of whitespace and newline, while \d, \D, \w, \W, \b, and \B use ASCII-only interpretations of digit, word character, and word boundary (e.g. /a\b/. returns true). Actual browser implementations often differ on these points. For example, Firefox 2 and lower considers \d and \D to be Unicode-aware, while Firefox 3 fixes this bug — making \d equivalent to [0-9] as with most other browsers.
To test which characters or positions are matched by the tokens mentioned above in your browser, see JavaScript Regex and Unicode Tests.
\u0009 — Tab — \t\u000a — Line feed — \n — newline\u000b — Vertical tab — \v\u000c — Form feed — \f\u000d — Carriage return — \r — newline\u0020 — Space\u00a0 — No-break space\u1680 — Ogham space mark\u180e — Mongolian vowel separator\u2000 — En quad\u2001 — Em quad\u2002 — En space\u2003 — Em space\u2004 — Three-per-em space\u2005 — Four-per-em space\u2006 — Six-per-em space\u2007 — Figure space\u2008 — Punctuation space\u2009 — Thin space\u200a — Hair space\u200b — Zero-width space\u2028 — Line separator — newline\u2029 — Paragraph separator — newline\u202f — Narrow no-break space\u205f — Medium mathematical space\u3000 — Ideographic spacex modifier is illegal in native ECMAScript 3 regular expressions. Note that line comments cannot contain invalid patterns if the user expects to apply this modifier post-compilation using addFlags.x modifier is proposed for inclusion in ECMAScript 4. Some of its implementation details are currently under consideration.Comment patterns use the syntax (?#comment). They are an alternative to the line-comments allowed in free-spacing and comments mode.
var regex = new XRegExp("(?#month)\\d\\d?[-/. ](?#day)\\d\\d?[-/. ](?#year)\\d{4}");
var isDate = regex.test("04/20/2008");
// -> true
undefined.lastIndex property is incorrectly incremented after zero-length matches.The XRegExp library automatically fixes both of these cross-browser compatibility issues.
When String.prototype.match is called with a regular expression that doesn't use the /g modifier, it returns the same result as would RegExp.prototype.exec. Hence, this method also benefits from the exec cross-browser compatibility fixes noted above.
There are several cross-browser inconsistencies when using a regular expression as the delimiter with the native String.prototype.split method. Divergences from the ECMA-262 Edition 3 standard are listed below, based on results from Internet Explorer 5.5–7, Firefox 2.0.0.14, Safari 3.0.3 beta, and Opera 9.23.
undefined values into the returned array as the result of non-participating capturing groups.split specification.XRegExp overrides the native String.prototype.split method with a uniform cross-browser implementation that attempts to precisely follow the relevant specification (ECMA-262 Edition 3, §15.5.4.14, pp.103–104).
Unlike other regex flavors, ECMA-262 Edition 3 does not treat a leading, unescaped ] within a character class as a literal character. Instead, [] is an empty set that will never match (similar to (?!) or \B\b), and [^] matches any single character (similar to [\S\s] or [\0-\uFFFF]). However, Internet Explorer and older versions of Safari use the more traditional behavior instead.
XRegExp automatically enforces the ECMA-262 Edition 3 standard behavior cross-browser.
Download XRegExp 0.5.2 (minified and gzipped: 2.4 KB) — released 2008-05-14.
XRegExp 0.5 is not fully backward compatible with previous releases. See the release notes and changelog for details.
cache method and several performance tweaks.
addFlags method.
s and x modifiers.
More tests will be added.
Feedback is very welcome on the related blog post (preferred) or by email.