Regular Expressions |
When searching for a text string using the Find, Replace, Replace Line Enders or Find Text in Disk Files commands, Boxer supports the use of Regular Expressions, a pattern matching grammar first popularized on the Unix operating system. Regular Expressions make it possible to specify a search string which can match many different target strings, or to restrict the ways in which a search string can be matched.
Boxer uses Perl-Compatible Regular Expressions as implemented by the increasingly popular PCRE 5.0 library. See the end of this topic for further information and acknowledgements.
A complete treatment of the topic of regular expressions could--and does--fill an entire book. Mastering Regular Expressions, by Jeffrey Friedl is one such book, and a good one at that. This help topic was written to acquaint the typical user with the most common regular expression features, without getting too bogged down in fine details. The advanced reader is encouraged to seek out additional information on the web, or within the PCRE documentation itself. We have posted one such reference document on our site for your convenience.
Regular Expressions are very powerful, and can be more easily understood by studying several examples.
Matching a Single Character The dot (.) will match any single character, except the newline character. Example: p.t will match pat, pet, pit, pot, and put, and in fact any 3-character sequence with p and t at its ends and a single character in the middle.
Matching with an Asterisk The asterisk (*) will match zero or more occurrences of the preceding character. Example: zo*m will match zm, zom, zoom and zooooooooom, among others. Note that the character preceding the asterisk can be the dot, so zero or more occurrences of any character will be matched when the construction .* is used. Example: Bo.*r will match Boxer, Bowler, Bookmaker, Bookkeeper and Building Manager.
Matching with a Plus Sign The plus sign (+) will match one or more occurrences of the preceding character. Example: ho+p will match hop, hoop and hooooooop, among others. Note that the character preceding the plus sign can be the dot, so that one or more occurrences of any character will be matched when the construction .+ is used.
Matching at Start of Line The caret (^) can be used to force a match to occur at the start of a line. Example: ^The will match any line beginning with The.
Matching at End of Line The dollar sign ($) can be used to force a match to occur at the end of a line. Example: result$ will match any line ending with the word result.
Character Classes or Range Expressions One or more characters can be placed within square brackets to designate the characters which can match in that position. Example: p[aeiou]t will match pat, pet, pit, pot and put. Note that digits are also characters, so an expression such as 201[1234] will match any of 2011, 2012, 2013 or 2014.
Characters can also be placed within square brackets with a dash between them to designate a range of characters. Example: [b-d]ent will match bent, cent and dent because the expression [b-d] is shorthand for all characters in that range. The character range can be entered in ascending or descending order; both [A-Z] and [Z-A] are allowed and are functionally equivalent.
The character set appearing within square brackets can be negated by using the caret (^) as the first character within the opening square bracket. Example: [^cb]ent will match tent, rent, sent, dent and others, but not cent or bent. The caret can also be applied to negate a character range within square brackets: [^a-e] will match all characters except a, b, c, d and e. If the caret appears anywhere else within the range expression, its meaning reverts to that of matching the caret itself.
Matching Multiple Strings The vertical rule (|) can be used to separate two or more regular expressions so that any of the patterns will match. Example: red|green|blue|yellow will match any of the color names that are separated by the vertical rules.
Subpatterns Left and right parentheses can be used to start and end a subpattern. Example: c(ar|en|oun)t will match cart, cent and count. In absence of the parentheses, car|en|ount would match car, en or ount... a very different result.
Escape Character The backslash can be used to remove significance from a pattern matching character. Example: if you need to search for an asterisk, use \*. To search for a dot, use \.. To search for a plus sign, use \+. To search for the backslash itself, use \\.
Matching Whole Words To force a pattern to find only those occurrences of a search string which appear as whole words, the pattern can be surrounded with a sequence that forces a match at a word boundary. Example: to find the word sign, but not words such as assign, signature or assignment, use \bsign\b.
Matching Special Characters Several characters that are not readily typed from the keyboard can be matched using special character sequences:
Generic Character Types There are several convenient shorthand sequences for matching common character classes:
Assertions The following sequences can be used to force a match to occur only at a required position:
Useful Constructions The following examples illustrate some common constructions, and give examples of the utility--and complexity--of some advanced regular expressions:
Min/Max Quantifiers A min/max quantifier can be used to control how many instances of the preceding entity are to be allowed within a match. The syntax for min/max quantifiers is summarized in this table:
Example: the pattern [abc]{4,8} would match a sequence of characters consisting of the letters a, b or c, so long as at least 4 characters are present, and no more than 8 appear. Potential matches: aaaa, accb, abcabc, bbbbcccc. Non matches: aaa, abcd, abcabcabc.
Back References and Named Subpatterns One of the more powerful features of Perl regular expressions is the ability to make reference within a pattern to the string that matched a subpattern which occurred earlier in the pattern. Subpatterns are created when a portion of a pattern is enclosed in left and right parentheses. The first opening left parenthesis encountered starts a subpattern whose number is 1. The second left parenthesis creates subpattern 2, and so on. To make a back reference to a subpattern by number, this syntax is used:
\1 back reference to subpattern number 1
Referring to subpatterns by number can get confusing when a complex regular expression is being created. For this reason, named subpatterns are also permitted. To start a subpattern named 'foo', the following syntax would be used:
(?P<foo> start a subpattern named 'foo'
Later on in the pattern, the string that matched subpattern 'foo' could be referenced using this syntax:
(?P=foo) back reference to the subpattern named 'foo'
The example presented above that matches repeated words uses a back reference:
\b(\w+)\s+\1\b
The subpattern (\w+) matches any string that contains one or more word characters. In order for the entire pattern to match, that same string must appear again (due to the \1 reference) with one or more spaces (\s+) in between. Finally, the \b sequences at each end ensure that the pattern matches only at a word boundary.
Closing Example Finally, it's worth mentioning that any or all of the expressions presented above can be used within the same regular expression. This artificially complex example:
^The\sq[^a]ic{1}k.*f[aeiou]x.*ov[a-e]r.*lazy\040dog\.$
would match the sentence:
The quick brown fox jumped over the lazy dog.
if it appeared on a single line.
PCRE 5.0 License The Perl-Compatible Regular Expression (PCRE) package used by Boxer was written by Philip Hazel, and is used in accordance with the PCRE license:
Copyright (c) 1997-2004 University of Cambridge All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
* Neither the name of the University of Cambridge nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|