Regular Expressions

Link: Learn Regex the Hard Way

Regular expressions ('regex') provide a means of identifying strings of interest such as particular characters, words or patterns of characters. A regular expression is an expression that describes a set of strings. They are usually used to give a concise description of a set without having to list all the elements. The name Catherine might be expressed as (C|K) ath (e|a) rine (for Catherine, Catharine, Katherine, Katharine; extend it for 'Kathryn'). See also the car example in Wikipedia.

Regex can be used for searching for any piece of text (e.g. 'Elvis') but is really useful when the text required makes some kind of pattern that can be captured in a rule e.g. a date, an IP address, a registration plate, a national health number, a bank account, a card number, an ISBN, a Dewey Decimal number... The modern world is full of such structured data items.

Special characters include:

These are also known as metacharacters. If you want to include the actual character it must be preceded by a \, which acts as an 'escape' character.

RegularExpressions.info

Searching

The following characters are used in searching to match single characters:

Integer

The regular expression for any integer is: [+ -]?\d+ or see here.

Break it down:

[+ -]? is for the optional sign; (- is a metacharacter, used in, e.g. [a-z] so \- could be used)

\d+ is for one or more digits

Real Number

The regular expression for a real number is: [+ -]?\d+(\.\d+)

Break it down:

[+\-]?\d+ is for the sign and whole number part.

(\.\d+) is for the fractional part: \. is for the fractional point: as '.' is a metacharacter it must be preceded by a \ to find the literal character; \d+ is for a sequence of digits in the fraction.

Email Address

The regular expression for an email address is:

[_a-zA-Z\d\-.]+@([_a-zA-Z\d\-]+(\.[_a-zA-Z\d-]+)+)

Break it down:

[_a-zA-Z\d\-.] matches a single character which is either underscore or a letter, a digit, a hyphen or a full stop. The + immediately after means that there can be one or more of these characters.

@ matches the single at character

[_a-zA-Z\d\-] is the same as before but without the full stop.

\.[_a-zA-Z\d-]+) matches a full stop followed by one or more of the same characters as before.

([_a-zA-Z\d\-]+(\.[_a-zA-Z\d-]+)+) after the @ means that the @ symbol is followed by one or more characters (excluding full stop), followed by an optional full stop and more characters; the full stop and characters can be repeated.

Applications

Regular expressions are used by many text editors and programming languages, for example Perl and Ruby have a regular expression engine built in. Wildcards (? and *) are regular expression tools of limited scope. A regular expression is written in a formal language that can be interpreted by a regular expression processor. One task for a regular expression processor might be to find all words that begin with 'ex' or all words that include 'iss' but not at the start or end of a word (no space before or after). (fission, glissando, frisson...) This might also have applications in detecting banned words. Other applications are syntax highlighting in editors and data validation. Web search engines do not usually offer searching by regular expressions as this would be too time-consuming - one exception to this is Google Code Search; Google page.

Applications include: virus signatures, search and replace in text editors, filtering text such as spam and banned words (like "freedom", "amnesty", "democracy", etc.), input masks.

Searching text

You might want to search for a date. You might think: 'But I can enter a search string like "09/10/40" and scan the text for a match. Even if I can't match the string in one go I can search for "0" and then see if it is followed by "9" and so on.' This, however, misses the point: what if you want to search for any date? How can you identify a date in the text when you don't know the numbers? It could be done in code, of course, but this is where regex comes in: form a regex expression that identifies any date of the form dd/mm/yy (or any other date format) and then scan the text with a regex analyser.

Expresso

This is a free tool available here. There is a video that makes extensive use of ithere (in two parts).

The program has three main sections: input text; regular expression;

Examples (Mainly from the Expresso Tutorial)

Explain the form and meaning of each regex:

Find a word: elvis: just the literal characters, any string will do; note that this will find substrings as well e.g. (p)elvis

Find a distinct word: \belvis\b

Find 'elvis' followed by 'alive': \belvis\b.*\balive\b

Find 7 digit phone numbers: \b\d\d\d-\d\d\d\d  or: \b\d{3}-\d{4}

The longest text starting with a and ending with b: a.*b

Words that begin with letter a: \ba\w*\b

Find all the 3 letter words: \b\w{3}\b

Find all the 4 digit numbers: \b\d{4}\b

Find words of 5 or 6 letters: \b\w{5,6}\b

The first word in a piece of text: ^\w*

Find all words beginning with 'ex': \be{1}x{1}\w*\b  or just: \bex\w*\b

Show the second part of words beginning with 're': (?<=\bre)\w+\b

Find words that contain 'q' but not 'qu': \b\w*q[^u]\w*\b

Find any date in mm/dd/yy format:

^
(?<Month>0?[1-9]|12|11|10)
/
(?<Day>[12]\d|0?[1-9]|3[01])
/
(?<Year>(?:\d{4}|\d{2}))
$

Switch month and day for British-style dates.

An IP address: (\d{1,3}\.){3}\d{1,3}

Show text between HTML tags: (?<=<(\w+)>).*(?=<\/\1>)

 

Postcodes

 

Regular Languages

By definition, the languages accepted by FSMs are regular languages; to put it another way, a language is regular if there is a finite state machine that accepts it.