Lab #7: Regular Expressions

Learning Objectives:

• To be able to understand, interpret and construct regular expressions
• To be able to use the symbols (*,+,?,[ ],|,^, etc...) properly in regular expressions
• To be able to test and debug regular expressions

Pre-Lab Instructions:

It is required that you complete Section 5 and Exercises 1-12 before you attend your lab session: 4 of the 16 lab marks are in this pre-lab section.  You should record your solutions (e.g.: email them to yourself) to those 12 exercises and have them ready to show your TA at the start of the lab.

This lab uses some JavaScript that may not work properly in your browser: if you are encountering problems, use the machines in the lab.

Section 1: Background and Introductory Language Theory (Optional)

Consider the following statements in a variety of different languages:

1. Mary had a little lamb. [English]
2. Parlez-vous l'anglais? [French]
3. Heghlu'meH QaQ jajvam! [Klingon]
4. for (j=0;j<100;j++) tot += j; [JAVA]
5. 01010101100111010101 [Binary Strings]

Each of the statements is a valid statement in its own language, but not in any of the others.  Let's say that we have a hypothetical computer program called LANGUAGECHECK where you could pass it a language and a statement, and the LANGUAGECHECK program could determine if the statement is a valid statement in that language.  In other words, the LANGUAGECHECK program would accept or reject the statement.  For example:

LANGUAGECHECK(English, "Mary had a little lamb.") => Accept
LANGUAGECHECK(JAVA,"for (j=0;j<100;j++) tot += j;") => Accept
LANGUAGECHECK(English, "01010101100111010101") => Reject

What about the following? What would LANGUAGECHECK return?

LANGUAGECHECK(English, "Mary a lamb little had.")
LANGUAGECHECK(English, "Lamb had a little Mary.")

That would depend on how sophisticated our hypothetical LANGUAGECHECK program was: if the LANGUAGECHECK program was a simple spell checker, then it would probably accept them.  One could argue that based on the syntax of these statements they should be accepted, but based on the semantics (the meaning) of the statements they should be rejected. This issue of syntax vs. semantics is pervasive throughout many areas of computer science: For example, your JAVA compiler will let you know if your JAVA program has a correct syntax (it compiles), but it cannot tell you if the program will execute the way you intended.  The development and definition of formal languages will be explored in other computer science classes, but in this course, and in this lab specifically, we will explore Regular Expressions which have straightforward and well defined syntaxes that are very popular in pattern matching and syntax checking applications.

Formally, a regular expression defines a unique set of strings.  Given a regular expression and a string, we can check if that string is an element of the set defined by the regular expression.  In other words, the regular expression will accept the string as a match or it will reject it.  This is very similar in concept to the LANGUAGECHECK program we had before.  For example, we could have a regular expression that defines the set of binary strings: all strings that contain no symbols other than 0's and 1's: {"", "0", "1", "00", "01", "10", and so on...}.  We will call that language [01]*, and returning to our hypothetical LANGUAGECHECK program from above:

LANGUAGECHECK([01]*, "10101111011") => Accept
LANGUAGECHECK([01]*, "22021120221") => Reject

As it turns out, [01]* is a valid regular expression that is equivalent to the set of all binary strings.  In the following sections we well explore that syntax further.

Section 2: The Cat in the Hat Rhymes with .*at  (Pre-Lab)

In this section we will take a page from the book of Dr. Seuss and look for potential words that rhyme with cat.

The period (.) symbol in a regular expression will match any individual character, so the following regular expression:

.at

means that the first character in the string can be anything, the second character must be an 'a', and the third character must be a 't', so it will accept the following strings:

aat, bat, cat, dat

and so on... but it will also accept strings such as:

7at, [at, @at

Our regular expression here is explicitly 3 characters long, so it will not accept:

phat, splat, at, aaaaaat, meerkat

In this lab we have provided exercise boxes that will allow you to create, edit and debug regular expressions.  In the following box, you can change the regular expression (defaults to .at) as well as enter your own test strings to help you better understand how regular expressions work.  Whenever you have changed your regular expression or your test strings, you can click the CHECK button to update the results.  Try adding dog and rat to the list of test strings.

If we want a regular expression that will accept any string that ends with at, we can use the regular expression:

.*at

Try changing the above regular expression from .at to .*at

The asterisk or star (*) symbol means the previous item can appear 0 or many times.  For example, examine the following regular expression:

the plus (+) symbol is very similar to the asterisk (*), except it means the previous item must appear at least once.  So ca+t will match everything ca*t matches except for ct.  The regular expressions ca+t and caa*t are equivalent.  Try them both in the previous example.  There are often many different ways of writing equivalent regular expressions.

the question mark (?) symbol means the previous item can appear exactly 0 or 1 times.  So ca?t will match only ct and cat.  Try this expression above.

The final method for specifying the quantity of an item is to use brace (curly) {} brackets that can be used in one of three forms: {exact} {min,} or {min,max} where you can specify an exact number of appearances, a minimum number of appearances, or a range of appearances between min and max.  So ca{2,4}t will match caat, caaat and caaaat and ca{0,3}t will match from ct through to caaat.  You can see that * is equivalent to {0,}, while + is equivalent to {1,}, and ? is equivalent to {0,1}.  Try experimenting with different configurations of brace brackets above.  Despite their versatility, in practice they are not used very often in regular expressions.

Parentheses (round brackets) () can be used to group items together and their placement in regular expressions is very important.  As you might expect, cat+ , (cat)+ and (ca)+t are all very different.  Try different combinations of parentheses and the * symbol to accept the following test cases:

The pipe (|) symbol is for alternation, although it is most commonly known as an OR.  Examine the following expression:

This OR is not an inclusive OR as we have studied before, but is instead more similar to an exclusive or operator in that it will match only one of the listed items.  If all the piped items are single characters, you can use a more shorthand notation that omits the pipes and uses square brackets:

You can specify a range of characters within square brackets []: Try [b-d] in the above example.

You can also exclude characters with the caret (^) symbol at the start of a section of square brackets. Try [^cd] in the above example.

We have seen a wide variety of special symbols such as +, and * and you may wish to match those symbols themselves in a regular expression.  To match special characters in a regular expression precede the special characters with a backslash (\).  For example, to match the string +* you would use the regular expression \+\*.  See the example below for some special characters:

You have now seen all of the essential elements required to construct regular expressions.  With the combination of *,+,?,{},(),|,[],[^] you can create a vast number of interesting and useful regular expressions.

Section 3: Programming with Regular Expressions (Optional)

Tools for using regular expressions are available in many languages. For example, In JAVA you can use the library java.util.regex. Regular expressions are so important in some text processing languages that they have been incorporated directly into the language specification.

In PERL, you can determine if a regular expression would accept a string by using the special =~ operator, so for example the statement:

if ("cat" =~ /^.at\$/)

would return true. In JavaScript, you could use the statement:

if ("cat".match(RegExp("^.at\$")))

In fact, try opening a new web browser, and pasting in the address bar: javascript:alert("cat".match(RegExp("^.at\$"))?true:false)

The ^ and the \$ surrounding the regular expression are there to identify the start and the end of the string: Both PERL and JavaScript add a theoretical .* at the start and end of the regular expression unless the ^ and \$ are present.  In this lab, the ^ and \$ are added automatically.

Regular expression implementations such as those used in JavaScript have some advanced features such as sub-expressions that be captured and re-used later in the expression, non-consumable matching, greedy and non-greedy (lazy) matching, and look-ahead matching.  Those advanced features will not be permitted in this course or this lab.

Regular expressions are used in the "real" programming world in many applications, including:

• Web form validation: Regular Expressions are used for validating phone numbers, postal codes, etc. on many websites.
• Advanced searching: Applications such as the unix grep utility, Google Codesearch, and many popular text editors allow for searching regular expressions such colou?r.
• Bioinformatics applications: Applications such as finding motifs and sequence analysis can benefit from regular expressions.
• Email and spam filtering: Many automated email filters can be configured with regular expressions.
• Automated web services and spiders: Many Internet tasks involve taking data from one source and reformatting it for another, a task made easier by regular expressions.

Section 4: Regular Expression Reference (Pre-Lab)

The following reference table has been provided for your convenience:

 Character(s) Description Example . Matches any character .at ={"aat", "bat", ...} [ ] Matches one character from those listed [cbr]at = {"cat", "bat", "rat"} [ - ] Matches one character from the range of characters listed [a-c]at = {"bat", "cat", "dat"} [^ ] Matches one character from those not listed [^b-d]at = {"aat","eat", "fat",...} | Matches one item from those separated by pipes | (ph|meerk|r)at = {"phat","meerkat","rat"} * Repeat the previous item 0 or many times ca*t = {"ct", "cat", "caat", ...} + Repeat the previous item 1 or many times ca+t = {"cat", "caat", ...} ?? Repeat the previous item 0 or 1 time ca?t = {"ct","cat"} {#} Repeat the previous item an exact number of times ca{3}t = {"caaat"} {#,}} Repeat the previous item a min. number of times ca{2,}t = {"caat","caaat",...} {#,#} Repeat the previous item between a min. and max. number of times ca{1,3}t = {"cat","caat","caaat"}

In addition, there are a few special character codes that come in handy:

 Symbol Description \d Equivalent to [0-9]: Matches any digit \D Equivalent to [^0-9]:  Matches any non-digit \s Matches white space character \S Matches a non-white space character \w Equivalent to [a-zA-Z0-9_]: Matches a "word" character \W Equivalent to [^a-zA-Z0-9_]: Matches a non-word character

Section 5: Pre-Lab Exercises (4 marks)

Remember: whenever you have changed your regular expression or your test strings, you can click the CHECK button to update the results.

For many of these exercises, we have provided a test suite of strings that will help you to test and debug your regular expressions.  For the pre-lab (Section 5) exercises, if some your regular expression does not pass all of the test suite examples, you can move on and ask for help during the lab.

Exercise 1: Develop a regular expression for any binary string.

Exercise 2: Develop a regular expression for any binary string that represents an unsigned integer that is EVEN:

Exercise 3: Develop a regular expression for any binary string of EVEN length: (Hint: remember the (cat)+ example?)

Exercise 4: Develop a regular expression for any binary string that contains the pattern 0110 OR the pattern 1001:

Exercise 5: Develop a regular expression for any binary string that contains the pattern 0110 AND the pattern 1001: (Hint: There are 4 cases, not 2)

Exercise 6: Identify what types of strings the regular expression [A-Z][0-9][A-Z]\s[0-9][A-Z][0-9] represents, and give some test strings:

Exercise 7: Identify what types of strings the regular expression ([0-9]|[A-F])+ represents, and give some test strings:

Exercise 8: Identify what types of strings the regular expression (\(\d\d\d\)\s)?\d\d\d-\d\d\d\d represents, and give some test strings:

Exercise 9: Develop a regular expression that will accept the following three strings: {"pickup truck", "pick up truck", "pick-up truck"}:

Exercise 10: Develop a regular expression that will accept three or four words:

Exercise 11: Develop a regular expression that will accept the word cat and the word hat with at most 2 other words between them:

Exercise 12: Develop a regular expression that will accept a properly formatted 24-hour time (0:00 ... 23:59):

Section 6: Lab Exercises

Exercise 13: (2 marks) Gene Identification

DNA sequences are comprised of a simple 4-alphabet language with the symbols {A,C,G,T}. Three consecutive letters are known as a codon, so ACT and TCG are both codons.  A Gene is a collection of at least three codons that starts with an ATG codon and ends with a TAA, TAG, or TGA codon.

You need to develop a regular expression that will match strings that contain a gene.

Exercise 14: (2 marks) Money Matters

Develop a regular expression that will accept a dollar amount. For example, \$3.56 and \$1,000,000 are valid amounts, whereas \$5.321 and \$5,29,40 are not.

Exercise 15: (2 marks) Testing Regular Expressions

Your co-worker has been assigned the task of writing a regular expression that tests to see if an email address is valid:

Identify 5 unique mistakes in the regular expression by creating test cases that are incorrectly rejected or accepted:

Exercise 16: (2 marks) Develop a regular expression for a binary string that has an even number of 0's:

Exercise 17: (2 marks) Develop a regular expression for a binary string that has no consecutive 0's and no consecutive 1's:

Exercise 18: (2 marks) Searching

You are in the market to buy a red pick-up truck, and you wish to develop an automated web searching program (a spider) to search daily through various online newsgroups and classified ad websites to find text containing the word red and the phrase pick-up truck close to each other, followed by a price. Specifically, you should match the words red and the phrase (pickup/pick-up/pick up) truck separated by at most two other words in between. The pick-up truck phrase could appear before or after the word red. After the words red and the phrase pick-up truck, the text should also contain a price, and you should be able to simply re-use your price regular expression you developed previously.

Section 7: Challenge Exercise

Exercise 19: (Bonus 2 marks) Develop a regular expression for a binary string that represents an unsigned value that is divisible by 3: