CPSC 121: Models of Computation

Lab #7: Regular Expressions


Learning Objectives:

Pre-Lab Instructions:

It is required that you complete Section 5 and Exercises 1-12 before you attend your lab session: 4 of the 16 lab marks are in this pre-lab section.  You should record your solutions (e.g.: email them to yourself) to those 12 exercises and have them ready to show your TA at the start of the lab.

Sections 1 and 3 are optional, but will help your understanding of regular expressions.

This lab uses some JavaScript that may not work properly in your browser: if you are encountering problems, use the machines in the lab.

Table of Contents:

Section 1: Background and Introductory Language Theory
Section 2: The Cat in the Hat Rhymes with .*at
Section 3: Programming with Regular Expressions
Section 4: Regular Expression Reference
Section 5: Pre-Lab Exercises (4 marks)
Section 6: Lab Exercises (12 marks)
Section 7: Challenge Exercise (2 marks)

Section 1: Background and Introductory Language Theory (Optional)

Consider the following statements in a variety of different languages:

  1. Mary had a little lamb. [English]
  2. Parlez-vous l'anglais? [French]
  3. Heghlu'meH QaQ jajvam! [Klingon]
  4. for (j=0;j<100;j++) tot += j; [JAVA]
  5. 01010101100111010101 [Binary Strings]

Each of the statements is a valid statement in its own language, but not in any of the others.  Let's say that we have a hypothetical computer program called LANGUAGECHECK where you could pass it a language and a statement, and the LANGUAGECHECK program could determine if the statement is a valid statement in that language.  In other words, the LANGUAGECHECK program would accept or reject the statement.  For example:

LANGUAGECHECK(English, "Mary had a little lamb.") => Accept
LANGUAGECHECK(JAVA,"for (j=0;j<100;j++) tot += j;") => Accept
LANGUAGECHECK(English, "01010101100111010101") => Reject

What about the following? What would LANGUAGECHECK return?

LANGUAGECHECK(English, "Mary a lamb little had.")
LANGUAGECHECK(English, "Lamb had a little Mary.")

That would depend on how sophisticated our hypothetical LANGUAGECHECK program was: if the LANGUAGECHECK program was a simple spell checker, then it would probably accept them.  One could argue that based on the syntax of these statements they should be accepted, but based on the semantics (the meaning) of the statements they should be rejected. This issue of syntax vs. semantics is pervasive throughout many areas of computer science: For example, your JAVA compiler will let you know if your JAVA program has a correct syntax (it compiles), but it cannot tell you if the program will execute the way you intended.  The development and definition of formal languages will be explored in other computer science classes, but in this course, and in this lab specifically, we will explore Regular Expressions which have straightforward and well defined syntaxes that are very popular in pattern matching and syntax checking applications.

Formally, a regular expression defines a unique set of strings.  Given a regular expression and a string, we can check if that string is an element of the set defined by the regular expression.  In other words, the regular expression will accept the string as a match or it will reject it.  This is very similar in concept to the LANGUAGECHECK program we had before.  For example, we could have a regular expression that defines the set of binary strings: all strings that contain no symbols other than 0's and 1's: {"", "0", "1", "00", "01", "10", and so on...}.  We will call that language [01]*, and returning to our hypothetical LANGUAGECHECK program from above:

LANGUAGECHECK([01]*, "10101111011") => Accept
LANGUAGECHECK([01]*, "22021120221") => Reject

As it turns out, [01]* is a valid regular expression that is equivalent to the set of all binary strings.  In the following sections we well explore that syntax further.

Section 2: The Cat in the Hat Rhymes with .*at  (Pre-Lab)

In this section we will take a page from the book of Dr. Seuss and look for potential words that rhyme with cat.

The period (.) symbol in a regular expression will match any individual character, so the following regular expression:

.at

means that the first character in the string can be anything, the second character must be an 'a', and the third character must be a 't', so it will accept the following strings:

aat, bat, cat, dat

and so on... but it will also accept strings such as:

7at, [at, @at

Our regular expression here is explicitly 3 characters long, so it will not accept:

phat, splat, at, aaaaaat, meerkat

In this lab we have provided exercise boxes that will allow you to create, edit and debug regular expressions.  In the following box, you can change the regular expression (defaults to .at) as well as enter your own test strings to help you better understand how regular expressions work.  Whenever you have changed your regular expression or your test strings, you can click the CHECK button to update the results.  Try adding dog and rat to the list of test strings.

Your First Regular Expression: .at
Test Strings:
ACCEPT: aat
ACCEPT: bat
ACCEPT: cat
ACCEPT: dat
ACCEPT: 7at
ACCEPT: [at
ACCEPT: @at
REJECT: phat
REJECT: splat
REJECT: at
REJECT: aaaaaat
REJECT: meerkat

If we want a regular expression that will accept any string that ends with at, we can use the regular expression:

.*at

Try changing the above regular expression from .at to .*at

The asterisk or star (*) symbol means the previous item can appear 0 or many times.  For example, examine the following regular expression:

The *, +, ? and {} Symbols
Test Strings:
ACCEPT: ct
ACCEPT: cat
ACCEPT: caat
ACCEPT: caaat
ACCEPT: caaaat
ACCEPT: caaaaat
ACCEPT: caaaaaaaaaaat

the plus (+) symbol is very similar to the asterisk (*), except it means the previous item must appear at least once.  So ca+t will match everything ca*t matches except for ct.  The regular expressions ca+t and caa*t are equivalent.  Try them both in the previous example.  There are often many different ways of writing equivalent regular expressions.

the question mark (?) symbol means the previous item can appear exactly 0 or 1 times.  So ca?t will match only ct and cat.  Try this expression above.

The final method for specifying the quantity of an item is to use brace (curly) {} brackets that can be used in one of three forms: {exact} {min,} or {min,max} where you can specify an exact number of appearances, a minimum number of appearances, or a range of appearances between min and max.  So ca{2,4}t will match caat, caaat and caaaat and ca{0,3}t will match from ct through to caaat.  You can see that * is equivalent to {0,}, while + is equivalent to {1,}, and ? is equivalent to {0,1}.  Try experimenting with different configurations of brace brackets above.  Despite their versatility, in practice they are not used very often in regular expressions.

Parentheses (round brackets) () can be used to group items together and their placement in regular expressions is very important.  As you might expect, cat+ , (cat)+ and (ca)+t are all very different.  Try different combinations of parentheses and the * symbol to accept the following test cases:

(Parentheses) Placement is Particular
Test Strings:
ACCEPT: cat
ACCEPT: catt
ACCEPT: cattt
REJECT: catcat
REJECT: catcatcat
REJECT: catat
REJECT: catatat
REJECT: cacat

The pipe (|) symbol is for alternation, although it is most commonly known as an OR.  Examine the following expression:

The Pipe (|) Symbol
Test Strings:
ACCEPT: bat
ACCEPT: cat
ACCEPT: rat
ACCEPT: meerkat
REJECT: hat

This OR is not an inclusive OR as we have studied before, but is instead more similar to an exclusive or operator in that it will match only one of the listed items.  If all the piped items are single characters, you can use a more shorthand notation that omits the pipes and uses square brackets:

Square Brackets: [abcd], [b-d], [^cd]
Test Strings:
ACCEPT: aat
ACCEPT: bat
ACCEPT: cat
ACCEPT: dat
REJECT: eat
REJECT: fat

You can specify a range of characters within square brackets []: Try [b-d] in the above example.

You can also exclude characters with the caret (^) symbol at the start of a section of square brackets. Try [^cd] in the above example.

We have seen a wide variety of special symbols such as +, and * and you may wish to match those symbols themselves in a regular expression.  To match special characters in a regular expression precede the special characters with a backslash (\).  For example, to match the string +* you would use the regular expression \+\*.  See the example below for some special characters:

Special Characters
Test Strings:
ACCEPT: .
ACCEPT: ..
ACCEPT: +++***+++
ACCEPT: .*?+\{}()[]^$

You have now seen all of the essential elements required to construct regular expressions.  With the combination of *,+,?,{},(),|,[],[^] you can create a vast number of interesting and useful regular expressions.

Section 3: Programming with Regular Expressions (Optional)

Tools for using regular expressions are available in many languages. For example, In JAVA you can use the library java.util.regex. Regular expressions are so important in some text processing languages that they have been incorporated directly into the language specification.

In PERL, you can determine if a regular expression would accept a string by using the special =~ operator, so for example the statement:

if ("cat" =~ /^.at$/)

would return true. In JavaScript, you could use the statement:

if ("cat".match(RegExp("^.at$")))

In fact, try opening a new web browser, and pasting in the address bar: javascript:alert("cat".match(RegExp("^.at$"))?true:false)

The ^ and the $ surrounding the regular expression are there to identify the start and the end of the string: Both PERL and JavaScript add a theoretical .* at the start and end of the regular expression unless the ^ and $ are present.  In this lab, the ^ and $ are added automatically.

Regular expression implementations such as those used in JavaScript have some advanced features such as sub-expressions that be captured and re-used later in the expression, non-consumable matching, greedy and non-greedy (lazy) matching, and look-ahead matching.  Those advanced features will not be permitted in this course or this lab.

Regular expressions are used in the "real" programming world in many applications, including:

Section 4: Regular Expression Reference (Pre-Lab)

The following reference table has been provided for your convenience:

Character(s) Description Example
. Matches any character .at ={"aat", "bat", ...}
[ ] Matches one character from those listed [cbr]at = {"cat", "bat", "rat"}
[ - ] Matches one character from the range of characters listed [a-c]at = {"bat", "cat", "dat"}
[^ ] Matches one character from those not listed [^b-d]at = {"aat","eat", "fat",...}
| Matches one item from those separated by pipes | (ph|meerk|r)at = {"phat","meerkat","rat"}
* Repeat the previous item 0 or many times ca*t = {"ct", "cat", "caat", ...}
+ Repeat the previous item 1 or many times ca+t = {"cat", "caat", ...}
?? Repeat the previous item 0 or 1 time ca?t = {"ct","cat"}
{#} Repeat the previous item an exact number of times ca{3}t = {"caaat"}
{#,}} Repeat the previous item a min. number of times ca{2,}t = {"caat","caaat",...}
{#,#} Repeat the previous item between a min. and max. number of times ca{1,3}t = {"cat","caat","caaat"}

In addition, there are a few special character codes that come in handy:

Symbol Description
\d Equivalent to [0-9]: Matches any digit
\D Equivalent to [^0-9]:  Matches any non-digit
\s Matches white space character
\S Matches a non-white space character
\w Equivalent to [a-zA-Z0-9_]: Matches a "word" character
\W Equivalent to [^a-zA-Z0-9_]: Matches a non-word character

 

Section 5: Pre-Lab Exercises (4 marks)

Remember: whenever you have changed your regular expression or your test strings, you can click the CHECK button to update the results.

For many of these exercises, we have provided a test suite of strings that will help you to test and debug your regular expressions.  For the pre-lab (Section 5) exercises, if some your regular expression does not pass all of the test suite examples, you can move on and ask for help during the lab.

Exercise 1: Develop a regular expression for any binary string.

Exercise 1
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: 0
REJECT: 1
REJECT: 00
REJECT: 10
REJECT: 111010101010111
Test Suite: (Should be REJECTED):
REJECT: cat
REJECT: 123
REJECT: he110, w0r1d

Exercise 2: Develop a regular expression for any binary string that represents an unsigned integer that is EVEN:

Exercise 2
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: 0
REJECT: 10
REJECT: 1110
REJECT: 010
REJECT: 00011000110110
Test Suite: (Should be REJECTED):
REJECT: 1
REJECT: 011
REJECT: 0101
REJECT: 0001101

Exercise 3: Develop a regular expression for any binary string of EVEN length: (Hint: remember the (cat)+ example?)

Exercise 3
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: 00
REJECT: 01
REJECT: 1010
REJECT: 0001
REJECT: 0001101100011011
Test Suite: (Should be REJECTED):
REJECT: 0
REJECT: 1
REJECT: 111
REJECT: 010
REJECT: 0001101

Exercise 4: Develop a regular expression for any binary string that contains the pattern 0110 OR the pattern 1001:

Exercise 4
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: 0110
REJECT: 1001
REJECT: 01101001
REJECT: 011001
REJECT: 111101101111
REJECT: 0000010010000
Test Suite: (Should be REJECTED):
REJECT: 0
REJECT: 1
REJECT: 0101
REJECT: 1010
REJECT: 1100010001110

Exercise 5: Develop a regular expression for any binary string that contains the pattern 0110 AND the pattern 1001: (Hint: There are 4 cases, not 2)

Exercise 5
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: 01101001
REJECT: 000110001001000
REJECT: 011001
REJECT: 100110
REJECT: 000011001000
Test Suite: (Should be REJECTED):
REJECT: 0
REJECT: 1
REJECT: 0110
REJECT: 1001
REJECT: 01101
REJECT: 10010
REJECT: 1100010001110

Exercise 6: Identify what types of strings the regular expression [A-Z][0-9][A-Z]\s[0-9][A-Z][0-9] represents, and give some test strings:

Exercise 6
Test Strings:

Exercise 7: Identify what types of strings the regular expression ([0-9]|[A-F])+ represents, and give some test strings:

Exercise 7
Test Strings:

Exercise 8: Identify what types of strings the regular expression (\(\d\d\d\)\s)?\d\d\d-\d\d\d\d represents, and give some test strings:

Exercise 8
Test Strings:

Exercise 9: Develop a regular expression that will accept the following three strings: {"pickup truck", "pick up truck", "pick-up truck"}:

Exercise 9
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: pickup truck
REJECT: pick up truck
REJECT: pick-up truck
Test Suite: (Should be REJECTED):
REJECT: pickuptruck
REJECT: pick.up.truck
REJECT: pick- up truck
REJECT: pick -up truck
REJECT: volkswagon

Exercise 10: Develop a regular expression that will accept three or four words:

Exercise 10
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: one two three
REJECT: one two three four
REJECT: 1 2 3 4
REJECT: how many words?
Test Suite: (Should be REJECTED):
REJECT: word
REJECT: word word
REJECT: word word word word word

Exercise 11: Develop a regular expression that will accept the word cat and the word hat with at most 2 other words between them:

Exercise 11
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: cat hat
REJECT: cat in hat
REJECT: cat in the hat
Test Suite: (Should be REJECTED):
REJECT: hat cat
REJECT: cathat
REJECT: catinthehat
REJECT: cat in the phat hat
REJECT: cat in the super phat hat

Exercise 12: Develop a regular expression that will accept a properly formatted 24-hour time (0:00 ... 23:59):

Exercise 12
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: 12:34
REJECT: 0:00
REJECT: 23:59
REJECT: 9:00
REJECT: 10:00
REJECT: 11:11
REJECT: 15:15
Test Suite: (Should be REJECTED):
REJECT: 24:00
REJECT: 12:32 pm
REJECT: 0:60
REJECT: 9:99
REJECT: 04:00
REJECT: 4
REJECT: -4:00

Section 6: Lab Exercises

Exercise 13: (2 marks) Gene Identification

DNA sequences are comprised of a simple 4-alphabet language with the symbols {A,C,G,T}. Three consecutive letters are known as a codon, so ACT and TCG are both codons.  A Gene is a collection of at least three codons that starts with an ATG codon and ends with a TAA, TAG, or TGA codon.

You need to develop a regular expression that will match strings that contain a gene.

Exercise 13
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: ATGCCCTAA
REJECT: ATGCCCTAG
REJECT: ATGCCCTGA
REJECT: CATGCCCTAA
REJECT: CATGCCCTAG
REJECT: CATGCCCTGA
REJECT: CATGCCCTAAC
REJECT: CATGCCCTAGC
REJECT: CATGCCCTGAT
REJECT: TCATGCCCTGACC
REJECT: TTATGCCCGGGTGACC
REJECT: AAACTCATGCCCGGGCCCTGACCTTAA
REJECT: ATGATGATGTAA
REJECT: ATGAAAAACAAGAATTAA
REJECT: ATGACAACCACGACTTAA
REJECT: ATGAGAAGCAGGAGTTAA
REJECT: ATGATAATCATGATTTAA
REJECT: ATGCAACACCAGCATTAA
REJECT: ATGCCACCCCCGCCTTAA
REJECT: ATGCGACGCCGGCGTTAA
REJECT: ATGCTACTCCTGCTTTAA
REJECT: ATGGAAGACGAGGATTAA
REJECT: ATGGCAGCCGCGGCTTAA
REJECT: ATGGGAGGCGGGGGTTAA
REJECT: ATGGTAGTCGTGGTTTAA
REJECT: ATGTACTATTCATCCTCGTCTTGCTGGTGTTTATTCTTGTTTTAA
Test Suite: (Should be REJECTED):
REJECT: GATTACA
REJECT: ATGTAA
REJECT: ATGTAG
REJECT: ATGTGA
REJECT: ATGCCCCTAG
REJECT: ATGCCCCCTAG
REJECT: CCCATGCCCCTAGCCC
REJECT: CCCATGCCCCCTAGCCC

Exercise 14: (2 marks) Money Matters

Develop a regular expression that will accept a dollar amount. For example, $3.56 and $1,000,000 are valid amounts, whereas $5.321 and $5,29,40 are not.

Exercise 14
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: $0
REJECT: $0.00
REJECT: $0.99
REJECT: $4
REJECT: $4.00
REJECT: $10
REJECT: $10.00
REJECT: $1000
REJECT: $1000.00
REJECT: $1,000
REJECT: $1,000.00
REJECT: $8,888,888,888,888.88
REJECT: $88,888,888,888,888.88
REJECT: $888,888,888,888,888.88
REJECT: $1234567.89
Test Suite: (Should be REJECTED):
REJECT: $-0
REJECT: $ 0
REJECT: $1.9
REJECT: $.99
REJECT: bannana
REJECT: $,333.33
REJECT: $12,34
REJECT: $22,333,22,333.22
REJECT: $$$
REJECT: $$$0
REJECT: 3$

Exercise 15: (2 marks) Testing Regular Expressions

Your co-worker has been assigned the task of writing a regular expression that tests to see if an email address is valid:

Identify 5 unique mistakes in the regular expression by creating test cases that are incorrectly rejected or accepted:

Exercise 15
Test Strings:
ACCEPT: davet@cs.ubc.ca

Exercise 16: (2 marks) Develop a regular expression for a binary string that has an even number of 0's:

Exercise 16
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: 1
REJECT: 11
REJECT: 111
REJECT: 1111
REJECT: 100
REJECT: 001
REJECT: 010
REJECT: 1001
REJECT: 100001
REJECT: 1001001
REJECT: 1010
REJECT: 101101
REJECT: 111101110111
REJECT: 00
REJECT: 0000
REJECT: 000000
REJECT: 1111011111110111110111110111
REJECT: 1101110111100111101110111111
Test Suite: (Should be REJECTED):
REJECT: 0
REJECT: 000
REJECT: 00000
REJECT: 10
REJECT: 01
REJECT: 110
REJECT: 011
REJECT: 0001
REJECT: 1000
REJECT: 10001
REJECT: 101001
REJECT: 1010101
REJECT: 11111111110
REJECT: 01111111111
REJECT: 11110111111

Exercise 17: (2 marks) Develop a regular expression for a binary string that has no consecutive 0's and no consecutive 1's:

Exercise 17
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: 1
REJECT: 0
REJECT: 10
REJECT: 01
REJECT: 101
REJECT: 010
REJECT: 1010
REJECT: 0101
REJECT: 10101
REJECT: 01010
REJECT: 101010
REJECT: 010101
Test Suite: (Should be REJECTED):
REJECT: 00
REJECT: 11
REJECT: 001
REJECT: 100
REJECT: 110
REJECT: 011
REJECT: 1001
REJECT: 0110
REJECT: 01010100101010
REJECT: 10101011010101

Exercise 18: (2 marks) Searching

You are in the market to buy a red pick-up truck, and you wish to develop an automated web searching program (a spider) to search daily through various online newsgroups and classified ad websites to find text containing the word red and the phrase pick-up truck close to each other, followed by a price. Specifically, you should match the words red and the phrase (pickup/pick-up/pick up) truck separated by at most two other words in between. The pick-up truck phrase could appear before or after the word red. After the words red and the phrase pick-up truck, the text should also contain a price, and you should be able to simply re-use your price regular expression you developed previously.

Exercise 18
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: red pickup truck $5000
REJECT: red pickup truck $5,000
REJECT: red pickup truck $1,234.56
REJECT: red pick-up truck $5000
REJECT: red pick up truck $5000
REJECT: red toyota pick-up truck $5000
REJECT: red toyota 1993 pick-up truck $5000
REJECT: blah blah red toyota 1993 pick-up truck blah blah $5000 blah
REJECT: pickup truck red $5000
REJECT: pick-up truck 1993 toyota red $5000
REJECT: blah blah blah pick-up truck toyota 1993 red blah blah blah $5000
REJECT: desperate: red 1993 toyota pickup truck for sale. $2,000 o.b.o.
REJECT: toy pickup truck - cherry red: $12.
REJECT: red red pickup pickup truck truck $5000.
Test Suite: (Should be REJECTED):
REJECT: red
REJECT: truck
REJECT: pickup truck
REJECT: red pickup truck
REJECT: red $5000
REJECT: pickup truck $5000
REJECT: red truck $5000
REJECT: $5000 red pickup truck
REJECT: blue pickup truck $5000
REJECT: red car $5000
REJECT: red toyota 1993 pick-up truck
REJECT: red 1993 toyota automatic pick-up truck $5000
REJECT: fred's pick-up truck sold for $5000
REJECT: pick-up trucks by fred: $5000
REJECT: reddy for sale pickup truck: $5000)

Section 7: Challenge Exercise

Exercise 19: (Bonus 2 marks) Develop a regular expression for a binary string that represents an unsigned value that is divisible by 3:

Exercise 19
Test Strings:
Test Suite: (Should be ACCEPTED)
REJECT: 11
REJECT: 110
REJECT: 1001
REJECT: 1100
REJECT: 10101
REJECT: 11011
REJECT: 100001
REJECT: 111111
REJECT: 1110101
REJECT: 101010110
REJECT: 000111000101
Test Suite: (Should be REJECTED):
REJECT: 1
REJECT: 10
REJECT: 100
REJECT: 111
REJECT: 110001
REJECT: 1000100
REJECT: 10010010
REJECT: 10101100
REJECT: 11100101
REJECT: 11011101
REJECT: 110110111