String Processing

IUB Ambiguity Codes

In this lesson the student will learn how to:
  1. Write regular expressions for IUB ambiguity codes
  2. Specify the portion of a sequence before and after a pattern match
By the end of this lesson the student will be able to:

	Write a script which allows the user to search for
	a pattern specified using IUB ambiguity codes.

IUB Ambiguity Codes

First of all IUB stands for International Union of Biochemistry (the full name is actually IUBMB which stands for the same thing only with Molecular Biology added to the end). The IUBMB is the organization responsible for the IUB Ambiguity Codes which are just standardized symbols referring to nucleotides. The standard symbols used for nucleotides are shown below:


                        R = G or A
                        Y = C or T
                        M = A or C
                        K = G or T
                        S = G or C
                        W = A or T
                        B = not A (C or G or T)
                        D = not C (A or G or T)
                        H = not G (A or C or T)
                        V = not T (A or C or G)
                        N = A or C or G or T

So, if we encounter a sequence like this:

	AGATCVWNNKAGATC

Any of the following match:

	AGATCAAGTKAGATC
	AGATCCTCAKAGATC
	AGATCGTAGKAGATC

Regular Expressions Matching IUB Ambiguity Codes

Suppose that we want to find a match for AGNVWCCT. How would we write a regular expression for this?

#!/usr/bin/perl $input = "AGNVWCCT"; %IUB= ( "R" => "[GA]", "Y" => "[CT]", "M" => "[AC]", "K" => "[GT]", "S" => "[GC]", "W" => "[AT]", "B" => "[CGT]", "D" => "[AGT]", "H" => "[ACT]", "V" => "[ACG]", "N" => "[ACGT]" ); foreach $item (keys %IUB){ $input =~ s/$item/$IUB{$item}/g; } $sequence = "ACGACGAAGACTCCTAGAGACCACT"; $sequence =~ /$input/; print "BEFORE: $`, MATCH: $&, AFTER: $'\n";
Notice the use of the special symbols for the match ($&), before the match ($`) and after the match ($'). Notice that after the match is signified with a single quote and before the match uses the mark made with the key to the left of the one key.

ASSIGNMENT:

Write a script which allows the user to enter a search pattern (which utilizes the IUB ambiguity code) and a sequence to use the pattern on. Report the part of the sequence which comes before the pattern match, the actual matched portion of the sequence, and the part that comes after the match. Make a loop in your script which keeps prompting for search pattern and sequence until the user enters "q" for quit. Ensure that the pattern and sequence are recalled between loop iterations unless the user enters new ones. (This way the user can repeat the pattern for several sequences or vice versa.)