String Processing

More Regular Expressions

In this lesson the student will learn how to:
  1. Incorporate quantification symbols into regular expressions
  2. Understand how the {}, *, +, and ? symbols are used to specify quantification
By the end of this lesson the student will be able to:

     Write a script which will count the number of repeats
     of a pattern within a sequence.

Telomeres

At the ends of eukaryotic chromosomes are telomeres. Telomeres consist of 10,000 nucleotides of the TTAGGG motif. That is, TTAGGG is repeated over and over hundreds of time at the ends of chromosomes. Telomeres are important since each time a cell divides and each chromosome is replicated, some of the nucleotides at each end of each chromosome are lost. When the telomeres become too short the cell will stop replicating, a state known as replicative cell senesence.

An enzyme called telomerase can repair repair the telomeres and cells which produce telomerase are immortal. This may, at first, sound like a good thing, but according to some researchers telomeres provide a mechanism to safeguard against uncontrolled cell proliferation (and thus they protect us from cancer). In cancer cells replicate in an out-of-control manner often because they inappropriately express the telomerase enzyme. By the way, the study of cancer is called oncology.

Another important part of a chromosome is the centromere which is located in the middle of human chromosomes. Actually centromeres are not exactly in the middle of human chromosomes. They are a little off center and for this reason chromosomes are said to have a short and a long arm. In fact, the location of genes is specified in terms of which arm they are on. The short arm is annotated with a p (for petit) and the long arm is annotated with a q (which is the next letter after p). For instance, the KRTHB1 gene (which codes for one type of keratin) is located at 12q13 (long arm of chromosome 12). The MTNR1B gene (which codes for one type of melatonin) is located at 11q21 (long arm of chromosome 11). The CD44 gene (which codes for the CD44 receptor) is located at 11p13 (short arm of chromosome 11). The last number in each example specifies the distance from the centromere.

More Regular Expressions

#!/usr/bin/perl print "ENTER: "; $s = <STDIN>; if($s =~ /CA{4,6}C/){ print "C and A sandwich\n"; } if($s =~ /GT{5}G/){ print "G and T sandwich\n"; } if($s =~ /A[CGT]{4}A/){ print "A and anything sandwich\n"; } if($s =~ /A{5}C{5}/){ print "five A's and five C's\n"; }
Experiment with the script shown above. At the very least try the following as input:

  CAAAC
  CAAAAC
  CAAAAAC
  CAAAAAAC
  CAAAAAAAC
  GTTTTG
  GTTTTTG
  GTTTTTTTG
  ACGCGCGA
  ACCCCA
  AAAAACCCCC
  AAAAAAAACCCCCCCCCC
  AAAACCCC

The {n,n} construct is called the quantifier since it allows you to quantify exact numbers of repeats of a certain pattern. For instance, A{3} matches three A's. A{3,5} matches from three to five A's. A{3,} matches three or more A's. Other quantifiers include the +, *, and ?. The + matches one or more repeats of a pattern. The * matches 0 or more. The ? matches 0 or 1.
#!/usr/bin/perl print "ENTER: "; $s = <STDIN>; chomp($s); if($s =~ /A+C+GA+C+/i){ print "A PLUS\n"; } if($s =~ /A?C?GA?C?/i){ print "A QUESTION MARK\n"; } if($s =~ /A*C*GA*C*/i){ print "A STAR\n"; } if($s =~ /ACGAC/i){ print "no symbols\n"; }
Experiment with the script shown above. At the very least try the following inputs:

  g
  acgac
  acac
  aaccgaacc
  aaccgggaacc

You can also count the number of occurences of a particular pattern like this:
#!/usr/bin/perl $p = "the quick brown fox the the thethe train"; @n = $p =~ /the/g; print "NUMBER OF THE's: " . @n . "\n";

ASSIGNMENT:

Write a script which checks for repetitions of the TTAGGG motif found in an input sequence. Make sure that any sequence which does not contain the normal nucleotides gets rejected by the script.