String Processing

Transliteration

In this lesson the student will learn how to:
  1. Use the tr// operator
  2. List the nucleotides used in DNA and RNA
  3. Convert lowercase to uppercase and vice versa
By the end of this lesson the student will be able to:

   Write a Perl script which transcribes a DNA sequence to
   a complementary RNA sequence using the tr// operator.

Transcription isn't nearly as simple as presented in the last lesson. There are lots of questions about the mechanics of how DNA is read and how genes are located by the transcription machinery which will not be addressed in this unit. We will, however, talk a little about the organization of genes.

Often a full gene is broken into several sections with lots of junk thrown in between the sections. The sections which wind up getting spliced together to make the final gene are called exons and the parts which get thrown away are called introns. Remember that there are twenty-three pairs of chromosomes and that each chromosome is a huge mass of double-stranded DNA. Many genes are located on each chromosome. There are about 30,000 genes and 23 pairs of chromosomes made of over 3 billion nucleotide pairs and so a lot of information is stored on each chromosome. But even at that, less than ten percent of the genome actually codes for genes. The rest of the nucleotides are often referred to as junk DNA.

Recall that RNA is transcribed from DNA and that one important difference between RNA and DNA is that RNA uses uracil (U) instead of thymine (T) which is used by DNA. Also recall that the following are complementary strands of DNA:


     ATGGCAGAGTTAGA
     TACCGTCTCAATCT

but that an RNA strand which is complementary to the top strand would look like this:

    UAGGCUCUCAAUCU

Thus the RNA transcribed from DNA is complementary and U's are used instead of T's.

Perl often provides many ways to solve a single problem. Take the following example for instance:

#!/usr/bin/perl $str = "ATTCAGAGCACCTAGGACCACGTCACTAGCACCATAGAGCGTAATAAA"; print "$str\n"; $str =~ tr/ATCG/UAGC/; print "$str\n";
This script is a lot shorter than the one from the last assignment, but it accomplishes the exact same task. In many ways it's also clearer since you don't have to look through as many lines to find out what's going on.

The important thing to know about this script is that the tr// operator is known as the transliteration operator and it performs basic substitution for multiple characters in an orderly fashion. Here's another example which changes all lower-case letters to capitals.

#!/usr/bin/perl $sample = "There was once a boy named Matt who lived in Lone Pine."; print "$sample\n"; $sample =~ tr/a-z/A-Z/; print "$sample\n";

ASSIGNMENT:

Use the tr// operator to help you write two scripts called encode.pl and decode.pl. The encode.pl script will take a message as input and output an encrypted version of the message. The decode.pl script will take the encrypted version of the message as input and output the original message. The character substitution scheme you choose is up to you, but you must implement it using the tr// operator (which is actually the easiest way to perform this kind of task). It would be a good idea to first change any lowercase letters to uppercase before doing the encode or decode routines, or you can come up with a better way to deal with upper and lower case letters.