String Processing

Substitution

In this lesson the student will learn how to:
  1. Use the s// operator
  2. Describe the difference between DNA and RNA in terms of nucleotide content
  3. Explain the process of transcription in general terms
By the end of this lesson the student will be able to:

  Write a Perl script which transcribes a DNA sequence to
  a complementary RNA sequence.

RNA

The DNA in your chromosomes encodes all the information needed to make every bit of your body. Actually there is some more information stored in the mitochondria which contributes a small part to this, but we will not worry about that at this time. Sequences of DNA in the chromosomes which encode for proteins are called genes. The organization of genes can be a little complex, but we will not worry about these complexities here. Genes contain the information needed for the body to make proteins. The first step in using this information to create a protein is called transcription. During transcription an RNA sequence which is complementary to the DNA sequence of the gene is created.

There is one important difference between RNA and DNA which we will focus on in this lesson. RNA is composed of the nucleotides adenine (A), uracil (U), cytosine (C), and guanine (G). As you will recall, DNA is composed of the same nucleotides except instead of uracil, DNA uses thymine (T). So, in RNA we will see U's instead of T's. For instance, while a sequence of DNA might look like this:


   CGATTACCGAGCCTA

a similar RNA sequence would look like this:

   CGAUUACCGAGCCUA

The process of transcription is where the DNA is copied into RNA. The RNA will be complementary to the DNA and any T nucleotide will become a U nucleotide.

The student should realize that there are many types of RNA. Our discussion here pertains to mRNA (messenger RNA). We will also be discussing tRNA (transfer RNA) in another lesson. There are still more types of RNA which are involved in the whole process of making proteins out of genes.

Here's a short Perl script which makes simple substitutions:

#!/usr/bin/perl $str = "TTAAGGCCATGCGGATACACGATGAC"; print "ORIGINAL: $str\n"; $str =~ s/T/U/g; print "AFTER SUBSTITUTION: $str\n";
The second to the last line causes all the T's in the sequence to change to U's.

Here's another, not biological, example:

#!/usr/bin/perl $str = "Fred went down to the store to buy some food."; print "$str\n"; $str =~ s/Fred/Mary/g; print "$str\n"; $str =~ s/down/up/g; print "$str\n"; $str =~ s/buy/find/g; print "$str\n"; $str =~ s/store/river/g; print "$str\n"; $str =~ s/food/wood/g; print "$str\n"; $str =~ s/went/walked/g; print "$str\n";
As you can see, not only can we substitute a single character for a single character, we can also substitute words for words.

ASSIGNMENT:

Your job is to create an RNA sequence based on a DNA template. In other words, you will simulate the process of transcription. For example, if you had this strand of DNA:


   TGACCGATAGATACCAGT

You would have this strand of RNA as your output:

   ACUGGCUAUCUAUGGUCA

The easiest way to produce this output is to first create a complement to the original strand of DNA (covered in the last lesson) and then to substitute a U for each T in the resulting string.