String Processing

Strings and DNA

In this lesson the student will learn how to:
  1. Represent a DNA sequence using a string
  2. Create complementary strings of nucleotides
  3. Use substr to inspect a single character in a string
  4. Concatentate strings using the dot operator
By the end of this lesson the student will be able to:

    Write a short Perl script which creates a 
    complementary sequence for an input DNA sequence.

Perl stands for Practical Extraction and Report Language, as you undoubtedly recall from the first unit of this course. In this unit we will be learning how to use PERL to help us process strings and analyze biological sequences. There are three types of biological sequences which we will investigate: DNA, RNA, and polypeptide. In this lesson we will discuss only DNA sequences.

Strands of DNA

Your genome is composed of over 3 billion pairs of nucleotides. Your genome is organized into 23 pairs of chromosomes. Each chromosomes consists of two complementary strands of DNA. Your genes are located on these strands of DNA. DNA is composed of four nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T). Adenine always pairs up with thymine. Cytosine always pairs with guanine. Two short complementary strand segments might look like this:

AATGGCCGATAGAGGGATTTACC TTACCGGCTATCTCCCTAAATGG You will notice that A's are always paired with T's and that C's are always paired with G's. This is a fairly simple arrangement.

Although there are lots of details about chromosomes, DNA, and genomes which concern the biologist, for now we will concern ourselves only with nucleotide sequence.

The following PERL script will help us get started analyzing a strand of DNA.

#!/usr/bin/perl $s1 = "GTAACGAGTTAACACTACGACCCGGCTTACAATGATGCCG"; $len = length $s1; print $s1 . "\n"; print "LENGTH: $len\n";
The length of a strand of DNA is about the most basic thing you might want to know about it. (Make sure you notice the use of the length function here to find out the length of the string.)

Another interesting thing to know is the number of each type of nucleotide. Here's a PERL script for that:

#!/usr/bin/perl $s1 = "GTAACGAGTTAACACTACGACCCGGCTTACAATGATGCCG"; $len = length $s1; $a=0; $c=0; $g=0; $t=0; for($i=0; $i<$len; $i++){ if(substr($s1,$i,1) eq "A"){ $a++; } elsif(substr($s1,$i,1) eq "T"){ $t++; } elsif(substr($s1,$i,1) eq "C"){ $c++; } elsif(substr($s1,$i,1) eq "G"){ $g++; } else{ print "BAD CHARACTER\n"; } } print $s1 . "\n"; print "A: $a, C: $c, G: $g, T: $t\n"; print "LENGTH: $len\n";
You should understand three things about this script:
  1. for loop - Make sure you understand how it is that every nucleotide in the sequence is inspected, one at a time.
  2. if-elsif-else - Make sure you understand how the complementary nucleotide is selected. Recall that only one choice from an if-elsif-else construct can be selected. In this case there are five choices specified by the construct.
  3. substr - The substr function takes three arguments: a string, a starting point, and a length. In the useage shown in this example, the length is always one character since we are inspecting each nucleotide one at a time.

Here's a quick perl script which illustrates concatenation (adding strings to strings).

#!/usr/bin/perl $str = "one"; $str .= " two"; $str .= " three"; print $str ."\n";
The little period in front of the equal signs shows one way to use the dot operator. Another example is shown in the second to the last line in the first example.

ASSIGNMENT:

Write a short script in which the user is prompted to enter a single strand of DNA of any length. Use a for loop, if-elsif-else constructs, and substr to produce a second sequence which is complementary to the first sequence. Here is a pair of complementary strands:


  INPUT:   ATCGGGCCTA
  OUTPUT:  TAGCCCGGAT