String Processing

Arrays: Split and Join

In this lesson the student will learn how to:
  1. Use the split function
  2. Use the join function
  3. Use the push function
By the end of this lesson the student will be able to:

    Write a script which generates two random polynucleotides
    and compares them reporting a percent similarity.

Sequence Similarity

We can compare two sequences and report their level of similarity. Since there are only four nucleotides used in a strand of DNA, random similarity will be 25%. Although there are different ways of comparing the similarity of two polynucleotides, we will use a two state system. For each position in the two polynucleotides the nucleotides in that position can either match (+) or not (-).


  seq1: ATCCGTACGA
  comp: +--+----++
  seq2: AGGCTAGACA

In this sample there are ten nucleotides in each sequence and there are four matches between them and so their similarity score is 40%.

Sequence similarity in biology is kind of a big deal. One method used to locate unknown genes is by their similarity to known genes. Also the relatedness of two genes can be measured in terms of similarity. For instance, a gene for keratin from a human and a mouse may be more similar than a keratin gene from a bird and a human. Sequence comparisons can be done between nucleotide sequences or between amino acid sequences. When dealing with 20 different amino acids similarity can be more complicated since there are groups of amino acids which are more similar or different in structure and other characteristics and so various scoring matrices have been divised to score the similarity of polypeptides.

The following script provides a quick overview of how to use split and join:

#!/usr/bin/perl $str = "AGTACGATTACTAGCGCTATGGCTACCTATAATAAATAAA"; print $str ."\n"; @list = split("A",$str); foreach $b (@list){ print $b . " "; } print "\n"; $s2 = join("-",@list); print $s2 . "\n";
Split takes a string and splits it into an array. Join takes an array and joins it into a string. Here's a base counting script which takes a string as input and outputs a count of each base:
#!/usr/bin/perl print "ENTER NUCLEOTIDE SEQUENCE: "; $str = <STDIN>; chomp $str; @list = split("",$str); $a=0; $g=0; $t=0; $c=0; $junk=0; foreach $item (@list){ if($item eq "A"){ $a++;} elsif($item eq "G"){ $g++;} elsif($item eq "C"){ $c++;} elsif($item eq "T"){ $t++;} else { $junk++;} } print "BASE COUNTS:\n"; print "A: $a\n"; print "C: $c\n"; print "G: $g\n"; print "T: $t\n"; print "JUNK: $junk\n";
Here's another example which randomly generates a list of nucleotides and joins them into a single string:
#!/usr/bin/perl srand(time); @nucs = ("A", "T", "C", "G"); @DNA = (); for($i=0; $i<20; $i++){ $n = int rand(4); push(@DNA,$nucs[$n]); } $output = join("-",@DNA); print $output . "\n";
The push function is new. It simply adds the second argument (a scalar) to the first argument (an array).

Make sure you spend time experimenting with each of the examples so you can have a strong understanding of the split and join.

ASSIGNMENT:

Write a script which generates two random polynucleotides 20 base pairs in length. Next compare these nucleotide sequences and provide output which both numerically and graphically reports their similarity:


  seq1:  ATTATACGAGCTTAACTAGC
  comp:  +----------------++-
  seq2:  ACAGATACGATAGCGAGAGG

  similarity: 15%

Notice that "+" stands for a match and that "-" stands for a mismatch.