String Processing

Best Matches

In this lesson the student will learn how to:
  1. Modify a script to add functionality.
By the end of this lesson the student will be able to:

	Write a relatively complex script which
	checks for the best match between two
	nucleotide sequences.

Best Matches

Two seqeunces can anneal in different ways. Consider these possibilities:


	GGTTAACCGG   --------------------> 0
	 GGTTAACCGG   -------------------> 0
	  GGTTAACCGG   ------------------> 0
	   GGTTAACCGG   -----------------> 2
	    GGTTAACCGG   ----------------> 1
	     GGTTAACCGG   ---------------> 1
	      GGTTAACCGG   --------------> 1
	       GGTTAACCGG   -------------> 3
	        GGTTAACCGG   ------------> 1
	         GGTTAACCGG   -----------> 1
		 AGTCCGAGGT
	          GGTTAACCGG   ----------> 2
	           GGTTAACCGG   ---------> 2
	            GGTTAACCGG   --------> 3
	             GGTTAACCGG   -------> 3
	              GGTTAACCGG   ------> 1
	               GGTTAACCGG   -----> 1
	                GGTTAACCGG   ----> 2
	                 GGTTAACCGG   ---> 0
	                  GGTTAACCGG   --> 0

As you can see there are three positions which are best matches. These are at offsets -2, 3, and 4.

When two sequences have unmatched ends, free nucleotides can fill in the missing spots providing the appropriate enzymes are available to connect them.

#!/usr/bin/perl $one = "AGCTAAATGA"; $two = "CCGTAACTTT"; %trans = ( "A" => "T", "T" => "A", "C" => "G", "G" => "C" ); $start=0; $end=10; @s1 = split("", $one); @s2 = split("", $two); do{ print "ROUND " . ($start+1) . ":\n"; for($a=0; $a<$end; $a++){ if( $s1[$a] eq $trans{$s2[$a+$start]} ){ print "MATCH: $s1[$a], $s2[$a+$start] -->$a\n"; } } $start++; $end--; }while($end>0); $start=0; $end=10; do{ print "ROUND " . ($start+1) . ":\n"; for($a=0; $a<$end; $a++){ if( $s2[$a] eq $trans{$s1[$a+$start]} ){ print "MATCH: $s1[$a+$start], $s2[$a] -->$a\n"; } } $start++; $end--; }while($end>0);
Study this script and its output carefully.

ASSIGNMENT:

Modify the sample script so that it can deal with any pair of nucleotides of any length (even two nucleotides of different length). Provide facilities for user input (and check user input to ensure that proper nucleotide sequences are entered).