String Processing

Random Substitution

In this lesson the student will learn how to:
  1. Use the seed functions to improve random number generation
  2. Use the rand function to generate random numbers
  3. Generate a random sequence of nucleotides
  4. Use the substr and length functions to isolate portions of a string
By the end of this lesson the student will be able to:

    Write a perl script which will insert a single random
    nucleotide into a randomly selected spot in a sequence.

Random Substitution

Although the DNA of any two people is going to be much more similar than different, there are differences as you would expect. Differences arrise over time due to mutations. Mutations are changes in the DNA sequence. Changes which occur within an exon of a gene usually make the gene non-functional or dysfunctional, but every now and then a mutation occurs which simply changes the function of the gene in a neutral or occassionally positive way. A mutation which results in a positive change is likely to become prevalent since individuals carrying that gene may have a selective advantage.

Mutations can take many forms. Deletions, insertions, and rearrangements are some different ways that a mutation can occur. In this assignment we will consider how to represent a random mutation.

The effect of a single mutation is likely to be quite subtle since gene products interact with different gene products. However, the effects of many mutations accumulated over thousands of generations may become quite noticeable. Although the effect of any single mutation may be of little consequence, the cummulative effects may be of great importance.

#!/usr/bin/perl srand(time); @nucs = ("A", "C", "T", "G"); $seq = ""; for($i=0; $i<20; $i++){ $n = int rand(4); $seq .= $nucs[$n]; } print $seq . "\n";
The s in srand stands for seed. The srand function seeds the random number generator. The random number generator does not actually produce values with are trully random and so it's actually a pseudo random number generator. By seeding the random number generator with time we can get sequences of pseudo random numbers and since time is constantly changing we can count on our sequences to be fairly unique. The rand function returns a random number within a range specified by the arugment given to it (in this case 4). The int function in front of the rand function ensures that the random number will be an integer (as opposed to a number with a decimal).

So, each time we go through the loop (which is set for 20 iterations) a nucleotide is added to the sequence. If you run the script several times, you should wind up with a whole bunch of different sequences. You could throw an extra print statement within the loop to see the sequence grow one nucleotide at a time per iteration of the loop.

#!/usr/bin/perl $str = "arginine guanine thymine cytosine"; @num = ( "A", "G", "T", "C" ); srand(time); $p = int rand(length $str); print "FRONT: " . substr($str, 0,$p) . "\n"; print "BACK: " . substr($str, $p) . "\n";

ASSIGNMENT:

Write a perl script which begins by displaying a string containing ten adenines ("AAAAAAAAAA"). Next write a loop which inserts a randomly selected C, G, or T in a random position of the original string and then prints out the modified string (within each iteration of the loop). This loop should iterate 10 times so that the final string will be twenty nucleotides long and may look something like this:


   AAGTACAATAGCGAATAGCA

Obviously your final string may be somewhat different since random nucleotides are being inserted into random positions.