Loading Multiple Files

We will take the contents of several files contained in a single directory and load the information contained in these files into a database. The files we will use are in GenBank format. Download these files and store them in a directory called GENES.

GenBank Files



Loading The Database From Files

First of all we need a strategy to read files one by one from a directory. Here's a script that will read the first line of every file in a directory:

#!/usr/bin/perl -w use strict; my @files=(); my $folder = "GENES"; unless(opendir(FOLDER, $folder) ){ print "Cannot open folder $folder!\n"; exit; } @files = readdir(FOLDER); closedir(FOLDER); foreach my $file (@files){ if($file =~ "txt"){ unless(open(FILE, "$folder/$file")){ print "File not found"; } my @content = <FILE>; foreach my $line (@content){ if($line =~ /^LOCUS/){ print "$line"; } } close(FILE); } } exit; Now we need to isolate the information we are interested in. The locus name is located in the first line. The organism information is located in a line which begins with and indentation and the word ORGANISM. The accession number is in the line which begins with the word ACCESSION. The number of base pairs for each of the four nucleotides are found in the line which begins with BASE COUNT. The total number of base pairs can be computed from the individual base counts, but it is also found in the first line of the file. We will isolate the total base count from the first line since that is slightly more challenging that just adding up the individual counts.

The following code goes in the inner foreach loop:

if($line =~ /^LOCUS/){ ($locus) = ( $line =~ /^LOCUS\s*([^ ]*).*/s ); print "LOCUS: $locus\n"; ($total) = ($line =~ /^LOCUS\s*[^\s]*\s*([\d]*).*/s ); print "TOTAL: $total\n"; } if($line =~ /^BASE COUNT/){ ($a,$c,$g,$t) = ( $line =~ /^BASE COUNT\s*(\d*)[^\d]*(\d*)[^\d]*(\d*)[^\d]*(\d*).*/s ); print "A: $a, C: $c, G: $g, T: $t\n"; } if($line =~ /^ ORGANISM/){ chomp $line; ($organism) = ( $line =~ /^ ORGANISM\s*(.*)/s ); print "ORGANISM: $organism\n"; } if($line =~ /^ACCESSION/){ chomp $line; ($accession) = ( $line =~ /^ACCESSION\s*([^ ]*).*/s ); print "ACCESSION: $accession\n"; }

ASSIGNMENT: