Satellog
 
Satellog Resources
Query Database
Tutorial
Documentation
Downloads

Contact Us

Acknowledgements
Publications

GSC
UBiC
BCNet2nd place
BC Net's Coolest Application Contest, 2005



Satellog Database Documentation


APPENDIX C

 

Generating the repeat classifier

 

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

@ Determining the distinct repeat classes @

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

 

Mathematically, if one is looking at all 1mers to 16mers then there are huge number of potential combinations.  Not all mathematically predicted repeats exist in the genome (at least not pure repeats).  For the pure repeats that do exist in the human genome we need a way of classifying them into their repeat families.  For example:

 

CAG, AGC, GCA, GTC, TCG, CGT

 

are all the same repeat.  They are detected distinctly because of the way TRF works (see text).  To do this we developed a repeat class that detects all repeats in a family.  We also had to design it in a way so that it could be searched specifically for each repeat. To do this, all detected repeats had to be collected from the directory containing the TRF output files as follows:

 

for file in *.dat; do cut -d " " -f 14 $file; done > rep_units_4090

 

This cuts the TRF output for each chromosome separated by space at the 14th column (the column containing the repeat unit) and outputs it to rep_units_4090.

 

One erroneous repeat unit was detected with a AAA instead of an A.  The cause of this error is unknown but this class of repeat is represented by the A.  So this value was changed in chr14.fa.3.4090.4090.80.10.30.16.dat, and reflected in rep_units_4090.2

 

Original erroneous hit in chr14.fa.3.4090.4090.80.10.30.16.dat:

 

chr14

102425491 102425500 3 3.3 3 100 0 30 100 0 0 0 0.00 AAA AAAAAAAAAA

 

Next, we sought to determine all the distinct repeats, this was accomplished by making a temporary “staging” table:

 

CREATE TABLE rep_class

(

       class CHAR(16) PRIMARY KEY

);

 

and inserting all the repeats into this table:

 

LOAD DATA INFILE '/home/perseusm/goldenpath/3.4090.4090.80.10.30.16/rep_units_4090.2' IGNORE INTO TABLE rep_class;

 

from this table, unique repeats were selected with the following shell script:

 

echo "

SELECT DISTINCT class

FROM rep_class

" | mysql --quick -h athena -u schz_rw -prepeat schz_db > repeats_4090.txt

 

Next we needed to generate all the distinct repeat classes possible from the distinct repeats in our dataset. home/perseusm/progs/repeat_classer/repeat_classer2.pl does this.

 

# Execute

 

/home/perseusm/progs/repeat_classer/repeat_classer2.pl repeats_4090.txt

 

############################

# Begin repeat_classer2.pl #

############################

 

#!/usr/bin/perl

# repeat_classer2.pl

# usage: rep_classer2.pl rep_units.txt

# run this script before repeatalyzer.pl

# detects what macroclass each repeat belongs to

# rep_units.txt is a list of all distinct types of repeats in the db

# Perseus Missirlis - Mar 18, 2004

 

use strict;

 

my $rep = "rep_class.txt";

 

unless ( open(REP_CLASS, ">$rep") ) {

   die "Cannot open file \"$rep\" to write to!\n\n";

      }

        

my $i = 1;

 

while (<>) {

  chomp;

 

if ($_ =~ /(\S+)/) {

 

    print "line $i matched: $1\n\n";

      

       my @units;

       my $rep_class;

    my @repeat = split('',$1);

    my $lengther = scalar @repeat;

    my $n;

 

    print "this is the repeat: @repeat\n";

    print "this is the length of the repeat: $lengther\n\n";

 

       my $x;

 

    while ($n < $lengther) {

              my $for = shift @repeat;

              $repeat[($lengther - 1)] = $for;

              my $for_run = "@repeat";

              $for_run =~ s/\s//g;

              my $rev_run = reverse $for_run;

              $rev_run =~ tr/ATGCatgc/TACGtacg/;

      

              print "this is the repeat after popping it: @repeat\nthis is the variable for regex: $for_run\nthis is the variable in reverse: $rev_run\n\n";

 

              $units[$x] = $for_run;

              $x++;

              $units[$x] = $rev_run;

              $x++;

              $n++;

              }

 

              print "this is the units array: @units\n";

 

              @units = sort(@units);

              print "sorted units:\n";

              foreach my $unit(@units) {

              print $unit . "\n";

              $rep_class .= $unit . "o";

              }

       $rep_class = "o" . $rep_class;

       print "this is the rep_class: $rep_class\n\n";

       print REP_CLASS $rep_class . "\n";

       $i++; 

       }

}

close(REP_CLASS);

exit;

 

##############

# End script #

##############

 

This script generates all the distinct classes in a file repeats_classes_4090.txt

 

located at: /home/perseusm/progs/repeat_classer/repeats_classes_4090.txt

 

This is fed through the rep_class table like above to get all the distinct class which is saved in a file called repeats_classes_4090.txt.

 

located at: /home/perseusm/progs/repeat_classer/repeats_classes_4090.txt

 

Now create the final, usable rep_class table:

 

CREATE TABLE rep_class

(

       rep_class_id INT auto_increment NOT NULL PRIMARY KEY,

       class TEXT

);

 

CREATE INDEX rc_search ON rep_class (class (50) ASC,rep_class_id);

 

which can be fed into the db using LOAD DATA or the following script:

 

# execute:

 

/home/perseusm/progs/repeat_classer/reg_ex_test.pl repeats_classes_4090.txt

 

##################

# reg_ex_test.pl #

##################

 

#!/usr/bin/perl

# reg_ex_test.pl

# run this script before repeatalyzer.pl # detects what macroclass each repeat belongs to # rep_units.txt is a list of all distinct types of repeats in the db # Perseus Missirlis - Mar 18, 2004

 

use strict;

use DBI;

 

# DBI

 

 my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca";

 my ($user_name) = "schz_rw";

 my ($password) = "repeat";

 my ($dbh, $sth);

 my (@ary);

 

#######################

  # Connect to Database #

#######################

  

   $dbh = DBI->connect ($dsn, $user_name, $password, { RaiseError => 1 });

  

my $i = 1;

 

while (<>) {

  chomp;

 

if ($_ =~ /(\S+)/) {

#      print "line: 1 - Inserting: $1\n\n";

       $sth = $dbh->prepare ("INSERT INTO rep_class VALUES(NULL,'$1')");

       $sth->execute ();

       $i++;

    }

}

 

exit;

 

# End script #


 


 

^ top

 




 Satellog  W3C: XHTML, CSS