|
|
|
APPENDIX C Generating the repeat classifier @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ Determining the distinct repeat classes @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ Mathematically, if one is looking at all 1mers to
16mers then there are huge number of potential combinations. Not all mathematically predicted repeats
exist in the genome (at least not pure repeats). For the pure repeats that do exist in the
human genome we need a way of classifying them into their repeat families. For example: are all the same repeat. They are detected distinctly because of the
way TRF works (see text). To do this we
developed a repeat class that detects all repeats in a family. We also had to design it in a way so that it
could be searched specifically for each repeat.
To do this, all detected repeats had to be collected from the directory
containing the TRF output files as follows: for file in *.dat; do cut -d " " -f 14
$file; done > rep_units_4090 This cuts the TRF output for each chromosome
separated by space at the 14th column (the column containing the repeat unit)
and outputs it to rep_units_4090. One erroneous repeat unit was detected with a AAA
instead of an A. The cause of this error
is unknown but this class of repeat is represented by the A. So this value was changed in
chr14.fa.3.4090.4090.80.10.30.16.dat, and reflected in rep_units_4090.2 Original erroneous hit in
chr14.fa.3.4090.4090.80.10.30.16.dat: chr14 102425491 102425500 3 3.3 3 100 0 30 100 0 0 0 0.00
AAA AAAAAAAAAA Next, we sought to determine all the distinct
repeats, this was accomplished by making a temporary “staging” table: CREATE TABLE rep_class ( class ); and inserting all the repeats into this table: LOAD from this table, unique repeats were selected with
the following shell script: echo " SELECT DISTINCT class FROM rep_class " | mysql --quick -h athena -u schz_rw -prepeat
schz_db > repeats_4090.txt Next we needed to generate all the distinct repeat
classes possible from the distinct repeats in our dataset.
home/perseusm/progs/repeat_classer/repeat_classer2.pl does this. # Execute /home/perseusm/progs/repeat_classer/repeat_classer2.pl
repeats_4090.txt ############################ # Begin repeat_classer2.pl # ############################ #!/usr/bin/perl # repeat_classer2.pl # usage: rep_classer2.pl rep_units.txt # run this script before repeatalyzer.pl # detects what macroclass each repeat belongs to # rep_units.txt is a list of all distinct types of
repeats in the db # use strict; my $rep = "rep_class.txt"; unless ( open( die
"Cannot open file \"$rep\" to write to!\n\n"; } my $i = 1; while (<>) { chomp; if ($_ =~ /(\S+)/) { print
"line $i matched: $1\n\n"; my
@units; my
$rep_class; my @repeat
= split('',$1); my
$lengther = scalar @repeat; my $n; print
"this is the repeat: @repeat\n"; print
"this is the length of the repeat: $lengther\n\n"; my $x; while ($n
< $lengther) { my
$for = shift @repeat; $repeat[($lengther
- 1)] = $for; my
$for_run = "@repeat"; $for_run
=~ s/\s//g; my
$rev_run = reverse $for_run; $rev_run
=~ tr/ATGCatgc/TACGtacg/; print
"this is the repeat after popping it: @repeat\nthis is the variable for
regex: $for_run\nthis is the variable in reverse: $rev_run\n\n"; $units[$x]
= $for_run; $x++; $units[$x]
= $rev_run; $x++; $n++; } print
"this is the units array: @units\n"; @units
= sort(@units); print
"sorted units:\n"; foreach
my $unit(@units) { print
$unit . "\n"; $rep_class
.= $unit . "o"; } $rep_class
= "o" . $rep_class; print
"this is the rep_class: $rep_class\n\n"; print $i++; } } close( exit; ############## # End script # ############## This script generates all the distinct classes in a
file repeats_classes_4090.txt located at:
/home/perseusm/progs/repeat_classer/repeats_classes_4090.txt This is fed through the rep_class table like above
to get all the distinct class which is saved in a file called
repeats_classes_4090.txt. located at:
/home/perseusm/progs/repeat_classer/repeats_classes_4090.txt Now create the final, usable rep_class table: CREATE TABLE rep_class ( rep_class_id
class
TEXT ); CREATE INDEX rc_search ON rep_class (class (50)
ASC,rep_class_id); which can be fed into the db using LOAD # execute: /home/perseusm/progs/repeat_classer/reg_ex_test.pl
repeats_classes_4090.txt ################## # reg_ex_test.pl # ################## #!/usr/bin/perl # reg_ex_test.pl # run this script before repeatalyzer.pl # detects
what macroclass each repeat belongs to # rep_units.txt is a list of all
distinct types of repeats in the db # use strict; use DBI; # DBI my ($dsn) =
"DBI:mysql:schz_db:athena.bcgsc.ca"; my
($user_name) = "schz_rw"; my
($password) = "repeat"; my ($dbh,
$sth); my (@ary);
####################### # Connect to
Database #
####################### $dbh =
DBI->connect ($dsn, $user_name, $password, my $i = 1; while (<>) { chomp; if ($_ =~ /(\S+)/) { # print
"line: 1 - Inserting: $1\n\n"; $sth =
$dbh->prepare ("INSERT INTO rep_class VALUES(NULL,'$1')"); $sth->execute
(); $i++; } } exit; # End script # ^ top
|
|
|