|
|
|
APPENDIX B Running TRF on v.34 whole chromosome
fasta files from UCSC @@@@@@@@@@@@@@@@@@@@@ @ Human Genome v.34 @ @@@@@@@@@@@@@@@@@@@@@ The human genome goldenpath for all chromosomes
(excluding random chromosome How to download fasta files by FTP: ftp -i hgdownload.cse.ucsc.edu # -i turns off interactive mode, therefore no
prompting during mget login u:anonymous p:your@email.com get files cd goldenPath/hg16/chromosomes/ mget * @@@@@@@ @ TRF @ @@@@@@@ We are interested in developing our own repeat
co-ordinates distinct from the pre-computed co-ords provided by UCSC for two
reasons: 1) We want to
detect repeats much smaller than the smallest at UCSC 2) We are
only interested in pure repeats The following parameters were recommended to detect
the purest repeats possible with TRF without running out of memory: Shell Script for TRF /home/perseusm/trf/trf321.linux.exe
/home/perseusm/goldenpath/chr7.fa 3 4090 4090 80 This is an interesting shell script here that gets
rid of all the html files spawned by TRF.
This, much to my annoyment, was not an option that could be disabled. You need to do each chromosome sequentially because
temporary files are created that are needed in the creation of the final .dat
file. The next thing I wanted to do was to test and ensure
that in fact only pure repeats were being detected by the script. The following is a quick and dirty script
that extracts the largest hits from the TRF .dat files. Due to the way the scoring algorithm works,
larger repeats have a higher chance of tolerating indels and
substitutions. I wanted to make sure the
TRF parameters I selected reported only pure hits. # Execute purity test script /home/perseusm/goldenpath/3.4090.4090.80.10.30.16/parse_dis.pl
> purity_test.txt # Code for purity test script ################ # parse_dis.pl # ################ #!/usr/bin/perl use strict; my $chrom; while (<>) { chomp; if ($_ =~ /chr(\S+)/) { $chrom
= $1; } elsif ($_ =~
/^(\d+)\s+(\d+)\s+(\d+)\s+(\d+\.\d+)\s+(\d+)\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\.\d+\s+(\S+)\s+(\S+)/)
{ # pull out
coords of interest my
$chromStart= $1; my
$chromEnd = $2; my
$rptPeriod = $3; my
$rptSize = $4; my
$rptConsensus = $5; my
$rptUnit = $6; my $rpt = $7; my
$rptLength = length $rpt; if
(($rptPeriod == 16) && ($rptSize > 10)) { print
"$chrom\t$chromStart\t$chromEnd\t$rptUnit\t$rpt\t$rpt\t$rptPeriod\t$rptSize\n"; } } } ################# # End of Script # ################# The contents of
/home/perseusm/goldenpath/3.4090.4090.80.10.30.16/purity_test.txt indicate that
all the largest hits are pure, this means that all smaller sized hits are pure
as well. You should go through this file
manually and ensure each hit is pure, i.e. only composed of tandem repeat units. ^ top
|
|
|