Satellog
 
Satellog Resources
Query Database
Tutorial
Documentation
Downloads

Contact Us

Acknowledgements
Publications

GSC
UBiC
BCNet2nd place
BC Net's Coolest Application Contest, 2005



Satellog Database Documentation


APPENDIX B

 

Running TRF on v.34 whole chromosome fasta files from UCSC

 

@@@@@@@@@@@@@@@@@@@@@

@ Human Genome v.34 @

@@@@@@@@@@@@@@@@@@@@@

 

The human genome goldenpath for all chromosomes (excluding random chromosome DNA data) was saved in /home/perseusm/goldenpath for subsequent analysis.  These files were downloaded from UCSC.

 

How to download fasta files by FTP:

 

ftp -i hgdownload.cse.ucsc.edu

 

# -i turns off interactive mode, therefore no prompting during mget

 

login

 

u:anonymous

p:your@email.com

 

get files

 

cd goldenPath/hg16/chromosomes/

mget *

 

@@@@@@@

@ TRF @

@@@@@@@

 

We are interested in developing our own repeat co-ordinates distinct from the pre-computed co-ords provided by UCSC for two reasons:

 

1)  We want to detect repeats much smaller than the smallest at UCSC

2)  We are only interested in pure repeats

 

The following parameters were recommended to detect the purest repeats possible with TRF without running out of memory:

 

Shell Script for TRF

 

/home/perseusm/trf/trf321.linux.exe /home/perseusm/goldenpath/chr7.fa 3 4090 4090 80 10 30 16 -d; for file2 in /home/perseusm/chr7*.html; do rm -i $file2 -f; done; for file3 in /home/perseusm/chr7*.tmp; do rm -i $file3 -f; done

 

This is an interesting shell script here that gets rid of all the html files spawned by TRF. This, much to my annoyment, was not an option that could be disabled.

 

You need to do each chromosome sequentially because temporary files are created that are needed in the creation of the final .dat file.

 

The next thing I wanted to do was to test and ensure that in fact only pure repeats were being detected by the script.  The following is a quick and dirty script that extracts the largest hits from the TRF .dat files.  Due to the way the scoring algorithm works, larger repeats have a higher chance of tolerating indels and substitutions.  I wanted to make sure the TRF parameters I selected reported only pure hits.

 

# Execute purity test script

 

/home/perseusm/goldenpath/3.4090.4090.80.10.30.16/parse_dis.pl > purity_test.txt

 

# Code for purity test script

 

################

# parse_dis.pl #

################

 

#!/usr/bin/perl

 

use strict;

 

my $chrom;

 

while (<>) {

  chomp;

 

if ($_ =~ /chr(\S+)/) {

        $chrom = $1;

       

} elsif ($_ =~ /^(\d+)\s+(\d+)\s+(\d+)\s+(\d+\.\d+)\s+(\d+)\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\.\d+\s+(\S+)\s+(\S+)/) {

 

  # pull out coords of interest

  my $chromStart= $1;

  my $chromEnd  = $2;

  my $rptPeriod = $3;

  my $rptSize   = $4;

  my $rptConsensus = $5;

  my $rptUnit   = $6;

  my $rpt       = $7;

  my $rptLength = length $rpt;

 

       if (($rptPeriod == 16) && ($rptSize > 10)) {

 

              print "$chrom\t$chromStart\t$chromEnd\t$rptUnit\t$rpt\t$rpt\t$rptPeriod\t$rptSize\n";

 

              }

       }

}

 

#################

# End of Script #

#################

 

The contents of /home/perseusm/goldenpath/3.4090.4090.80.10.30.16/purity_test.txt indicate that all the largest hits are pure, this means that all smaller sized hits are pure as well.  You should go through this file manually and ensure each hit is pure, i.e. only composed of tandem repeat units.



 

^ top

 




 Satellog  W3C: XHTML, CSS