|
|
|
APPENDIX E Downloading and processing UniGene data We were curious if there was
any indication of repeat polymorphism in the UniGene clusters posted at
NCBI. repeatalyzer.pl automatically
evaluates each repeat for polymorphisms within UniGene clusters. To do this however, we need the clusters and
all sequences: How to download fasta files by FTP: # open FTP connection to NCBI $ ftp –I ftp.ncbi.nih.gov # login u:anonymous p:your@email.com # change to UniGene directory cd /repository/UniGene # download all human UniGene files mget Hs* Convert FASTA formatting of Hs.seq.uniq file The
Hs.seq.uniq file contains all sequences representing the longest, highest
quality stretch of For example the FASTA header for Hs.2 is: >gnl|UG|Hs#S1728506 Homo sapiens
N-acetyltransferase 2 (arylamine N-acetyltransferase) ( From this, we only really need the cluster
identifier (Hs.2) and the UniGene identifier for this sequence within Hs.2
(Hs#S1728506). Run the following command-line perl script to format
this file for subsequent BLAT analysis: $ perl -i.bak -p -e
's/^.*(Hs\#\S+).*\/ug\=(\S+).*$/>\2\|\1/g' Hs.seq.uniq The FASTA header for all sequences in Hs.seq.uniq is
now: >Hs.2|Hs#S1728506 Now rename this file to Hs.seq.uniq2: $ mv Hs.seq.uniq Hs.seq.uniq2 And rename the back-up file created by command-line
file to the original: $ mv Hs.seq.uniq.bak Hs.seq.uniq Make Hs.seq.uniq2 into a BLATable database BLAT
requires multiple FASTA files converted to a .2bit file format in order to
process them. $ ~/blat/faToTwoBit Hs.seq.uniq2 Hs.seq.uniq2.2bit Remember where this file is, it is required by
repeatalyzer to work. Split the UniGene clusters into cluster delineated
multiple FASTA files The
Hs.seq.all file from UniGene is essentially one huge flat file. Within this file, UniGene clusters are
delimited by # followed by a collection of sequences that make up the UniGene
cluster. For repeatalyzer to work, the
UniGene clusters need to be parsed to separate files representing each cluster
with all of its associated sequences.
The Hs.seq.all file was parsed by the following script: # make a new directory (105680 files will be created!) # make a note of the absolute location of these
files # they will be needed by repeatalyzer $ mkdir ugc_fasta # run the script $ ./parse_unigene_all.pl
Hs.seq.all #
Code for parsing Hs.seq.all ########################### # parse_unigene_unique.pl
# ########################### #!/usr/bin/perl -w # parse_unigene_unique.pl use strict; my $outputfile = "frig";; my $i; my $count; while (<>) { if
(/^\s+$/) {
next; }
elsif (/#.*containing\s+(\d+)/) { $count = $1;
$i = 1;
print "conditional 2: $count\n"; }
elsif (/^(>.*\/ug\=(\S+).*$)/) {
if ($outputfile eq "frig") { $outputfile = $2; unless ( open(SEQ,
">$outputfile\.ugc") ) { die "Cannot open
file \"$outputfile\" to write to!\n\n";
} print SEQ
"$1\n"; $i++;
} elsif (($outputfile ne "frig") && ($i == 1)) {
close(SEQ); $outputfile = $2; unless ( open(SEQ,
">$outputfile\.ugc") ) { die "Cannot open
file \"$outputfile\" to write to!\n\n";
} print SEQ
"$1\n"; $i++;
} elsif (($outputfile ne "frig") && ($i <= $count))
{ print SEQ
"$1\n"; $i++; } } elsif (/(\S+)/) {
print SEQ "$1\n"; } } exit; ^ top
|
|
|