Satellog Resources
Query Database
Tutorial
Documentation
Downloads

Contact Us

Acknowledgements
Publications

GSC
UBiC

2nd place BC Net's Coolest Application Contest, 2005

Satellog Database Documentation

APPENDIX E

Downloading and processing UniGene data

We were curious if there was any indication of repeat polymorphism in the UniGene clusters posted at NCBI. repeatalyzer.pl automatically evaluates each repeat for polymorphisms within UniGene clusters. To do this however, we need the clusters and all sequences:

How to download fasta files by FTP:

# open FTP connection to NCBI

$ ftp –I ftp.ncbi.nih.gov

# login

u:anonymous

p:your@email.com

# change to UniGene directory

cd /repository/UniGene

# download all human UniGene files

mget Hs*

Convert FASTA formatting of Hs.seq.uniq file

The Hs.seq.uniq file contains all sequences representing the longest, highest quality stretch of DNA for each particular UniGene cluster. We will be using the BLAT algorithm to see if each repeat plus 10 bp of upstream and downstream genomic sequence can be detected within these sequences. The FASTA files provided by NCBI have a long, somewhat cumbersome naming convention that is too big for the BLAT output.

For example the FASTA header for Hs.2 is:

>gnl|UG|Hs#S1728506 Homo sapiens N-acetyltransferase 2 (arylamine N-acetyltransferase) (NAT2), mRNA /cds=(108,980) /gb=NM_000015 /gi=4557782 /ug=Hs.2 /len=1276

From this, we only really need the cluster identifier (Hs.2) and the UniGene identifier for this sequence within Hs.2 (Hs#S1728506).

Run the following command-line perl script to format this file for subsequent BLAT analysis:

$ perl -i.bak -p -e 's/^.*(Hs\#\S+).*\/ug\=(\S+).*$/>\2\|\1/g' Hs.seq.uniq

The FASTA header for all sequences in Hs.seq.uniq is now:

>Hs.2|Hs#S1728506

Now rename this file to Hs.seq.uniq2:

$ mv Hs.seq.uniq Hs.seq.uniq2

And rename the back-up file created by command-line file to the original:

$ mv Hs.seq.uniq.bak Hs.seq.uniq

Make Hs.seq.uniq2 into a BLATable database

BLAT requires multiple FASTA files converted to a .2bit file format in order to process them.

$ ~/blat/faToTwoBit Hs.seq.uniq2 Hs.seq.uniq2.2bit

Remember where this file is, it is required by repeatalyzer to work.

Split the UniGene clusters into cluster delineated multiple FASTA files

The Hs.seq.all file from UniGene is essentially one huge flat file. Within this file, UniGene clusters are delimited by # followed by a collection of sequences that make up the UniGene cluster. For repeatalyzer to work, the UniGene clusters need to be parsed to separate files representing each cluster with all of its associated sequences. The Hs.seq.all file was parsed by the following script:

# make a new directory (105680 files will be created!)

# make a note of the absolute location of these files

# they will be needed by repeatalyzer

$ mkdir ugc_fasta

# run the script

$ ./parse_unigene_all.pl Hs.seq.all

# Code for parsing Hs.seq.all

###########################

# parse_unigene_unique.pl #

###########################

#!/usr/bin/perl -w

# parse_unigene_unique.pl

use strict;

my $outputfile = "frig";;

my $i;

my $count;

while (<>) {

if (/^\s+$/) {

next;

} elsif (/#.*containing\s+(\d+)/) {

$count = $1;

$i = 1;

print "conditional 2: $count\n";

} elsif (/^(>.*\/ug\=(\S+).*$)/) {

if ($outputfile eq "frig") {

$outputfile = $2;

unless ( open(SEQ, ">$outputfile\.ugc") ) {

die "Cannot open file \"$outputfile\" to write to!\n\n";

}

print SEQ "$1\n";

$i++;

} elsif (($outputfile ne "frig") && ($i == 1)) {

close(SEQ);

$outputfile = $2;

unless ( open(SEQ, ">$outputfile\.ugc") ) {

die "Cannot open file \"$outputfile\" to write to!\n\n";

}

print SEQ "$1\n";

$i++;

} elsif (($outputfile ne "frig") && ($i <= $count)) {

print SEQ "$1\n";

$i++;

}

} elsif (/(\S+)/) {

print SEQ "$1\n";

}

exit;

^ top

Satellog

W3C: XHTML, CSS