Satellog Resources
Query Database

Contact Us


BCNet2nd place
BC Net's Coolest Application Contest, 2005

Satellog Database Documentation



Downloading and processing UniGene data


We were curious if there was any indication of repeat polymorphism in the UniGene clusters posted at NCBI. automatically evaluates each repeat for polymorphisms within UniGene clusters.  To do this however, we need the clusters and all sequences:


How to download fasta files by FTP:


# open FTP connection to NCBI

$ ftp –I


# login



# change to UniGene directory

cd /repository/UniGene


# download all human UniGene files

mget Hs*


Convert FASTA formatting of Hs.seq.uniq file


       The Hs.seq.uniq file contains all sequences representing the longest, highest quality stretch of DNA for each particular UniGene cluster.  We will be using the BLAT algorithm to see if each repeat plus 10 bp of upstream and downstream genomic sequence can be detected within these sequences.  The FASTA files provided by NCBI have a long, somewhat cumbersome naming convention that is too big for the BLAT output.


For example the FASTA header for Hs.2 is:


>gnl|UG|Hs#S1728506 Homo sapiens N-acetyltransferase 2 (arylamine N-acetyltransferase) (NAT2), mRNA /cds=(108,980) /gb=NM_000015 /gi=4557782 /ug=Hs.2 /len=1276


From this, we only really need the cluster identifier (Hs.2) and the UniGene identifier for this sequence within Hs.2 (Hs#S1728506).


Run the following command-line perl script to format this file for subsequent BLAT analysis:


$ perl -i.bak -p -e 's/^.*(Hs\#\S+).*\/ug\=(\S+).*$/>\2\|\1/g' Hs.seq.uniq


The FASTA header for all sequences in Hs.seq.uniq is now:



Now rename this file to Hs.seq.uniq2:

$ mv Hs.seq.uniq Hs.seq.uniq2


And rename the back-up file created by command-line file to the original:

$ mv Hs.seq.uniq.bak Hs.seq.uniq


Make Hs.seq.uniq2 into a BLATable database


       BLAT requires multiple FASTA files converted to a .2bit file format in order to process them.


$ ~/blat/faToTwoBit Hs.seq.uniq2 Hs.seq.uniq2.2bit


Remember where this file is, it is required by repeatalyzer to work.


Split the UniGene clusters into cluster delineated multiple FASTA files


       The Hs.seq.all file from UniGene is essentially one huge flat file.  Within this file, UniGene clusters are delimited by # followed by a collection of sequences that make up the UniGene cluster.  For repeatalyzer to work, the UniGene clusters need to be parsed to separate files representing each cluster with all of its associated sequences. The Hs.seq.all file was parsed by the following script:


# make a new directory (105680 files will be created!)

# make a note of the absolute location of these files

# they will be needed by repeatalyzer


$ mkdir ugc_fasta


# run the script


$ ./ Hs.seq.all


# Code for parsing Hs.seq.all



# #



#!/usr/bin/perl -w



use strict;


my $outputfile = "frig";;

my $i;

my $count;


while (<>) {


        if (/^\s+$/) {


        } elsif (/#.*containing\s+(\d+)/) {

                $count = $1;

$i = 1;

print "conditional 2: $count\n";

        } elsif (/^(>.*\/ug\=(\S+).*$)/) {


if ($outputfile eq "frig") {

                        $outputfile = $2;

                        unless ( open(SEQ, ">$outputfile\.ugc") ) {

                          die "Cannot open file \"$outputfile\" to write to!\n\n";



                        print SEQ "$1\n";



} elsif (($outputfile ne "frig") && ($i == 1)) {



                        $outputfile = $2;


                        unless ( open(SEQ, ">$outputfile\.ugc") ) {

                          die "Cannot open file \"$outputfile\" to write to!\n\n";



                        print SEQ "$1\n";


} elsif (($outputfile ne "frig") && ($i <= $count)) {


                        print SEQ "$1\n";




        } elsif (/(\S+)/) {


print SEQ "$1\n";





^ top


 Satellog  W3C: XHTML, CSS