The MyGenBank Page

MyGenBank is a package for managing a local copy of GenBank using MySQL. Why would you want to do this? If you have to ask, you're not the intended audience.

How does it work?

MyGenBank stores only the most important data about each sequence in the actual database. This keeps it pretty small. The sequence itself is not stored in the database. Instead, it is kept in both FASTA and GenBank flat file forms.

Release Notes

This is unsupported, free software. If it doesn't meet your needs, don't use it. Please report bugs, but don't expect me to fix them immediately (if at all).

The main developer of MyGenBank is now Robert Citek. Contact him via rwcitek@cs.wustl.edu

Documentation

Introduction
Database Specification
Setting up MyGenBank
Querying MyGenBank

Introduction

Anyone interested in MyGenBank should first read the most recent GenBank release notes and perhaps also see the DDJB/EMBL/GenBank Feature table defintion and taxonomy definitions (see the names.dmp file). MyGenBank consists of two main components:

the administration tool mygb_admin
the querying tools mygb_fetch and mygb_query

Database Specification

MyGenBank exists as a single table containing some of the most important sequence attributes. However, the sequence is not stored in the database. To get the raw sequence, Fasta file, or GenBank flat file, you use the mygb_fetch tool (see below).

Column Type Attributes Indexed

accession VARCHAR(8) NOT NULL, PRIMARY KEY yes

version INT1 NOT NULL no

gi INT4 NOT NULL, UNIQUE yes

length INT3 NOT NULL, yes

date DATE NOT NULL yes

taxid INT3 NOT NULL yes

mol_type ENUM NOT NULL yes

division ENUM NOT NULL yes

keywords SET no

features SET no

file VARCHAR(5) NOT_NULL no

fasta INT4 NOT_NULL no

genbank INT4 NOT_NULL no

Enums and Sets

The division enums are determined during setup. They are stored in the $MYGENBANK_DATA/Definition directory. The keywords, features, and mol_types are stored in the $MYGENBANK_CODE/def directory and terse copies are made during setup and saved to the $MYGENBANK_DATA/Definition directory. You may edit the keywords and features files to fit your own criteria, see the directions in each of the files in $MYGENBANK_CODE/def. The default keywords, features, mol_types, and divisions are given below.

mol_type - note that I am including the ds- or ss- in the mol_type rather than store a separate value. Also, the CIRCULAR tag is omitted.: AA ds-RNA ds-mRNA ds-rRNA mRNA ms-DNA ms-RNA rRNA scRNA ss-DNA ss-RNA tRNA uRNA
division: BCT EST GSS HTG INV MAM PAT PHG PLN PRI ROD STS SYN UNA VRL VRT
keywords - this is my chosen set of keywords (from the KEYWORDS field).: EST HTG HTGS_PHASE0 HTGS_PHASE1 HTGS_PHASE2 HTGS_PHASE3 HTGS_DRAFT GSS STS
features - this is my set of features, not the entire GenBank set. There is a maximum of 63 features, which is less than what is present in GenBank.: 3_UTR 3_clip 5_UTR 5_clip CAAT_signal CDS GC_signal RBS STS TATA_signal conflict enhancer exon gene intron mRNA mat_peptide misc_RNA misc_binding misc_signal misc_structure modified_base polyA_signal polyA_site precursor_RNA prim_transcript promoter protein_bind rRNA repeat_region satellite scRNA sig_peptide snRNA stem_loop tRNA terminator transit_peptide transposon unsure variation

Setting up MyGenBank

Environment Variables

You need to set two envirionment variables. You should probably add these to your login scripts.

MYGENBANK_CODE: This should point to the directory where this documentation exists. You should find 4 subdirectories here: arch, bin, lib, and def.
MYGENBANK_DATA: This should point to a directory where MyGenBank will store its files. Four directories will be created here: Definition, GenBank, Fasta, and Table.

$MYGENBANK_CODE Directory

arch: An archive of code that is not necessary to run the current version of MyGenBank.
bin: Contains the executable files for MyGenBank. Currently, this contains mygb_admin, mygb_fetch, and mygb_query.
def: Contains the default keywords, features, and mol_types which are stored as SET types in MyGenBank. These files may be edited to capture more sequence attributes. Keywords are parsed from the GenBank KEYWORD lines and features are parsed from the "feature keys" in the feature table. See the Feature table definition for more information.
lib: Contains the GBlite.pm perl module used for parsing GenBank flat files and may be useful outside this context as well.

$MYGENBANK_DATA Directory

Defintion: Contains files for keywords, features, divisions, mol_types, and filenames. keywords and features are copied from $MYGENBANK_CODE/def. divisons, mol_types, and filenames are created by the "mygb_admin parse" command. Also contains the *.sql files. The MyGenBank.sql file contains the database definition. Other *.sql files correspond to the individual GenBank files.
Fasta: Contains the Fasta files corresponding to the sequence(s) from the GenBank flat file. The files are created by the "mygb_admin parse" command.
GenBank: Contains GenBank flat files downloaded from the NCBI. The files are created by the "mygb_admin ftp" command.
Table: Contains tab-delimited data for bulk loading into MySQL. The files are created by the "mygb_admin parse" command.

External Dependencies

Before you begin, you must have MySQL and Perl installed. You will also need the libnet modules (just Net::FTP actually) as well as the MySQL DBI for Perl. You can find these components at mysql.com and CPAN.

Space Requirements

You're going to need a lot of space. GenBank is continually growing. See the release notes to find out how big the flat files are for the latest release. You need to add about 1/3 more than this for the Fasta versions of the files. If you plan on doing incremental updates, you need to take this into account too (see growth of GenBank in the release notes).

mygb_admin

The mygb_admin tool is used to build MyGenBank. The first time you try building MyGenBank, you may wish to use the -t switch to enter testing mode. This will process just one GenBank file, which will allow you to determine if your environment is set up correctly before wasting a lot of download and cpu time.

mygb_admin    setup
mygb_admin -t ftp
mygb_admin -t parse
mygb_admin -t build
mygb_admin -t test

If you plan on doing incremental updates, you should test this too.

mygb_admin -t update
mygb_admin -t test

mygb_admin commands

setup

creates the directories in $MYGENBANK_DATA if necessary
copies definitions from $MYGENBANK_CODE/def to $MYGENBANK_DATA/Definition
creates filenames and divisions files in $MYGENBANK_DATA/Definition

ftp

reads the filenames from $MYGENBANK_DATA/Definitino/filenames
skips files already transferred (checks for existence in $MYGENBANK_DATA/GenBank)
downloads each file, pipes it to gunzip, and saves it to $MYGENBANK/_DATA/GenBank

parse

reads the filenames from $MYGENBANK_DATA/GenBank
skips files already parsed (checks for existence in $MYGENBANK_DATA/Table)
parses each GenBank file
creates a corresponding fasta file in $MYGENBANK_DATA/Fasta
creates a corresponding tab-delimited file in $MYGENBANK_DATA/Table

build

reads filenames from $MYGENBANK_DATA/Table and skips files already loaded into MySQL (checks for existence in $MYGENBANK_DATA/Definition)
defines the MyGenBank table (see the $MYGENBANK_DATA/Definition/MyGenBank.sql file)
loads each tab-delimited file in $MYGENBANK_DATA/Table

test

runs some simple querries on MyGenBank, see the section on querrying below

update

gets a list of all GenBank update files from NCBI
ftp (unless already downloaded)
parse (unless already parsed)
build (unless already built)B

You may put the commands together on a single line, and the typical command line for a test build of MyGenBank would look like this:

mygb_admin -t setup ftp parse build test

If everything works, then you should use the following command line:

mygb_admin setup ftp parse build test >& logfile &

This may take some time, the exact amount will depend on your network, cpu, filesystem, and size of GenBank. On my workstation (900 MHz, 512 Mb, 73Gb 10K SCSI, 400-600 Kb/sec bandwith) with release 120, I was able to build MyGenBank in about 6-10 hours depending upon traffic and if I was also including the updates.

If the build stops for some reason, like network failure, you can restart it again and it won't download files previously fetched (see the command details above). You may have to delete the last file created if it has errors. If you're logging STDERR as shown above, you should be able to find the file without any problems.

If you want to do incremental updates, you can use the following command:

mygb_admin update >& update_log &

Querying MyGenBank

There are two command line tools for interacting with MyGenBank. These are explained below.

mygb_fetch

mygb_fetch is used for retrieving sequences in raw, Fasta, or GenBank format, singly or in batches. You may specify accesion numbers, gi numbers, or query strings. The default format is Fasta. For example, to fetch a single specific sequence, gi=23456, in fasta format, you would type:

mygb_fetch 23456

You could also retrieve that entry by its accession:

mygb_fetch Z16870

Or with an abbreiviated SQL statement (without the "select ... from ..." precedent, and don't forget the inner quotes for strings):

mygb_fetch "gi = 23456"
mygb_fetch "accession = 'Z16870'"

You can also retrieve multiple sequences by including multiple arguments on the command line:

mygb_fetch 23456 45678

You can retrieve a batch of sequences with abbreviated SQL syntax. Here's how you build a Fasta database of all the human transcripts:

mygb_fetch "taxid = 9606 and mol_type = 'mRNA'" > human_tx.fasta

You can even mix and match if you like:

mygb_fetch Z16870 45678 "mol_type = 'uRNA'"

Here's how to build fasta database of all human sequences with annotated coding sequences that have been deposited in GenBank since March 15th, 2000. Note the use of "find_in_set" which is used for querying features and keywords.

mygb_fetch "taxid = 9606 and find_in_set('CDS',features) and date > 2000-03-15"

You can get the sequence in raw format or GenBank flat file format using the -r and -g switches (and be explicit about fasta if you like):

mygb_fetch -r 23456
mygb_fetch -g 23456
mygb_fetch -f 23456

You may find that the data in MyGenBank is limiting. For example, you might want to know who the authors of the sequences are. To do this, you can process the flat files as a post-processing step with UNIX shell tools, with the GBlite.pm Perl module included in the $MYGENBANK_CODE/lib directory, or with other tools, such as those found at bioperl.

For archival/publication reasons, you may want to exclude update sequences so you can just say "we used Release 120". You can either build without updates or use the -u switch in mygb_fetch.

mygb_fetch -u "division = 'EST'"

mygb_query

mygb_query is used for retrieving tab-delimited columns of data from the database. To use it, you give it straight SQL. These are the commands issued by "mygb_admin test":

"select COUNT(*) from MyGenBank",
"select accession, mol_type, date, taxid from MyGenBank limit 5",
"select length, accession from MyGenBank where length < 10 limit 5",
"select COUNT(*) from MyGenBank where division='BCT'",
"select COUNT(*) from MyGenBank where find_in_set('HTG', keywords)",
"select COUNT(*) from MyGenBank where find_in_set('CDS', features)",

Latest version at sourceforge.net

Back to IK's home page