MyGenBank is a package for managing a local copy of GenBank using MySQL. Why
would you want to do this? If you have to ask, you're not the intended
audience.
How does it work?MyGenBank stores only the most important data about each sequence in the actual database. This keeps it pretty small. The sequence itself is not stored in the database. Instead, it is kept in both FASTA and GenBank flat file forms.Release NotesThis is unsupported, free software. If it doesn't meet your needs, don't use it. Please report bugs, but don't expect me to fix them immediately (if at all).The main developer of MyGenBank is now Robert Citek. Contact him via rwcitek@cs.wustl.edu DocumentationIntroductionDatabase Specification Setting up MyGenBank Querying MyGenBank IntroductionAnyone interested in MyGenBank should first read the most recent GenBank release notes and perhaps also see the DDJB/EMBL/GenBank Feature table defintion and taxonomy definitions (see the names.dmp file). MyGenBank consists of two main components:
Database SpecificationMyGenBank exists as a single table containing some of the most important sequence attributes. However, the sequence is not stored in the database. To get the raw sequence, Fasta file, or GenBank flat file, you use the mygb_fetch tool (see below).
Enums and SetsThe division enums are determined during setup. They are stored in the $MYGENBANK_DATA/Definition directory. The keywords, features, and mol_types are stored in the $MYGENBANK_CODE/def directory and terse copies are made during setup and saved to the $MYGENBANK_DATA/Definition directory. You may edit the keywords and features files to fit your own criteria, see the directions in each of the files in $MYGENBANK_CODE/def. The default keywords, features, mol_types, and divisions are given below.
Setting up MyGenBankEnvironment VariablesYou need to set two envirionment variables. You should probably add these to your login scripts.
$MYGENBANK_CODE Directory
$MYGENBANK_DATA Directory
External DependenciesBefore you begin, you must have MySQL and Perl installed. You will also need the libnet modules (just Net::FTP actually) as well as the MySQL DBI for Perl. You can find these components at mysql.com and CPAN.
Space RequirementsYou're going to need a lot of space. GenBank is continually growing. See the release notes to find out how big the flat files are for the latest release. You need to add about 1/3 more than this for the Fasta versions of the files. If you plan on doing incremental updates, you need to take this into account too (see growth of GenBank in the release notes).
mygb_adminThe mygb_admin tool is used to build MyGenBank. The first time you try building MyGenBank, you may wish to use the -t switch to enter testing mode. This will process just one GenBank file, which will allow you to determine if your environment is set up correctly before wasting a lot of download and cpu time.mygb_admin setup mygb_admin -t ftp mygb_admin -t parse mygb_admin -t build mygb_admin -t testIf you plan on doing incremental updates, you should test this too. mygb_admin -t update mygb_admin -t testmygb_admin commands
mygb_admin -t setup ftp parse build testIf everything works, then you should use the following command line: mygb_admin setup ftp parse build test >& logfile &This may take some time, the exact amount will depend on your network, cpu, filesystem, and size of GenBank. On my workstation (900 MHz, 512 Mb, 73Gb 10K SCSI, 400-600 Kb/sec bandwith) with release 120, I was able to build MyGenBank in about 6-10 hours depending upon traffic and if I was also including the updates. If the build stops for some reason, like network failure, you can restart it again and it won't download files previously fetched (see the command details above). You may have to delete the last file created if it has errors. If you're logging STDERR as shown above, you should be able to find the file without any problems. If you want to do incremental updates, you can use the following command: mygb_admin update >& update_log & Querying MyGenBankThere are two command line tools for interacting with MyGenBank. These are explained below.mygb_fetchmygb_fetch is used for retrieving sequences in raw, Fasta, or GenBank format, singly or in batches. You may specify accesion numbers, gi numbers, or query strings. The default format is Fasta. For example, to fetch a single specific sequence, gi=23456, in fasta format, you would type:mygb_fetch 23456You could also retrieve that entry by its accession: mygb_fetch Z16870Or with an abbreiviated SQL statement (without the "select ... from ..." precedent, and don't forget the inner quotes for strings): mygb_fetch "gi = 23456" mygb_fetch "accession = 'Z16870'"You can also retrieve multiple sequences by including multiple arguments on the command line: mygb_fetch 23456 45678You can retrieve a batch of sequences with abbreviated SQL syntax. Here's how you build a Fasta database of all the human transcripts: mygb_fetch "taxid = 9606 and mol_type = 'mRNA'" > human_tx.fastaYou can even mix and match if you like: mygb_fetch Z16870 45678 "mol_type = 'uRNA'"Here's how to build fasta database of all human sequences with annotated coding sequences that have been deposited in GenBank since March 15th, 2000. Note the use of "find_in_set" which is used for querying features and keywords. mygb_fetch "taxid = 9606 and find_in_set('CDS',features) and date > 2000-03-15"You can get the sequence in raw format or GenBank flat file format using the -r and -g switches (and be explicit about fasta if you like): mygb_fetch -r 23456 mygb_fetch -g 23456 mygb_fetch -f 23456You may find that the data in MyGenBank is limiting. For example, you might want to know who the authors of the sequences are. To do this, you can process the flat files as a post-processing step with UNIX shell tools, with the GBlite.pm Perl module included in the $MYGENBANK_CODE/lib directory, or with other tools, such as those found at bioperl. For archival/publication reasons, you may want to exclude update sequences so you can just say "we used Release 120". You can either build without updates or use the -u switch in mygb_fetch. mygb_fetch -u "division = 'EST'" mygb_querymygb_query is used for retrieving tab-delimited columns of data from the database. To use it, you give it straight SQL. These are the commands issued by "mygb_admin test":"select COUNT(*) from MyGenBank", "select accession, mol_type, date, taxid from MyGenBank limit 5", "select length, accession from MyGenBank where length < 10 limit 5", "select COUNT(*) from MyGenBank where division='BCT'", "select COUNT(*) from MyGenBank where find_in_set('HTG', keywords)", "select COUNT(*) from MyGenBank where find_in_set('CDS', features)", |
Latest version at
sourceforge.net
|