WebApollo provides an easy to use platform for manual curation. Being a web application, the requirements to use the software is simply a modern web browser and an internet connection. However, being a web application does come at the cost of more work involved when first deploying the application on the server.
Setting up the WebApollo server requires familiarity with server administration, database administration, and with the applications used by WebApollo. There are quite a few requirements to get the application up and running. We have developed two solutions to address this issue. The first is GMOD in the Cloud, a virtual machine for deployment on the cloud. It comes with WebApollo (among other GMOD tools) already installed. This provides a great solution for those who have no issues with their instances and data being hosted elsewhere.
For those who might have issues with data on the cloud, we are providing this virtual machine which can be deployed locally.
You can download the latest version of the virtual machine here.
You'll need to download and install VirtualBox to run the virtual machine. It's freely available for all major operating systems and it has been greatly improved over the years.
The virtual machine comes with a data partition of about 4.2GB in size. This should be enough for most smaller organisms without a massive amount of data. If you do need more storage than that, you'll need to setup a shared folder for your virtual machine. All the system administration work to use the shared folder has already been done, you'll just need to do a few steps:
Now WebApollo will use the shared folder for its data. This allows you to make the amount of data allocated as large as you'd like.
The virtual machine is running Ubuntu 12.04 Server. Since it's meant to be only run as a server, no desktop environment has been installed (this greatly decreases the virtual machine size). You can always access the server through the console in VirtualBox or SSH into it.
We have installed and configured all requirements to run WebApollo:Applications:
Since most things are already configured, you will just need to configure and process the data for your organism.
The username and password to access the virtual machine are both 'webapollo'. You will likely want to change it. Type in your shell:
enter 'webapollo' as the current password and then the new password.
As mentioned previously, there's no desktop environment installed on the virtual machine. You will need to access it either through the VirtualBox console or SSH into the machine. If you want to SSH into the machine, you'll need to find the machine's IP address. You can do that from the VirtualBox console. Type in your shell:
You are interested in the 'inet addr:' value that's returned.
Before you can start using WebApollo, you'll need to configure and process your data. Note that whenever we use '~' as a directory, we're referencing the 'webapollo' user home directory ('/home/webapollo').
We have setup a number of symlinks in the '~/webapollo' directory to make accessing the different directories for setting up the application easier. They're as follows:
For this guide, we'll be using the included sample data located in:
We have an experimental script to help in setting up your WebApollo instance. The script is interactive and will setup the configuration, database and permissions, run the JBrowse setup scripts, and setup the BLAT database all from just a GFF3 file. Let's try it with our sample data.
Some things to note about this script. First, if will give default colors for the evidence tracks. If you want to modify their look, you'll need to configure them yourself. If you want to setup canned comments, you'll have to do it yourself. You'll want to edit '~/webapollo/config/canned_comments.xml'. This tool also won't handle adding BAM and BigWig files. You'll need to run the corresponding scripts to add those types of files. See the section on BAM and BigWig respectively.
The following sections will cover the individual steps that were automated by the script. You'll want to read through those so that you have a better understanding of the steps involved in setting up the installation, since this script is experimental and if something goes wrong during the setup process, you'll need to handle the failed steps manually.
Also, you'll want to change the password for user 'web_apollo_admin':
The script also takes a number of arguments if you want to run it from the command line without the interactivity. Use the '-h' option to see the available arguments. One argument of significance is '-a' which will be used when you're adding GFF3 files to an EXISTING instance. This will skip some of the steps and just process the GFF3 file. Running it without the '-a' option will restart the setup process from scratch, so you'll lose any previously processed data.
First we'll need to setup the user database. It stores users who will have access to WebApollo and their corresponding access level. The database uses PostgreSQL. We have already created the PostgreSQL user for 'webapollo', with password 'webapollo' who has full access to the database used by WebApollo. Note that this is the user for accessing the PostgreSQL database, not a WebApollo user. You can choose to change the password for the PostgreSQL user if you'd like, but PostgreSQL has been setup to not accept any outside connections, so to have unauthorized access to the database means that access to the server has already been compromised. If you choose to change the password, you'll need to update the WebApollo configuration to reflect the change (as well as provide the password in all the included scripts).
We have already created the database 'web_apollo_users' and the user 'web_apollo_admin', with password 'web_apollo_admin' who has full access to the WebApollo. This password you will most likely want to change, since it's for a WebApollo user who will have access outside the server. You can do so with the 'change_user_password.pl' script:
You can create other users through the web interface when logged in as 'web_apollo_admin' or use the included 'add_user.pl' script:
Next we'll add the annotation track ids for the genomic sequences for our organism. We’ll use the 'add_tracks.pl' script. We need to generate a file of genomic sequence ids for the script. For convenience, there’s a script called 'extract_seqids_from_fasta.pl' which will go through a FASTA file and extract all the ids from the deflines. Let’s first create the list of genomic sequence ids. We'll store it in '~/scratch/seqids.txt'.
Now we’ll add those ids to the user database.
Now that we have the annotation track ids loaded, we’ll need to give 'web_apollo_admin' permissions to access the data. We’ll give all the all permissions (read, write, publish, user manager) to the user. We’ll use the 'set_track_permissions.pl' script and will need to provide the script a list of genomic sequence ids, like in the previous step.
Note that we’re only using a subset of the options for all the scripts mentioned above. You can get more detailed information on any given script (and other available options) using the “-h” or “--help” flag when running the script.
We’re all done setting up the user database. Now we need to move on to configuring the application.
Note about text editors: We'll have to edit some of the configuration files before we're up and running. Since there's no desktop environment installed, you'll need to use a non-graphical editor. The virtual machine provides two options: 'nano' (easier to use) and 'vim' (super powerful but a lot harder to use). Unless you're already familiar with 'vim', you're probably better off using 'nano'.
Most of main configuration, located in '~/webapollo/config/config.xml' has been set already. However, there're a few items you'll need to change.
First we'll need to replace '
Next we'll want to replace '
You should read the configuration section in the server installation documentation to get details about further customizing your WebApollo instance.
You can configure a set of predefined comments that will be available for users when adding comments through a dropdown menu. The configuration is stored in '~/webapollo/config/canned_comments.xml'. Let’s take a look at the configuration file.
You’ll need one '<comment>' element for each predefined comment. The element needs to have a 'feature_type' attribute in the form of 'CV:term' that this comment applies to. Let’s make a few comments for feature of type 'sequence:gene' and 'sequence:transcript':
We're now done configuring WebApollo. Onto data generation.
The steps for generating data (in particular static data) are mostly similar to JBrowse data generation steps, with some extra steps required. All the data generation steps should be done within WebApollo's JBrowse directory. Let's change into that directory.
The first thing we need to do before processing our evidence is to generate the reference sequence data to be used by JBrowse. We'll use the 'prepare-refseqs.pl' script.
We now have the DNA track setup.
We now need to setup the data configuration to use the WebApollo plugin. We'll use the 'add-webapollo-plugin.pl' script to do so.
Generating data from GFF3 works best by having a separate GFF3 per source type. If your GFF3 has all source types in the same file, we need to split up the GFF3. We can use the 'split_gff_by_source.pl' script to do so. We'll output the split GFF3 to some temporary directory (we'll use '~/scratch/split_gff').
If we look at the contents of '~/scratch/split_gff', we can see we have the following files:
We need to process each file and create the appropriate tracks.
We'll start off with 'maker.gff'. We need to handle that file a bit differently than the rest of the files since the GFF represents the features as gene, transcript, exons, and CDSs.
Note that 'brightgreen-80pct', 'darkgreen-60pct', 'container-100pct', 'container-16px', 'gray-center-20pct' are all CSS classes defined in WebApollo stylesheets that describe how to display their respective features and subfeatures. WebApollo also tries to use reasonable default CSS styles, so it is possible to omit these CSS class arguments. For example, to accept default styles for 'maker.gff', the above could instead be shortened to:
See the Customizing features section in the WebApollo installation guide for more information on CSS styles.
Now we need to process the other remaining GFF3 files. The entries in those are stored as 'match/match_part', so they can all be handled in a similar fashion.
We'll start off with blastn as an example.
Again, 'container-10px' and 'darkblue-80pct' are CSS class names that define how to display those elements. See the Customizing features section in the WebApollo installation guide for more information.
We need to follow the same steps for the remaining GFF3 files. It can be a bit tedious to do this for the remaining six files, so we can use a simple Bash shell script to help us out. Don't worry if the script doesn't make sense, you can always process each file manually.
Once data tracks have been created, you will need to generate a searchable index of names using the generate-names.pl script:
This script creates an index of sequence names and feature names in order to enable auto-completion in the navigation text box. This index is required, so if you do not wish any of the feature tracks to be indexed for auto-completion, you can instead run 'generate-names.pl' immediately after running 'prepare_refseqs.pl', but before generating other tracks.
The script can be also rerun after any additional tracks are generated if you wish feature names from that track to be added to the index.
Now let's look how to configure BAM support. WebApollo has native support for BAM, so no extra processing of the data is required.
First we'll copy the BAM data into the 'data/bam' directory. Keep in mind that this BAM data was randomly generated, so there's really no biological meaning to it. We only created it to show BAM support.
Now we need to add the BAM track.
You should now have a simulated BAM track available.
WebApollo has native support for BigWig files (.bw), so no extra processing of the data is required.
Configuring a BigWig track is very similar to configuring a BAM track. First we'll copy the BigWig data into the 'data/bigwig' directory. Keep in mind that this BigWig data was generated as a coverage map derived from the randomly generated BAM data, so like the BAM data there's really no biological meaning to it. We only created it to show BigWig support.
Now we need to add the BigWig track.
You should now have a simulated BigWig track available.
To allow sequence searching, we'll need to create a sequence database for BLAT to use. We'll use the 'faToTwoBit' tool to convert the sample data FASTA file into a 2bit database. The BLAT database must be stored in '~/webapollo/data/blat/db/genomic.2bit' (otherwise you'll need to update the 'blat.xml' configuration to point to the right file).
WebApollo should now be up and running. You can access it:
Enter 'web_apollo_admin' for the 'User name' and whatever password you changed it to for 'Password' (default is 'web_apollo_admin').