The Maker Gene Annotation Pipeline (Cantarel et al. 2008; Holt and Yandell 2012) combines the outputs of several programs to generate a high quality set of gene annotations. Maker was designed primarily for use on non-model organisms and to not over predict genes as is often the case solely using gene predictors. Maker uses two types of evidence, intrinsic and extrinsic, to generate gene annotations. Extrinsic evidence is sequence homology to known repeats and mRNA assembled from the organism or closely related organisms. Homology is predicted from BLAST alignments which are then polished using exonerate. Intrinsic evidence is evidence is evidence such as start and stop codons and intron-exon boundaries predicted from gene predictors. The two gene predictors used by Maker are SNAP and Augustus. Annotations are only produced if they are supported by multiple lines of evidence. For example, a BLAST alignment must have at least a corresponding gene predictor result. A gene predictor result without additional evidence will not be annotated as a gene. Maker assigns each annotation an Annotation Evidence Distance (AED) score which corresponds to the amount of dissimilarity between the evidence and the annotation. For example, an annotation that perfectly matches the evidence would have an AED score of 0, an annotation with 50 % overlap with the evidence would have an AED of 0, and a theoretical annotation that did not overlap any evidence would have an AED of 1. The most important parts of Maker's output are a gff file, the predicted transcripts, and predicted proteins for each scaffold. The program Apollo can be used to view and edit gene annotations produced by Maker.
V&C core competencies addressed
1) Ability to apply the process of science: Experimental design, Evaluation of experimental evidence, Developing problem-solving strategies
2) Ability to use quantitative reasoning: Developing and interpreting graphs, Applying statistical
3) Use modeling and simulation to understand complex biological systems: Applying informatics tools, Managing and analyzing large data sets
GCAT-SEEK sequencing requirements
Computer/program requirements for data analysis
Linux OS, Maker, GCATSEEK Linux Virtual Machine or access to Juniata HHMI cluster
Apollo requires a GUI and can be run on Mac or PC.
If using cluster from Window OS: Putty
If using from Mac or Linux OS: SSH
This tutorial assumes you are using a fasta file named genome.fa you could, however, use any fasta file simple replacing genome.fa with the name of the fasta file being used. This also assumes that you are using the custom repeat library created in the RepeatScout tutorial, a protein file named proteins.fa, a snap HMM file named genome.hmm, and an Augustus configuration directory species.
Also, running any of the listed commands without any additional input (or running them with the -h option) will display the help message including a description of the program, the usage, and all options.
1) Make a Maker directory, move into it, and generate the Maker control files
This will generate three control files (maker_opts.ctl, maker_bopts.ctl, and maker_exe.ctl)
2) The maker_opts.ctl file contains all of the performance options for maker.
For this run, edit the following lines (nano -c maker_opts.ctl)
Line 2: genome=genome.fa
Line 22: protein=proteins.fa
Line 26: model_org=danio
Line 27: rmlib=genome.secondfilter.lib #created by repeatscout
Line 34: snaphmm=./44Sru.hmm
Line 36: augustus_species=SebastesRubrivinctus
Line 56: predstats=1
Note: The make_opts.ctl and maker_bopts.ctl (shown in full at the bottom) can be further edited to adjust Maker further to adjust stringency of the blast searches. The maker_exe.ctl simply tells Maker where to find the executables and should not need to be edited.
3) Once maker_opts.ctl and maker_bopts.ctl have been edited, run Maker.
If you are running on multiple processors:
$mpirun -n N maker
If you are running worker nodes on a cluster, use the line above as the last (command) line of the Qsub script, and then run Qsub (see module 3 for details on Qsub).
If only running on a single processor:
Maker will take each scaffold and place it in it's own directory along with the output that Maker generates for that scaffold. The three most important files that are created are the gff, Maker's predicted transcripts, and Maker's predicted proteins.
Maker's directory structure:
Directory_maker_ran_in -> genome.maker.output -> genome_datastore -> first_number -> second_number -> scaffold_directory
Maker divides the scaffold directories into several layers of subdirectories so the filesystem will not be slowed down by having to handle hundreds or even thousands of directories at one time.
Individual gffs can be viewed in genome browsers such as Apollo.
The most efficient way to make use of Maker's output is by combining Maker's output into a relatively small number of large files. More information on handling these files is provided in Module 10.
1) Gather the predicted proteins (the predicted proteome).
In the genome_datastore directory
$cat ./*/*/*/*.proteins.fasta > genome.predictedproteins.fasta
2) Gather the predicted transcripts (the predicted transcriptome)
In the same directory
$cat ./*/*/*/*.transcripts.fasta > genome.predictedtranscripts.fasta
3) Gather all of the GFFs into one conglomerate GFF.
In the genome.maker.output directory
Everything (including Maker's prediction, all evidence, masked repeats, and DNA)
$gff3_merge -d genome_master_datastore_index.log
No DNA at the end
$gff3_merge -n -d genome_master_datastore_index.log
Only Maker annotations
$gff3_merge -n -g -d genome_master_datastore_index.log
GFF file format
General Feature Format (GFF) is a standard file format, meaning that it is supported and used by several programs and appears in the same form regardless of what program generated it. GFF is a tab delimited file featuring nine columns of sequence data.
The name of the sequence
The program that created the feature
What the sequence is (gene, contig, exon, match)
First base in the sequence
Last base in the sequence
Score of the feature
+ or -
0,1, or 2 first, second, or thrid frame
List of miscellaneous features not covered in the first 8 columns, each feature separated by a ;
Example gff (Shows part of a single exon maker gene found on scaffold 1)
The windows at the bottom display detailed information about selected features
Maximize the Apollo window. Focus on the strand that contains the gene you are working on by clicking on the "view" tab in the menu bar at the top of the window and unchecking the strand that does NOT contain the gene of interest.
Click the "X2" button (for example) at the bottom of the screen to zoom in (or out) on your sequence. The structure of the genes will become apparent--with alternating exons (boxes) and introns (lines). You can continue zooming until the center scale displays the nucleotide sequence. The red and green bars at the top and the bottom of the screen represent potential start and stop codons in each of the three reading frames on the forward (top) and reverse (bottom) strands.
Clicking "reset" returns you to the normal view.
Click "Tiers" in the menu bar and select "Expand all tiers." This displays each piece of evidence on a separate line, allowing you to see all of the results of your analyses. The results of each prediction program (e.g. Augustus, Maker, blastx) display separately in white in the black evidence panel.
Clicking on any of the pieces of evidence will display additional information at the bottom of the window (see “selected element” in picture above).
Zoom in to focus on one set of gene models and BLAST evidence. Consider several reasons why various pieces of evidence may not align perfectly (notice how the BLASTX predicted exon is longer than the other predicted exons in the figure below):
The first and last exons are the hardest to predict because they have only one intron boundary.
Gene predictors may miss exons of an alternatively spliced gene.
BLAST evidence may come from different species, whose exons differ in length.
The BLAST algorithm does not determine splice sites.
BLAST results may include false matches.
Scroll left and right for a gene that has one or more predictions (FGenesH, Augustus, SNAP), as well as BLASTN or/and BLASTX evidence.
Use the "zoom" function to enlarge the gene and clearly display the exon/intron structure.
Compare the gene predictions from FGenesH, SNAP, and/or Augustus with evidence from BLASTN and BLASTX
The gene prediction algorithms are quite good at identifying locations of genes and correct splice sites (exon/intron boundaries). However, they have difficulty identifying first and last exons because these have only one splice site.
BLAST results are the most authoritative evidence for exons because they are retrieved from databases for biological evidence from mRNA or protein sequence from a given organism. However, BLAST results often do not accurately reflect splice sites. Also the exon structure may be different for a related gene from a different organism. Finally, BLAST may return incomplete results as not all genes or proteins are represented in the database.
Click on any exon in the evidence panel to highlight "edge matches" (making sure “show edge matches” is checked in the “view” tab of the menu bar). White bars on the exon indicate consensus exon-intron boundaries within the evidence. (However, these are not necessarily the correct splice sites.)
The predicted gene model may contain several errors that need to be fixed:
A green or red arrow indicates that the gene model is missing a start or stop codon
A yellow arrow indicates a splice site that does not follow the GT/AG rule (aka, “non-canonical”)
Annotations can be edited using the Exon Detail Editor. You can adjust annotations by dragging the boundaries of exons and save the changes directly into the GFF.
Double-clicking on a gene model will select an entire annotation. Right click, select sequence, then hit peptide or protein. From this window you can copy the predicted protein sequence across all exons, paste in the NCBI blastp web page and compare single annotations to other species of interest.
The number of genes that Maker annotates is sensitive to Blast stringency as specified in the bopts control file. Students could run the default sensitivity, then increase or decrease stringency to see the effect on number of genes predicted.
Use Apollo to pull a protein sequence from one gene and use gaps to see if any internal exons are missing. Compare full length of the homologous NCBI proteins to the Maker predicted protein to see if and edge exons were missed by Maker.
Time line of module
One two hour lab
Discussion topics for class
Look at the annotation of several scaffolds. Is there evidence that repeats disrupted the assembly?
Are all your genes complete?
Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M. 2008.MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res.18:188–196.
Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. 2011;12:491. doi: 10.1186/1471-2105-12-491.
Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics5:59
Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel.Bioinformatics. 2003;19:ii215–ii225. doi: 10.1093/bioinformatics/btg1080.
Stanke M, Schöffmann O, Morgenstern B, Waack S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics. 2006;7:62. doi: 10.1186/1471-2105-7-62.