Aligning the sequences using VISTA genome browser 15
The comparison of the mouse and human genomes has demonstrated the power of comparative genomics in inferring the evolutionary history of species and in identifying functional regions in genomes. The possibilities for identifying regions under selection are enhanced with the addition of more sequences and this observation has led to numerous ‘focused sequencing’ projects which seek to obtain sequence for a small region of a genome in numerous other organisms.
Biologists who seek to analyze conserved regions among homologous sequences are faced with the daunting task of aligning large genomic regions and subsequently sifting through massive amounts of data. In order to facilitate the discovery process without requiring biologists to download and install complex software, a number of web servers for alignment and analysis have been set up in recent years. These servers align submitted sequences and then generate plots or graphs designed to help researchers identify conserved regions.
AVID is a progressive alignment program. The program works by recursively aligningthe ‘alignments’ at ancestral nodes of the guidetree. At each internal node, ancestral sequences are inferredfrom the existing alignments using maximum likelihood and thesealignments are then aligned using the AVID program.
The server goes through a number of steps:
Sequences are repeat-masked using the DUST program (Tatusovand Lipman, unpublished).
A random (almost complete) binaryguide tree is generated foralignment of the sequences usingthe progressive alignment method.
The sequences are alignedusing AVID.
A phylogenetic tree is inferred from the multiplealignmentusing the neighbor joining method.
Steps 3 and 4are repeated for a total of three iterations.
Pairwise alignmentsare generated from the multiple alignmentwith respect to allof the sequences and these are used to generateconservationplots and to identify conserved regions.
In this exercise we will perform an alignment of orthologs of the pax6 gene.
1) In a first step we will download the sequences of the orthologs of the pax6 gene using Biomart
2) Subsequently orthologs will be aligned using AVID/LAGAN.
Go to the Attributes section and select the Homologs genes in chicken, fugu and mouse
This information can be used to download the corresponding sequences
Go to the attributes and select ‘sequences’
You can select different parts of the sequence: Either you select the protein sequence (introns spliced out, only translated part is downloaded). Alternatively you might wish to download the entire gene (introns, exons included and neglect the transcript info) together with the 5 and 3 ‘ ends.
Try both. How many sequences do you get when you download the peptide. How many when you download the gene. Explain. You can do this by the Export data section in the left panel
Download the gene (genomic sequence, unspliced gene). Do not forget to also export ‘header’ information such as the strand information and the exon positions and the gene start and end and save it as textfile (FASTA).
To annotate the file we will also download the genes structural information (exon-intron start +order of the exon and introns. Save in excel format. This information will be used to make for our file an annotation file.
To test what exactly we down loaded, go to the ensemble gene Pax6 and view the sequence information. Perform a find function and try to locate the beginning (find with CCCTCTTTTCTTATCA) and end of your downloaded sequence in the displayed sequence. This shows you that the sequence you downloaded starts with the end of the last exon and ends with the beginning of the first exon (as the sequence is located on -1 strand). So the downloaded sequence is not reverse complemented.
Looking at the header information of the biomart downloaded pax6 gene we see that the gene start is 31784792|and the gene end is 31817961
We know from the structural information that our downloaded sequence starts at the end of last exon 14 31784792 and ends with the start of the first exon 31817961 (these values can be obtained from the structural annotation file). So the beginning of the downloaded sequence and the end correspond to the end of exon1 and the beginning of exon 14 (as we observed previously).
So to annotate the positions of the exons:
1 in our downloaded file corresponds to the genomic position 31784792
33170 in our downloaded file corresponds to the genomic position 31817961 (31784792-31817961+1) This information will be used to annotate the positions of the exons on our downloaded file (see below).
You might wish to add the annotation (positions of the exons, introns) to the multiple alignment you are going to make in AVID. You will have to construct a gff file with the essential annotation. To construct this file you need the exon positions. See the instructions for creating this file in the figures below.
31817809 31817961 ENSE00001479873 1
31784792 31790019 ENSE00001213516 14
31790710 31790860 ENSE00003700637 13
31793438 31793553 ENSE00003701932 12
If you do this in excel this results in
< 1 33169 PAX6
14 1 5228 exon
13 5919 6069 exon
12 8647 8762 exon
11 8861 9011 exon
10 9241 9323 exon
9 9839 9997 exon
8 15900 16065 exon
7 16770 16985 exon
6 15900 16065 exon
4 17913 18043 exon
3 21621 21671 UTR
2 22058 22134 UTR
1 33018 33170 UTR
This information can also directly be obtained from ensemble. Go to the genome browser. Search for PAX 6 human. Select the gene summary and select at the left panel ‘download sequences’
< 1 33170 PAX6
1 5142 UTR
5143 5228 exon
5919 6069 exon
8647 8762 exon
8861 9011 exon
9241 9323 exon
9839 9997 exon
15900 16065 exon
16770 16985 exon
17080 17121 exon
17913 18043 exon
21611 21620 exon
21621 21671 UTR
22058 22134 UTR
26037 26224 UTR
33018 33170 UTR
Download the orthologs
Repeat the complete flow to download the corresponding complete gene sequences of the orthologs making use of their ENSEMBL gene Ids (save them in separate FASTA files).
Check by comparing the gene sequence in the ensemble browser what exactly you downloaded: indeed your downloaded sequence starts with the first exon (the gene is located on +1)
Download the annotation file )ook voor pax6 geven featured strand and forward strand idem)
Before you use the downloaded FASTA file you have to adapt them (short header because otherwise the visualization is messed up; and an enter after the header and before the sequence otherwise youhave not a correct FASATA file.
do not reverse complement the sequences
View the pdf file in which all sequences are compared relative to the human sequence.
The blue boxes are the exons in the human sequence from the annotation file. Remark the high homology between the rat and the mouse sequence. Even in remote organisms such as fugu and zebrafish some of the human exons are conserved.
Between rat and mouse and human in the region 1000 bp upstream of the first exons parts of the sequence are conserved as well. These might correspond to the regulatory motifs, responsible for transcriptional regulation.
The low homology between the two zebra fish sequences (paralogs) can be attributed to the bad sequence quality of the second zebrafish copy (genome assembly not complete yet) and the short sequence of fugu.
If you want to exercise yourself you can also start from the gene ENSMUSG00000025190.
Aligning the sequences using VISTA genome browser
Go the the VISTA genome browser website: http://genome.lbl.gov/vista/index.shtml