Biomart/ GENOME ALIGNMENT III
Contents
Biomart/ GENOME ALIGNMENT III 1
Introduction 10
Downloading the sequences/ BIOMART 11
Aligning the sequences using AVID 15
Aligning the sequences using VISTA genome browser 15
Introduction
The comparison of the mouse and human genomes has demonstrated the power of comparative genomics in inferring the evolutionary history of species and in identifying functional regions in genomes. The possibilities for identifying regions under selection are enhanced with the addition of more sequences and this observation has led to numerous ‘focused sequencing’ projects which seek to obtain sequence for a small region of a genome in numerous other organisms.
Biologists who seek to analyze conserved regions among homologous sequences are faced with the daunting task of aligning large genomic regions and subsequently sifting through massive amounts of data. In order to facilitate the discovery process without requiring biologists to download and install complex software, a number of web servers for alignment and analysis have been set up in recent years. These servers align submitted sequences and then generate plots or graphs designed to help researchers identify conserved regions.
AVID is a progressive alignment program. The program works by recursively aligning the ‘alignments’ at ancestral nodes of the guide tree. At each internal node, ancestral sequences are inferred from the existing alignments using maximum likelihood and these alignments are then aligned using the AVID program.
The server goes through a number of steps:
-
Sequences are repeat-masked using the DUST program (Tatusov and Lipman, unpublished).
-
A random (almost complete) binary guide tree is generated for alignment of the sequences using the progressive alignment method.
-
The sequences are aligned using AVID.
-
A phylogenetic tree is inferred from the multiple alignment using the neighbor joining method.
-
Steps 3 and 4 are repeated for a total of three iterations.
-
Pairwise alignments are generated from the multiple alignment with respect to all of the sequences and these are used to generate conservation plots and to identify conserved regions.
In this exercise we will perform an alignment of orthologs of the pax6 gene.
1) In a first step we will download the sequences of the orthologs of the pax6 gene using Biomart
2) Subsequently orthologs will be aligned using AVID/LAGAN.
Downloading the sequences/ BIOMART
-
http://www.ensembl.org/biomart/martview/f0e53c7f00c3cded9dff7e1e22d391dc
view the tutorial
http://www.ensembl.org/info/website/tutorials/index.html
-
Choose a database (“Ensembl Genes 78”)
-
Choose a dataset (“Homo sapiens genes (GRCh38)”)
-
Go to the Filters section
You want to select a specified gene based on its ENSEMBL gene ID (ENSG00000007372) in the human genome.
-
Move to the Attributes section
Selects the attributes you want to download:
You might want to select the position and the strand of the gene on the genome. Also select a protein identifier (e.g. HGNC ID) and most importantly the ENSEMBLgeneIDs of the original gene.
how many PAX6 gene transcripts were selected?
-
Go to the Attributes section and select the Homologs genes in chicken, fugu and mouse
This information can be used to download the corresponding sequences
-
Go to the attributes and select ‘sequences’
You can select different parts of the sequence: Either you select the protein sequence (introns spliced out, only translated part is downloaded). Alternatively you might wish to download the entire gene (introns, exons included and neglect the transcript info) together with the 5 and 3 ‘ ends.
Try both. How many sequences do you get when you download the peptide. How many when you download the gene. Explain.
You can do this by the Export data section in the left panel
Download the gene (genomic sequence, unspliced gene). Do not forget to also export ‘header’ information such as the strand information and the exon positions and the gene start and end and save it as textfile (FASTA).
To annotate the file we will also download the genes structural information (exon-intron start +order of the exon and introns. Save in excel format. This information will be used to make for our file an annotation file.
To test what exactly we down loaded, go to the ensemble gene Pax6 and view the sequence information. Perform a find function and try to locate the beginning (find with CCCTCTTTTCTTATCA) and end of your downloaded sequence in the displayed sequence. This shows you that the sequence you downloaded starts with the end of the last exon and ends with the beginning of the first exon (as the sequence is located on -1 strand). So the downloaded sequence is not reverse complemented.
Looking at the header information of the biomart downloaded pax6 gene we see that the gene start is 31784792|and the gene end is 31817961
>ENSG00000007372|-1|31784792|31817961|31801776;31812926;31806013;31793802;31802834;31800856;31806462;31817948;31790019;31806925;31793553;31794788;31794114;31790860;31806921;31811677;31812183;31803673;31811015;31801912;31811118;31811137;31804046;31811331;31811045;31811237;31802971;31791309;31817961;31803333;31812177;31804619;31800646;31817937;31804025;31810667;31811308;31794126;31801335;31804044;31810305;31817874|31801561;31812572;31805389;31793652;31802729;31800763;31806344;31817809;31789936;31806849;31802704;31793438;31806402;31794630;31794032;31790710;31801728;31811115;31788910;31784792;31800691;31812093;31803398;31810828;31801869;31801871;31794780;31811213;31800707;31801762;31789830;31804452;31801578;31800539;31789182;31789918;31794098;31801617;31789917;31801745;31803952;31789922;31801230;31793173;31793483;31809906;31806406;31789913;31793674
We know from the structural information that our downloaded sequence starts at the end of last exon 14 31784792 and ends with the start of the first exon 31817961 (these values can be obtained from the structural annotation file). So the beginning of the downloaded sequence and the end correspond to the end of exon1 and the beginning of exon 14 (as we observed previously).
So to annotate the positions of the exons:
1 in our downloaded file corresponds to the genomic position 31784792
33170 in our downloaded file corresponds to the genomic position 31817961 (31784792-31817961+1)
This information will be used to annotate the positions of the exons on our downloaded file (see below).
AVID
You might wish to add the annotation (positions of the exons, introns) to the multiple alignment you are going to make in AVID. You will have to construct a gff file with the essential annotation. To construct this file you need the exon positions. See the instructions for creating this file in the figures below.
Exon 1
31817809 31817961 ENSE00001479873 1
31817961-31817809+1 31817961-31784792+1
1
Exon 14
31784792 31790019 ENSE00001213516 14
31784792-31784792+1 31790019-31784792+1
1 5228
Exon 13
31790710 31790860 ENSE00003700637 13
31790710-31784792+1 31790860-31784792+1
5919 6069
Exon 12
31793438 31793553 ENSE00003701932 12
31793438-31784792+1 31793553-31784792+1
8647 8762
If you do this in excel this results in
< 1 33169 PAX6
14 1 5228 exon
13 5919 6069 exon
12 8647 8762 exon
11 8861 9011 exon
10 9241 9323 exon
9 9839 9997 exon
8 15900 16065 exon
7 16770 16985 exon
6 15900 16065 exon
4 17913 18043 exon
3 21621 21671 UTR
2 22058 22134 UTR
1 33018 33170 UTR
This information can also directly be obtained from ensemble. Go to the genome browser. Search for PAX 6 human. Select the gene summary and select at the left panel ‘download sequences’
< 1 33170 PAX6
1 5142 UTR
5143 5228 exon
5919 6069 exon
8647 8762 exon
8861 9011 exon
9241 9323 exon
9839 9997 exon
15900 16065 exon
16770 16985 exon
17080 17121 exon
17913 18043 exon
21611 21620 exon
21621 21671 UTR
22058 22134 UTR
26037 26224 UTR
33018 33170 UTR
Download the orthologs
Repeat the complete flow to download the corresponding complete gene sequences of the orthologs making use of their ENSEMBL gene Ids (save them in separate FASTA files).
ENSMUSG00000027168
Check by comparing the gene sequence in the ensemble browser what exactly you downloaded: indeed your downloaded sequence starts with the first exon (the gene is located on +1)
Download the annotation file )ook voor pax6 geven featured strand and forward strand idem)
Forward strand
> 1 28465 Pax6
1 150 UTR
10911 10990 UTR
11348 11398 UTR
11399 11408 exon
14929 15059 exon
15852 15893 exon
15988 16203 exon
16879 17044 exon
22666 22824 exon
23295 23377 exon
23571 23721 exon
23838 23953 exon
26407 26557 exon
27371 27456 exon
27457 28465 UTR
Featured strand
> 1 28465 Pax6
1 150 UTR
10911 10990 UTR
11348 11398 UTR
11399 11408 exon
14929 15059 exon
15852 15893 exon
15988 16203 exon
16879 17044 exon
22666 22824 exon
23295 23377 exon
23571 23721 exon
23838 23953 exon
26407 26557 exon
27371 27456 exon
27457 28465 UTR
Compare with the annotation you make yourself
Download the gene structural info and save in xls
Exon 1
ENSMUSG00000027168 105668900 105669049 1
105668900-105668900+1 105669049-105668900+1
-
150
> 1 28465 Pax6
1 150 UTR
10911 10990 UTR
11348 11398 UTR
11399 11408 exon
14929 15059 exon
15852 15893 exon
15988 16203 exon
16879 17044 exon
22666 22824 exon
23295 23377 exon
23571 23721 exon
23838 23953 exon
26407 26557 exon
27371 27456 exon
Aligning the sequences using AVID -
Go to mVISTA tools of the VISTA genome browser (http://genome.lbl.gov/vista/index.shtml).
-
Align the sequences with lagan
-
Specify the number of sequences you want to align, then press ‘Submit’.
-
Fill in your email address and provide the fasta and annotation files, after ‘Submit’.
-
Wait until you get an email with the results.
Before you use the downloaded FASTA file you have to adapt them (short header because otherwise the visualization is messed up; and an enter after the header and before the sequence otherwise youhave not a correct FASATA file.
ENSMUSG00000027168_gene_unspliced_23012014_adapted.txt
ENSG00000007372_gene_unspliced_23012014_adapted.txt
Annotation files:
human_PAX6_annotatie_23012014.txt
mus_PAX6_annotatie_23012014.txt
do not reverse complement the sequences
View the pdf file in which all sequences are compared relative to the human sequence.
The blue boxes are the exons in the human sequence from the annotation file. Remark the high homology between the rat and the mouse sequence. Even in remote organisms such as fugu and zebrafish some of the human exons are conserved.
Between rat and mouse and human in the region 1000 bp upstream of the first exons parts of the sequence are conserved as well. These might correspond to the regulatory motifs, responsible for transcriptional regulation.
The low homology between the two zebra fish sequences (paralogs) can be attributed to the bad sequence quality of the second zebrafish copy (genome assembly not complete yet) and the short sequence of fugu.
If you want to exercise yourself you can also start from the gene ENSMUSG00000025190.
Aligning the sequences using VISTA genome browser -
Go the the VISTA genome browser website: http://genome.lbl.gov/vista/index.shtml
-
Go to the Precomputed Alignments
-
Provide the proper coordinates for the human PAX6 gene (Chr11: 31,806,340-31,839,509) and Submit
-
-
-
-
-
Compare with previous results. (note we do not have the UTRs)
Share with your friends: |