Biomart/ genome alignment III contents

Download 51.24 Kb.
Size51.24 Kb.




Introduction 10

Downloading the sequences/ BIOMART 11

Aligning the sequences using AVID 15

Aligning the sequences using VISTA genome browser 15


The comparison of the mouse and human genomes has demonstrated the power of comparative genomics in inferring the evolutionary history of species and in identifying functional regions in genomes. The possibilities for identifying regions under selection are enhanced with the addition of more sequences and this observation has led to numerous ‘focused sequencing’ projects which seek to obtain sequence for a small region of a genome in numerous other organisms.

Biologists who seek to analyze conserved regions among homologous sequences are faced with the daunting task of aligning large genomic regions and subsequently sifting through massive amounts of data. In order to facilitate the discovery process without requiring biologists to download and install complex software, a number of web servers for alignment and analysis have been set up in recent years. These servers align submitted sequences and then generate plots or graphs designed to help researchers identify conserved regions.

AVID is a progressive alignment program. The program works by recursively aligning the ‘alignments’ at ancestral nodes of the guide tree. At each internal node, ancestral sequences are inferred from the existing alignments using maximum likelihood and these alignments are then aligned using the AVID program.

The server goes through a number of steps:

  1. Sequences are repeat-masked using the DUST program (Tatusov and Lipman, unpublished).

  2. A random (almost complete) binary guide tree is generated for alignment of the sequences using the progressive alignment method.

  3. The sequences are aligned using AVID.

  4. A phylogenetic tree is inferred from the multiple alignment using the neighbor joining method.

  5. Steps 3 and 4 are repeated for a total of three iterations.

  6. Pairwise alignments are generated from the multiple alignment with respect to all of the sequences and these are used to generate conservation plots and to identify conserved regions.

In this exercise we will perform an alignment of orthologs of the pax6 gene.

1) In a first step we will download the sequences of the orthologs of the pax6 gene using Biomart

2) Subsequently orthologs will be aligned using AVID/LAGAN.

Downloading the sequences/ BIOMART


view the tutorial

  • Choose a database (“Ensembl Genes 78”)

  • Choose a dataset (“Homo sapiens genes (GRCh38)”)

  • Go to the Filters section

You want to select a specified gene based on its ENSEMBL gene ID (ENSG00000007372) in the human genome.

  • Move to the Attributes section

Selects the attributes you want to download:

You might want to select the position and the strand of the gene on the genome. Also select a protein identifier (e.g. HGNC ID) and most importantly the ENSEMBLgeneIDs of the original gene.

how many PAX6 gene transcripts were selected?

  • Go to the Attributes section and select the Homologs genes in chicken, fugu and mouse

This information can be used to download the corresponding sequences

  • Go to the attributes and select ‘sequences’

You can select different parts of the sequence: Either you select the protein sequence (introns spliced out, only translated part is downloaded). Alternatively you might wish to download the entire gene (introns, exons included and neglect the transcript info) together with the 5 and 3 ‘ ends.

Try both. How many sequences do you get when you download the peptide. How many when you download the gene. Explain.
You can do this by the Export data section in the left panel

Download the gene (genomic sequence, unspliced gene). Do not forget to also export ‘header’ information such as the strand information and the exon positions and the gene start and end and save it as textfile (FASTA).

To annotate the file we will also download the genes structural information (exon-intron start +order of the exon and introns. Save in excel format. This information will be used to make for our file an annotation file.
To test what exactly we down loaded, go to the ensemble gene Pax6 and view the sequence information. Perform a find function and try to locate the beginning (find with CCCTCTTTTCTTATCA) and end of your downloaded sequence in the displayed sequence. This shows you that the sequence you downloaded starts with the end of the last exon and ends with the beginning of the first exon (as the sequence is located on -1 strand). So the downloaded sequence is not reverse complemented.

Looking at the header information of the biomart downloaded pax6 gene we see that the gene start is 31784792|and the gene end is 31817961


We know from the structural information that our downloaded sequence starts at the end of last exon 14 31784792 and ends with the start of the first exon 31817961 (these values can be obtained from the structural annotation file). So the beginning of the downloaded sequence and the end correspond to the end of exon1 and the beginning of exon 14 (as we observed previously).

So to annotate the positions of the exons:

1 in our downloaded file corresponds to the genomic position 31784792

33170 in our downloaded file corresponds to the genomic position 31817961 (31784792-31817961+1)
This information will be used to annotate the positions of the exons on our downloaded file (see below).

You might wish to add the annotation (positions of the exons, introns) to the multiple alignment you are going to make in AVID. You will have to construct a gff file with the essential annotation. To construct this file you need the exon positions. See the instructions for creating this file in the figures below.

Exon 1
31817809 31817961 ENSE00001479873 1

31817961-31817809+1 31817961-31784792+1

Exon 14

31784792 31790019 ENSE00001213516 14

31784792-31784792+1 31790019-31784792+1

1 5228
Exon 13

31790710 31790860 ENSE00003700637 13

31790710-31784792+1 31790860-31784792+1

5919 6069

Exon 12

31793438 31793553 ENSE00003701932 12

31793438-31784792+1 31793553-31784792+1

8647 8762

If you do this in excel this results in
< 1 33169 PAX6

14 1 5228 exon

13 5919 6069 exon

12 8647 8762 exon

11 8861 9011 exon

10 9241 9323 exon

9 9839 9997 exon

8 15900 16065 exon

7 16770 16985 exon

6 15900 16065 exon

4 17913 18043 exon

3 21621 21671 UTR

2 22058 22134 UTR

1 33018 33170 UTR

This information can also directly be obtained from ensemble. Go to the genome browser. Search for PAX 6 human. Select the gene summary and select at the left panel ‘download sequences’
< 1 33170 PAX6

1 5142 UTR

5143 5228 exon

5919 6069 exon

8647 8762 exon

8861 9011 exon

9241 9323 exon

9839 9997 exon

15900 16065 exon

16770 16985 exon

17080 17121 exon

17913 18043 exon

21611 21620 exon

21621 21671 UTR

22058 22134 UTR

26037 26224 UTR

33018 33170 UTR
Download the orthologs
Repeat the complete flow to download the corresponding complete gene sequences of the orthologs making use of their ENSEMBL gene Ids (save them in separate FASTA files).


Check by comparing the gene sequence in the ensemble browser what exactly you downloaded: indeed your downloaded sequence starts with the first exon (the gene is located on +1)

Download the annotation file )ook voor pax6 geven featured strand and forward strand idem)

Forward strand

> 1 28465 Pax6

1 150 UTR

10911 10990 UTR

11348 11398 UTR

11399 11408 exon

14929 15059 exon

15852 15893 exon

15988 16203 exon

16879 17044 exon

22666 22824 exon

23295 23377 exon

23571 23721 exon

23838 23953 exon

26407 26557 exon

27371 27456 exon

27457 28465 UTR
Featured strand

> 1 28465 Pax6

1 150 UTR

10911 10990 UTR

11348 11398 UTR

11399 11408 exon

14929 15059 exon

15852 15893 exon

15988 16203 exon

16879 17044 exon

22666 22824 exon

23295 23377 exon

23571 23721 exon

23838 23953 exon

26407 26557 exon

27371 27456 exon

27457 28465 UTR

Compare with the annotation you make yourself

Download the gene structural info and save in xls

Exon 1
ENSMUSG00000027168 105668900 105669049 1

105668900-105668900+1 105669049-105668900+1

  1. 150

> 1 28465 Pax6

1 150 UTR

10911 10990 UTR

11348 11398 UTR

11399 11408 exon

14929 15059 exon

15852 15893 exon

15988 16203 exon

16879 17044 exon

22666 22824 exon

23295 23377 exon

23571 23721 exon

23838 23953 exon

26407 26557 exon

27371 27456 exon

Aligning the sequences using AVID

  • Go to mVISTA tools of the VISTA genome browser (

  • Align the sequences with lagan

  • Specify the number of sequences you want to align, then press ‘Submit’.

  • Fill in your email address and provide the fasta and annotation files, after ‘Submit’.

  • Wait until you get an email with the results.

Before you use the downloaded FASTA file you have to adapt them (short header because otherwise the visualization is messed up; and an enter after the header and before the sequence otherwise youhave not a correct FASATA file.



Annotation files:


do not reverse complement the sequences
View the pdf file in which all sequences are compared relative to the human sequence.

The blue boxes are the exons in the human sequence from the annotation file. Remark the high homology between the rat and the mouse sequence. Even in remote organisms such as fugu and zebrafish some of the human exons are conserved.

Between rat and mouse and human in the region 1000 bp upstream of the first exons parts of the sequence are conserved as well. These might correspond to the regulatory motifs, responsible for transcriptional regulation.

The low homology between the two zebra fish sequences (paralogs) can be attributed to the bad sequence quality of the second zebrafish copy (genome assembly not complete yet) and the short sequence of fugu.

If you want to exercise yourself you can also start from the gene ENSMUSG00000025190.

Aligning the sequences using VISTA genome browser

  • Go the the VISTA genome browser website:

  • Go to the Precomputed Alignments

  • Provide the proper coordinates for the human PAX6 gene (Chr11: 31,806,340-31,839,509) and Submit

  • Compare with previous results. (note we do not have the UTRs)

Download 51.24 Kb.

Share with your friends:

The database is protected by copyright © 2020
send message

    Main page