DEEP: Data Standards and Policies
Introduction
The German Epigenome Program DEEP is a consortium targeted at interdisciplinary epigenome research in Germany associated with IHEC initiative. It comprises 31 research centers, currently. The aim of the project is to generate and analyze more than 70 human epigenomes of 13 tissue types in the context of metabolic, inflammatory and neurodegenerative diseases. ChIP-seq, DNase-seq, RNA-seq and bisulfite sequencing data will be processed along with genomic sequencing. Among those experiments, due to high sequencing depth (more than 30X), processing of bisulfite sequencing data will be particularly demanding with respect to resources. The substantial volume and diversity of generated data requires adoption of standardized data formats and experiment descriptions. Consistent quality standards are crucial for downstream data integration.
This document describes data standards and policies used by the DEEP consortium to exchange data between producers and the Data Coordination Center (DCC) and make the data available to the whole consortium. We will propose both detailed data exchange policies and standard data formats for data distribution.
Data flow and distribution
The six data providers in the DEEP project will submit the raw sequence data to the DCC at the DKFZ. Metadata describing sequencing experiments as well as quality values will be transferred to the DCC together with sequencing results. The DCC at the DKFZ will provide access to this data for the whole consortium. The sequencing data will be transferred (on request/automatically) to the Data Analysis Center (DAC) in Saarbrücken via an Aspera interface. Metadata describing biological samples will be transferred from the sample providers to the DCC by data producer in parallel with or previous to sequencing results.
Primary Analysis results will be accessible from the DCC, if feasible. Initially, this will be restricted to the Consortium members. Public access will be provided according to the IHEC policies (http://ihec-epigenomes.net/about/policies-and-guidelines/).
We plan to follow BLUEPRINT recommendation and make our high level analysis results (the ChIP-seq, DNAse-seq, methylation signals and RNA-seq expression analysis) visible via a common web portal. The Biomart approach already being realized for ICGC was given a first attempt by the Spanish colleagues of BLUEPRINT (groups of Ivo Gut from Barcelona Supercomputing Center (BSC) and Alfonso Valencia from Spanish National Cancer research Center (CNIO)). Biomart failed as it did not scale up to epigenetics approaches. Consequently, the BLUEPRINT colleagues are currently developing a new database using MongoDB (a NoSQL technology). There was agreement that DEEP will wait for the availability of the MongoDB realization provided by BLUEPRINT’s Spanish colleagues. Afterwards, we will continue the discussion with the BLUEPRINT team to harmonize the web presentation of high level analysis results between DEEP and BLUEPRINT.
For access to the raw data from outside the DEEP consortium, the sequencing data will be additionally transferred (automated, on request?) to the European Genome-phenome Archive (EGA). The EGA will provide access to the wider public under the review of DEEP Data Access Compliance Office (DACO) (which still has to be established). Data access rules will follow the general guidelines from ICGC and IHEC.
Figure 1 The Data Coordination Center has a central role in the project, providing a platform for storage and reception of raw data
Data Formats
Large-scale projects like DEEP need to specify their data formats and data standards to ensure interoperability and maximum usability of the data. DEEP is supporting community driven data standards for all its data types.
Each kind of data transfer (each vertical arrow in figure 1) requires a clearly defined data format specification.
Metadata for sequencing raw data (transfer from sequencing groups to DCC):
Format will be used according to ICGC specifications.
Metadata for sequencing experimental data (transfer from sequencing groups to DCC):
Format will be used according to IHEC recommendations (NIH Roadmap).
Metadata for primary cell lines (transfer from sample provider to DCC):
Have to be determined by sample provider.
Format will be used according to NIH roadmap for IHEC specification.
In agreement with BLUEPRINT we propose to use EBI or NIH Biosample database as a guideline.
http://www.ncbi.nlm.nih.gov/biosample
http://www.ebi.ac.uk/biosamples/index.html
Metadata for patient information (transfer from sample provider to DCC):
Have to be defined by sample provider.
Sample naming will be defined together by the DAC and the DCC.
Metadata for raw data to EGA (for submission from DCC to EGA at EBI):
According to EGA requirements. It’s similar to ICGC operation.
A Data Access Compliance Office (DACO) is required
Data exchange from DCC to DAC
DAC obtains access to all the data gathered by DCC
Data (high-level analysis) exchange from DAC to DCC
Based on BLUEPRINT precedent.
High level analysis results (for submission from DCC to the BLUEPRINT MongoDB):
Based on BLUEPRINT precedent.
Naming conventions for high-level analysis files will be defined together by DCC and DAC.
Primary data formats
The primary data formats produced by DEEP are described in Table 1
Experiment Type
|
Data Type
|
Data Format
|
All
|
Alignment
|
Sam/Bam
|
RNASeq
|
Expression Levels
|
GTF
|
BiSulphite, DNase and ChipSeq
|
Regions
|
BED
|
BiSulphite, DNase and ChipSeq
|
Signal
|
WIG
|
Table 1: File Formats used to store different data types produced by DEEP
Quality Metrics
The DCC will record some standard metrics to assist with quality control of both the incoming data and primary analysis results.
All DEEP data will be submitted to three main types of quality control. First, all sequence data will be checked for quality and contamination. Second, all alignments and corresponding quality statistics will be calculated and distributed by DCC together with certain quality statistics and distributed for QC purposes. Finally, there will be data type specific metrics for RNA-Seq, Chip-Seq, DNase-Seq and BS-Seq sequencing which are also described here.
It is expected that each sequencing center will perform internal quality control prior to submission to the DCC. Once sequence reads are stored by the DCC group, they are assessed for quality with specialized software.
Sequence Quality and Contamination
FastQC assesses multiple aspects of sequence quality and library diversity. These include the per-base quality across reads and the degree of sequence duplication and overrepresented sequence.
(http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen)
FastQ Screen is used to check the data for contamination and ensure that the reads map to the expected genome. Reads are mapped against the following:
Human genome & transcriptome
E. coli and yeast genomes
UniVec (http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html)
Common contaminants & PhiX
(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
The output from both programs is available via DCC web portal, allowing transparent comparison of sequencing quality by all interested consortium members.
Alignment Quality Metrics
All our alignment metrics are collected using programs from the Picard toolkit (http://picard.sourceforge.net/) for BAM files. The alignment quality metrics are collected using AlignmentSummaryMetrics and DuplicationSummaryMetrics. These provide statistics like the number of reads mapped, mismatch rates, read aligned in pairs and duplication rates. (Table 2a and 2b) These statistics will be presented in a tab-delimited format DCC web page alongside any defined data freezes.
RNA-Seq Quality Metrics
Picard also has an RNA-Seq statistics utility called RnaSeqMetrics which also collects information such as Number of Ribosomal Bases, Number of coding bases and other coverage statistics for the RNA-Seq data set against a specific transcriptome (Table 3).
ChIP-Seq and DNase-Seq Quality Metrics
For these data sets measures including: percentage of reads mapped to enriched regions, mean and median region length and region length variance will be calculated. The ENCODE Project also defined a number of ChIP-Seq metrics that we will also measure like the Irreproducible Discovery Rate (IDR), Normalised Strand Cross-correlation coefficient (NSC) and Uniquely Mappable Reads. These statistics will be distributed in tab-delimited files alongside the region and signal files. (http://genome.ucsc.edu/ENCODE/qualityMetrics.html#chipSeq) (Table 4)
BS-Seq Quality Metrics
The primary metric measured for BS-Seq is the conversion efficient. This is estimated using reads that uniquely map to the phage lambda reference; these should be derived from the spiked in unmethylated lambda DNA. Conversion efficiency λ is estimated from the formula: λ=1-(nC1+nG2)/(nG1+nC2), where nXy is the number of observed X bases in read y. This metric will be distributed along the BS-Seq data.
A
Metric
|
Description
|
#Reads
|
Total Number of Reads
|
#Noise Reads
|
Number of Noise Reads
|
#Reads Aligned
|
Number of Reads Aligned
|
#Aligned_Bases
|
Number of Aligned Bases
|
#High-Quality Aligned Reads
|
Number of Aligned Reads with Mapping Quality of 20 or higher
|
#High-Quality Aligned Bases
|
Number of Aligned Bases with Mapping Quality of 20 or higher
|
#High-Quality Aligned Q20 Bases
|
Number of High Quality Aligned bases with Base Quality of 20 of higher
|
Median Mismatches
|
Median Number of Mismatch High Quality Reads
|
Mismatch Rate
|
Rate of Mismatches in all reads
|
In-del Rate
|
Number of Indels per 100bp
|
Mean Read Length
|
Average Read Length
|
Reads Aligned in Pairs
|
Number of Read Pairs where both reads aligned to the reference
|
%Reads Aligned in Pairs
|
Percentage of Read Pairs where both reads aligned
|
#Bad Cycles
|
Number of Cycles where 80% of the bases were no calls
|
Strand Balance
|
Ratio of Number of Reads Aligned to the Positive Strand to the Number of Reads Aligned to the Whole Genome
|
%Chimeras
|
Number of Read Pairs where the reads map outside the maximum insert size or on different chromosomes
|
%Adapter
|
Number of Reads which failed to map to the genome but do map to a known adaptor sequence
|
B
Metric
|
Description
|
#Unpaired_Reads
|
Total Number of unpaired Reads
|
#Read Pairs
|
Total Number of Read Pairs
|
#Unmapped Reads
|
Total Number of Unmapped Reads
|
#Unpaired Read Duplicates
|
Number of Unpaired Reads which are Duplicated
|
#Read Pair Duplicates
|
Number of Read Pairs which are Duplicated
|
#Read Pair Optical Duplicates
|
Number of Read Pairs which are Optical Duplicates
|
%Duplication
|
Percentage of Duplication
|
Estimated Library Size
|
Estimated Number of Unique Molecules in the Library
|
Table 2 Alignment Quality Metrics Collected by Picard AlignmentSummary and DuplicationSummary Metrics
Metric
|
Description
|
#Bases
|
Number of Bases
|
#Aligned Bases
|
Number of Aligned Bases
|
#Ribosomal Bases
|
Number of Ribosomal Bases
|
#Coding Bases
|
Number of Coding Bases
|
#UTR Bases
|
Number of UTR Bases
|
#Intronic Bases
|
Number of Intronic Bases
|
#Intergenic Bases
|
Number of Intergenic Bases
|
%Ribosomal Bases
|
Percentage of Ribosomal Bases
|
%Coding Bases
|
Percentage of Coding Bases
|
%UTR Bases
|
Percentage of UTR Bases
|
%Intronic Bases
|
Percentage of Intronic Bases
|
%Intergenic Bases
|
Percentage of Intergenic Bases
|
%mRna Bases (#UTR Bases + #Coding bases / #Aligned Bases)
|
Percentage of mRNA Bases
|
%Usable Bases (#UTR Bases + #Coding bases / #Bases)
|
Percentage of Useable Bases
|
Median CV Coverage (Median Coefficient of Variation of coverage of the 1000 most highly expressed transcripts)
|
Coefficient of Variation of the Coverage
|
Median 5' Bias
|
Median 5' bias
|
Median 3' Bias
|
Median 3' Bias
|
Median 5' to 3' Bias
|
Median 5' to 3' Bias
|
Table 3 RNA-Seq Quality Metrics Collected by Picard RNASeqMetrics
Metric
|
Description
|
Region Enrichment
|
Percentage of reads maapping to enriched regions
|
Median Region Length
|
Median Region Length
|
Mean Region Length
|
Mean Region Length
|
Region Length Variance
|
|
Region Length Standard Deviation
|
|
Table 4 ChIP-Seq and DNase-Seq Quality Metrics
DACO (Data Access Compliance Office): Public dissemination
For users to access the raw data files the correct authentication and authorization is required. The EGA does not grant access to the data; data access must be applied for from the IHEC Data Access Compliance Office (DEEP-DACO). The EGA provides a personal account with access permissions for each successful applicant. Applicants must also sign the Data Access Agreement with the DEEP-DACO that provides instruction on how data can be stored, used and transferred once it has been download from our system.
Conclusions
This report describes coherent set of standards and policies for the DEEP consortium to follow. As the project progresses and data types evolve, the DCC will seek to find appropriate new measures to ensure the data DEEP produces continues to represent a high quality epigenomic resource.
Scheduling issues
Metadata for sequencing raw data (transfer from sequencing groups to DCC):
First test submission of raw data from sequencing centers to DCC will be initiated
Time: after the first runs have been performed (expected in summer 2013)
Metadata for sequencing experimental data (transfer from sequencing groups to DCC):
The extract of the current BLUEPRINT metadata is available and will be distributed by the DAC to the sequencing centers and sample labs.
Time: will be discussed at the Saarbrücken conference (28.6.)
Metadata for patient information (transfer from sample provider to DCC):
Has been initiated by the DAC.
Time: waiting for answer from sample provider
Metadata for raw data to EGA (for submission from DCC to EGA at EBI):
According to EGA requirements. It’s similar to ICGC operation.
A Data Access Compliance Office (DACO) is required. Initiation will be done by DEEP coordinator.
Time: waiting for answer of coordinator
Share with your friends: |