Deep: Data Standards and Policies Introduction

Download 87.36 Kb.

Date	09.06.2018
Size	87.36 Kb.
	#53474

Data flow and distribution

DEEP: Data Standards and Policies

Introduction

The German Epigenome Program DEEP is a consortium targeted at interdisciplinary epigenome research in Germany associated with IHEC initiative. It comprises 31 research centers, currently. The aim of the project is to generate and analyze more than 70 human epigenomes of 13 tissue types in the context of metabolic, inflammatory and neurodegenerative diseases. ChIP-seq, DNase-seq, RNA-seq and bisulfite sequencing data will be processed along with genomic sequencing. Among those experiments, due to high sequencing depth (more than 30X), processing of bisulfite sequencing data will be particularly demanding with respect to resources. The substantial volume and diversity of generated data requires adoption of standardized data formats and experiment descriptions. Consistent quality standards are crucial for downstream data integration.

This document describes data standards and policies used by the DEEP consortium to exchange data between producers and the Data Coordination Center (DCC) and make the data available to the whole consortium. We will propose both detailed data exchange policies and standard data formats for data distribution.

Data flow and distribution

The six data providers in the DEEP project will submit the raw sequence data to the DCC at the DKFZ. Metadata describing sequencing experiments as well as quality values will be transferred to the DCC together with sequencing results. The DCC at the DKFZ will provide access to this data for the whole consortium. The sequencing data will be transferred (on request/automatically) to the Data Analysis Center (DAC) in Saarbrücken via an Aspera interface. Metadata describing biological samples will be transferred from the sample providers to the DCC by data producer in parallel with or previous to sequencing results.

Primary Analysis results will be accessible from the DCC, if feasible. Initially, this will be restricted to the Consortium members. Public access will be provided according to the IHEC policies (http://ihec-epigenomes.net/about/policies-and-guidelines/).
We plan to follow BLUEPRINT recommendation and make our high level analysis results (the ChIP-seq, DNAse-seq, methylation signals and RNA-seq expression analysis) visible via a common web portal. The Biomart approach already being realized for ICGC was given a first attempt by the Spanish colleagues of BLUEPRINT (groups of Ivo Gut from Barcelona Supercomputing Center (BSC) and Alfonso Valencia from Spanish National Cancer research Center (CNIO)). Biomart failed as it did not scale up to epigenetics approaches. Consequently, the BLUEPRINT colleagues are currently developing a new database using MongoDB (a NoSQL technology). There was agreement that DEEP will wait for the availability of the MongoDB realization provided by BLUEPRINT’s Spanish colleagues. Afterwards, we will continue the discussion with the BLUEPRINT team to harmonize the web presentation of high level analysis results between DEEP and BLUEPRINT.
For access to the raw data from outside the DEEP consortium, the sequencing data will be additionally transferred (automated, on request?) to the European Genome-phenome Archive (EGA). The EGA will provide access to the wider public under the review of DEEP Data Access Compliance Office (DACO) (which still has to be established). Data access rules will follow the general guidelines from ICGC and IHEC.

Group 28

Figure 1 The Data Coordination Center has a central role in the project, providing a platform for storage and reception of raw data

Data Formats
Large-scale projects like DEEP need to specify their data formats and data standards to ensure interoperability and maximum usability of the data. DEEP is supporting community driven data standards for all its data types.
Each kind of data transfer (each vertical arrow in figure 1) requires a clearly defined data format specification.

Metadata for sequencing raw data (transfer from sequencing groups to DCC):
- Format will be used according to ICGC specifications.

Metadata for sequencing experimental data (transfer from sequencing groups to DCC):
- Format will be used according to IHEC recommendations (NIH Roadmap).

Metadata for primary cell lines (transfer from sample provider to DCC):
- Have to be determined by sample provider.
- Format will be used according to NIH roadmap for IHEC specification.
- In agreement with BLUEPRINT we propose to use EBI or NIH Biosample database as a guideline.
  - http://www.ncbi.nlm.nih.gov/biosample
  - http://www.ebi.ac.uk/biosamples/index.html

Metadata for patient information (transfer from sample provider to DCC):
- Have to be defined by sample provider.
- Sample naming will be defined together by the DAC and the DCC.

Metadata for raw data to EGA (for submission from DCC to EGA at EBI):
- According to EGA requirements. It’s similar to ICGC operation.
- A Data Access Compliance Office (DACO) is required

Data exchange from DCC to DAC
- DAC obtains access to all the data gathered by DCC
Data (high-level analysis) exchange from DAC to DCC
- Based on BLUEPRINT precedent.
High level analysis results (for submission from DCC to the BLUEPRINT MongoDB):
- Based on BLUEPRINT precedent.
- Naming conventions for high-level analysis files will be defined together by DCC and DAC.

Primary data formats

The primary data formats produced by DEEP are described in Table 1

Experiment Type	Data Type	Data Format
All	Alignment	Sam/Bam
RNASeq	Expression Levels	GTF
BiSulphite, DNase and ChipSeq	Regions	BED
BiSulphite, DNase and ChipSeq	Signal	WIG

Table 1: File Formats used to store different data types produced by DEEP
Quality Metrics
The DCC will record some standard metrics to assist with quality control of both the incoming data and primary analysis results.
All DEEP data will be submitted to three main types of quality control. First, all sequence data will be checked for quality and contamination. Second, all alignments and corresponding quality statistics will be calculated and distributed by DCC together with certain quality statistics and distributed for QC purposes. Finally, there will be data type specific metrics for RNA-Seq, Chip-Seq, DNase-Seq and BS-Seq sequencing which are also described here.
It is expected that each sequencing center will perform internal quality control prior to submission to the DCC. Once sequence reads are stored by the DCC group, they are assessed for quality with specialized software.
Sequence Quality and Contamination
FastQC assesses multiple aspects of sequence quality and library diversity. These include the per-base quality across reads and the degree of sequence duplication and overrepresented sequence.

(http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen)

FastQ Screen is used to check the data for contamination and ensure that the reads map to the expected genome. Reads are mapped against the following:

Human genome & transcriptome
E. coli and yeast genomes
UniVec (http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html)
Common contaminants & PhiX

(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

The output from both programs is available via DCC web portal, allowing transparent comparison of sequencing quality by all interested consortium members.

Alignment Quality Metrics
All our alignment metrics are collected using programs from the Picard toolkit (http://picard.sourceforge.net/) for BAM files. The alignment quality metrics are collected using AlignmentSummaryMetrics and DuplicationSummaryMetrics. These provide statistics like the number of reads mapped, mismatch rates, read aligned in pairs and duplication rates. (Table 2a and 2b) These statistics will be presented in a tab-delimited format DCC web page alongside any defined data freezes.
RNA-Seq Quality Metrics
Picard also has an RNA-Seq statistics utility called RnaSeqMetrics which also collects information such as Number of Ribosomal Bases, Number of coding bases and other coverage statistics for the RNA-Seq data set against a specific transcriptome (Table 3).
ChIP-Seq and DNase-Seq Quality Metrics
For these data sets measures including: percentage of reads mapped to enriched regions, mean and median region length and region length variance will be calculated. The ENCODE Project also defined a number of ChIP-Seq metrics that we will also measure like the Irreproducible Discovery Rate (IDR), Normalised Strand Cross-correlation coefficient (NSC) and Uniquely Mappable Reads. These statistics will be distributed in tab-delimited files alongside the region and signal files. (http://genome.ucsc.edu/ENCODE/qualityMetrics.html#chipSeq) (Table 4)
BS-Seq Quality Metrics
The primary metric measured for BS-Seq is the conversion efficient. This is estimated using reads that uniquely map to the phage lambda reference; these should be derived from the spiked in unmethylated lambda DNA. Conversion efficiency λ is estimated from the formula: λ=1-(n_C¹+n_G²)/(n_G¹+n_C²), where n_X^y is the number of observed X bases in read y. This metric will be distributed along the BS-Seq data.
A

Metric	Description
#Reads	Total Number of Reads
#Noise Reads	Number of Noise Reads
#Reads Aligned	Number of Reads Aligned
#Aligned_Bases	Number of Aligned Bases
#High-Quality Aligned Reads	Number of Aligned Reads with Mapping Quality of 20 or higher
#High-Quality Aligned Bases	Number of Aligned Bases with Mapping Quality of 20 or higher
#High-Quality Aligned Q20 Bases	Number of High Quality Aligned bases with Base Quality of 20 of higher
Median Mismatches	Median Number of Mismatch High Quality Reads
Mismatch Rate	Rate of Mismatches in all reads
In-del Rate	Number of Indels per 100bp
Mean Read Length	Average Read Length
Reads Aligned in Pairs	Number of Read Pairs where both reads aligned to the reference
%Reads Aligned in Pairs	Percentage of Read Pairs where both reads aligned
#Bad Cycles	Number of Cycles where 80% of the bases were no calls
Strand Balance	Ratio of Number of Reads Aligned to the Positive Strand to the Number of Reads Aligned to the Whole Genome
%Chimeras	Number of Read Pairs where the reads map outside the maximum insert size or on different chromosomes
%Adapter	Number of Reads which failed to map to the genome but do map to a known adaptor sequence

Metric	Description
#Unpaired_Reads	Total Number of unpaired Reads
#Read Pairs	Total Number of Read Pairs
#Unmapped Reads	Total Number of Unmapped Reads
#Unpaired Read Duplicates	Number of Unpaired Reads which are Duplicated
#Read Pair Duplicates	Number of Read Pairs which are Duplicated
#Read Pair Optical Duplicates	Number of Read Pairs which are Optical Duplicates
%Duplication	Percentage of Duplication
Estimated Library Size	Estimated Number of Unique Molecules in the Library

Table 2 Alignment Quality Metrics Collected by Picard AlignmentSummary and DuplicationSummary Metrics

Metric	Description
#Bases	Number of Bases
#Aligned Bases	Number of Aligned Bases
#Ribosomal Bases	Number of Ribosomal Bases
#Coding Bases	Number of Coding Bases
#UTR Bases	Number of UTR Bases
#Intronic Bases	Number of Intronic Bases
#Intergenic Bases	Number of Intergenic Bases
%Ribosomal Bases	Percentage of Ribosomal Bases
%Coding Bases	Percentage of Coding Bases
%UTR Bases	Percentage of UTR Bases
%Intronic Bases	Percentage of Intronic Bases
%Intergenic Bases	Percentage of Intergenic Bases
%mRna Bases (#UTR Bases + #Coding bases / #Aligned Bases)	Percentage of mRNA Bases
%Usable Bases (#UTR Bases + #Coding bases / #Bases)	Percentage of Useable Bases
Median CV Coverage (Median Coefficient of Variation of coverage of the 1000 most highly expressed transcripts)	Coefficient of Variation of the Coverage
Median 5' Bias	Median 5' bias
Median 3' Bias	Median 3' Bias
Median 5' to 3' Bias	Median 5' to 3' Bias

Table 3 RNA-Seq Quality Metrics Collected by Picard RNASeqMetrics

Metric	Description
Region Enrichment	Percentage of reads maapping to enriched regions
Median Region Length	Median Region Length
Mean Region Length	Mean Region Length
Region Length Variance
Region Length Standard Deviation

Table 4 ChIP-Seq and DNase-Seq Quality Metrics

DACO (Data Access Compliance Office): Public dissemination

For users to access the raw data files the correct authentication and authorization is required. The EGA does not grant access to the data; data access must be applied for from the IHEC Data Access Compliance Office (DEEP-DACO). The EGA provides a personal account with access permissions for each successful applicant. Applicants must also sign the Data Access Agreement with the DEEP-DACO that provides instruction on how data can be stored, used and transferred once it has been download from our system.

Conclusions
This report describes coherent set of standards and policies for the DEEP consortium to follow. As the project progresses and data types evolve, the DCC will seek to find appropriate new measures to ensure the data DEEP produces continues to represent a high quality epigenomic resource.

Scheduling issues

Metadata for sequencing raw data (transfer from sequencing groups to DCC):
- First test submission of raw data from sequencing centers to DCC will be initiated

Time: after the first runs have been performed (expected in summer 2013)

Metadata for sequencing experimental data (transfer from sequencing groups to DCC):
- The extract of the current BLUEPRINT metadata is available and will be distributed by the DAC to the sequencing centers and sample labs.

Time: will be discussed at the Saarbrücken conference (28.6.)

Metadata for patient information (transfer from sample provider to DCC):
- Has been initiated by the DAC.
- Time: waiting for answer from sample provider

Metadata for raw data to EGA (for submission from DCC to EGA at EBI):
- According to EGA requirements. It’s similar to ICGC operation.
- A Data Access Compliance Office (DACO) is required. Initiation will be done by DEEP coordinator.
- Time: waiting for answer of coordinator

Directory: fileadmin
fileadmin -> The Collapse of the gdr and the Reunification of Germany
fileadmin -> Filmskript zur Sendung „From Georgia to Virginia“ Sendereihe: The East Coast of the usa
fileadmin -> Comparative Politics Central Europe Mgr. Juraj Marušiak, PhD. course coordinator
fileadmin -> Annex 1 to the Interim Report
fileadmin -> Review of projects and contributions on statistical methods for spatial disaggregation and for integration of various kinds of geographical information and geo-referenced survey data
fileadmin -> An overview of land evaluation and land use planning at fao
fileadmin -> Contact information
fileadmin -> Review of the literature
fileadmin -> Sigchi extended Abstracts Sample Adapted to mamn25
fileadmin -> Communication and Information Sector Knowledge Societies Division

Download 87.36 Kb.

Share with your friends: