Deep: Data Standards and Policies Introduction



Download 87.36 Kb.
Date09.06.2018
Size87.36 Kb.
#53474
DEEP: Data Standards and Policies

Introduction

The German Epigenome Program DEEP is a consortium targeted at interdisciplinary epigenome research in Germany associated with IHEC initiative. It comprises 31 research centers, currently. The aim of the project is to generate and analyze more than 70 human epigenomes of 13 tissue types in the context of metabolic, inflammatory and neurodegenerative diseases. ChIP-seq, DNase-seq, RNA-seq and bisulfite sequencing data will be processed along with genomic sequencing. Among those experiments, due to high sequencing depth (more than 30X), processing of bisulfite sequencing data will be particularly demanding with respect to resources. The substantial volume and diversity of generated data requires adoption of standardized data formats and experiment descriptions. Consistent quality standards are crucial for downstream data integration.


This document describes data standards and policies used by the DEEP consortium to exchange data between producers and the Data Coordination Center (DCC) and make the data available to the whole consortium. We will propose both detailed data exchange policies and standard data formats for data distribution.

Data flow and distribution

The six data providers in the DEEP project will submit the raw sequence data to the DCC at the DKFZ. Metadata describing sequencing experiments as well as quality values will be transferred to the DCC together with sequencing results. The DCC at the DKFZ will provide access to this data for the whole consortium. The sequencing data will be transferred (on request/automatically) to the Data Analysis Center (DAC) in Saarbrücken via an Aspera interface. Metadata describing biological samples will be transferred from the sample providers to the DCC by data producer in parallel with or previous to sequencing results.


Primary Analysis results will be accessible from the DCC, if feasible. Initially, this will be restricted to the Consortium members. Public access will be provided according to the IHEC policies (http://ihec-epigenomes.net/about/policies-and-guidelines/).
We plan to follow BLUEPRINT recommendation and make our high level analysis results (the ChIP-seq, DNAse-seq, methylation signals and RNA-seq expression analysis) visible via a common web portal. The Biomart approach already being realized for ICGC was given a first attempt by the Spanish colleagues of BLUEPRINT (groups of Ivo Gut from Barcelona Supercomputing Center (BSC) and Alfonso Valencia from Spanish National Cancer research Center (CNIO)). Biomart failed as it did not scale up to epigenetics approaches. Consequently, the BLUEPRINT colleagues are currently developing a new database using MongoDB (a NoSQL technology). There was agreement that DEEP will wait for the availability of the MongoDB realization provided by BLUEPRINT’s Spanish colleagues. Afterwards, we will continue the discussion with the BLUEPRINT team to harmonize the web presentation of high level analysis results between DEEP and BLUEPRINT.
For access to the raw data from outside the DEEP consortium, the sequencing data will be additionally transferred (automated, on request?) to the European Genome-phenome Archive (EGA). The EGA will provide access to the wider public under the review of DEEP Data Access Compliance Office (DACO) (which still has to be established). Data access rules will follow the general guidelines from ICGC and IHEC.

Group 28

Figure 1 The Data Coordination Center has a central role in the project, providing a platform for storage and reception of raw data

Data Formats
Large-scale projects like DEEP need to specify their data formats and data standards to ensure interoperability and maximum usability of the data. DEEP is supporting community driven data standards for all its data types.
Each kind of data transfer (each vertical arrow in figure 1) requires a clearly defined data format specification.


  • Metadata for sequencing raw data (transfer from sequencing groups to DCC):

    • Format will be used according to ICGC specifications.




  • Metadata for sequencing experimental data (transfer from sequencing groups to DCC):

    • Format will be used according to IHEC recommendations (NIH Roadmap).




  • Metadata for primary cell lines (transfer from sample provider to DCC):

    • Have to be determined by sample provider.

    • Format will be used according to NIH roadmap for IHEC specification.

    • In agreement with BLUEPRINT we propose to use EBI or NIH Biosample database as a guideline.

      • http://www.ncbi.nlm.nih.gov/biosample

      • http://www.ebi.ac.uk/biosamples/index.html




  • Metadata for patient information (transfer from sample provider to DCC):

    • Have to be defined by sample provider.

    • Sample naming will be defined together by the DAC and the DCC.




  • Metadata for raw data to EGA (for submission from DCC to EGA at EBI):

    • According to EGA requirements. It’s similar to ICGC operation.

    • A Data Access Compliance Office (DACO) is required




  • Data exchange from DCC to DAC

    • DAC obtains access to all the data gathered by DCC

  • Data (high-level analysis) exchange from DAC to DCC

    • Based on BLUEPRINT precedent.

  • High level analysis results (for submission from DCC to the BLUEPRINT MongoDB):

    • Based on BLUEPRINT precedent.

    • Naming conventions for high-level analysis files will be defined together by DCC and DAC.

Primary data formats

The primary data formats produced by DEEP are described in Table 1




Experiment Type

Data Type

Data Format

All

Alignment

Sam/Bam

RNASeq

Expression Levels

GTF

BiSulphite, DNase and ChipSeq

Regions

BED

BiSulphite, DNase and ChipSeq

Signal

WIG


Table 1: File Formats used to store different data types produced by DEEP
Quality Metrics
The DCC will record some standard metrics to assist with quality control of both the incoming data and primary analysis results.
All DEEP data will be submitted to three main types of quality control. First, all sequence data will be checked for quality and contamination. Second, all alignments and corresponding quality statistics will be calculated and distributed by DCC together with certain quality statistics and distributed for QC purposes. Finally, there will be data type specific metrics for RNA-Seq, Chip-Seq, DNase-Seq and BS-Seq sequencing which are also described here.
It is expected that each sequencing center will perform internal quality control prior to submission to the DCC. Once sequence reads are stored by the DCC group, they are assessed for quality with specialized software.
Sequence Quality and Contamination
FastQC assesses multiple aspects of sequence quality and library diversity. These include the per-base quality across reads and the degree of sequence duplication and overrepresented sequence.

(http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen)


FastQ Screen is used to check the data for contamination and ensure that the reads map to the expected genome. Reads are mapped against the following:

  • Human genome & transcriptome

  • E. coli and yeast genomes

  • UniVec (http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html)

  • Common contaminants & PhiX

(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

The output from both programs is available via DCC web portal, allowing transparent comparison of sequencing quality by all interested consortium members.



Alignment Quality Metrics
All our alignment metrics are collected using programs from the Picard toolkit (http://picard.sourceforge.net/) for BAM files. The alignment quality metrics are collected using AlignmentSummaryMetrics and DuplicationSummaryMetrics. These provide statistics like the number of reads mapped, mismatch rates, read aligned in pairs and duplication rates. (Table 2a and 2b) These statistics will be presented in a tab-delimited format DCC web page alongside any defined data freezes.
RNA-Seq Quality Metrics
Picard also has an RNA-Seq statistics utility called RnaSeqMetrics which also collects information such as Number of Ribosomal Bases, Number of coding bases and other coverage statistics for the RNA-Seq data set against a specific transcriptome (Table 3).
ChIP-Seq and DNase-Seq Quality Metrics
For these data sets measures including: percentage of reads mapped to enriched regions, mean and median region length and region length variance will be calculated. The ENCODE Project also defined a number of ChIP-Seq metrics that we will also measure like the Irreproducible Discovery Rate (IDR), Normalised Strand Cross-correlation coefficient (NSC) and Uniquely Mappable Reads. These statistics will be distributed in tab-delimited files alongside the region and signal files. (http://genome.ucsc.edu/ENCODE/qualityMetrics.html#chipSeq) (Table 4)
BS-Seq Quality Metrics
The primary metric measured for BS-Seq is the conversion efficient. This is estimated using reads that uniquely map to the phage lambda reference; these should be derived from the spiked in unmethylated lambda DNA. Conversion efficiency λ is estimated from the formula: λ=1-(nC1+nG2)/(nG1+nC2), where nXy is the number of observed X bases in read y. This metric will be distributed along the BS-Seq data.
A

Metric

Description

#Reads

Total Number of Reads

#Noise Reads

Number of Noise Reads

#Reads Aligned

Number of Reads Aligned

#Aligned_Bases

Number of Aligned Bases

#High-Quality Aligned Reads

Number of Aligned Reads with Mapping Quality of 20 or higher

#High-Quality Aligned Bases

Number of Aligned Bases with Mapping Quality of 20 or higher

#High-Quality Aligned Q20 Bases

Number of High Quality Aligned bases with Base Quality of 20 of higher

Median Mismatches

Median Number of Mismatch High Quality Reads

Mismatch Rate

Rate of Mismatches in all reads

In-del Rate

Number of Indels per 100bp

Mean Read Length

Average Read Length

Reads Aligned in Pairs

Number of Read Pairs where both reads aligned to the reference

%Reads Aligned in Pairs

Percentage of Read Pairs where both reads aligned

#Bad Cycles

Number of Cycles where 80% of the bases were no calls

Strand Balance

Ratio of Number of Reads Aligned to the Positive Strand to the Number of Reads Aligned to the Whole Genome

%Chimeras

Number of Read Pairs where the reads map outside the maximum insert size or on different chromosomes

%Adapter

Number of Reads which failed to map to the genome but do map to a known adaptor sequence


B

Metric

Description

#Unpaired_Reads

Total Number of unpaired Reads

#Read Pairs

Total Number of Read Pairs

#Unmapped Reads

Total Number of Unmapped Reads

#Unpaired Read Duplicates

Number of Unpaired Reads which are Duplicated

#Read Pair Duplicates

Number of Read Pairs which are Duplicated

#Read Pair Optical Duplicates

Number of Read Pairs which are Optical Duplicates

%Duplication

Percentage of Duplication

Estimated Library Size

Estimated Number of Unique Molecules in the Library


Table 2 Alignment Quality Metrics Collected by Picard AlignmentSummary and DuplicationSummary Metrics


Metric

Description

#Bases

Number of Bases

#Aligned Bases

Number of Aligned Bases

#Ribosomal Bases

Number of Ribosomal Bases

#Coding Bases

Number of Coding Bases

#UTR Bases

Number of UTR Bases

#Intronic Bases

Number of Intronic Bases

#Intergenic Bases

Number of Intergenic Bases

%Ribosomal Bases

Percentage of Ribosomal Bases

%Coding Bases

Percentage of Coding Bases

%UTR Bases

Percentage of UTR Bases

%Intronic Bases

Percentage of Intronic Bases

%Intergenic Bases

Percentage of Intergenic Bases

%mRna Bases (#UTR Bases + #Coding bases / #Aligned Bases)

Percentage of mRNA Bases

%Usable Bases (#UTR Bases + #Coding bases / #Bases)

Percentage of Useable Bases

Median CV Coverage (Median Coefficient of Variation of coverage of the 1000 most highly expressed transcripts)

Coefficient of Variation of the Coverage

Median 5' Bias

Median 5' bias

Median 3' Bias

Median 3' Bias

Median 5' to 3' Bias

Median 5' to 3' Bias


Table 3 RNA-Seq Quality Metrics Collected by Picard RNASeqMetrics


Metric

Description

Region Enrichment

Percentage of reads maapping to enriched regions

Median Region Length

Median Region Length

Mean Region Length

Mean Region Length

Region Length Variance




Region Length Standard Deviation





Table 4 ChIP-Seq and DNase-Seq Quality Metrics

DACO (Data Access Compliance Office): Public dissemination

For users to access the raw data files the correct authentication and authorization is required. The EGA does not grant access to the data; data access must be applied for from the IHEC Data Access Compliance Office (DEEP-DACO). The EGA provides a personal account with access permissions for each successful applicant. Applicants must also sign the Data Access Agreement with the DEEP-DACO that provides instruction on how data can be stored, used and transferred once it has been download from our system.


Conclusions
This report describes coherent set of standards and policies for the DEEP consortium to follow. As the project progresses and data types evolve, the DCC will seek to find appropriate new measures to ensure the data DEEP produces continues to represent a high quality epigenomic resource.

Scheduling issues


  • Metadata for sequencing raw data (transfer from sequencing groups to DCC):

    • First test submission of raw data from sequencing centers to DCC will be initiated

Time: after the first runs have been performed (expected in summer 2013)


  • Metadata for sequencing experimental data (transfer from sequencing groups to DCC):

    • The extract of the current BLUEPRINT metadata is available and will be distributed by the DAC to the sequencing centers and sample labs.

Time: will be discussed at the Saarbrücken conference (28.6.)



  • Metadata for patient information (transfer from sample provider to DCC):

    • Has been initiated by the DAC.

    • Time: waiting for answer from sample provider



  • Metadata for raw data to EGA (for submission from DCC to EGA at EBI):

    • According to EGA requirements. It’s similar to ICGC operation.

    • A Data Access Compliance Office (DACO) is required. Initiation will be done by DEEP coordinator.

    • Time: waiting for answer of coordinator

Directory: fileadmin
fileadmin -> The Collapse of the gdr and the Reunification of Germany
fileadmin -> Filmskript zur Sendung „From Georgia to Virginia“ Sendereihe: The East Coast of the usa
fileadmin -> Comparative Politics Central Europe Mgr. Juraj Marušiak, PhD. course coordinator
fileadmin -> Annex 1 to the Interim Report
fileadmin -> Review of projects and contributions on statistical methods for spatial disaggregation and for integration of various kinds of geographical information and geo-referenced survey data
fileadmin -> An overview of land evaluation and land use planning at fao
fileadmin -> Contact information
fileadmin -> Review of the literature
fileadmin -> Sigchi extended Abstracts Sample Adapted to mamn25
fileadmin -> Communication and Information Sector Knowledge Societies Division

Download 87.36 Kb.

Share with your friends:




The database is protected by copyright ©ininet.org 2024
send message

    Main page