Use Cases from nbd(nist big Data) Requirements wg 0


Healthcare and Life Sciences



Download 0.88 Mb.
Page7/17
Date21.06.2017
Size0.88 Mb.
#21442
1   2   3   4   5   6   7   8   9   10   ...   17


Healthcare and Life Sciences
NBD(NIST Big Data) Requirements WG Use Case Template Aug 11 2013

Use Case Title

Genomic Measurements

Vertical (area)

Healthcare

Author/Company/Email

Justin Zook/NIST/jzook@nist.gov

Actors/Stakeholders and their roles and responsibilities

NIST/Genome in a Bottle Consortium – public/private/academic partnership

Goals

Develop well-characterized Reference Materials, Reference Data, and Reference Methods needed to assess performance of genome sequencing


Use Case Description

Integrate data from multiple sequencing technologies and methods to develop highly confident characterization of whole human genomes as Reference Materials, and develop methods to use these Reference Materials to assess performance of any genome sequencing run



Current

Solutions

Compute(System)

72-core cluster for our NIST group, collaboration with >1000 core clusters at FDA, some groups are using cloud

Storage

~40TB NFS at NIST, PBs of genomics data at NIH/NCBI

Networking

Varies. Significant I/O intensive processing needed

Software

Open-source sequencing bioinformatics software from academic groups (UNIX-based)

Big Data
Characteristics




Data Source (distributed/centralized)

Sequencers are distributed across many laboratories, though some core facilities exist.

Volume (size)

40TB NFS is full, will need >100TB in 1-2 years at NIST; Healthcare community will need many PBs of storage

Velocity

(e.g. real time)

DNA sequencers can generate ~300GB compressed data/day. Velocity has increased much faster than Moore’s Law

Variety

(multiple datasets, mashup)

File formats not well-standardized, though some standards exist. Generally structured data.

Variability (rate of change)

Sequencing technologies have evolved very rapidly, and new technologies are on the horizon.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

All sequencing technologies have significant systematic errors and biases, which require complex analysis methods and combining multiple technologies to understand, often with machine learning

Visualization

“Genome browsers” have been developed to visualize processed data

Data Quality

Sequencing technologies and bioinformatics methods have significant systematic errors and biases

Data Types

Mainly structured text

Data Analytics

Processing of raw data to produce variant calls. Also, clinical interpretation of variants, which is now very challenging.

Big Data Specific Challenges (Gaps)

Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

Big Data Specific Challenges in Mobility

Physicians may need access to genomic data on mobile platforms

Security & Privacy

Requirements

Sequencing data in health records or clinical research databases must be kept secure/private, though our Consortium data is public.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

I have some generalizations to medical genome sequencing above, but focus on NIST/Genome in a Bottle Consortium work. Currently, labs doing sequencing range from small to very large. Future data could include other ‘omics’ measurements, which could be even larger than DNA sequencing

More Information (URLs)

Genome in a Bottle Consortium: www.genomeinabottle.org


Note:

Healthcare and Life Sciences
NBD(
NIST Big Data) Requirements WG Use Case Template Aug 11 2013

Use Case Title

Comparative analysis for metagenomes and genomes

Vertical (area)

Scientific Research: Genomics

Author/Company/Email

Ernest Szeto / LBNL / eszeto@lbl.gov

Actors/Stakeholders and their roles and responsibilities

Joint Genome Institute (JGI) Integrated Microbial Genomes (IMG) project. Heads: Victor M. Markowitz, and Nikos C. Kyrpides. User community: JGI, bioinformaticians and biologists worldwide.

Goals

Provide an integrated comparative analysis system for metagenomes and genomes. This includes interactive Web UI with core data, backend precomputations, batch job computation submission from the UI.


Use Case Description

Given a metagenomic sample, (1) determine the community composition in terms of other reference isolate genomes, (2) characterize the function of its genes, (3) begin to infer possible functional pathways, (4) characterize similarity or dissimilarity with other metagenomic samples, (5) begin to characterize changes in community composition and function due to changes in environmental pressures, (6) isolate sub-sections of data based on quality measures and community composition.

Current

Solutions

Compute(System)

Linux cluster, Oracle RDBMS server, large memory machines, standard Linux interactive hosts

Storage

Oracle RDBMS, SQLite files, flat text files, Lucy (a version of Lucene) for keyword searches, BLAST databases, USEARCH databases

Networking

Provided by NERSC

Software

Standard bioinformatics tools (BLAST, HMMER, multiple alignment and phylogenetic tools, gene callers, sequence feature predictors…), Perl/Python wrapper scripts, Linux Cluster scheduling

Big Data
Characteristics




Data Source (distributed/centralized)

Centralized.

Volume (size)

50tb

Velocity

(e.g. real time)

Front end web UI must be real time interactive. Back end data loading processing must keep up with exponential growth of sequence data due to the rapid drop in cost of sequencing technology.

Variety

(multiple datasets, mashup)

Biological data is inherently heterogeneous, complex, structural, and hierarchical. One begins with sequences, followed by features on sequences, such as genes, motifs, regulatory regions, followed by organization of genes in neighborhoods (operons), to proteins and their structural features, to coordination and expression of genes in pathways. Besides core genomic data, new types of “Omics” data such as transcriptomics, methylomics, and proteomics describing gene expression under a variety of conditions must be incorporated into the comparative analysis system.

Variability (rate of change)

The sizes of metagenomic samples can vary by several orders of magnitude, such as several hundred thousand genes to a billion genes (e.g., latter in a complex soil sample).

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Metagenomic sampling science is currently preliminary and exploratory. Procedures for evaluating assembly of highly fragmented data in raw reads is better defined, but still an open research area.

Visualization

Interactive speed of web UI on very large data sets is an ongoing challenge. Web UI’s still seem to be the preferred interface for most biologists. It is use for basic querying and browsing of data. More specialized tools may be launched from them, e.g. for viewing multiple alignments. Ability to download large amounts of data for offline analysis is another requirement of the system.

Data Quality

Improving quality of metagenomic assembly is still a fundamental challenge. Improving the quality of reference isolate genomes, both in terms of the coverage in the phylogenetic tree, improved gene calling and functional annotation is a more mature process, but an ongoing project.

Data Types

Cf. above on “Variety”

Data Analytics

Descriptive statistics, statistical significance in hypothesis testing, discovering new relationships, data clustering and classification is a standard part of the analytics. The less quantitative part includes the ability to visualize structural details at different levels of resolution. Data reduction, removing redundancies through clustering, more abstract representations such as representing a group of highly similar genomes in a pangenome are all strategies for both data management as well as analytics.

Big Data Specific Challenges (Gaps)

The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.

Big Data Specific Challenges in Mobility

No special challenges. Just world wide web access.


Security & Privacy

Requirements

No special challenges. Data is either public or requires standard login with password.


Highlight issues for generalizing this use case (e.g. for ref. architecture)

A replacement for the RDBMS in big data would be of benefit to everyone. Many NoSQL solutions attempt to fill this role, but have their limitations.



More Information (URLs)

http://img.jgi.doe.gov


Note:


Download 0.88 Mb.

Share with your friends:
1   2   3   4   5   6   7   8   9   10   ...   17




The database is protected by copyright ©ininet.org 2024
send message

    Main page