Use Cases from nbd(nist big Data) Requirements wg 0

Healthcare and Life Sciences

Download 0.88 Mb.

Page	7/17
Date	21.06.2017
Size	0.88 Mb.
	#21442

1 2 3 4 5 6 7 8 9 10 ... 17

Healthcare and Life Sciences
NBD(NIST Big Data) Requirements WG Use Case Template Aug 11 2013

Use Case Title		Genomic Measurements
Vertical (area)		Healthcare
Author/Company/Email		Justin Zook/NIST/jzook@nist.gov
Actors/Stakeholders and their roles and responsibilities		NIST/Genome in a Bottle Consortium – public/private/academic partnership
Goals		Develop well-characterized Reference Materials, Reference Data, and Reference Methods needed to assess performance of genome sequencing
Use Case Description		Integrate data from multiple sequencing technologies and methods to develop highly confident characterization of whole human genomes as Reference Materials, and develop methods to use these Reference Materials to assess performance of any genome sequencing run
Current Solutions	Compute(System)		72-core cluster for our NIST group, collaboration with >1000 core clusters at FDA, some groups are using cloud
	Storage		~40TB NFS at NIST, PBs of genomics data at NIH/NCBI
	Networking		Varies. Significant I/O intensive processing needed
	Software		Open-source sequencing bioinformatics software from academic groups (UNIX-based)
Big Data Characteristics	Data Source (distributed/centralized)		Sequencers are distributed across many laboratories, though some core facilities exist.
	Volume (size)		40TB NFS is full, will need >100TB in 1-2 years at NIST; Healthcare community will need many PBs of storage
	Velocity (e.g. real time)		DNA sequencers can generate ~300GB compressed data/day. Velocity has increased much faster than Moore’s Law
	Variety (multiple datasets, mashup)		File formats not well-standardized, though some standards exist. Generally structured data.
	Variability (rate of change)		Sequencing technologies have evolved very rapidly, and new technologies are on the horizon.
Big Data Science (collection, curation, analysis, action)	Veracity (Robustness Issues)		All sequencing technologies have significant systematic errors and biases, which require complex analysis methods and combining multiple technologies to understand, often with machine learning
	Visualization		“Genome browsers” have been developed to visualize processed data
	Data Quality		Sequencing technologies and bioinformatics methods have significant systematic errors and biases
	Data Types		Mainly structured text
	Data Analytics		Processing of raw data to produce variant calls. Also, clinical interpretation of variants, which is now very challenging.
Big Data Specific Challenges (Gaps)		Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.
Big Data Specific Challenges in Mobility		Physicians may need access to genomic data on mobile platforms
Security & Privacy Requirements		Sequencing data in health records or clinical research databases must be kept secure/private, though our Consortium data is public.
Highlight issues for generalizing this use case (e.g. for ref. architecture)		I have some generalizations to medical genome sequencing above, but focus on NIST/Genome in a Bottle Consortium work. Currently, labs doing sequencing range from small to very large. Future data could include other ‘omics’ measurements, which could be even larger than DNA sequencing
More Information (URLs)		Genome in a Bottle Consortium: www.genomeinabottle.org
Note:

Healthcare and Life Sciences
NBD(NIST Big Data) Requirements WG Use Case Template Aug 11 2013

Use Case Title		Comparative analysis for metagenomes and genomes
Vertical (area)		Scientific Research: Genomics
Author/Company/Email		Ernest Szeto / LBNL / eszeto@lbl.gov
Actors/Stakeholders and their roles and responsibilities		Joint Genome Institute (JGI) Integrated Microbial Genomes (IMG) project. Heads: Victor M. Markowitz, and Nikos C. Kyrpides. User community: JGI, bioinformaticians and biologists worldwide.
Goals		Provide an integrated comparative analysis system for metagenomes and genomes. This includes interactive Web UI with core data, backend precomputations, batch job computation submission from the UI.
Use Case Description		Given a metagenomic sample, (1) determine the community composition in terms of other reference isolate genomes, (2) characterize the function of its genes, (3) begin to infer possible functional pathways, (4) characterize similarity or dissimilarity with other metagenomic samples, (5) begin to characterize changes in community composition and function due to changes in environmental pressures, (6) isolate sub-sections of data based on quality measures and community composition.
Current Solutions	Compute(System)		Linux cluster, Oracle RDBMS server, large memory machines, standard Linux interactive hosts
	Storage		Oracle RDBMS, SQLite files, flat text files, Lucy (a version of Lucene) for keyword searches, BLAST databases, USEARCH databases
	Networking		Provided by NERSC
	Software		Standard bioinformatics tools (BLAST, HMMER, multiple alignment and phylogenetic tools, gene callers, sequence feature predictors…), Perl/Python wrapper scripts, Linux Cluster scheduling
Big Data Characteristics	Data Source (distributed/centralized)		Centralized.
	Volume (size)		50tb
	Velocity (e.g. real time)		Front end web UI must be real time interactive. Back end data loading processing must keep up with exponential growth of sequence data due to the rapid drop in cost of sequencing technology.
	Variety (multiple datasets, mashup)		Biological data is inherently heterogeneous, complex, structural, and hierarchical. One begins with sequences, followed by features on sequences, such as genes, motifs, regulatory regions, followed by organization of genes in neighborhoods (operons), to proteins and their structural features, to coordination and expression of genes in pathways. Besides core genomic data, new types of “Omics” data such as transcriptomics, methylomics, and proteomics describing gene expression under a variety of conditions must be incorporated into the comparative analysis system.
	Variability (rate of change)		The sizes of metagenomic samples can vary by several orders of magnitude, such as several hundred thousand genes to a billion genes (e.g., latter in a complex soil sample).
Big Data Science (collection, curation, analysis, action)	Veracity (Robustness Issues)		Metagenomic sampling science is currently preliminary and exploratory. Procedures for evaluating assembly of highly fragmented data in raw reads is better defined, but still an open research area.
	Visualization		Interactive speed of web UI on very large data sets is an ongoing challenge. Web UI’s still seem to be the preferred interface for most biologists. It is use for basic querying and browsing of data. More specialized tools may be launched from them, e.g. for viewing multiple alignments. Ability to download large amounts of data for offline analysis is another requirement of the system.
	Data Quality		Improving quality of metagenomic assembly is still a fundamental challenge. Improving the quality of reference isolate genomes, both in terms of the coverage in the phylogenetic tree, improved gene calling and functional annotation is a more mature process, but an ongoing project.
	Data Types		Cf. above on “Variety”
	Data Analytics		Descriptive statistics, statistical significance in hypothesis testing, discovering new relationships, data clustering and classification is a standard part of the analytics. The less quantitative part includes the ability to visualize structural details at different levels of resolution. Data reduction, removing redundancies through clustering, more abstract representations such as representing a group of highly similar genomes in a pangenome are all strategies for both data management as well as analytics.
Big Data Specific Challenges (Gaps)		The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.
Big Data Specific Challenges in Mobility		No special challenges. Just world wide web access.
Security & Privacy Requirements		No special challenges. Data is either public or requires standard login with password.
Highlight issues for generalizing this use case (e.g. for ref. architecture)		A replacement for the RDBMS in big data would be of benefit to everyone. Many NoSQL solutions attempt to fill this role, but have their limitations.
More Information (URLs)		http://img.jgi.doe.gov
Note:

Directory: uploadfiles
uploadfiles -> Use Cases from nbd(nist big Data) Requirements wg
uploadfiles -> Nist big Data Public Working Group (nbd-pwg) nbd-pwd-2015/6a,DW. abbreviated rr (M0444) Source: nbd-pwg status: Draft Title: Big Data Use Case #6 Implementation, using nbdra author: Afzal Godil
uploadfiles -> Nist special Publication 1500-4 draft: nist big Data Interoperability Framework: Volume 4, Security and Privacy

Download 0.88 Mb.

Share with your friends:

1 2 3 4 5 6 7 8 9 10 ... 17