Census 2010 and 2000 – Title 13 Big Data; Vivek Navale & Quyen Nguyen, NARA
National Archives and Records Administration Accession NARA, Search, Retrieve, Preservation; Vivek Navale & Quyen Nguyen, NARA
Commercial
Cloud Eco-System, for Financial Industries (Banking, Securities & Investments, Insurance) transacting business within the United States; Pw Carey, Compliance Partners, LLC
Mendeley – An International Network of Research; William Gunn , Mendeley
IaaS (Infrastructure as a Service) Big Data Business Continuity & Disaster Recovery (BC/DR) Within A Cloud Eco-System; Pw Carey, Compliance Partners, LLC
Cargo Shipping; William Miller, MaCT USA
Materials Data for Manufacturing; John Rumble, R&R Data Services
Simulation driven Materials Genomics; David Skinner, LBNL
Healthcare and Life Sciences
Electronic Medical Record (EMR) Data; Shaun Grannis, Indiana University
Pathology Imaging/digital pathology; Fusheng Wang, Emory University
Genomic Measurements; Justin Zook, NIST
Comparative analysis for metagenomes and genomes; Ernest Szeto, LBNL (Joint Genome Institute)
Individualized Diabetes Management; Ying Ding , Indiana University
Statistical Relational Artificial Intelligence for Health Care; Sriraam Natarajan, Indiana University
World Population Scale Epidemiological Study; Madhav Marathe, Stephen Eubank or Chris Barrett, Virginia Tech
Social Contagion Modeling for Planning, Public Health and Disaster Management; Madhav Marathe or Chris Kuhlman, Virginia Tech
Biodiversity and LifeWatch; Wouter Los, Yuri Demchenko, University of Amsterdam
Deep Learning and Social Media
Large-scale Deep Learning; Adam Coates , Stanford University
Organizing large-scale, unstructured collections of consumer photos; David Crandall, Indiana University
Truthy: Information diffusion research from Twitter Data; Filippo Menczer, Alessandro Flammini, Emilio Ferrara, Indiana University
CINET: Cyberinfrastructure for Network (Graph) Science and Analytics; Madhav Marathe or Keith Bisset, Virginia Tech
NIST Information Access Division analytic technology performance measurement, evaluations, and standards; John Garofolo, NIST
The Ecosystem for Research
DataNet Federation Consortium DFC; Reagan Moore, University of North Carolina at Chapel Hill
The ‘Discinnet process’, metadata <-> big data global experiment; P. Journeau, Discinnet Labs
Highlight issues for generalizing this use case (e.g. for ref. architecture)
More Information (URLs)
Note:
Note: No proprietary or confidential information should be included
Government Operation
NBD(NIST Big Data) Requirements WG Use Case Template
Use Case Title
Big Data Archival: Census 2010 and 2000 – Title 13 Big Data
Vertical (area)
Digital Archives
Author/Company/Email
Vivek Navale & Quyen Nguyen (NARA)
Actors/Stakeholders and their roles and responsibilities
NARA’s Archivists
Public users (after 75 years)
Goals
Preserve data for a long term in order to provide access and perform analytics after 75 years.
Use Case Description
Maintain data “as-is”. No access and no data analytics for 75 years.
Preserve the data at the bit-level.
Perform curation, which includes format transformation if necessary.
Provide access and analytics after nearly 75 years.
Current
Solutions
Compute(System)
Linux servers
Storage
NetApps, Magnetic tapes.
Networking
Software
Big Data
Characteristics
Data Source (distributed/centralized)
Centralized storage.
Volume (size)
380 Terabytes.
Velocity
(e.g. real time)
Static.
Variety
(multiple datasets, mashup)
Scanned documents
Variability (rate of change)
None
Big Data Science (collection, curation,
analysis,
action)
Veracity (Robustness Issues)
Cannot tolerate data loss.
Visualization
TBD
Data Quality
Unknown.
Data Types
Scanned documents
Data Analytics
Only after 75 years.
Big Data Specific Challenges (Gaps)
Preserve data for a long time scale.
Big Data Specific Challenges in Mobility
TBD
Security & Privacy
Requirements
Title 13 data.
Highlight issues for generalizing this use case (e.g. for ref. architecture)
.
More Information (URLs)
Government Operation
NBD(NIST Big Data) Requirements WG Use Case Template
Use Case Title
National Archives and Records Administration Accession NARA Accession, Search, Retrieve, Preservation
Vertical (area)
Digital Archives
Author/Company/Email
Quyen Nguyen & Vivek Navale (NARA)
Actors/Stakeholders and their roles and responsibilities
Agencies’ Records Managers
NARA’s Records Accessioners
NARA’s Archivists
Public users
Goals
Accession, Search, Retrieval, and Long term Preservation of Big Data.
Use Case Description
Get physical and legal custody of the data. In the future, if data reside in the cloud, physical custody should avoid transferring big data from Cloud to Cloud or from Cloud to Data Center.
Pre-process data for virus scan, identifying file format identification, removing empty files
Index
Categorize records (sensitive, unsensitive, privacy data, etc.)
Transform old file formats to modern formats (e.g. WordPerfect to PDF)
E-discovery
Search and retrieve to respond to special request
Search and retrieve of public records by public users
Current solution requires transfer of those data to a centralized storage.
In the future, those data sources may reside in different Cloud environments.
Volume (size)
Hundred of Terabytes, and growing.
Velocity
(e.g. real time)
Input rate is relatively low compared to other use cases, but the trend is bursty. That is the data can arrive in batches of size ranging from GB to hundreds of TB.
Variety
(multiple datasets, mashup)
Variety data types, unstructured and structured data: textual documents, emails, photos, scanned documents, multimedia, social networks, web sites, databases, etc.