Opportunities and Challenges in Digital Curation for Science and the Humanities Museums often own thousands more objects than they can ever put on display. As a result, many objects sit in storage while a select few are brought out, conserved, and displayed for visitors to see up close. Traditionally, this has been the definition of curation: the selection, care, and display of objects in a cultural institution. With the explosion of digital information, the term curation has increasingly been applied to data and digital objects. Digital curation is of particular importance for research institutions. Researchers in fields from climate science to ecology to papyrology produce large volumes of digital information, which must not only be preserved, but curated so as to present a vast amount of data in a manner that is searchable and relevant to future students and researchers.
Digital curation can be particularly challenging when data is culled from or relevant to multiple departments within an institution, multiple sites within a large research study, or across multiple institutions and disciplines. There have been many exciting developments in digital curation that will hopefully facilitate and nurture innovative research in both the sciences and the humanities. At the same time, challenges persist in managing time, the needs and priorities of researches, and institutional goals. The question that persists across these readings is not only how research data can be preserved, but how it can be preserved meaningfully. While funding and technical issues present challenges, the fact that digital curation strives not only to preserve bytes but also to preserve understanding makes the human factors particularly significant. In order for a digital curation project to succeed, the needs and desires of researchers must be properly assessed, researchers must buy into the idea that data sharing and curation are worth the investment of time and money, and projects must be properly staffed not only with strong information managers as well as researchers.
Before one can appreciate the various challenges and goals of digital curation, it is useful to define what digital curation actually comprises. The Digital Curation Center (DCC), a UK-based institution providing expertise and assistance to research institutions, defines digital curation as an “ongoing process,” a lifecycle consisting of: the conceptualization and creation of digital objects; provision of access to these objects and selection of objects for either disposal or long-term preservation; ingestion of objects selected for long-term preservation into an archive; ongoing maintenance and reappraisal of archived objects; ensuring access to objects in the archive for relevant users; and transformation of objects into new formats, as necessary. As new digital objects are created and reappraisal of existing objects continues, this lifecycle goes on indefinitely.1
Elements of this lifecycle reappear in descriptions of digital curation projects across disciplines. However, not all institutions or research projects prioritize every point of the cycle. Some focus almost exclusively on the initial creation of digital objects with little focus on the creation of a long term archive, while others might have a robust archive of long-term data but no well-developed system for providing access to this information. In his article about data curation at Georgia Tech, Tyler Walters writes: “Once it is determined which lifecycle steps are most critical to an institution’s scientists, then those people responsible for curation can scrutinize and test certain curation-related software components.”2 However, while understanding the researchers’ priorities is extremely important and may be helpful for devising initial goals, the entire digital curation lifecycle needs to be taken into account, particularly as many digital curation projects are in their infancy. It is important for project managers and administrators to think broadly with a view toward an ultimate plan for preservation that covers all aspects of this lifecycle, even if all of these steps are not implemented immediately.
Opportunities for preserving understanding through digital curation
Digital data sets in the sciences are enabling researchers to ask questions, track patterns, and analyze phenomena in ways they were never previously able. When properly curated, scientific datasets can be utilized to verify past research and recreate trials. They can be shared with scholars at other institutions and research sites and complement and inform their own research projects. In ecology, datasets collected over long time spans can reveal historical changes taking place and are vital to contextualizing ecological theory, understanding current conditions, and predicting future conditions. In the National Science Foundation-sponsored Long Term Ecological Research (LTER), data is taken from multiple sites and researchers from other sites as well as outside users can enhance and create new understandings of the data being gathered. For example, “a study of storms and their consequent ecological disturbance in coastal regions (Hayden, 2000) contributed to a network theme of disturbance not as an anomaly but as a natural part of an ecosystem’s development.”3
In climate science, past weather data can be combined with contemporary modeling techniques and current climate data in order to improve forecasting and climate change models. Reference data can also be accessed and used to contextualize and support newer studies. A particularly notable example of this use is in the Research Data Archive (RDA), part of the Mass Storage System data archive of the National Center for Atmospheric Research (NCAR). In financial year 2007, 100 TB of data stored in the RDA was accessed by 5400 unique users, mainly through its web pathway, and access has continued to grow. The RDA has been curated for over 40 years: 10-20 new datasets are added each year, and 100 of its 580 datasets are regularly updated.4 This is an excellent example of a successful data curation initiative that follows the digital curation lifecycle, taking in new data when relevant, updating and sustaining access to older datasets, and providing researchers access to archival data.
In addition to opportunities in the sciences, scholars in the humanities are also creating and using digital objects and datasets in their research. Some of these objects are digital versions of tools with analog precedents, such as finding aids and concordances. Others are “products of new digital research methods,” such as digital text corpora or the use of GIS and 3D technologies to image archaeological sites and ancient monuments.5 The Digital Hammurabi project from Johns Hopkins University has been a pioneer in digitization of cuneiform tablets, in the process creating digital resources including standard encoding for Cuneiform, a Java program called iClay that provides for the virtual manipulation of ancient tablets, and three-dimensional scanning technologies to allow cuneiform tablets to be fully rendered.6 Many resources in the humanities provide access to text corpora, which are essentially datasets comprised of primary texts. Examples include the Integrating Digital Papyrology project, which standardized a large corpus of Greek texts using an XML standard called EpiDoc and provides a single interface (papyri.info) through which scholars can access transcriptions, images, and metadata for papyri texts from three partner organizations, maintained by Duke University.7 This tool, which provides scholars with easy access to both transcriptions and images of the original papyri, has allowed other registered users to view and suggest edits and corrections to existing transcriptions, which are vetted by an editing team. Thus, the IDP not only serves as a resource, but it allows scholars to contribute their own knowledge to refining and augmenting the sources that are offered.8 This kind of dynamic environment, where the sharing of information between scholars allows for the improvement and enrichment of the resources for the entire research community, is one of the most exciting aspects of digital curation. At the same time, the dynamism of these sources also brings with it an expectation (noted in the DCC lifecycle model) for continued reevaluation and additions to digital archives.
Technical Factors in Data Curation
Reading about the various ways in which data can be used to advance research and understanding in the science and humanities, one can’t help but feel excited: data and modeling techniques that will allow for more accurate hurricane forecasts, 3D renderings of tablets and monuments, and more. This is a brave new world, yet myriad challenges exist in preserving this data.
Some of the challenges for digital curation are in the technical realm. At Georgia Tech both neuroscience and bioscience researchers identified long-term storage (and associated access) issues as some of the most significant impediments to long term data management: the repository services used by the bioscientists are not able to accommodate all file formats used, and accessing older data from these companies would be prohibitively expensive.9 Other technical challenges exist, as well, particularly when dealing with large, diverse datasets. The data collected through TIGGE, an initiative of the World Meteorological Organization World Weather Research Program to improve weather forecasting, is a particularly complex and illustrative example. TIGGE data is received almost constantly from 10 different international providers and archived and distributed through three separate sites. It must account for outages, power-downs, and production delays at all the different sites. The fact that this massive and complicated data is collected and made accessible in real-time is another major technical challenge.10
Funding issues accompany both the technical and human sides of digital curation, but they are particularly acute in relation to the technical needs of a digital preservation project. The funding needed for data storage is a significant issue, though this may become less pressing as storage costs continue to decrease. Funding is also an issue when it comes to what kinds of curation activities get support: often it is easier to secure funds for special projects than it is for ongoing maintenance, but it is ongoing maintenance that is crucial to the long-term relevance of curation projects. As Jacobs and Worley argue, “there needs to be a clear understanding that sustained funding is necessary to keep a curated collection viable.”11
Technical issues such as storage, back-up plans, and well designed cyberinfrastructure are crucial to successful data curation. Jacobs and Worley mention “robust storage facilities” and “backup plans” first in their list of important factors for sustainable data curation. However, the factors they cite as requirements for successful projects — “a stable commitment from people, facilities, and an institutional organization” — are ultimately human factors.12 While the technical challenges of storing and managing data throughout the lifecycle cannot be overlooked, the human aspects of data curation must be fully understood and thoughtfully addressed for even the best technical solutions to be successfully implemented.
Human Factors in Digital Curation
Understanding Researcher Needs
Digital curation initiatives must take into account the needs and desires of researchers so they can be designed in beneficial and useful ways. Walters comments that assessing faculty needs and desires often begins with “bottom-up” assessments of faculty data practices. Once these are understood, they can be evaluated in the context of the OAIS and DCC Lifecycle models with an eye toward which pieces of these models are most important to faculty. Appropriate technology programs can then be built and implemented accordingly.13 At the same time, researcher needs are just one piece of the puzzle: the larger institution may have its own needs and goals for curation initiatives that differ from those of researchers, and those need to be considered as well.
One of the biggest “human” challenges for research institutions is staffing of data curation projects. Allocation of responsibility for curating data, the skills and expertise required by those involved, the time commitment made by researchers versus information managers, and consistency in staffing are some of the biggest concerns in this area. In order for any data curation project to be successful, it must be appropriately staffed. Jacobs and Worley mention the importance of having staff educated in the scientific area of the data so that they can be literate in the topics they are working with, able to produce high quality metadata, and design systems with a good understanding of the user base. In addition, they discuss the crucial importance of consistent staffing: “No matter how well an archive is documented, a great deal of information is held by individuals who have performed the stewardship and curation work.”14 For this reason, academic libraries and subject-area librarians are natural partners for researchers in the work of digital curation. At Georgia Tech, for example, the library has been actively engaging in efforts to become involved with research data curation, including establishing a position of research data specialist, whose responsibilities include building relationships with faculty and other data management groups on campus. The library’s digital development team is also working to develop an infrastructure for data curation.15 Partnerships between researchers and librarians, particularly librarians with literacy in the subject knowledge of the area they are curating, are a promising direction for data curation programs.
While almost all can agree that appropriately educated, consistent staff is key to success, aligning the needs of information managers and researchers is not always straightforward. One issue is getting researchers to buy into the concept of sharing their data. This was initially a concern for researchers in the LTER, though interviews of scientists further along in the project suggest that most have concluded that there is “more to be gained that to be lost” in sharing data.16 However, this issue is not isolated to the LTER: when writing about Integrating Digital Papyrology, classicist Roger Bagnall comments that many scholars are concerned with their work being subsumed into the larger body of scholarship without receiving sufficient credit for it.17 Another staff challenge is getting scholars to buy into the time commitment required for digital stewardship. Interviews with ecologists in the LTER indicated that many scientists viewed the process of data documentation to be burdensome. As one scientist commented, “[scientists] realize it will probably take them 20 to 30% more time [to put out a paper] if they actually really clean the data up, figure out what it is and get it stored away properly. And some people don’t want to make that investment, other people want to but haven’t effectively been able to do it, and some people do it.”18
Proper outreach to researchers must be a part of every digital curation initiative, be it a special project or an ongoing effort by an institution: if researchers are to be willing to share their data and invest their own or their staff’s time in making it searchable and accessible to others, they must understand why these acts are important to their field and ultimately beneficial to their own work as well. This will hopefully get progressively easier as scholars see the benefits of utilizing shared data in their own work. Reward structures, argue Karasti, et al., may be necessary as well.19 Furthermore, as Walters comments, rewards or incentives may need to be offered at a higher level, to universities and institutions themselves, as an impetus for them to commit to the curation of their researchers’ data.20
Yet another human challenge exists when a digital curation project involves cooperation between multiple institutions or even simply multiple sites or large departments. Managing a wide variety of formats and multiple different contributors is a technical challenge as well, as discussed earlier with regard to the TIGGE project. However, even putting aside technical issues, human factors make these collaborations challenging as well. Often, these institutions will be coming with their own file formats and ways of doing business. For example, a digital curation projects in the field of medieval studies to create a shared interface for digitized manuscripts ran into difficulties when each of the three participating institutions wanted the others to adopt some or all of its own system. While a collaborative solution was ultimately reached, these territorial disagreements have the potential to arrest the progress of collaborative projects and must be anticipated and addressed thoughtfully by project administrators.21 Conclusion
With so many technical and human challenges that seem necessary to surmount, with so many cyberinfrastructures to build, metadata schemas to develop, faculty interviews to be administered and administrative policies to be put in place, some may wonder if all of this “noise” surrounding digital curation is distracting or even detracting from research itself. Bill LeFurgy suggests that the sheer volume of digital information being generated is a stumbling block toward preserving understanding. “It’s interesting to wonder,” LeFurgy writes, if our constant generation of new content is putting down layer upon layer of info-fill that hinders our ability to remember, find and make sense of older content–even yesterday’s.”22 However, it is important to remember that throughout history, transformational technologies have wrought major changes and led to periods of uncertainty. Tom Schienfeldt of George Mason University articulates this point beautifully:
Sometimes new tools are built to answer pre-existing questions. Sometimes. . . new questions and answers are the byproduct of the creation of the creation of new tools. Sometimes it takes a while, in which meantime tools themselves and the whiz-bang effects they produce must be the focus of scholarly attention. Like 18th century natural philosophers confronted with a deluge of strange new tools like microscopes, air pumps, and electrical machines, maybe we need time to articulate our digital apparatus, to produce new phenomena that we can neither anticipate nor explain immediately.23 In the end the trial and error--the negotiation of responsibilities, the working committees, the new technical platforms and policy development--are all steps, however halting, towards solutions, best practices, and ultimately to understanding. While human factors may create many of the challenges of digital curation, they also drive its path toward improvement. If history teaches us anything, it is that humans will always be desirous of discovering more, and that the best we can do is create the best possible conditions for future discovery. There is historical precedent for this: Hugh Cayless, a Digital Library Programmer at the Institute for the Study of the Ancient World at NYU, points out that “self-sustaining communities of interest provide the best insurance against the ravages of time.” The Vergilian corpus was saved because he had several communities which had used his texts. Conversely, Sappho survives only in fragments, largely because her poems did not survive a period of societal disinterest during which they were not copied, despite being kept in the Library of Alexandria. This suggests an important role for digital archivists: “facilitating communication between interested users and creating communities that care about our materials.”24 Museum curators have been doing this for centuries by putting precious objects on display in order to tell stories. While we must be clear-eyed about the challenges of digital curation, we must also maintain a sense of openness and wonder at the possibilities for sharing knowledge that technology provides, even as we figure out the specifics of what and how.
1. When determining a digital curation policy for a research institution, should the goals and research practices of faculty be paramount, or should institutional goals (which may or may not dovetail with those of faculty) take precedence?
2. When trying to create a staff for data management, is it most important to have staff with a background in data curation, or those with a knowledge of the subject matter, if one needs to choose? If you were in charge of hiring a staff of information managers for a scientific research project, what would your job requirements be, and where would you be willing to compromise?
3. We have often spoken about the challenges of preserving not only the content of a digital object, but also the “user experience.” When establishing policies for digital curation initiatives, how important should preserving the user experience be as compared with preserving content?
4. The Georgia Tech case provides an example of a library reaching out to departments within its own institution to serve as a partner in data curation. Do you agree that libraries should serve as centers for digital curation, or does it make more sense for individual research projects and departments to have their own embedded information managers? What might be the pros and cons of each scenario?
Sources Babeu, A. “Classics, ‘Digital Classics, and Issues for Data Curation.” DH Curation Guide: a community resource guide to data curation in the digital humanities. Retrieved from http://guide.dhcuration.org/researchpractices/classics/#digital_preservation_lessons_resources.
Cayless, H.A. “Ktêma es aiei: Digital Permanence from an Ancient Perspective.” In G. Bodard and S. Mahony, (Eds.), Digital Research in the Study of Classical Antiquity (part 3, chapter 8), pp. 139-150. Available from http://site.ebrary.com.proxy2.library.illinois.edu/lib/uiuc/docDetail.action?docID=10385816.
Digital Curation Center. (2010) “What is Digital Curation?” Retrieved from http://www.dcc.ac.uk/digital-curation/what-digital-curation.
Integrating Digital Papyrology. “Background and Funding: Integrating Digital Papyrology.” Retrieved from http://idp.atlantides.org/trac/idp/wiki/BackgroundAndFunding.
Jacobs, C.A. and Worley, S.J. “Data Curation in Climate and Weather: Transforming Our Ability to Improve Predictions through Global Knowledge Sharing.”The International Journal of Digital Curation, 2(4), pp. 68-79. Retrieved from http://www.ijdc.net/index.php/ijdc/article/viewFile/119/122.
Johns Hopkins University. “Digital Hammurabi.” Retrieved from http://www.jhu.edu/digitalhammurabi.
Karasti, H., Baker, K.S. & Halkola, E. (2006). "Enriching the Notion of Data Curation in E-Science: Data Managing and Information Infrastructuring in the Long Term Ecological Research (LTER) Network,"Computer Supported Cooperative Work15(4), p. 321-558. DOI 10.1007/s10606-006-9023-2. LeFurgy, Bill. “Persistent Paleontology: How do Stones and Bones Relate to Digital Preservation?” (10 January, 2013). Retrieved from: http://blogs.loc.gov/digitalpreservation/2013/01/persistent-paleontology-how-do-stones-and-bones-relate-to-digital-preservation/.
Walters, T. (2009). “Data Curation Program Development in U.S. Universities: The Georgia Institute of Technology Example.” The International Journal of Digital Curation 3(4), p. 83-92. Retrieved from: http://www.ijdc.net/index.php/ijdc/article/viewFile/136/153.
Waters, D.J. (2013). “Digital Humanities and the Changing Ecology of Scholarly Communications.” International Journal of Humanities and Arts Computing. pp. 13-28. DOI: 10.3366/ijhac.2013.0057.
1 Digital Curation Center. (2010) “What is Digital Curation?” Retrieved from http://www.dcc.ac.uk/digital-curation/what-digital-curation.
2 Walters, T. (2009). “Data Curation Program Development in U.S. Universities: The Georgia Institute of Technology Example.” The International Journal of Digital Curation 3(4), p. 89. Retrieved from: http://www.ijdc.net/index.php/ijdc/article/viewFile/136/153.
3Karasti, H., Baker, K.S. & Halkola, E. (2006). "Enriching the Notion of Data Curation in E-Science: Data Managing and Information Infrastructuring in the Long Term Ecological Research (LTER) Network,"Computer Supported Cooperative Work15(4), p. 325 . Retrieved from UIUC Library.
4Jacobs, C.A. and Worley, S.J. “Data Curation in Climate and Weather: Transforming Our Ability to Improve Predictions through Global Knowledge Sharing.”The International Journal of Digital Curation, 2(4), pp. 71-73. Retrieved from http://www.ijdc.net/index.php/ijdc/article/viewFile/119/122.
5 Babeu, A. “Classics, ‘Digital Classics, and Issues for Data Curation.” DH Curation Guide: a community resource guide to data curation in the digital humanities. Retrieved from http://guide.dhcuration.org/research-practices/classics/#digital_preservation_lessons_resources.
6 Babeu, A. and Johns Hopkins University, “Digital Hammurabi.” Retrieved from http://www.jhu.edu/digitalhammurabi.
7 Integrating Digital Papyrology. “Background and Funding: Integrating Digital Papyrology.” Retrieved from http://idp.atlantides.org/trac/idp/wiki/BackgroundAndFunding.
8 Waters, D.J. (2013). “Digital Humanities and the Changing Ecology of Scholarly Communications.” International Journal of Humanities and Arts Computing. pp. 16-17. doi: 10.3366/ijhac.2013.0057.
9 Walters, pp. 87-88.
10 Jacobs and Worley, pp. 75-76.
11 Jacobs and Worley, p. 77.
13 Walters p. 89.
14 Jacobs and Worley, p. 77.
15 Walters, pp. 84-85.
16 Karasti, et al., p. 327.
18 Karasti, et al. p. 327.
19 Ibid., p. 328.
20 Walters, p. 89.
21 Waters, pp. 17-19.
22 LeFurgy, Bill. “Persistent Paleontology: How do Stones and Bones Relate to Digital Preservation?” (10 January, 2013). Retrieved from http://blogs.loc.gov/digitalpreservation/2013/01/persistent-paleontology-how-do-stones-and-bones-relate-to-digital-preservation/.
23 Tom Schienfeldt, quoted in Waters, p. 26.
24 Cayless, H.A. “Ktêma es aiei: Digital Permanence from an Ancient Perspective.” In G. Bodard and S. Mahony, (Eds.), Digital Research in the Study of Classical Antiquity (part 3, chapter 8). Available from http://site.ebrary.com.proxy2.library.illinois.edu/lib/uiuc/docDetail.action?docID=10385816.