Earth science data appear as the results of production methods that often involve complex algorithms and tangled webs of production and validation.
Each of these characteristics influences the way data producers generate their data and the way users access and retrieve it. Of course, the perspective of the data producers is far from identical with the perspective of the many user communities. As we explore these perspectives and their implications in this paper, we will be reminded that digital archives have a long-term view of their data, but that they also need to be in intimate contact with the living communities of producers and users for that data.
We anticipate that the producer and user communities will experience significant changes in the way they produce data, store it, search for it, and use it as they gain experience in working with data that are linked in the fashion of ‘hypertext’. The WWW pages and links that move us from web site to web site show how the familiar ‘Table of Contents’ and ‘Index’ expand when computers power these ‘search engines’ for books. Eventually the influence of these pages and links will reach into the way data producers package their ‘data products’. When we arrive at that part of the next millenium, fluid forms of data packaging will fit naturally into the world of the digital archive. However, we still have the pangs of childhood and adolescence to endure before this field is mature.
In the body of this paper, we begin with the views of the data producers. We are particularly interested in the data structures that producers currently create. These structures often reveal a great deal about the scientific producer’s ‘world view’. They also reveal the constraints that current technology imposes on data production. For example, most of the data producers in the first EOS missions produce ‘files’. One of the reasons they do so is that database approaches still raise serious concerns about data throughput. An additional concern arises from the complexity of data production and configuration management. We illustrate this complexity by showing how these file structures provide a useful ‘tree’ to index the data in the archives. Data producers have to introduce special branches in these trees to account for versions of data products that arise from validating scientific data. We also recognize that this tree structure for files and file collections corresponds to a ‘Table of Contents’ for a particular data producer’s products. Metadata that aggregates data values in the data files then plays the role of an ‘Index’ that allows users to search through data in a ‘random access’ mode.
The second section of the body of this text considers how users view what the data producers create. Users often want to search for data in ways that are quite different from the paths that data producers use. We first consider the diversity of user communities that may access Earth science data. From this discussion, we see that users often adopt an ‘object-oriented’ frame of mind, a ‘retail’ approach to data. Data producers, on the other hand, tend to look at data from a ‘wholesaler’ point of view that emphasizes uniform blocks of data that do not have many external differences. When the archivist needs to mediate between these two points of view, he may want to use several different approaches to building searchable structures. He might include producer-provided statistical summaries of the data product files. He might use secondary indices to the files. Independent investigators could build such indices. We are probably just at the beginning of an era in which this kind of ‘data scholarship’ will markedly expand the usefulness of scientific data archives.
In the third section of this text, we consider the role of digital archive centers for scientific data. Before we discuss this role, we consider that the divergence between data producer views and data user views naturally leads us to consider subsetting data in files and then combining the subsets into new supersets. In print libraries, such an approach would be equivalent to allowing users to Xerox individual pages, then to excerpt the material from the pages, and finally, to create new scholarly works from interpretations of the excerpts. With subsetting and supersetting of the basic data, scientists can produce entirely new interpretations of natural phenomena. Of course, these new interpretations need peer review and extended discussion by scientists before they provide the scientific community accepts them. The archive centers that contain the data may help by serving as guides and facilitators for this discussion and consensus building activity. In order to fulfill this role, these centers need to
Provide documentation about the data
Provide new mechanisms for secondary indexing and other kinds of data searches
This perspective suggests that in the long run, digital archive centers will become centers for ‘scholarly access’ to scientific data. It also suggests that the current notion of placing ‘used’ data in ‘long-term archives’ (which often appear to be viewed as ‘write-once, read-never’ data sinks in a managerial context) is incorrect. A much more useful notion is to view digital archive centers as specialized research libraries that are one component of a community of data scholars.