Beyond the base definition of documents, we may additionally characterize documents in terms of content, attributes, and their relation to other documents.
A document is first and foremost defined by its content. The content of a document is a symbolic manifestation used for communication. The internal representation of documents in a system is composed of many attributes. These attributes serve different functions in a system and control the nature of the document space that is defined. Finally, within a space, the relations among documents define the document space. We will look at each of these in turn.
2.2.1Content and Document Encoding
A simple form of document content is a sequence of ASCII codes. In a graphical user interface (GUI), control codes may be added to define presentation of content, font, type size, color, etc. In structured documents such as Standard Generalized Markup Language (SGML), embedded control codes known as markup are used to structure the document. Semantically meaningful markup is generally used. The presentation control is separated. Using the SGML attribute feature, attributes may be embedded into a document as a part of the markup. Generally, a makeup attribute is not presented to users.
In order to correctly present the contents of a document to users, a document is typed by its encoding scheme. The encoding scheme is selected for efficiency related to a particular application. A standard encoding scheme is mainly considered for an interoperability purpose.
The document’s content may consist of text and other components, including pictures and graphics. In multimedia applications using a computer as the media, the components may also include sound, movies, or other data forms with a temporal dimension. With the use of a temporal dimension, the reading process is complicated by the need to synchronize temporal media.
Beyond basic temporal media, a computer-based document may integrate interaction with the reader. Hypertext may be viewed as one example of a coarse level interactive document where the content of a document depends on the path selected by the user. The content of a document may be adapted to user needs and profiles.
2.2.2Document Attributes (Metadata)
Attributes of a document can be defined implicitly by a system or explicitly by users. Attributes may be stored within a document or in a separate record system. Attributes may be required for processing a document or may be optional and simply used as information for users.
Many attributes are driven by applications which collect information for a certain function. For example, in a file system, certain information is kept about each file, e.g. system flag, archive flag, read only flag, etc. These attributes are useful in managing the file.
Authority and access information may also be maintained by a system. Group role, user membership, and access rights information are kept for each document. The authority and access information consists of ownership information, access protections, and access rights combined with business processing rules.
In this discussion, attributes of documents are broken down into two groups: identifiers and attributes suggested by system. An identifier of a document is a special attribute that is unique within its scope. While the application collects attributes, it may create secondary attributes, -- attributes created from other attributes.
Attributes may be classified by data type. Data can be classified as discrete or continuous and bounded or unbounded types. An enumerated data type is discrete and bounded. The set of integers is a discrete data type that may or may not be bounded. The continuous data type, such as real numbers, creates a non-finite resolution. A bounded space with non-finite resolution is a dual property of unbounded space with finite resolution. In most cases, attributes of a document space are assumed to be discrete at a certain resolution. For example, while time data is arguably a continuous property, most systems round tune off at some discrete level. Real number computation in computers is also discrete at a fixed resolution. Most unbounded data types are also considered to be bounded in a certain scope. For example, Y2K is caused by “time-unbounded” data, which is assumed to be bounded in a hundred-year period. File size is theoretically unbounded, but, in implementation, it is limited by a size of file control record which keeps file size information that is a part of an operating system. Whether data type is discrete or continuous and bounded or unbounded data type impacts the method of presentation of data and thus navigation tools.
The data type of an attribute may be classified as nominal, ordinal, interval or ratio. Many presentation theories address this classification in order to assure the accurate presentation of information (Bertin, 1981;Cleveland, 1985;Spring & Jenning, 1993). The type of presentation should be the same as attribute data type. This will be discussed later.
Secondary attributes may be created in order to change the data type for presentation and convey more information. For instance, word count may be created from document contents. Relative frequency of words is computed by word count divided by total number of words. In this process, the datatype of attributes is changed from nominal (each word) to ordinal (word count) to ratio (word frequency). File size, which is ordinal data, can be presented as percent of disk space used, which is a file size divided by a disk size. Some derived attributes may be viewed as abstraction of attributes.
In most standards for data interchange and in many systems, the number of attributes is extensible, or there is some “other” type as an option. This feature improves flexibility of standards to be used in an unforeseen circumstance. For example, the Hypertext Markup Language, HTML, header allows an application to apply any meta-information in its META element. HTML standard allow using “ignorance strategy” when they encounter unknown attributes in implementing applications (World Wide Web Consortium [W3C], 1998).
2.2.2.1Identifier Attributes
An identifier serves as a reference to a document. It serves to allow a user to locate and access a document's contents. An identifier is used by a system as a unique name for objects. An identifier may be generated by a system or by a user with a specific rule. It may be a combination of attributes. An identifier must be unique within its scope.
In a physical document, a document may be referred to in many ways – the document's title is commonly used. In a research paper, reference works are referred to by their bibliographical data. Books are identified by an International Standard Book Number (ISBN), using for ordering, inventory and marketing. International Standard Serial Number (ISSN) is an identification of any serial publication that includes newspapers, journals, and electronic publication. In library systems, call numbers are assigned to books. A call number may be a combination of codes for collection and code for classification, a cutter number, a call letters, and the publication date (Taylor, 1999). The classifications commonly used include Dewey Decimal Classification (DDC) and Library of Congress Classification (LCC). However, call numbers are not necessarily unique.
In electronic document systems, an object, document, or document fragment, is referred to by an identifier provided by the system. Identifiers such as a file name are supported by the file management system. The identifiers used in various systems will be discussed later, including the widely used Web identifier, the Uniform Resource Locator (URL).
An identifier, by itself, is a nominal datatype. Generally, there is no ordering or relationship among identifiers. Collections of attributes may be used to construct an identifier. This is similar to the concept used in relational data systems where a primary key can be an arbitrary system generated number, or a collection of attributes that make a record unique for access.
An identifier of a document indicates whether objects in a system are the same. For example, a file system considers two files to be different if they have two different names despite their having the same content. In a library context, a given identifier represents a document. In electronic media, however, a document is not always treated as one file or object. It is more likely that the notion of a document is found only in a document management system. The “one to many” or “many to many” mapping between a document and an identifier in a system leads to some conceptual problems. For many functions, such as “copy”, the basic unit of a user’s consideration may be a document unit. Copying a composite file containing links to other files may not result in the copying of a document. It may be considered that the copy function applies in a file system context and not in a document context. On the other hand, editor programs provide presentation of a document in the case of composite files. For instance, a Web page may be composed of HTML text and embedded images which are viewed as single page using a Web browser. When a page is saved from a WWW browser, only a single HTML page is actually kept, not embedded image files. When a page is later viewed by the browser, the images cannot be shown. This function is being improved as WWW browser technology evolves.
2.2.2.2Attributes Suggested by Various Systems
Some attributes are general and common across systems and applications, such as title, author, creation date, and size. A title or label may be used for representing a document in a meaningful way instead of an identifier. Date-time stamping is used in various forms, including the publication date in a bibliography, the file creation date in a file system, or the last modification date in a version control system. A document size is given at a byte level, but sizes may not be comparable if documents are of different types. A document size may be useful in file management; however, page numbers of documents cannot be deduced from file sizes without specifying encoding method, paper size, and presentation setting, etc.
From a historical perspective on document spaces, catalog data represents one of the earliest records of attributes of a document. Of all the catalog data record sets, the Machine-Readable Cataloging (MARC) format is the best known. Currently managed by the Library of Congress, the MARC record has been used for interchanging information about documents, since the mid 1960s. The MARC standards define the fields, formats, and allowable contents of document attributes in a catalog. Recently, the Library of Congress has developed the MARC DTD. This DTD provides an SGML representation of the MARC record (Myers, 1995; Vinson, 1999).
MARC defines a format for bibliography, authority, holdings, community information, and classification data. MARC information comes from a standard cataloging rule, standard subject heading list, and standard classification scheme.
MARC standards are complex standards requiring hundreds of pages of definition and qualification. In contrast, the Dublin Core (Weibel, Kunze, Lagoze, & Wolf, 1998) metadata element set is comprised of only fifteen elements presented in a six-page description. Initiated by the Online Computer Library Center (OCLC) and the UK Office for Library and Information Networking (UKOLN), through two workshops in 1995 and 1996, the Dublin Core has been defined for author-generated description of Web resources. The Dublin Core defines the fields and format of data in each field. Formal syntax and implementation of the Dublin Core is a part of the meta tag definitions for HTML 4.0, the Platform for Internet Content Selection (PICS), and the Resource Description Framework (RDF). Dublin Core elements are shown in Table 2.
Note that most of the data elements in MARC and Dublin Core are nominal data which have a non-ordered relation, and their values cannot be compared. However, from a practical point of view the nominal data element are treated as ordinal, in alphabetical order for searching purposes. A few data are ordinal or ratio type, e.g. date.
Table 2: Duplin Core elements
Content
|
Intellectual Property
|
Instantiation
|
Title
Subject
Description
Type
Source
Relation
Coverage
|
Creator
Publisher
Contributor
Rights
|
Date
Format
Identifier
Language
|
2.2.3Document Relations
Document space implies that documents are related to each other. This may be explicit as in hypertext links or implicit based on common attributes. Documents always have multi-dimensional relations. However, in using or presenting documents, it is not necessary to show all the relations in a multi-dimensional form. For example, in a file system, all files can be considered as an unordered set. Files may be ordered by their file sizes, resulting in a linear-ordered list. It is also convenient to present files in a hierarchical relation among directories and files. What relational attribute is used for presentation will depend on the goal of the presentation.
Implemented systems use some internal representation to represent a document space. Based on the design of the system, a certain relation of the documents among the document space will be captured.
Relations may be classified as follows;
-
Un-ordered set: relation among files in directory, the result of a Boolean query.
-
Linearly ordered list by an attribute or measurement function: the result of vector model query.
-
Linear order + branch and join: files ordered by its version in a revision control system, e.g. RCS, SCCS and etc.
-
Hierarchical: parent-children relation: file-directory relationships.
-
Network: - Sparse connection: link in hypertext.
- Full connection: distance matrix.
In database terminology, relations are classified as one to one, one to many, and many to many. The relation may be directed or un-directed. Relations may have attributes attached to them, such as link type or value in distance matrix.
Share with your friends: |