Document space may be classified in any number of ways. Surely, there is a difference between the space defined by all the documents of the World Wide Web and the space defined by a set of documents returned based on a Boolean query.
Document spaces may be divided into classes as follows:
-
Partial document
-
Document
-
Document set
-
Document collection
-
Document universe
This segmentation is not ideal for our purposes but it provides some separation for the comparison of systems. It is roughly hierarchical in that each class can be considered a subset of the next. Partial documents are significant when a document is large. A document is a basic unit in a document space. A document set is a group of documents which is described by conditions for membership. The most common document set is a set of documents resulting from a query. The document collection implies a larger and more permanent set of documents organized in some systematic fashion. For example, a computer file system is a document collection. Each individual document belongs to at least one document collection. A document set may span multiple collections but is normally thought of as a subset of a single collection. As a result, only common attributes are valid in the scope of a set. The document universe is presented as the conceptualization of a space for all documents -- along the lines of Ted Nelson's Xanadu (1987).
A classification of document spaces may also distinguish between private, group, and public spaces. In comparison to a public area, a private space is usually smaller, and the owner may use personal organizational schemes to manage it. Public spaces are more likely to be managed by an administrator and organized in some standard fashion. When navigating through public spaces, knowledge of organization schemes is advantageous. On the other hand, the owners of private space have an intuitive knowledge of their space and the location of documents within it.
A classification of document space may be distinguished by the relations among those documents. The topologies of relations, as described, include linear, hierarchical, and network. Presentations and navigations will be different for each kind of topology.
2.4Major Document Systems and Their Requirements
Document systems come in many varieties, including file systems, IR systems, hypertext systems, and structured document systems. The document systems included in this review are file systems, information retrieval systems, hypertext systems including the World Wide Web, and structured document management systems. In each case, the systems are discussed in terms of the generic capabilities and characteristics. Each system is considered as a system focused on a document collection, except the structured document system, which naturally focuses on a structured document. Each system is discussed in terms of how it works and what attributes it makes use of.
The systems selected for review comprise the major categories of document systems. They may be enhanced and combined to support specific functions or applications. Applications involving documents include document management, digital library, and office management. Integration of systems is common and will increase in the future. All of these systems are different views of document spaces. The same document in each system will use a different attribute set.
2.4.1File Systems
A file system is a system that provides basic mechanisms for storing and accessing data and programs on computer storage devices. A file system is a part of an operating system. File systems date back to the development of computer storage systems. The sequential file is a reflection of the property of tape devices. The development of disks was a driving force for random access file systems. This is simply to suggest that the underlying technology available defines the objects that might be created, and these in turn play a role in defining the attributes of the document space.
In a file system, a document may be a part of a single file, isomorphic with a single file, or contained within a set of files. The content of a document is normally the content of a file. In order to handle a large number of files, file systems use a directory abstraction. A directory is a container that can store files and other directories. A directory may be considered as a special type of file whose content is a list of file names and directory names. Given the above definition of a directory, it may be deduced that there is a hierarchical relation among directories. In many file systems, there is a provision for linked files. A link may be considered as a special file that contains a pointer to a file or directory. Links create a non-hierarchical structure. A link also provides a shortcut that allows a user to access a file or directory that is remote in the file system hierarchy.
In a file system, files can be referred to by file names and path names. A file name may include file types, as in the MS DOS™ system. In a network file system, the machine name is used as part of an identifier, internal or explicit shown, as an extension of scope. It is a nested, hierarchical structure.
The main functions of a file system related to document processing are creating, reading, writing, deleting, and changing the location of a file. To access a file, it is referred to by a name. File names may be absolute or relative. An absolute name provides a unique identifier within the context of a given system, while a relative file name provides an identifier related to the current location. A default search path for a file name is applied in some conditions. For example, in the Unix system, a help file can be shown by using the “man” command; an argument for this command is a file name that first matches while searching through many paths provided by the man path environment variable. In linked files in some operations, the link is resolved to a destination file, i.e. a “read” operation reads the content of the file that the linked file points to. Other operations act differently; for example, the “remove” command removes the link file, not the file the link points to.
File systems maintain information or attributes related to a file such as file name, size, creation date, last modification date, system flag, archive flag, read only flag, etc. The complete list of file attributes can be found in the system-specific manuals. Attributes of a file are changed after certain operations, e.g. “last modification” data will be changed after file content is written. The file system automatically handles these updates. Note that the attributes in a file system commonly are binary, integer or date type.
In multi-user environments, a file system also provides for access control. Users belong to groups, and a group has a defined access right to a particular file or directory. An access right includes right to access, read, modify, create, and delete files. The access right controls visibility of space for a particular user. Protected documents may not be visible, or they may be visible but may not be accessible. Document access rights and document protections can become very complex in document and shared file systems. Concurrent control of a file, e.g. locking mechanism, is used to maintain file integrity when multiple users are allowed access.
Files take up storage space. File size captures how much it takes. A file is in reality an abstraction. For example, a file is generally viewed as a single unit that takes up a given amount of space. In reality, its physical storage may not be continuous, and the amount of space it occupies will normally be greater than the reported file size. Creation date and last modification date are valuable for some utilities, such as the backup process and managing disk space.
In the Unix system, file space is viewed as a single hierarchical tree whether it is a combination of many physical disk devices or network connections. On the other hand, MS DOS™ separates structures based on disk device.
The Unix system also has another view of files, as byte streams of data. Unlike chunks of data, a file may be viewed as stream of data that flows from a storage device, passing through programs, to a display device or another storage device. From this view, many Unix utilities are designed to process a file regardless of its size, by accepting a stream of data as an input and then delivering a stream of data as an output. This makes it possible to “pipe” data, cascading through many programs.
Research on File Systems
There is little empirical data about the shape and size of personal file systems. Some research suggests that a directory structure is highly varied in both width and depth.
Spasojevic (1996) studied usage of a wide-area distributed file system, the Andrew File System (AFS) which supports thousands of servers and tens of thousands of clients. A sampling of servers was analyzed. The estimated data of the whole system at that time was more than 3 to 4 terabytes. From the sampling data, the number of “user type” volume takes 52% of total volume, which is only 19% of data size. This may indicate that private space is a smaller portion compared to public space in this kind of environment. About 40% of volumes, which are accessed daily, comprise 50% of data size. There is an 80% chance for those volumes to be accessed the next day. This is an indication of recency of usage.
Data from protocol analysis, as a percent of total Remote Procedure Call (RPC) requests observed at servers and clients, show that 73.5% and 64.2%, respectively, are “fetch-status” that are expected from users listing directory. While “fetch-data” are about 8.4% and 14.9%, “store-data” are only 3.5% and 4.6% at servers and clients, respectively. This may indicate that the main activity is navigation via a directory listing. Create file and delete file are roughly the same at 1.5%. This means that the number of total files is approximately steady from these two functions.
The percentage of references to the file in a foreign cell, i.e. someone else’s file, was about 5%. Of all directory modifications, 99.1% were done by previous writers. These observations indicate that sharing files is quite rare, but the author suggests that sharing file is a major benefit.
2.4.2Information Retrieval Systems
Information retrieval systems arise from two sources. The first is databases of bibliographic or abstract data available to some special community, e.g. LEXUS. The other is an extension of card catalogs supported by computer systems, or Online Public Access Catalogs (OPACs). The main purpose of an information retrieval system is to organize and find documents. Any system must address the basic problem of how to retrieve relevant documents for a given query.
In an information retrieval system, an identifier is assigned to each record, i.e. each data record has a document identifier. A record may have a one-to-many relationship to the actual documents. Title, author, and publication date combined is conventional and often used in a bibliography as an identifier of a document. If document content is available electronically, a document identifier of file, Web page, is used, e.g. file name and URL. A system may also use an internal scheme to access a document -- as would be the case for Data Base Management System (DBMS) controlled access.
Parts of text, i.e. chapters or paragraphs, are considered by some researchers as a document in a collection of documents (Korfhage, 1997). One reason is that some large texts, e.g. an encyclopedia, would tend to match every query if the entire document were stored as a single record.
Historically, documents were represented in information retrieval systems by document surrogates as records in a database. Document attributes including title, author, publication date, etc. made up the record. In most systems, attributes were comparable to MARC records. Subjects or keywords were normally used to classify a document. One document could relate to many subjects. In addition, a subject could have a relationship to other subjects.
The query process matches words or values to certain record attributes, e.g. to find a document where subject is specified by a topic. Matching may be based on complex logical operations, i.e. Boolean searching. The result of a pure Boolean search is a set of documents without ordering.
Current information retrieval systems may actually store the entire document as an attribute of the record. In this case, it becomes possible to form far more complex searches on this “attribute” of the record. Full text search uses words that appear in documents. Structured search involves a query that references structure and key words against words in a document, for example, searching for a word in a chapter heading.
Full text search methods may use the words in a document as a representation of the document. Inverted indices, words pointing to documents, are commonly used to improve search efficiency. Documents may represented by vectors of word occurrences, vector model (Salton & McGill, 1983). These objects populate an n dimensional space. The retrieval process in this kind of space involves computing the “distance” between some query and various document objects. The search result is a list of documents that falls within some distance from the query. The set is ordered by distance.
One interesting property of a document space using the vector model is that the dimensions of the space depend on the documents in the space. The dimensions of the space are selected words that appear in the document collection. A document may be viewed in a lower dimension than the space itself if it does not contain every word in that space. Word count is usually used to present a value in word dimensions. In this case, the value is discrete. Word count per document length is also used to compensate for the differences in document size; the value will be a fraction, between zero and one.
Word dimensions are not truly unrelated and orthogonal. A thesaurus is one way of describing word or term relations. A thesaurus may be derived from co-occurrence of words in a document set. A thesaurus has a complex structure where there are broader term, narrower term, or related term (Foskett, 1980).
The vector space model may be viewed in a distance-measurement dimension (Salton, Wong, et al., 1975). All distances among documents are computed. It creates a document distance matrix. In this view, documents are positioned based on their relation to each other (Olsen, Korfhage, Sochats, Spring, & Williams, 1993).
Example of an information retrieval system
The OCLC system contained more than 37 million records and 638 million location listings of 44.5 million books and other materials in 1997, when 2.1 million records were added. An average of 3.1 million catalog and data transactions per day were requested. These records can be represented in MARC form (OCLC Annual Report 1996-1997). Because of the size of this data, an overview of all the records is difficult.
Lawrence and Giles (1999) reported that in February 1999, the Web search engines indexed about 50 to 150 million pages from an estimated total of over 800-million indexable Web pages. Major search engine providers had indexed only about 16% or less of the whole. Many expected that search engines should provide a better coverage because they may be the only way for some Web sites to be visible to the public. However, a search engine site manager made an argument that search result sets are normally large, many times comprising thousands of records, and therefore to add more Web pages would not make any difference (Brake, 1997). Most queries, probably 90%, matched with the first one million pages, and 90% of indexed pages were never retrieved as query results. Given economic constraints, Web search engine providers changed their strategy from collecting all the Web pages to providing better information and interfaces.
In the Web environment, it is common that a search result contain a large set of data. It will require navigation tools to examine the resulting data set. The refined query process will be important to narrow the results set size.
2.4.3Hypertext Systems
In a hypertext system, a document is no longer a single integrated unit, but may consist of a network of text components. A document is no longer linear but consists of a graph of “nodes” and “links.” One may consider hypertext as a set of documents, where each path through the nodes may be considered as one document. Further, because users can choose any path when reading or can create new links, the structure of the document is both dynamic and publicly and privately extensible.
A node contains content and anchors. A link is defined as the relation between two anchors. In general, a link contains a source anchor and a destination anchor. In implementation, a link also contains source node identification and destination node identification. The scope of an anchor is bound in a node. A link may contain other attributes such as link types and directions. Links may be managed by a link manager to maintain consistency when a node is moved or deleted. More details about a concept and implementations of hypertext system can be found in Conklin (1987).
Hypertext was first envisioned by Vannevar Bush (1945). The memex (memory extension) he envisioned contained a very large library and personal notes. It was used to make links to related documents, thereby joining them into a trail. The system was optimized for scientific use and the primary goals were to support making notes and browsing the documents. Douglas Engelbart (1963) developed the first operational computer-based hypertext system, NLS (oN Line System). The statement, in a file, is a hierarchical structure that may contain any number of reference links. Statements can be displayed in each window. A viewing filter is used for selecting statements for a display. Ted Nelson (1987) is considered by many to be the spiritual father of a proposed global hypertext system -- Xanadu. Related documents would be linked together on a large scale where everything would be in a single system. Further, he envisioned a document being archived with a history of its development -- versioning.
The Dexter Hypertext Reference Model (Halasz & Schwartz, 1994) suggests that a hypertext system may be broken down into three components, the within-component layer, the storage layer, and the run-time layer. The run-time layer describes mechanisms supporting the user’s interaction with the hypertext. The presentation of the hypertext is controlled by an instantiator function that defines presentation specifications. A presentation will also include link markers, an instantiated anchor, and an active area that allows the user to traverse the link. In the model, sessions are captured as history. A component, in the model, is an atom, a link, or a composite entity made from other components. A composite entity has its own presentation specification, describing how the composite object should be presented.
The Amsterdam Hypermedia model (Hardman, Bulterman, & uido Van Rossum, 1994) extends the Dexter model by adding the notions of time, high-level presentation attributes and link context. Temporal information is a part of the component, which also contains synchronization information. Coarse-grained synchronization might include the relative start time of each component to be presented. High-level presentation attributes contain global attributes such as fonts and style for text. The global attributes will be used as default but can be overwritten by component specifications. Link context uses source context and destination context to specify display options.
2.4.4The World Wide Web
The World Wide Web (WWW) originated as a distributed hypertext system. It consists of an address system (Universal Resource Identifiers: URI), a network protocol (Hypertext Transfer protocol: HTTP), and a markup language (Hypertext Markup Language: HTML) (Berners-Lee, Cailliau, Luotonen, Nielsen, & Secret, 1994). A WWW system is composed of one or more WWW servers and one or more WWW browsers. The original WWW browser, Mosaic, was able to view plain documents and pictures. It is also capable of using GOPHER and FTP protocols.
The Uniform Resource Identifier (URI) standard specifies mechanisms for uniquely identifying objects (Internet Engineering Task Force [IETF], 1998 [RFC2396]). It is currently implemented in the WWW as a Uniform Resource Locator (URL) (IETF, 1994 [RFC1738]). It is a set of URI schemes for locations of resources in the Internet. The URL standard specifies the syntax and semantics in the context of the Internet. It comprises protocol names, host Internet addresses, and internal file names. The “query operator” may be applied to a URL as a mechanism to pass state parameters through a URL.
The HTTP protocol is stateless (IETF, 1999 [RFC2616]). HTTP 1.1 offers nine operations of which “GET” and “POST” are the most frequently used. Resources can be obtained from or stored on a server. It also provides a flexible scheme for transferring many types of data.
HTML (W3C, 1998), while considered by many to be a markup language in its own right, is in reality an instantiation of one Document Type Definition (DTD) under Standard Generalized Markup Language (SGML). It provides the syntax of markup in an HTML document. HTML specifies the syntax for specifying hypertext links. The browser is able to recognize an anchor and traverse a link embedded within an HTML document. The distinction between links and anchors is collapsed into a single anchor tag and the HREF attribute. It is a unidirectional, untyped, and direct link. (HTML version 4 proposes the capability for link types and direction.) WWW clients are required to comprehend an HTML document. Current WWW clients also have the ability to present a variety of document formats.
According to Conklin's definition of hypertext (1987), the WWW is a weak example of hypertext. It lacks the database aspect of hypertext, lacking a node and link manager. There is nothing to prevent the dissolution of links or the creation of invalid links. The “browser,” which is used to display the network graphically for navigation, does not exist as a standard part of the WWW. Conklin suggested that an essential component of hypertext was a “browser” which provided a graphical overview of multiple nodes and links. What is commonly referred to as a browser in the WWW is simply a tool for viewing a node.
In the original design (Berners-Lee et al., 1994), a WWW server is a front end that translates a document space into HTML format. It is possible that an underlying system is a hypertext system. In this case, links within a single hypertext manager can be maintained with integrity, but such a system still lacks external link updating.
The WWW uses the concept of a page (a hypertext node). A document is not specifically defined. It may be a single page or a set of pages. Multiple documents could be included on one page. It may be determined by a link to a start page. Most components, e.g. server, client, and search engine, use a page as a basic unit for service since it can be pointed to by a URL. The HTML standard is flexible; metadata may be used to describe a set of pages as a document.
Relationships of pages are explicitly defined by links as in any hypertext system. The frame feature in HTML creates a complex relation between pages allowing new kinds of implicit relations. On the presentation level, frames create an effect of state. The view is dependent on which combinations of nodes are used to fill a frame.
In general, hypertext systems use proprietary mechanisms for identifying nodes. In the WWW, the URI is a standard that is used for identification of a page. The URL, one URI scheme, also uses a file name in combination with a host’s Internet name. A URL is allowed to use a relative file name, combined with a “base” parameter as a current path name.
One major problem in WWW applications is that sometimes URLs are void -- a result of a destination page that is pointed to by a URL being moved, deleted or having never existed. This looseness may not be acceptable for applications such as digital library collections or a research paper referring to WWW resources. The Internet Engineering Task Force (IETF) is currently working on Uniform Resource Names (URNs)(IETF, 1994 [RFC1737]; IETF, 1997 [RFC2141]), which are intended to serve as persistent, location-independent, resource identifiers. A URN will be unique under the URN namespace. Once a URN is defined, it will not be changed. OCLC, in cooperation with IETF's URN technology, has developed the Persistent URLs (PURLs) as an intermediate solution. PURL Resolution Service, mapping PURL to actual URL, supports PURLs. A URL, to which a PURL resolves, can be changed, but the PURL cannot. (Salton, Allan, Buckley, & Singhal, 1996)
The Resource Description Framework (RDF) (W3C, 1999) defines data models and syntax specification, in XML, of Web resource metadata for data interchange and application interoperability. RDF addresses metadata needs for many applications, including sitemaps, content ratings, search engine data collection, digital library collections, and distributed authoring. For example, Cooperative Online Resource Catalog (CORC) is a research project under OCLC, which uses RDF for cataloging of Web resources.
The WWW was originally designed to both read and write documents. Currently, however, the major use of the WWW is reading only. Research is being conducted and commercial products are being developed to support collaborative authoring, distributed annotation, and document management. These systems include an authoring scheme and versioning for editing WWW contents. Many features were added to HTML versions 3 and 4 to support a variety of interactions for WWW clients. These include applets, intrinsic event declaration, and scripting. With add-on technology and improvement of browsers, current WWW content may also be a programming language. As a result, the interface of WWW is equivalent to an interactive program, not only a text viewer.
According to Nielsen (Myers, 1993), the Web will grow to 200 million sites in the year 2003, an exponential growth rate from approximately 4 million sites in early 1999. As reported by Lawrence and Giles (1999), in February 1999, the estimated number of Web servers was 2.8 million. The estimation numbers of site are varies due to different methods of estimation. Based on a sampling of the number of pages in thousands of servers, the mean number of Web pages per server was about 300, and distribution was very skewed. The estimation of total Web pages was about 800 million pages.
A comprehensive summary of WWW data can be found in Albers (1997). The summary includes a characterization of client, proxy and gateways, server, and WWW. The two studies of Web page characteristics show the mean page sizes are 4.4 KB and 6.5 KB, the median at 2 KB, and very high deviation with a very long tail. Over 50% of pages contain more than one image. The HTML format is used in over 76%, and nearly 95% of HTML had the HREF attribute occurring an average of 14 times per document, which is an indication of number of links (Woodruff, Aoki, Brewer, Gauthier, & Rowe, 1996). The number of links between sites was small, nearly 80% of sites had no links to other sites, and 80% of sites had 1-10 links pointing to them. This indicates that only a small number of major sites had contributed to navigation to other sites. The life span of documents was around 50 days before modification or disappearance. The following usage of the Web was reported: at the clients side, 75% were not local requests; at the servers’ side, 70% of requested files and 60% of requested data came from remote sites. There are notions of popularity in usage: requested files showed a Zipf distribution in both client usage and requested files from servers, and 25% of sites were responsible for 80 to 95% of accesses.
2.4.5Structured Documents
A document may be large. For example, the 32-volume Encyclopædia Britannica has over 7,000 articles in approximately 32,000 pages. It contains over 44 million words and 23,000 illustrations. From this perspective, a document is composed of many components. The components may include a title, table of contents, chapters, paragraphs, etc. A document also changes over time. It may be viewed as an original text, changes, annotations, comments and revision. These create a space within a document.
At the word level, each word has its own attributes; e.g. meaning and part of speech. The meaning of a word is context-dependent. Position of a word relative to a sentence and other words determines its part of speech. Syntax of a language defines legitimate word arrangement to form a sentence. Sentences form a paragraph. A document may have other components such as title and chapter header. There is also a semantic structure in a document.
In many document-publishing applications, a document is a composite of several pieces: text fragments, pictures and its layout. The relation is explicitly defined by putting the components together, in linear or spatial layout, as they should be seen on paper. For example, in MS Word™ embedded-object model, the objects, picture, or table could be in separate files. Links to objects are added to the document. In normal presentation, objects are resolved to its presentation and shown as parts of a document.
In order to exchange a document between systems, a standard, which explicitly defines structure, was proposed. The Standard Generalized Markup Language (SGML) is a standard (International Standards Organization [ISO], 1986 [ISO 8879]) that specifies the rules for the definition of a document's structure and the encoding of documents so as to show the instantiation of that definition. The standard distinguishes the structure and content of the document instance. The structure of a document must adhere to a definition known as Document Type Declaration (DTD). The SGML standard defines the rules for the construction of a DTD as well as the rules for constructing a document in accordance with the developed DTD. In the content of a document, SGML defines the syntax of how to embed the DTD components, i.e. elements, using “mark up” known as “tags.” A SGML document has an explicit structure that is hierarchical. The DTD is constructed by declarative descriptions of elements, attributes, and entities. An element is a conceptual structure, nested with other elements or consisting of atomic elements, e.g. text data. The relation of elements is described by the “content model.” Attributes are used to provide additional information about an element. The attribute is described as name and data types in the DTD. In the document, the value of an attribute will be assigned. The entities are shortcuts to describe predefined information by “name” that will be replaced when the document is processed.
In practice, standard DTDs are defined for use within an organization or community of users. Documents of the same type will use that common DTD. By design, DTD-compliant documents have a tree structure based on elements. Many standard DTDs have been developed for interchanging documents at a public level, such as the Text Encoding Initiative (TEI) and the Hypertext Markup Language (HTML). It is common to define DTDs such that an instance document will have self-contained metadata information, data about a document, in its header.
The Text Encoding Initiative (TEI) produces guidelines for the preparation and interchange of electronic texts for scholarly research (Sperberg-McQueen & Burnard, 1990). The TEI is sponsored by the Association for Computers and the Humanities (ACH), the Association for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing (ALLC). In 1990, the TEI issued its “Guidelines for the Encoding and Interchange of Machine-Readable Texts.” The guideline defines DTDs for each type of document, e.g. dictionaries, drama, hypermedia, etc.
A header component of TEI includes a wide range of metadata, a set different from MARC. In a source description part, it includes bibliographic data. Details of document changes can be recorded in a revision history part of TEI. Developers of the TEI DTD were included in early planning for the MARC DTD so that the two standards could complement each other.
Presentation of a structured document is not covered in the SGML standard, but in a companion standard, the Document Style Semantics and Specification Language (DSSSL) (ISO, 1996 [ISO10179]). Similarly, the Hytime Standard provides linking and hypertext-like extensions for SGML.
Most recently, the Extensible Markup Language (XML) (W3C, 1998) has been developed as a simplification of SGML with more restrictions. Any valid XML document is also a valid SGML document. XML has been developed for use in the WWW environment. It greatly extends the definitional capability afforded to the user -- i.e. an HTML-compliant browser allows for documents conforming to the HTML DTD. An XML-compliant browser will allow access to documents conforming to a DTD written in compliance with XML. Just as SGML is complemented by presentation and linking standards, so XML is complemented by Extensible Stylesheet Language (XSL) (W3C, 2000) and XML Linking Language (XLink) (W3C, 2000) standards.
A hypertext document is also a structured document. Unlike SGML documents, which are hierarchically structured, hypertexts exhibit a network structure of nodes and links. Bernstein (1998) observed hypertext based on topological and rhetorical structures. He proposed many patterns of hypertext structure including cycle, counterpoint, mirrorworld, tangle, sieve, montage, split/join, etc.
Highly structured documents may also be found in artificial languages such as programming languages. Many applications support presentation, or re-generate a source code of software in a structural spatial form or color-coding, such as Emacs or integrated programming environments (IDEs).
Printed text in books and papers is physically linear, from left to right and top to bottom in the English language. However, accessing text is not necessarily linear. Readers may skim or jump from chapter head to chapter head. Some chapters may be skipped. Some will be read later in a different order. In order to guide readers, the structure of a document may be generated by analyzing its content. Automatic extraction of global and local text vectors can be used (Salton, Allan, Buckley, & Singhal, 1996). The relation of local vectors can be used to create links within text or to compare it to other documents. Text themes can be indicated by patterns of relations. An ordered link path can be created by a depth-first search of a text theme network, considering text coherence as the main criterion.
2.4.6Other Systems
Other types of systems that may be considered as generators of document spaces include: version control systems-- used to keep track of changes; distributed systems, increasingly used because of portable computers and ubiquitous networks; and multi-user environments, in which conflict and collaborative usage are addressed. These systems support a variety of functions in applications. Each system will have its own attributes. It is possible to use those attributes as information for navigation in a document space.
In version control systems, a document is an entity that changes over time. It will be created, changed, and destroyed. Changes are captured by the differences from a previous version. There may be multiple paths of changes, i.e. branching in RCS. It is also possible to merge these pieces into a single unit from different changes. Change takes place in the relationship between documents too. A file may be moved in a hierarchical structure. Hypertext links change places.
Nowadays, many people use more than one computer, one at home and one at the office, and may carry a notebook computer. In this environment, there is a need for synchronization between a document and its copy while editing. The alternative is to connect computers to networks and work through a central document management system. Reading a document may not cause any differences because its content will not be changed, but different computers may require different presentations. For instance, notebooks normally have smaller display areas and lower resolution than does a desktop monitor.
In multi-user environments, conflicting actions can be expected to occur. There are many locking schemes that can be used to assure consistency of data. Most file systems provide basic write-lock, and do not allow other users to write into a currently locked file. Most read-only applications do not use lock protection beyond that provided by the operating system. As a result, a file can be deleted while it is being read. In collaborative activity, other information related to documents may be added so that other users are aware of each other’s activities. These include user awareness, group navigation, and commenting systems.
Share with your friends: |