b) XML
EXtensible Markup Language, XML, is rapidly establishing itself as an industry-wide format for data and document exchange, being the de facto standard for representation of information content optimised for Web delivery in variable formats (Miller 2000, Marco 2001, Banerjee 2002). Every serious Web technology is now expected to define its relationship to XML (Rhyno 2002). Virtually all major software developers have now integrated support for XML into their products10. The overarching issue for libraries, as we have seen, is that of resource integration across a distributed information environment (see above – Powell and Lyon 2002, Brygfjeld 2001). XML is a language or format capable of representing complex structures in non-proprietary and self-explanatory ways. It appears to have the potential of facilitating such integration, through the possibilities it affords for metasearching across different document types and metadata formats, and as a vehicle for systems interoperability (via Web services?).The potential applications of XML within library systems have drawn considerable interest from the beginning, yet it is apparent that this widespread interest has not lead to clear directions about what the role of XML should be (cf. Carvalho and Cordeiro 2002).
It appears that this complex issue has three main aspects: the extent to which XML is being used as a format for core library metadata standards and employed within data exchange technologies; the penetration of XML standards within e-book publishing (one may assume that library systems vendors will introduce support for such standards once they have stabilised); and the “pure” technology issues: of data storage and manipulation, and use of the so-called Web services within library systems.
I aim to provide here:
-
An overview of XML fundamentals
-
A summary of metadata issues, especially use of MARC
XML
c) a summary of the issues concerning data storage, database technologies, and Web services
The data exchange issue has already been covered under Z39.50 and EDI.
a) XML was introduced in 1996 and endorsed by the World Wide Web Consortium in 1998. XML, like HTML, is a derivative of Standard Generalised Markup Language (SGML). XML provides a customisable, structural markup of a document, unlike HTML, which provides a presentation format rather than a structure. It is a meta-language for defining an unlimited number of specific markup languages, each of which may contain an unlimited number of tags (hence extensible); its elements are defined by the user. In XML, content is separated entirely from presentation; the presentation of XML content needs to be specified by style sheets.
XML was designed for a number of specific purposes:
-
to enable international media-independent electronic publishing
-
to allow for the definition of platform-independent protocols for data exchange
-
to deliver information to user agents in a manner that allows for automatic processing post-receipt, and to reduce the costs of the processing
-
to permit choice of display format by means of style-sheet control
-
to facilitate the production of metadata
(W3C XML Activity Statement, quoted by Fichter and Cervone 2000)
The XML syntax has three main building blocks. XML documents contain a hierarchy of named elements, the structure of which can be conceived of as an inverted tree with the actual data values occupying the “leaves”. Container elements contain text and/or other elements. Named attributes of an element can be specified in its “start” tag. Entities permit components of a document to be named and stored separately. A Document Type Definition (DTD) can be set up to declare each of the permitted elements, attributes, entities and their inter-relationships. (Brandt 2001). A DTD expresses the hierarchy and granularity of data, allowable attribute values, and whether elements are optional, repeatable, etc. Such DTDs form templates for the logical structure of associated XML documents. An XML document that conforms to a DTD is said to be valid. A DTD can be established by the user, or the document can refer to an existing DTD. The notion of an XML namespace has been established to address the issue of potential name clashes of XML elements, whereby a machine-readable definition of the element set for a document (e.g. a resource list) is given at a fictitious URL or URI (Uniform Resource Identifier) (Kelly 2000). Since a DTD defines a single namespace, a suite of DTDs can be defined to permit elements from different DTDs to occur in one document.
At a more fundamental level (e.g. across a particular industry sector), it is possible to define XML schemata or specialised derivative markup languages, which define within the overall XML Schema framework, using a separate XML document, the particular tags and attributes used for applications within that industry, (van der Vlist 2000). This has been done, e.g. for voice recognition (VoxML) multimedia (Synchronised Multimedia Integration Language – SMIL) and wireless access (Wireless Markup Language –WML). A draft of XML Schema was released by the W3C as a Proposed Recommendation on March 21st 2001.11 While much more powerful that a DTD, XML Schema has only limited support within currently available software.
XML by itself models and delivers structured data without any reference to documents. The display and linking of XML data is defined by several related technologies and standards (the proliferation of which is fairly described by Peek (2000) as “alphabet soup”!):
XSL
In a manner analogous to the use of Cascading Style Sheets with HTML, the appearance of XML documents is controlled by eXtensible Style Language (XSL). This separation of form from content is a powerful feature of XML; XML data can be displayed in many different formats using different style sheets, hence it can be used to customise user interfaces. XSL may also be used to perform calculations. XML can be converted to HTML using a variety of methods at either client or server side. A further development, XSL Transformations (XSLT), enables one XML document to be transformed into another according to an XSL style sheet, so, for instance, XSLT can convert an XML document into HTML, or reformat it for display within the screen of a WAP mobile ‘phone. XHTML, which is effectively replacing versions of HTML according to the W3C’s recommendations, is a representation of HTML in XML (Kelly 2001).
XML query languages
A variety of XML query languages have been proposed; in the library systems literature one can find references to XQL, XSL Patterns, and XQuery. XQuery is the most likely candidate to emerge as a W3C standard: XQuery 1.0, which relates closely to XPath (see below), is the subject of a W3C Working Group.12
Xlink, Xpointer and XPath
XLink provides hyperlinking functionality considerably greater than that of HTML. It includes links that lead users to multiple destinations, (so that, e.g. a hyperlink to an author’s name could yield a list of multiple options, such as secondary sources, bibliographical information, further links, portraits etc.) bi-directional links, and links with special actions. With XLink it is possible to set up external link databases to facilitate the maintenance of hyperlinks. These extended links are of two sorts, inline and out-of-line. In the latter, the links between documents are not stored in the documents themselves, but in a separate linking document. XML also provides HTML-like simple links, bi-directional links, and links with special actions (Kim and Choi 2000, Miller 2000). XPointer addresses the limitations inherent in HTML for processing pointers into documents. Using XPointer it is possible to link to any portion of an XML document, even if the author has not provided an internal anchor. It uses another XML technology, XPath, to specify locations with the document and to provide a means of querying the document (Evans 2002).
XML documents may of course be created using text editors, but specialised XML editors are required for producing them in quantity; again, several are now available. XML is supported by the most recent versions of Web browsers13.
To validate XML documents and provide access to their content, an XML parser is required. A variety of academic and free parsers is available, mostly coded in Java. In addition, several commercial companies have started offering updated versions of these, or have built their own. There are two main standard application programming interfaces (APIs) specifying how an application may access an XML document once it is in a parser: the tree-based Document Object Model (DOM) and the event-based, less memory-intensive Simple API for XML (SAX). DOM reproduces an XML document’s data hierarchy in a programming language’s native object format, providing programmers with an easy and familiar way of working with the data in the document. The DOM API loads the entire document into memory, favouring repetitive operations performed on short documents. For lengthy documents, SAX is a better choice. Unlike DOM, however, it cannot make backward or multiple passes through the data. (Yager 2000).
b) Since the 1960s the main metadata format used within the library community for print-based materials, has been MARC. A huge amount of bibliographic data is extant in MARC formats. With the advent of XML and other Web metadata standards, the issue for libraries obviously thus arises of the prospects for MARC in an integrated information environment. A great deal of work has been carried out in relation to XML and MARC. Approaches and perspectives have varied: the issue is bound up with a complex debate, which is beyond the scope of this article, concerning the suitability of MARC as a bibliographic format for cataloguing Web resources, and the desirability of its replacement with an XML-based alternative.14 Some efforts (e.g. those of Miller and his team at the Lane Medical Library) have focused on replacing MARC content with an XML schema for bibliographic records (Miller 2000, 2002). Several teams and agencies (e.g. Miller 2000, Logos Research Systems) have developed methods and tools for conversion of MARC to XML at the structural level. The other main emphasis has been on XML implementations of MARC. In the spring of 2002 the Library of Congress announced an official specification for representing MARC data in an XML environment, MARC XML. It seems reasonable to suppose that, while MARC implementation efforts and experiments will continue, the library community is unlikely to abandon MARC within the foreseeable future (Johnson 2001).
XML has already been widely adopted as the language of other metadata standards within the library and information community. For instance,
i) XML is itself the syntax for the Resource Description Framework (RDF). RDF is the central component of W3C “semantic web” activity (Medeiros 2000) and a major application for digital libraries (Kelly 2000, Bray 2001). It is not itself a metadata scheme, but a system for encoding metadata schemes within a standardised framework; it provides a standard way of describing element names, their content and their relationships (ODL 2001).
ii) The Open Archives Initiative15 is a protocol that enhances access to e-print archives as a means of improving access to scholarly communication; within the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), XML is used both for protocol requests and for delivering metadata, Dublin Core16 based metadata in XML being the main metadata format that it uses (Kent 2002).
This is far from being a complete list. Other library-related metadata standards using XML are summarised by Rhyno (2002a) and by ODL (2001).
c) There is an obvious issue for software developers as to how XML-based documents and data may be stored and managed. Two main approaches are possible: the relational database (RDBMS) using XML extensions or middleware, and the “native” XML database.17 In both these types of database, the tools to manage the XML documents conform to XML-related data models: XPath, DOM, and sometimes XQuery, Relational databases store large XML documents as Binary Large Objects (BLOBs) or Character Large Objects (CLOBs), using an XML parser to manipulate the XML as it is moved in or out of the BLOB or CLOB (Trippe 2002). They use SQL for querying, and a variety of mapping tools and technologies for mapping the XML data to the relational fields and back again. They may have built-in extensions for transferring data between XML documents and themselves (in which case they are referred to as XML-enabled databases) or may employ third-party middleware for this purpose (Castor, IBM Database DOM, and Breeze are products that are readily available). This can be processing-intensive and relatively slow, losing the performance advantage of a relational database system; it also has the disadvantage that depending on the type of mapping employed, it may not always be possible to retrieve documents in the form in which they were input (“round-tripping”). All the major relational database players have moved to strengthen XML support (the use of XML query languages and the more stable standards) within their products: Oracle (Oracle 8i and 9i), IBM (DB2) and Microsoft (SQL Server using SQLXML) are some of the market leaders (Mable 2002).
“Native” XML databases, such as Ipedo (Ipedo Inc.), eXcelon’s XIS, and Tamino (Software AG) may use any physical storage model. By definition, they store and retrieve documents according to an XML-derived hierarchical data model, generally as indexed text or some variant of the DOM mapped to an existing data store. (Content management systems, incidentally, use such “native” XML databases for storage, but have additional functionality such as editors, workflow control, and version control built in.) XML-enabled relational databases, however, conventionally break down the XML hierarchy into sets of relational tables (Mable 2002, Bourret 2002).
The relative merits of these different types of database depends very much on how the application makes use of the XML document. The terms data-centric and document-centric (Bourret 2002) are used to describe the primary function that an XML document provides for an application. A data-centric XML document is one:
-
which is designed primarily as a vehicle for data transport
-
which is intended to be processed by an application, is accessed and manipulated at the level of individual fields
-
which has a regular structure with specified field lengths
-
which is fine-grained
-
in which sibling order (i.e. of elements) is not important.
A document-centric document, by contrast:
-
is intended to be updated and edited at the document level
-
is designed to be read by human beings
-
has a variable structure
-
has larger grained data
-
is one in which sibling order matters
These distinctions are not absolute, as many XML documents are of “mixed” type. A hybrid library obviously incorporates a heterogeneous range, from MARC records (data-centric) to e-books and even collections of e-archives. XML databases are also much slower than relational database management systems, of working with internal data structures (McCarthy 2000, Rich 1999) although they can provide very good performance for certain types of information retrieval. Hitherto, relational databases have been considered more appropriate for data-centric applications, whereas for document-centric applications…native XML databases, object-relational databases, and other solutions that can maintain XML documents as a more complete unit, have been preferred.18
Library systems vendors so far seem to have chosen to adhere to relational database solutions, although native XML databases such as Tamino (Software AG) and Ixiasoft’s TEXTML are being used in some digital library projects
(Yeates 2002).
XML is not a panacea to end all data exchange and interoperability problems. Its versatility comes at the expense of computational efficiency. A considerable expenditure is required to create XML documents and standards, while the documents themselves are verbose and slow-loading. Also, there is currently little awareness and use of it within the wider library and information community. It is, however, an important enabling technology.
Share with your friends: |