“Libraries exist to preserve society's cultural artifacts and to provide access to them. If libraries are to continue to foster education and scholarship in this era of digital technology, it's essential for them to extend those functions into the digital world.” (Internet Archive, 2001).
The Internet Archive, (http://www.archive.org/details/texts), is actually a collaboration of several smaller, but by no means small, digital libraries, including the North Carolina State University libraries, the Library of Congress and Project Gutenberg, as well as libraries in Canada and China. It is a 501(c)(3) non-profit organization, established in 1996, to “offer permanent access to researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format.”
The Internet Archives goal is to preserve the memory of the culture of each country that it is currently working with. “Without paper libraries, it would be hard to exercise our "right to remember" our political history or hold government accountable” (Internet Archive, 2001). Preserving business and government documents is essential to maintaining the current culture of our nations.
They offer several different uses for their Internet Archives, all using this archive as a springboard to see where the internet has come from to trace both economic and social history of our societies. For this purpose, the Internet Archives offers much more than just texts. There are digital images, audio recordings, software programs and archived websites to trace just how far our societies have come with the advent of the internet. Additionally, the preservation of software will allow users to access documents and other items that were created using software that is no longer available or outdated.
Despite the many different formats available on this site, I will focus on the books available from the American libraries for this case study.
The Internet Archive itself currently has over 850,000 registered users, although registration is not necessary to view the site or to download materials. They are adding approximately another 12,000 users per month. (http://www.archive.org/about/graphs.php)
Tracking of their users is handled by Alexa.com, a partner in the project. According to its website, Alexa.com has “developed and installed, based of millions of toolbars, one of the largest Web crawls and an infrastructure to process and serve massive amounts of data” (Alexa.com, 2010). In keeping with the Internet Archives desire to trace the roots of the web; this seems like the best tool at their disposal to do so.
Alexa.com tracks the number of users and what it is that they look at on the website. It researches the demographics of the users as well as how long they spent on the site. They also trace the traffic on the particular sites, a useful tool for those trying to study internet usage, as the Internet Archive does on its website.
Statistics provided by Alexa.com, http://www.alexa.com/siteinfo/archive.org
Internet Archive does not try to hide the fact that they are collecting data. In fact, they have a page that explains their mission and their policy for doing so. On this page, they state that adherences to several different standards are met and that the information is kept for no more than thirty days, for research purposes only. “In the typical default configuration, this includes the IP address, last page visited, and browser used, among other information. This log data is useful for security, improving usability, and marketing purposes, but contrary to the operating assumptions of many organizations, it typically includes personally identifiable information (PII), or information that can be used to generate PII. PII is governed by a complex set of state, federal, and European laws, and mismanagement of this information can result in potential civil liability or regulatory action by state or federal agencies.” (http://groups.ischool.berkeley.edu/log-mgmt/)
Currently, the Internet Archive is comprised of 1,882,138 texts, a majority of which come from American libraries, although the Canadian libraries have recently added their 200,000th text. All of the texts are older, mostly dating back to the 1800s for fiction works, mainly due to copyrighting issues. Unfortunately, most of the more scholarly texts, such as those in Economics and other disciplines are just as old. There are, however, a few newer texts, dating back to the 1990s, but those would be used more for a research of past economic trends rather than for current ones.
The main screen for the American libraries showcases their recently added texts as well as the most popular searches.
American Libraries Main Page, Internet Archives, 2001
Their most popular exhibition are their children’s books from the late 1800s. These are truly fascinating for their workmanship and illustrations. Children’s books in the 1800s were considered works of art and this is a prime example of their trying to preserve the culture for future generations.
Each of these books is scanned as it was found, with the name plates, title pages and table of contents for both text and illustrations all included in the text.
“Goody Two Shoes” text, Internet Archives, 2001
Navigation through the texts is easily done using the toolbar at the top of the page. The tool bar allows you to zoom in or out from the text, as well as find a particular page of passage using the search box to the right. Moving from page to page is done using the icon of the book at the top of the page. Additionally, readers can skip to a particular page using the box next to page number. Simply put in the desired page and hit enter, and the reader is taken to the desired page.
From the main page, all texts can be printed or downloaded in a variety of formats. By clicking on “details” next to the name of the text, the reader is brought to a page with all of the bibliographic information as well as the scanning information. This text, as are most of the texts on this site, are available for download in PDF version, Kindle, DjVu, in both color and black and white.
“Goody Two Shoes” details, Internet Archives, 2001
Unfortunately, not all texts are available this way. Project Gutenberg texts, since they were donated in their own format, are just really long web pages. They are only available in the HTML format and are not available for download. The user must read the book on the site, with no bookmarks available.
Bibliographic information, including copyright, is available at the top of the page. Also listed is the contributing library. Further down on this page is the scanning information. All bibliographic information on this site is based on MARC II record.
Because each text is scanned by different people at different libraries, there are no set standards for scanning. For example, this text was scanned at 500 ppi. Generally, on this site, everything is scanned between 400-600 ppi.
When Internet Archive scans a book for a library, they use a custom-engineered workstation with “Scribe” technology. However, when others contribute, they ask that texts be donated in a PDF format, with images made with scanners or digital cameras. They accept JPEG2000, standard JPEGs or TIFF formats. There are no naming requirements so long as the text name ends in “.pdf”. “If the pdf has no hidden text layer (i.e., isn't searchable), then after doing OCR, Archive.org creates a second pdf with a text layer.” (Internet Archive, 2001, http://www.archive.org/about/faqs.php#276) Therefore, even if the text is completely uploaded by the donating library, the Internet Archive still has work to do, since most of the texts are not searchable.
Since the Internet Archive is a collection of several different types of media, the search engine is a little more detailed. Users are given the opportunity to search all media or within a specific format. Within that particular format, for example, text, the user is then given the opportunity to search different subsets. Subsets include American libraries as opposed to Canadian libraries, or any of the other ones that they have, such as Project Gutenberg or Additional Collections, which are a variety of texts donated by individuals or libraries other than their member libraries.
Most of the books within their archives are not recent so searching can be a little daunting. Typing in a subject does bring several results, however. Searching for “fashion”, for example, returns over 9 pages of results, all sorted by relevance. It starts with any work that has “fashion” in its title, and eventually moves down to works that are tagged as “fashion” in their keywords. With almost 2,000,000 texts, this can be very labor intensive.
Advanced search allows the user to narrow down the field considerably in a different ways. One way is to search the collection with more specific information such as date range or collection description.
Internet Archive, 2001 http://www.archive.org/advancedsearch.php?q=fashion%20AND%20mediatype:texts
For more advanced users, mainly those who have created or donated texts, there are other ways to search. Using the XML descriptors and bibliographic information allows those who are donating to see if their text is already among the many that the site already offers.
Internet Archive search, 2001 http://www.archive.org/advancedsearch.php?q=fashion%20AND%20mediatype:texts
Additionally, for those who are having trouble finding exactly what it is that they are looking for, the Internet Archive offers instructions on exactly how to search what they have more efficiently. I found this particular section very useful and wish that other sites offered this as well.
Internet Archive, 2001. http://www.archive.org/advancedsearch.php?q=fashion%20AND%20mediatype:texts
This also allows for searching within the texts, including the Project Gutenberg texts, which are only available in HTML format.
Browsing through the library is done from the main page. Besides the most popular texts that are showcased on the front page, users may click on a particular library and access their collections that way. Browsing can also be done by most recently added items. Each page has a side bar that lists the most recently downloaded items, items with the most downloads by week or staff picks.
Internet Archive, 2001 http://www.archive.org/details/americana
However, there is no browsing by subject, which I find to be particularly difficult. If I am a parent trying to find different children’s books by just looking at titles, not searching, then I cannot do so. I would have to know the title and/or the author to be able to browse in this fashion.
The only collections that allow browsing by subject are the “additional collections”. These are usually their subjects that were uploaded by smaller libraries not actively associated with the project and/or individuals, and are, among other items, dance manuals and cookbooks. Once a user selects one of these categories, however, they are faced with the same browsing mechanism that only allows them to choose by title or author.
The Internet Archive also offers reviews of their titles, which allow users, much like Amazon.com, to see what others thought before investing their time downloading and reading the text in question. The reviews offer insights such as “corrupted file” which are particularly useful. The user must be logged in to read the reviews or post their own.
There are also forums available on the site, which allow the users to discuss the texts or ask questions to clarify the reading. Both of these are accessed by the main page of the library and can be searched using the same search engine as the regular texts.
Storage and Preservation
Perhaps the most difficult challenge the Internet Archive has, besides the browsing feature, is the storage and preservation of their materials.
“Storing the Archive's collections involves parsing, indexing, and physically encoding the data. With the Internet collections growing at exponential rates, this task poses an ongoing challenge.” (Internet Archive, 2001, http://www.archive.org/about/about.php)
Currently, the collection is housed on DLT tapes, which allow for scalabilty and backward-reading compatibility and holds approximately 1600GB of compressed data at a transfer rate of 432 GB/h. (http://www.quantum.com/Products/TapeDrives/DLT/Index.aspx#). “Web data is received and stored in archive format of 100-megabyte ARC files made up of many individual files.” (Internet Archive, 2001, http://www.archive.org/about/about.php)
Migration of such a large collection is a huge undertaking. Migration of their data is done approximately every 10 years, even though things are generally done every thirty years in the industry. Their information is currently being migrated onto a new medium, the Petabox (http://www.archive.org/web/petabox.php) and moving away from the DLT tapes. The Petabox, custom-designed for this project, now houses approximately 3PB of storage, and is an inexpensive design and storage for this large project. It also has software that automates full mirroring of the site, and requires only one system administrator for each petabyte, which cuts down on labor costs, a good thing for a project that depends on donations to continue.
Most importantly, as mentioned before, the Internet Archive is also seeking to preserve the software used for many of these items so that they can continue to offer these texts and other formats even after the software has become obsolete.
Comparison to Other Internet Archives
The size of the Internet Archive, due mainly to the many different libraries that make up the archive, makes it almost impossible to compare it to another one of the same size. The only other site that I could find of almost the same size was Amazon.com, and that is a pay per use site (downloading using Kindle).
Since the Internet Archive has several university libraries within its website, I will compare the University of Pennsylvania libraries offerings through the Internet Archive with the one that is available directly online (http://onlinebooks.library.upenn.edu/) . The most interesting thing about the University of Pennsylvania website is that it is merely a portal, which leads you to other sites that have digitized books.
The University of Pennsylvania website is maintained and edited by John Mark Ockerbloom, a digital library architect and researcher at the University of Pennsylvania. Although the server is through the university, it does not look like the site is actually maintained by the university, just one of its employees.
The University of Pennsylvania website also has their books available in HTML and PDF format, when available. Unfortunately, although they boast over 35,000 texts, most of the links are no longer active. A majority of them have the message “Error 404 - Page not found”.
Where the University of Pennsylvania does excel, however, is in the browse function. The site allows the user to browse by subject, something not easily possible in the Internet Archive. And within the subject search, there is another search available, with the alphabetical search at the top and an additional “search by prefix” option.
The other nice feature about the University of Pennsylvania site is that each of the books is divided by chapters and are downloadable in this format. Researchers may find this particularly useful with the larger books as they do not have to use so much storage space if they only need a few pages or one chapter.
However, all of these additional features are hindered by the fact that most of the links are broken and therefore what should be a 35,000 book collection really is much smaller than that.
In researching this topic, I came across many different book based digital libraries but decided to focus on the Internet Archive for many reasons. It seems to be the largest, most comprehensive collection of books, free of charge, on the Internet. Although they would prefer the user to log-in, it is not required.
Other digital libraries sometimes require memberships, and do not even offer as many books. One site, maintained by Gale Publishing, has several of their reference books available for use online. However, this requires a subscription through your own library, be it the public library or through a university. And then, the subscription allows you to access books that the subscriber has decided to purchase.
The Internet Archive has a large selection of books on a variety of topics available free of charge and is doing everything in its power to make sure that those books are available for years to come.
I found their website easy to use and had a little bit of everything for all users. If I were doing a large research project, however, I am not sure if I would be able to find everything I was looking for. The Internet Archive is limited by copyright laws so they may not ever be able to be everything for everyone unless deals are negotiated with the publishing houses but overall it does have a lot of resources available for all users.
The Internet Archive (http://www.archive.org/details/texts) Accessed February 21, 2010.
Smith, Alastair G. (2000) "Search features of digital libraries" Information Research, 5(3) Available at: http://informationr.net/ir/5-3/paper73.html (Accessed February 6, 2010)
Alexa.com, (http://www.alexa.com/) Accessed February 21, 2010.
Digital Library of the University of Pennsylvania, (http://onlinebooks.library.upenn.edu/) Accessed February 21, 2010.
Goody Two Shoes. (New York: McLoughlin Bros, 1888). http://www.archive.org/stream/goodytwoshoes00newyiala#page/n0/mode/2up Accessed February 21, 2010.