|PP: Digital Humanities & Digitised Newspapers: The Australian Story
The title of my paper is comprised of two parts – ‘digital humanities’ and ‘digital newspapers’ – and my rather ambitious aim today is to tell the ‘Australian story’ about both; that is, to give some background and an overview of digital humanities in Australia as well as to discuss my current research with digitised newspapers. I’ll give a brief introduction to this project, but I’ll focus on the epistemological issues it raises: specifically, regarding the relationship between the humanities and the archive in our age of digital remediation.
Digital humanities in Australia
It’s possible to say now – it wouldn’t have been a few years ago – that Australian universities incorporate a number of centres for digital humanities research. These include: at the Australian National University, where I’m located, and which hosted the first digital humanities Australasia conference in 2012 (at which Julia Flanders was one of our keynotes); at the University of Western Australia, which hosted the second of these conferences this year; at the University of Western Sydney, where the international Digital Humanities conference will be held next year; and there’s other places I could name.
However, my perception is that, in Australia, these digital humanities centres are not really leading the charge with digital research. In fact, I would suggest that they’re rather post-hoc add-ons to a pre-existing, and vibrant research culture that uses digital technologies to progress humanities research. Rather than driving the uptake of digital methods, in other words, in the Australian context, the ‘digital humanities’ movement or moment (or whatever you want to call it) stands largely as an institutional attempt to group together existing research trajectories and strengths in a way that taps into what is increasingly an internationally recognised brand.
On the one hand, this post-hoc situation makes it a bit difficult to survey digital humanities in Australia, because the field does not really cohere in relation to particular debates, or institutional centres with specific research strengths, as it might in the United States. On the other hand, I would argue that this post-hoc development has, in some ways, been a strength for research at the intersection of digital technologies and humanities in Australia, in that it has left room for the emergence of a great diversity of strong projects in a wide range of disciplines. In particular, due to the lack of a digital humanities context, digital projects have – and still have – to speak, first and foremost, to the disciplines from which they arise.
One example of this feature of digital research in Australia, that I think also points to a set of unique research questions that the field is pursuing – can be found in projects relating to Aboriginal or indigenous Australian language and culture. Researchers such as Nick Thieberger and Howard Morphy have been involved, for decades, in building digital archives and developing digital tools to collect, curate, represent, and analyse cultural artefacts and linguistic information arising from indigenous Australian cultures. This research has not set them apart from other scholars in anthropology and linguistics, but has been integral to them becoming leaders in their fields. Their digital work has also led these researchers to very interesting and pertinent philosophical and practical considerations of data ethics. For instance, for some Aboriginal groups it’s forbidden to say the name or see an image of a relative who has died. How can data be thought of and structured in a way that acknowledges and respects cultural differences?
PP: For those of you interested in this research area the “Endangered Languages and Cultures” blog provides an excellent introduction, and is part of Nick’s PARADISEC project (which stands for Pacific and Regional Archive for Digital Sources in Endangered Cultures): http://www.paradisec.org.au/blog/
This might be controversial, but rather than do a show-and-tell of the features of the projects I’m going to discuss, I’m just going to put up links. We’ve got limited time, and I’m sure that if you’re interested in a particular topic or project, you’ll want to explore it yourself, at your leisure. I’ve put this paper and slides up on my wordpress site – the address is down the bottom there – so you can download the powerpoint and access the urls that way if you’d like.
I could go on naming individual projects, and areas of research strength, but I want instead to elaborate what I see as another key feature of digital humanities research in Australia (besides its post-hoc evolution): that is, its basis in, and emergence from, a number of very successful and innovative national eResearch infrastructure projects. In part, Australia’s success in the development of such infrastructure relates to the duration of European/Australian history as well as to the country’s population size: Australia was settled by Europeans in 1788, so fairly recently, and the population today – 21 million – is significantly smaller than the United States and, indeed, than California. So there are fewer objects in fewer cultural collections to digitise, and a smaller population to deliver them to. Notwithstanding these advantages, Australian cultural institutions – and the researchers who work with them – have developed digital infrastructure, and are using it, in world-leading ways.
I’ll give a few examples close to my heart (i.e. things I’ve been involved in). AustLit: The Australian Literature Resource (http://www.austlit.edu.au/) is the most comprehensive online bibliography of a national literature. Although you can access most areas of AustLit freely, unfortunately it still has a residual subscription model, but I’m told if you email them to say your library does not subscribe they let researchers have free access.
As well as being one of the first digital resources of this type, AustLit has developed a wide range of digital tools for researchers to explore and enhance its content, including federated searches, annotation features, an editing platform, and a range of visualisation tools (including network and object-oriented methods). If you’re interested in the data modelling and visualisation strategies AustLit’s is pursuing and making available, this website provides some information video demonstrations of their activities: http://www.itee.uq.edu.au/eresearch/projects/aus-e-lit/demos.
AustLit is the basis for a number of digital humanities projects. It was the main source for my most recent monograph, Reading by Numbers, which demonstrated how quantitative bibliometric analysis can reconfigure our understanding of such issues as periodization, genre, gender trends in publication, cultural value, and so on. AustLit has also become a collecting point for research projects in Australian literature, hosting “Research Communities” and “Specialist Datasets” – which are projects that employ bibliography, data collection and analysis, scholarly editing, image and text annotation, and a range of other methods to explore particular aspects of Australian literary culture. This hosting function allows researchers to do such things as publish datasets, including images and annotated files, describe methodologies and publish findings.
PP: Cultural institutions, such as the National Library of Australia, the National Archives, and the National Gallery, have also been at the forefront of international best practice in digitising their collections. A good example of this innovation is Trove, which is the basis of my current research. Trove (http://trove.nla.gov.au/) is a large, federated archive that includes digitised images, music, maps, letters, and the section I’m focused on, newspapers; it also enables simultaneous searching of libraries and collections across the country.
Trove newspapers is the most popular subsection of the broader Trove database (https://trove.nla.gov.au/newspaper). It provides access to the largest collection of digitised newspapers in the world, with its 14 million pages dwarfing both Chronicling America (with 8 million pages) and the British Newspaper Archive (with 8.5 million). Besides its size it has a number of other features that make it particularly innovative and useful for historical research, including article segmentation in its digitisation process (which enables targeted searching); a very successful crowd-sourced model for correction of Optical Character Recognition text; and an Application Programming Interface (or API) that enables researchers (and members of the general public) to export bibliographic metadata and full-text records of search results as csv and text files – a facility that’s been integral to my research).
Australian cultural institutions are world-leading not only in the scale and quality of their digitisation programs, but in their willingness to make their collections available, and to work with researchers to create innovative means of accessing and using the collections. Tim Sherratt, who’s now the manager of Trove, actually came to their notice because he built an API to harvest Trove data for various experiments, ranging from fun to serious. One of his more light-hearted experiments is Headline Roulette, which asks you to guess the year a particular headline featured in the newspapers:
His blog archive (http://discontents.com.au/archive/) provides details of many more of his projects, including serious work on the “White Australia” policy archives.
Another Australian scholar who’s been doing wonderful things with the digital archives of cultural institutions is Mitchell Whitelaw, who emphasises the limitations of a “search” model and for an approach he calls “generous interfaces” (http://mtchl.net/tag/generousinterfaces/). His generous interfaces aim to put digital collections on display, and allow users to interact with them, rather than forcing them to guess what might be contained within by using search terms. My favourite is his interface for the Australian National Gallery’s collection of prints (http://printsandprintmaking.gov.au/explore/), but I’d encourage you to have a look at them all as they’re very beautiful and I think an important step forward in how we might imagine and represent digital collections.
So, that’s my brief fly-over digital humanities in Australia, for what it’s worth. I’m sure the next Australian you have out here would give an entirely different perspective.
What I want to do now, is turn to my own work with a major cultural archive – the Trove Newspaper database I was discussing previously.
As I said in the introduction, I’ll spend a little bit of time describing my project – background, aims, method and research questions. But I’ll focus on describing the epistemological principles this project raises for digital research, and the relationship between the humanities and the archive more broadly in our age of digital remediation.
So, to background: In the nineteenth century, newspapers were not only the main source of serial fiction in Australia, but the main source of fiction – indeed, of reading material – in general, due to low levels of book ownership and access to lending libraries. Until recently, however, as is the case in respect to newspaper publishing everywhere – and indeed, to many humanities topics based in the archive – we’ve had very little idea about what fiction was published in these newspapers due to the size of the archive. Manually searching and cataloguing fiction in the thousands of published in Australia is simply unfeasible. Like an emerging number of digitisation projects, Trove Newspapers changes this situation, in enabling us to explore and represent the archive in new and powerful ways.
My project specifically doesn’t search the archive for specific titles and authors – an intuitive method, but one that tends to find what we already assume to be present in the archive. Instead it leverages the generic quality of newspapers by searching for the terms used to frame and introduce serial fiction in these publications, including “to be continued”, “serial story”, “our novelist” and, the first term we’ve actually trialled to begin with: “chapter”. (When I say “we” I’m referring to myself and Carol Hetherington, who’s a bibliographer employed full-time on the project for three years with funding from the Australian Research Council.) “Chapter” has proven very effective in optimising results for fiction, because the word often occurs in the title of the “article” (which is defined by Trove as the first four lines of text) and multiple times in the “article” text (because a single instalment frequently containing many chapters). Both of these features push fictional titles to the top of Trove’s search results. Here are some of the fiction instalments that a “chapter” search turns up: PP; PP; PP.
Here is the way the search results appear in the python-based Application Programming Interface as it cycles through and extracts them in both csv form (for the bibliographic metadata) and text files (for the full text of the instalments): PP.
Entering “chapter” into Trove in July last year returned more than 800,000 results, and we exported the full-text and bibliographic metadata for the first 250,000 using a slightly modified version of the API Tim Sherratt designed. Due to the usefulness of “chapter” in optimising the relevance ranking for serial fiction, we found that the first 30 sets of 5000 results were almost exclusively fiction, with the share of other records to fiction increasing over the next 20 sets of 5000 results. At this point, we deemed the share of non-relevant to relevant material too high to warrant further investigation (the relevance ranking algorithm had, for our processes, exhausted its usefulness). Other results of the “chapter” search not relevant to our project include reports of meetings of a chapter of a lodge or religious association, accounts of a chapter in the life of a town or person, or even public documents such as deeds of grant and regulations organised in chapter divisions.
We took these 250,000 results and removed all duplicates – of which there were many – deleted non-relevant material, and created and populated a large number of additional metadata fields. I won’t go into detail about these (although I’m happy to discuss them in question time if anyone’s interested); basically, my point is that, while automatic search and harvesting significantly expedites the bibliographic process, it by no means removes the necessity of bibliographic scholarship.
PP: After all this, for the nineteenth century a “chapter” search has yielded the following results:
58,717 unique records (or instalments – and remember, we also have the full text for all of these)
these instalments constitute 6,269 titles
1,212 of those titles are completed in one issue (in some cases, these are short stories with chapters; in others, the stories are more like novellas, running over 10 or more pages in the case of some special supplements. In some cases, a story that is completed in one issue in one newspaper runs across multiple issues in another).
altogether we’ve found 4,076 unique titles (as you can see in the difference between the number of titles and the number of unique titles, many stories are published multiple times in different newspapers – and even, in some cases, in the same newspaper, a decade or so apart).
A great many of these stories were published anonymously, pseudonymously, or with signatures only (that is, with a note saying “By the author of …” and giving other titles, but not giving the author’s name). We’ve been able to identify
1,693 individual authors of these titles;
there remain 1,301 titles by authors we have not yet been able (and in all probability will never be able) to identify
This fiction comes from across the world: there are Australian stories, but British and American serial fiction is incredibly prevalent, and there is also serial fiction originating from Canada, France, Germany, New Zealand, Russia, South Africa and beyond.
PP: As I said, this is only the outcome of the first search, and already, it massively expands our bibliographic record. As a result, it provides the basis for a range of interesting research questions:
obviously, using it we can ask questions about overall trends in serial fiction publication, including such things as:
the prevalence of publication at different times as well as trends in the gender and nationality of authors;
we can also use these overall trends to consider fiction publication as a system, for instance: as the basis for identifying newspapers that frequently shared the same stories in order to ask questions about the underlying structures that produced those commonalities (relationships between editors and authors; the workings of syndication agencies, and so on)
it allows us to explore the prevalence of anonymous fiction publication in the nineteenth century, and the insights this offers into the “author function” operating in this period (and it’s my predication that anonymity is going to be an increasingly important topic in literary and book history in coming years, not only because of its topicality with respect to comment on the internet, but because analysis of digitised archives (which are mostly eighteenth and nineteenth century due to availability and copyright restrictions) are going to turn up not only thousands of works, but thousands of works unattached authors);
this expanded bibliographical record allows us to ask questions about the content of the fiction, whether reading titles newly recovered to literary history, perhaps by famous authors; reading titles based on elements of the data (for instance, those that were the most republished in Australian newspapers to explore those that had a particular resonance with readers); or using digital methods such as topic modelling to identify genres and themes prevalent throughout the corpus.
Now I’m happy to say more about these specific research questions and how I’m approaching them – I’m more advanced in some areas than in others.
PP: But as I said, I want to focus not on findings, but on epistemological principles: what does it mean to “know” literary and print culture through analysis of the digitised archive and on the basis of data?
This question is important for two main, interconnected, reasons.
One is the enormous rhetorical power of data in contemporary society. (And I was at a paper and discussion last week, in the Digital Scholarship Group, about visualisation where this issue of data as rhetoric was raised.) As was discussed there, for many reasons, data appears to us – that is to say, the rhetoric surrounding data makes it seem – as true, objective, seamless, totalising, and commensurate with the world. This rhetoric can lead us to believe that data-based analyses and visualisations show us the truth about what happened in the past.
The second reason relates to this first, and is the unfortunate tendency for some data-based literary research to work within the framework of this rhetoric – or even to buy into it as a way of asserting the importance of their research findings – by presenting the results of data-led analysis as, precisely, true, objective, and complete.
This is a big topic in and of itself, and has been the subject of a lot of debate. I’m happy to discuss it in more detail if anyone would like to do so, but what I want to signal, for now, is that there is a desire in digital humanities and in the humanities more broadly to articulate a framework for data-rich research that does not resonate with and/or deploy this rhetoric of data as true, objective, etc.
So, in what remains of my paper, I want to suggest four epistemological principles that my project has brought to the foreground for me, and that I think make a contribution to this discussion about how to pursue data-rich research in ways that resist and suggest alternatives to the rhetoric of data as complete, objective, true and so on. Very broadly, my claim is that we should not be looking to the sciences and the model of “big data” as a framework for digital research. We already have a more appropriate model worked out over many decades of philosophical and practical discussion: that of the archive, and our awareness of how its structures and practices shape what we can know about the past and how we can know it.
PP: Here, then, are my four principles. The first two relate to how we might imagine and therefore represent the objects in the archive. I am specifically interested in how to move beyond the one-dimensionality of data points to the multiplicity and complexity of human history and culture. As I said, these principles arise specifically from my project, so are oriented toward literary and print cultural objects – but I hope they have relevance to other forms of data-led humanities research. The third and fourth principles are broader, and concern the overall indeterminacy of the archive and how to conceptualise, articulate, and accommodate this indeterminacy in this age of digital remediation. There are probably more … I’m not suggesting this list is comprehensive (let alone true, objective, complete)! All four draw on the established critical view of archives – as always and already constructed, value-laden, mediated, and so on – to insist on an understanding of data as the same.
First, a literary work is not singular, but processes over time and space; and in quantitative or data-led research we need to conceptualise and represent the print cultural archive in ways that acknowledge this multiplicity.
We tend to think of literary works as singular – James Joyce’s Ulysses, Nathanian Hawthorne’s The Scarlet Letter. This might be true on an abstract level, but on a documentary or archival level, it is the exception rather than the rule. Not only are literary works often published in multiple editions, but even in single editions, multiple documents are put forth into the world. Where literary studies is preoccupied with textual multiplicity, it trains us not to care about what is considered this ‘material’ multiplicity. But this approach ignores how every publication event changes both the definition and the meaning of the work – the definition, in terms of the array of things in the world that go under the name of and therefore constitute the existence of that literary work; and the meaning, in terms of the different presentations and constitutions of text and paratext that make up the work in its different publication events.
The chaotic world of newspaper publishing makes it impossible to avoid this multiplicity. Not only are we finding multiple republications of the same story, but stories with essentially the same text are published under multiple titles and with various attributions. We’ve found cases with up to eight different titles accorded to substantially the same text. And as fiction moved across national borders in the nineteenth century, it was often plagiarised, rewritten and localised. For instance, American author “Old Sleuth’s” novel The American Detective in Russia is serialised in several Australian newspapers as “Barnes, the Australian Detective”. These publications compel us to ask: which of the many publication events constitutes “the novel”? What is the relationship between these publication events and prior or subsequent serial or book publications? Are works published with different titles but substantially similar texts the same novel? And so on.
Such questions highlight the problem with analyses that represent literary works as single data points: for instance, charting (as I have been guilty of in the past) the publication of “Australian novels” – or “British” or “Irish” or “American” fiction. If we adopt an archival approach, we can see, instead, that the literary work is not a singular entity in time but a process that unfolds over time and space. One way I’ve tried to represent this, in my project, is by collecting data relating to the literary work at both abstract and documentary levels (and this terminology borrows from a scholar called Paul Eggert, in his recently published Biography of a Book). By abstract level I mean the title of the work and name of the author as commonly given in literary history; by documentary level, I mean the way that both are given in the particular newspapers in the archive (and I’m including here the ascription of titles to anonymous or pseudonymous authors, lists of other titles by the same author listed in the newspaper, and so on). This approach attempts to represent literary works as processes constituted by multiple events as well as literary historical objects indicative of a particular set of social and aesthetic relationships.
Second, another issue relating to multiplicity: the archive contains multiple systems of meaning, and in modelling literary works we need to think about how to represent these.
This point extends on the first. When we approach the print cultural archive, we tend to do so first and foremost by asking and answering questions such as “who wrote it?” and “what is it called?” to supply the author and title of a work. This information then provides the route to further bibliographic facts, such as previous or subsequent publications of a work; the gender and nationality of the author, that function as the means by which we locate literary works in time and space, and thus render them amenable to literary analysis. Identifying these facts regarding the documentary record, and modelling the digital archive accordingly, can lull us into believing that we are accurately representing that archive. And in one respect we would be. For instance, knowing the nationalities of authors whose novels were published in Australian newspapers allows us to see how fiction moved globally in the nineteenth century. However, such bibliographical facts do not constitute the entirety of – and in some cases stand in opposition to and function to obscure – other systems of meaning that the archive contains.
In respect to serial fiction, the newspaper archive contains bibliographical traces of the experience of historical newspaper readers. For instance, with novels often published anonymously or pseudonymously, or with signatures only, knowledge of the author’s name, let alone their nationality, may not have been part of the contemporary experience of reading serial fiction. In other cases, including where the author’s name was not given, nationality was emphasised, for instance, in by-lines announcing a story to be “A daring tale of the American frontier”, or in pseudonyms such as by “A London Man”. Such features of the publication event bring nationality to the foreground of readers’ experiences. However, the nationalities with which these stories are inscribed may not be – and in many cases are not – related to the nationalities of authors discovered through bibliographic research.
The point is not that one is true and the other is false, but that both represent systems of meaning within the archive. An important challenge for data-rich humanities research, is working out how to conceptualise and visualise the archive in ways that do not privilege one system of meaning and, in so doing, overwrite and obscure all others.
So I’ve been arguing, with both of these epistemological principles, that multiplicity is something we need to acknowledge and represent when we analyse the digital archive and represent literary and print culture at scale. Now, I take as given that, even if we do this – that is, no matter how many layers we include in our representation of the archive – it is never possible to fully access or represent the archive.
My third epistemological principle is more subtle (and I’m sorry that I couldn’t find a simpler way to put it, but hopefully if it is opaque it will become clearer as I elaborate). It is this: digital methods of accessing the archive increase the potential for unrealised mismatches between the access we intend and the access we achieve.
All archival research is provisional, in that we can never know the documentary record in its entirety because aspects of that record have been misplaced, mislabelled, damaged, lost, or destroyed. Part of the argument for digital archival methods is that, within this constraint, they provide much more access to the archive than analogue methods. Obviously, I agree with this argument to some extent: the amount of serial fiction I’ve been able to discover using automatic search and harvesting methods significantly expands our understanding of what was available for nineteenth-century Australian readers. However, at the same time, I worry that this perception of being able to access more blinds us to the ways in digital methods create gaps between our perceived and achieved access. They do this, I argue, by multiplying the number of proxies or models involved in accessing the archive.
All archival research involves proxies or models. In traditional literary or book historical research, each print cultural document represents countless other print cultural documents (such as the many newspapers of the same title and date, the vast majority of which no longer exist) and is accessed via a particular collection model, such as a library card catalogue. But automatic search and retrieval methods introduce many additional proxies into these processes of representation and access. In the case of my project, the csv and full text data extracted from Trove are models for the OCR rendered digitised newspaper pages collected by Trove; these digitised pages are, in turn, models of the newspapers—or of the microfiche models of those newspapers—held in library archives throughout Australia. These physical newspapers are, in turn, models or proxies – by virtue of their collection in the archive – for the many newspapers of the same name published on the same date, the vast majority of which no longer exist. Trove’s search interface not only provides a system for accessing these documentary proxies, but mediates such access via multiple other models. For instance, each search result or collection “view” “has its own home page, its own relevance ranking algorithm, and its own facets, influenced by the type of material included in the view”.
As the number of proxies or models multiplies, so too does the potential for any changes in these representational systems, or any biases or errors they introduce or perpetuate, to alter how the collection is accessed and represented. This is what I mean by digital research methods increasing the potential for a mismatch between our perceived and achieved access to the archive. Some changes or problems—such as the digitisation of additional documents, or biases in the search results because certain titles are not digitised—might be relatively easy to identify, and hence, to accommodate. The same cannot be said of the processes that construct, and frame our access to, the archive. In particular, the effects of OCR and search algorithms on search results can be difficult to perceive and rectify. Sampling can reveal specific ways in which these technologies misrepresent or omit aspects of the archive. But one can never be certain that all issues have been dealt with because, without these devices, our only access to the archive is manual, returning us to difficulties presented by its sheer size.
The importance of emphasising provisionality in respect to digital projects is important because of the disruptive and obscuring potential of these multiple proxies; it is also important because of the rhetoric of objectivity and comprehensiveness that attends such research, where the very scale of information recovered, as well as the supposedly direct access to an underlying system provided by computers, encourage a perception that all has been revealed.
This, then, is the conundrum of digital archival research methods – at the same time they massively increase and enhance our understanding of the archive, they introduce gaps and uncertainties into this process of knowing. But these gaps and uncertainties are only an insurmountable problem if we exist in the fantasy world of data as a source of objectivity and truth. If we acknowledge, instead, that both data and the archive are constructed, value-laden, changeable, and indeterminate, we can move from the idea of “big data” providing us with the overall picture to “thick data” – highlighting multiple layers and the ongoing process of thickening our understanding and representation of them. Rigor and method are not less important from this perspective but more, in that we move forward not by uncovering the truth in some dramatic data-dump, but by gradually thickening what we know of the past and continually reflecting on how it is we know it.
So data as “thick” not “big,” and also, as freely available or published, which is my fourth epistemological principle. Although the most straightforward, it is also probably the most important, because in its absence, all the other principles are undercut. Basically, as we reconceptualise “big data” research as digital archival research we need to operate as archivists as well as researchers: that is, we need to make freely available our perception of the archive for many reasons:
so that other scholars can see what it is that we’re actually analysing with our quantitative or visual models, including what layers of the archive we represent as well as leave out;
so that others scholars can engage with our arguments in the terms we make them, and also contest them in that way;
and so that other scholars can use our representations of the archive in different ways and/or to different ends, thus continuing the process of thickening our knowledge of the past, and contributing to an ongoing discussion based in – but not reducible to – data.
It is in these principles of digital archival research – based in theories of the archive rather than prevailing ideas about “big data” – that we can propose a model for data-led humanities research that conceptualises data as multiple, complex, provisional and perpetual rather than objective, totalising, comprehensive and true.