Shrine20230725 26411 1fw8iif | Analysis of Text as Data | Liverpool University Press Digital Collaboration Hub

Analysis of Text as Data: Definitions and Disclaimers

Crystal Hall and Birgit Tautz

Many of the essays in this volume engage with counting, connecting, and mapping words from pages via computational methods. Here we present a brief description of what is happening "under the hood" of these methods, moving from a written or printed word to its digital representation on the screen to its eventual analysis. These methods wrestle with, in the eyes of some readers, surely unexpected obstacles in text recognition, while relying on ordering principles (aka metadata) imported into the texts from elsewhere.

The first section will engage directly with a chief limitation of computational research that relies on material composed before the 1900s: the quality of page scans for creating accurate digital texts. An ancillary problem is situated in the digitization and file storage process prior of the archives and repositories that house these materials. While this pertains primarily to the essay by Tautz and the interview with Erlin and Walsh, it has implications for any digital humanities research that relies on existing scans, that is images rather than text files.

From Print to Screen: Hurdles of Full-Text Search

The chief hurdles for searching full-text are not imposed, as one might assume, by old German script in the Zeitschriften der Aufklärung (Journals of the Enlightenment) collection. Rather, they were erected in the process of digitizing the journals as JPEG-files. When the physical pages were photographed or scanned decisions were made about how to store the resulting files and those storage decisions have consequences for the methods available to scholars to query and study the repository. To be clear, the PDF-digitization of the journals reveals itself to be an assemblage of individual page-JPEG-files masquerading as image-only PDFs. Unlike conventional PDFs, which contain convertible text and where each text character is referenced to its image counterpart, the files are and remain images. They are not searchable and must be converted into searchable text files (.txt files).

Similarly, the Journals of the Enlightenment present the researcher with another peculiar choice. Since the documents retain boundaries of the image (i.e. the page from an issue in a volume), individual articles cannot be downloaded as one file. One must download articles page by page or download a single volume of an individual journal in its entirety, leaving scholars with a bulk of text not needed to answer the research question. Both processes happen at a very slow speed,^{^[1]} and they do not yield the penultimate step for computational analysis such as topic modeling (Tautz) or co-mention analysis (Erlin and Walsh). A fairly recent feature, added to make the database appear more user-friendly, is “download the pdf.” Still, this is only possible page-by-page while “opening” the section of the volume – or, in case of multivolume journal issues – the volume in which the essay appears. This “improvement” adds visual confusion.

Subsequently, in order to make image files readable, a process called Optical Character Recognition (OCR) must be employed to extract the journals’ texts. This step involves adaptation of machine learning to identify text characters in the patterns of pixels in the images, followed by manual cleaning of illegible passages, all in order to convert the image-only PDFs, finally, into fully searchable text files. (A list of sources identified on the K/cosmopolit project website allows readers to experiment with adaptation and replication.)

Text as a Network: Nodes, Edges, and Library of Congress Subject Headings Several essays in this volume engage with connections between words, represented or conceptualized as networks (Erlin and Walsh, Hall, Höyng). Therefore, we provide here a corresponding overview of network analysis terminology. It takes as its example a curious pamphlet that was part of James Bowdoin III's personal library collection donated to Bowdoin College. The focus here is on metadata about the book, not the contents. The metadata, much like that used by Erlin and Walsh, has been transmitted into digital databases. It represents the material bibliographic information about the texts (title, author, date, container, place of publication, pages) as well as interpretative layers added by librarians and scholars (i.e. subject headings).

The Core of a Network: An Example

While some subjects could be gleaned from a title (see Erlin and Walsh), the pamphlet in question offers a way to see the value added to an analysis if we include the Library of Congress data as part of the research. (Hall speaks about the limitations and biases in this data in her essay.) Bartholomew Ruspini's 8-page pamphlet bears the curious title Mr. Ruspini, earnestly recommends the following short observations to the perusal of the nobility, gentry, and others, but particularly to parents, and to those who have the care of young persons (Pall Mall, 1785).

Supplemental Figure 1. Title page of Bartholomew Ruspini's short treatise. Courtesy of the George J. Mitchell Department of Special Collections & Archives, Bowdoin College Library, Brunswick, Maine.

Ruspini uses his text to advocate for the regular intervention of a dentist and constant oral hygiene practices at home as part of a pitch to sell his dentrifice (toothpaste) and tincture for the gums. Accordingly, the Library of Congress subject headings include the following: "Teeth -- Care and hygiene -- Early works to 1800;" Dental hygiene -- Early works to 1800;" and "Dentifrices -- Early works to 1800." In Hall's essay Ruspini was presented as evidence of the varied subject matter from Italian-born authors in Bowdoin's collection, and the headings would support that, although they are not part of the title. Ruspini's is the only work in the Bowdoin collection from the relatively young field of dentistry.

The pamphlet is at the center of a metaphorical molecule of ideas, or the center of a network, built from the connections among each of these ideas in the headings: teeth, care and hygiene, dental hygiene, and dentrifices. The subheading “Early works to 1800” is associated with nearly every volume in the collection and is therefore omitted from analysis along with "17th century," 18th century," and "19th century." To study the network quantitatively, we can document, visualize, and analyze the connections among ideas that occur in the same book.

Supplemental Figure 2. Representation of the subject headings for Ruspini's pamphlet as a network of interconnected subjects.

The circles in Supplemental Figure 2 are the nodes, each representing a unique idea, including teeth, dental hygiene, and dentrifices. The lines are the edges that connect these nodes, indicating that they co-appear in at least one book.

Clearly, this method overlooks aspects of the tone and content of the text, as well as the identity of the author. For example, when remarking that the English are more at risk for tooth decay, Ruspini points out, with no small amount of patriotic bias: "In France and Italy, where the Teeth of the Natives are in general perfect and durable, it may be observed, that very great Attention is given to them in the early Stages of Life, but principally during the whole Time of the Second Dentition."^{^[2]} Close reading captures what the material aspects and subject headings do not: Ruspini's own posturing as an Italian surgeon living abroad trying to convince his neighbors to employ his services. (He lived on the same street in London as Bowdoin's sister, Elizabeth Temple.) The material features put Ruspini into the print context in which he would have been received, one in which few Italians had authoritative voices, where there was scattered topical interest outside French and English, and where people often reported their cultural observations to their home audience. The Library of Congress subject headings are an additional tool for identifying pathways of exploration, but not exhaustive.

Text as Topics: Concepts and Challenges

Topic modeling looks for groups of words that travel together in full texts of documents, short passages (Tautz), or even just titles (Hall). This requires counting all of the words that appear in the passage or title, irrespective of their order, referred to as a bag of words model. Each bag of words is compared to the others in the corpus and a model is built. Importantly, the scholar must determine how many topics should be in the model - too few and the resulting model is too general, too many and the model is overfit and likely fills with nonsensical topics. In Supplemental Figure 3 the image and the boxes below it represents the relationship between Ruspini's text and the corpus as a whole.

A close-up of a document

Description automatically generated Supplemental Figure 3. Conceptual illustration of building a topic model from words that appear in the title of Ruspini's pamphlet.

In the example of the pamphlet on dentistry, the model would take into account that in the collection overall the words nobility and gentry frequently occur together, but in some titles, observations appears with state, short with treatise, young with students or letters, etc. In Supplemental Figure 3 green is used to show words that are typically excluded from analysis given their frequency and lack of semantic value. As highlighted in Supplemental Figure 3, the output is a list of words in each topic, which are the terms most likely to indicate that a topic is present in the texts under consideration. Thus, if a title uses nobility it might, but does not necessarily, use elite. The analyst then infers a label from those words in the topic and would say that a title with those words represents a topic on social status.

Topic Model Challenges

Digital Humanities scholars frequently seem to speak in metaphors whenever they find themselves in dialog with the traditionalists; as they liken the underlying processes of topic modeling to an intuitive grabbing of words and arranging them in bags, they borrow from mathematicians the expressions for simplifying a model. In any case, analysts provide the labels in congruity with the semantic fields a conventional reader may discern when consuming the texts through “close reading;” in fact, labeling follows the steps of such a designating process. Not surprisingly, this process has informed calling the approach a topic model. David Blei has detailed the methods and workings of building the model and the genesis of the technique over the past 15 years or so.^{^[3]} While open-source Voyant Tools (https://voyant-tools.org/) allow for preliminary text analysis tasks such as word frequency lists and frequency distribution plots to describe vocabulary usage in documents in a corpus, Tautz (with the assistance of Quyen Ha) employed Blei's more sophisticated approach to model vocabulary usage at the corpus level with latent Dirichlet allocation (LDA). The difference is that word frequency and collocation can show specific instances of usage, whereas topic modeling suggests overall patterns of usage. They implemented this algorithm in R and visualized the results using an R package called LDAvis, which extracted information from a fitted LDA topic model to inform an interactive web-based visualization.^{^[4]}

Unlike in Hall’s analysis of titles in Bowdoin's collection, the K/cosmopolit project uses html-representation by circles or bubbles of different sizes. The dynamic bubbles stand for the topics: Tautz and Ha offer links to a 3- and 4-topic model for the entire corpus grabbed by k/cosmopolit* searches from the Enlightenment Journals. The bubbles visualize the topics that the LDA algorithm identifies in a given corpus and the distance from one topic to another (distance here is defined as the degree of similarity between different topics, calculated by the probability distributions of words). The graphic display of the 30 most salient terms refers to the entire corpus, whereby only the length of a bar is relevant to interpretation, not its vertical placement in the diagram. Hovering over or clicking on a circle will display the 30 most relevant terms for the selected topic. The red bars represent the frequency of a term in a given topic, and the overlapping blue bars represent that term’s frequency across the entire corpus.

While one needs to break down the corpus for analysis into individual pages (or even smaller parts) through “chunking,” in order to interpret the data for our purposes and vis-à-vis close reading, we need to know source, title, author, year and other metadata for each piece of writing. Treating all the individual JPEGs as part of one large text and chunking it is therefore not practical; it also remains to be seen whether the one-page unit is the proper chunk size. We thank Quyen Ha for preparing and visualizing the data for Tautz's essay in this volume. ↑
Ruspini, p. 1 ↑
See first and foundational to present essay, David M. Blei, Andrew Ng, and Michael Jordan “Latent Dirichlet Allocation,” Journal of Machine Learning Research 3 (2003) 993-1022. ↑
See, in addition to our additional sources identified through link above, Matthew Jockers’ comprehensive book Text Analysis with R for Students of Literature (Berlin: Springer, 2014); an author-narrated helpful introduction can be found at http://www.matthewjockers.net/text-analysis-with-r-for-students-of-literature/ ↑

Supplementary texts

Show the following:

Adjust appearance:

Notes

Annotate