Graphs Based On Google Books’s Ngram Data | Appendix 2: Constituting the graphs based on Google Books’s Ngram data | Liverpool University Press Digital Collaboration Hub

Constituting the Graphs based on Google Books’ Ngram Data

Figures 2 and 13 were created using the raw data behind Google Books’ Ngram Viewer.^[1] An ‘Ngram’ is a consecutive sequence of case-sensitive blocks of text, with ‘N’ representing the number of blocks in a given series. For example, ‘littérature’ is a 1-gram, ‘belles lettres’ is a 2-gram, ‘belles-lettres’ is a 3-gram (since a hyphen is counted as its own block) and so on. At the time of writing, Ngram Viewer is a free, publicly available search engine that allows users to look for Ngrams (up to a 5-gram), in a large corpus of printed books, dating from 1500. This corpus is a subset of the books digitised as part of the Google Books project, for which Google initially scanned and digitised some 15 million books held in over forty different university libraries (with additional items contributed by publishers). These books were then run through optical character recognition (OCR) software, enabling each book to be converted into digital text that can be read by a machine. From this corpus of over 15 million books, the Google Ngram team originally selected just over 5 million to form the first Ngram corpus, which was constituted in 2009. These 5 million items were drawn from seven languages (Chinese, English, French, German, Hebrew, Russian, Spanish). The engineers and scholars who created the Ngram tool state that these books were chosen from the full Google Books corpus ‘on the basis of the quality of their OCR and metadata’, and that periodicals are excluded from the Ngram corpus.^[2] NGram Viewer was launched online in 2010.

In 2012, the Ngram corpus was expanded and amended. A further language was added (Italian); the OCR software was improved, which increased accuracy of word recognition; and the total corpus size increased to over 8 million books. Google claimed that this represented around 6 per cent of all books ever published.^[3] Although Google has not yet published a paper detailing the changes made to its 2020 corpus, a quick comparison shows that its French-language corpus has increased in size from a total of almost 800,000 volumes (in 2012) to more than 3 million. In 2012, these texts came from the [p.265] Bibliothèque municipale de Lyon, the Bibliothèque universitaire de Lausanne and (primarily) US university libraries. I have used the French-language corpus of the 2020 edition for this study. I consider those items published between 1650 and 1900 inclusive. This subset of the French-language corpus comprises 1,286,124 volumes, over 725 million pages and almost 140 billion 1-grams. Between 1750 and 1800, the subset totals 184,864 volumes.

Since the launch of Ngram Viewer, dozens of scholarly articles have used its data to help shed light on cultural changes.^[4] However, many scholars and lay commentators have also noted the limitations of this software, and of the corpora on which it is based. Several of these are noted in the introduction.^[5] Others include the fact that the Ngram corpus sets out to include only one of each book. This means that Ngram Viewer gives users no sense of whether a text (and the words within it) was frequently republished and likely to have been read by many people, or whether it had a small print run and could only have been seen by a few.^[6] In sum, Ngram weights all uses of a word equally, even if one received much more publicity than another.

The most well-known criticism of Ngram Viewer relates to OCR errors in the data. Although Google’s OCR software has been much improved for recent editions of the Ngram corpus, errors do of course remain. In my search of Ngram’s raw data, for instance, I came across ‘betles-lettres’ and ‘littérature.’, misrecognitions of a ‘t’ and a full stop respectively. The OCR error means that such Ngrams are not returned in searches for belles-lettres or littérature made using the standard Ngram Viewer. On the basis of its limitations, Pechenick, Danforth and Dodds conclude that Google’s Ngram data ‘warrants a very cautious approach to any effort to extract scientifically meaningful results’.^[7]

To reduce data inaccuracy, then, I downloaded the raw data files (publicly available in comma-separated value or CSV format), and searched for non-standard spellings and capitalisation and for OCR [p.266] errors for the words belles-lettres and littérature. I compiled a database of the results, which records annual frequency of each Ngram. I then retained only those Ngrams that occur one hundred times or more across the date range. I also omitted foreign-language words (such as ‘literature’), as well as Ngrams that might plausibly have been intended as a different word (for instance, littératur, which could have been a misrecognition of littérateur just as much as littérature). Finally, I chose not to include instances where a number is attached to a word (such as belles-lettres2), which is likely to be a footnote misrecognition. This left me with forty-one variants of belles-lettres and fifty-three variants of littérature, as listed below.^[8] I totalled the frequency of all variants to arrive at one sum per year, for each of belles-lettres and littérature. Finally, I calculated these yearly totals as a proportion of the number of pages within the whole corpus, each year.^[9] The result provides the relative use of these words within the Ngram French-language corpus, shown in Figures 2 and 13. Below are the raw figures for each variant:

	NGram	No. of total counts (1650-1900)	Year in which annual count first reaches 200 or more (where applicable)
1	belles-lettres	664,288	1751
2	Belles-Lettres	231,148	1734
3	belleslettres	85,565	1785
4	belles lettres	85,515	1691
5	BellesLettres	30,135	1802
6	BELLES-LETTRES	26,747	1836
7	Belles-lettres	18,618	1824
8	Belles Lettres	16,942	1722
9	belles Lettres	14,699	1701
[p.267]
10	BELLEs-LETTREs	3,756	1833
11	Belles lettres	2,978	1885
12	BELLES LETTRES	2,791
13	BELLES-LETTREs	2,335
14	Belleslettres	2,004
15	BELLESLETTRES	1,167
16	belles-lettre	877
17	belles-Lettres	832
18	belles-leltres	649
19	Belles-Lettre	549
20	belles-letlres	505
21	BELLEs LETTREs	486
22	Belles-Letres	470
23	Belles-Leltres	386
24	belles-lellres	349
25	bettes-lettres	346
26	BELLEsLETTREs	328
27	belles-lettrés	280
28	bellesleltres	267
29	Belles-Letlres	257
30	Belles-Lellres	241
31	bellesletlres	239
32	Belles-Lettres_	231
33	bellesLettres	189
34	BELLESLETTREs	155
35	belles-lettres_	147
36	betles-lettres	143
37	belles-letires	142
38	BELLEsLETTRES	112
39	Betles-Lettres	112
40	belles lettre	105
41	belles-tettres	105

[p.268]

	NGram	No. of total counts (1650-1900)	Year in which annual count first reaches 200 or more (where applicable)
1	littérature	5,260,362	1733
2	Littérature	349,036	1730
3	littératures	274,931	1811
4	LITTÉRATURE	213,126	1798
5	litterature	18,877	1800
6	LITTERATURE	14,346	1829
7	Littératures	7,807	1879
8	Litterature	7,633
9	LITTÉRATURES	4,327
10	litérature	4,162
11	lalittérature	3,537
12	littéralure	2,310
13	litlérature	1,934
14	littératurc	1,729
15	lillérature	1,694
16	Littérature.	1,441	1822
17	liltérature	1,394
18	Litérature	1,093
19	litte'rature	993
20	littératnre	842
21	littérarature	663
22	LITTÉRATUREs	537
23	litteratures	529
24	littératuré	513
25	lilléralure	509
26	Littérature_	492
27	litléralure	404
28	LITTERATURES	375
29	littératuro	343
30	Litlérature	314
31	liltéralure	301
[p.269]
32	littérature.	296
33	littératuie	293
34	Littéralure	278
35	littérarure	272
36	littéraiure	256
37	littcrature	256
38	littératare	240
39	littératre	205
40	la'littérature	202
41	LITTÉRATURE.	199
42	LITTÉRATuRE	196
43	Lilléralure	187
44	Littératurc	184
45	llttérature	150
46	littératture	147
47	littératurè	134
48	Litte'rature	125
49	littératute	116
50	littératute	116
51	Litteratures	108
52	litlerature	108
53	littéralures	102

[1] Google Books Ngram Viewer, https://books.google.com/ngrams (last accessed 8 September 2022). The raw data can be downloaded at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.

[2] See Michel et al., ‘Quantitative’, p.176. Google confirms that the updated, 2012 corpus was selected according to these same criteria. See Lin et al., ‘Syntactic’, p.170.

[3] The updates are explained in Lin, et al., ‘Syntactic’, p.169.

[4] Nadja Younes and Ulf-Dietrich Reips offer an overview of such studies in ‘Guideline for improving the reliability of Google Ngram studies: evidence from religious terms’, PLoS ONE 14 (2019), p.1-17 (2).

[5] See p.12-14.

[6] Ngram’s inability to account for a book’s popularity is discussed in Eitan Adam Pechenick, Christopher M. Danforth and Peter Sheridan Dodds, ‘Characterizing the Google Books corpus: strong limits to inferences of socio-cultural and linguistic evolution’, PLoS ONE 10 (2015), p.1-24 (2). The authors also note that despite Google’s claims to only include one copy of each book, ‘new editions and reprints allow some books to appear more than once’ (p.2).

[7] Pechenick, Danforth, and Dodds, ‘Characterizing’, p.23.

[8] One of Google’s updates to the 2012 iteration of Ngram was to tag each word according to grammatical function. For example, one finds ‘run’, ‘run_NOUN’ and ‘run_VERB’. I retain only untagged words, which comprise all uses of a term. For more on Google’s tagging, see Lin et al., ‘Syntactic’, p.171-72.

[9] The online Ngram Viewer only allows users to view Ngrams as a proportion of the total 1-grams in a given year. This would have been slightly inaccurate for my study, since I look not just at 1-grams, but also at 2-grams (‘belles lettres’) and 3-grams (‘belles-lettres’). The number of pages within the corpus represents, therefore, a common denominator, which can be used to calculate proportionate values for the different Ngrams I consider.

Corpora and Appendices

Show the following:

Adjust appearance:

Notes

Annotate