Notes
Constituting the Graphs based on Google Books’ Ngram Data
Figures 2 and 13 were created using the raw data behind Google Books’ Ngram Viewer.[1] An ‘Ngram’ is a consecutive sequence of case-sensitive blocks of text, with ‘N’ representing the number of blocks in a given series. For example, ‘littérature’ is a 1-gram, ‘belles lettres’ is a 2-gram, ‘belles-lettres’ is a 3-gram (since a hyphen is counted as its own block) and so on. At the time of writing, Ngram Viewer is a free, publicly available search engine that allows users to look for Ngrams (up to a 5-gram), in a large corpus of printed books, dating from 1500. This corpus is a subset of the books digitised as part of the Google Books project, for which Google initially scanned and digitised some 15 million books held in over forty different university libraries (with additional items contributed by publishers). These books were then run through optical character recognition (OCR) software, enabling each book to be converted into digital text that can be read by a machine. From this corpus of over 15 million books, the Google Ngram team originally selected just over 5 million to form the first Ngram corpus, which was constituted in 2009. These 5 million items were drawn from seven languages (Chinese, English, French, German, Hebrew, Russian, Spanish). The engineers and scholars who created the Ngram tool state that these books were chosen from the full Google Books corpus ‘on the basis of the quality of their OCR and metadata’, and that periodicals are excluded from the Ngram corpus.[2] NGram Viewer was launched online in 2010.
In 2012, the Ngram corpus was expanded and amended. A further language was added (Italian); the OCR software was improved, which increased accuracy of word recognition; and the total corpus size increased to over 8 million books. Google claimed that this represented around 6 per cent of all books ever published.[3] Although Google has not yet published a paper detailing the changes made to its 2020 corpus, a quick comparison shows that its French-language corpus has increased in size from a total of almost 800,000 volumes (in 2012) to more than 3 million. In 2012, these texts came from the [p.265] Bibliothèque municipale de Lyon, the Bibliothèque universitaire de Lausanne and (primarily) US university libraries. I have used the French-language corpus of the 2020 edition for this study. I consider those items published between 1650 and 1900 inclusive. This subset of the French-language corpus comprises 1,286,124 volumes, over 725 million pages and almost 140 billion 1-grams. Between 1750 and 1800, the subset totals 184,864 volumes.
Since the launch of Ngram Viewer, dozens of scholarly articles have used its data to help shed light on cultural changes.[4] However, many scholars and lay commentators have also noted the limitations of this software, and of the corpora on which it is based. Several of these are noted in the introduction.[5] Others include the fact that the Ngram corpus sets out to include only one of each book. This means that Ngram Viewer gives users no sense of whether a text (and the words within it) was frequently republished and likely to have been read by many people, or whether it had a small print run and could only have been seen by a few.[6] In sum, Ngram weights all uses of a word equally, even if one received much more publicity than another.
The most well-known criticism of Ngram Viewer relates to OCR errors in the data. Although Google’s OCR software has been much improved for recent editions of the Ngram corpus, errors do of course remain. In my search of Ngram’s raw data, for instance, I came across ‘betles-lettres’ and ‘littérature.’, misrecognitions of a ‘t’ and a full stop respectively. The OCR error means that such Ngrams are not returned in searches for belles-lettres or littérature made using the standard Ngram Viewer. On the basis of its limitations, Pechenick, Danforth and Dodds conclude that Google’s Ngram data ‘warrants a very cautious approach to any effort to extract scientifically meaningful results’.[7]
To reduce data inaccuracy, then, I downloaded the raw data files (publicly available in comma-separated value or CSV format), and searched for non-standard spellings and capitalisation and for OCR [p.266] errors for the words belles-lettres and littérature. I compiled a database of the results, which records annual frequency of each Ngram. I then retained only those Ngrams that occur one hundred times or more across the date range. I also omitted foreign-language words (such as ‘literature’), as well as Ngrams that might plausibly have been intended as a different word (for instance, littératur, which could have been a misrecognition of littérateur just as much as littérature). Finally, I chose not to include instances where a number is attached to a word (such as belles-lettres2), which is likely to be a footnote misrecognition. This left me with forty-one variants of belles-lettres and fifty-three variants of littérature, as listed below.[8] I totalled the frequency of all variants to arrive at one sum per year, for each of belles-lettres and littérature. Finally, I calculated these yearly totals as a proportion of the number of pages within the whole corpus, each year.[9] The result provides the relative use of these words within the Ngram French-language corpus, shown in Figures 2 and 13. Below are the raw figures for each variant:
| NGram | No. of total counts (1650-1900) | Year in which annual count first reaches 200 or more (where applicable) |
1 | belles-lettres | 664,288 | 1751 |
2 | Belles-Lettres | 231,148 | 1734 |
3 | belleslettres | 85,565 | 1785 |
4 | belles lettres | 85,515 | 1691 |
5 | BellesLettres | 30,135 | 1802 |
6 | BELLES-LETTRES | 26,747 | 1836 |
7 | Belles-lettres | 18,618 | 1824 |
8 | Belles Lettres | 16,942 | 1722 |
9 | belles Lettres | 14,699 | 1701 |
[p.267] | |||
10 | BELLEs-LETTREs | 3,756 | 1833 |
11 | Belles lettres | 2,978 | 1885 |
12 | BELLES LETTRES | 2,791 |
|
13 | BELLES-LETTREs | 2,335 |
|
14 | Belleslettres | 2,004 |
|
15 | BELLESLETTRES | 1,167 |
|
16 | belles-lettre | 877 |
|
17 | belles-Lettres | 832 |
|
18 | belles-leltres | 649 |
|
19 | Belles-Lettre | 549 |
|
20 | belles-letlres | 505 |
|
21 | BELLEs LETTREs | 486 |
|
22 | Belles-Letres | 470 |
|
23 | Belles-Leltres | 386 |
|
24 | belles-lellres | 349 |
|
25 | bettes-lettres | 346 |
|
26 | BELLEsLETTREs | 328 |
|
27 | belles-lettrés | 280 |
|
28 | bellesleltres | 267 |
|
29 | Belles-Letlres | 257 |
|
30 | Belles-Lellres | 241 |
|
31 | bellesletlres | 239 |
|
32 | Belles-Lettres_ | 231 |
|
33 | bellesLettres | 189 |
|
34 | BELLESLETTREs | 155 |
|
35 | belles-lettres_ | 147 |
|
36 | betles-lettres | 143 |
|
37 | belles-letires | 142 |
|
38 | BELLEsLETTRES | 112 |
|
39 | Betles-Lettres | 112 |
|
40 | belles lettre | 105 |
|
41 | belles-tettres | 105 |
|
[p.268]
| NGram | No. of total counts (1650-1900) | Year in which annual count first reaches 200 or more (where applicable) |
1 | littérature | 5,260,362 | 1733 |
2 | Littérature | 349,036 | 1730 |
3 | littératures | 274,931 | 1811 |
4 | LITTÉRATURE | 213,126 | 1798 |
5 | litterature | 18,877 | 1800 |
6 | LITTERATURE | 14,346 | 1829 |
7 | Littératures | 7,807 | 1879 |
8 | Litterature | 7,633 |
|
9 | LITTÉRATURES | 4,327 |
|
10 | litérature | 4,162 |
|
11 | lalittérature | 3,537 |
|
12 | littéralure | 2,310 |
|
13 | litlérature | 1,934 |
|
14 | littératurc | 1,729 |
|
15 | lillérature | 1,694 |
|
16 | Littérature. | 1,441 | 1822 |
17 | liltérature | 1,394 |
|
18 | Litérature | 1,093 |
|
19 | litte'rature | 993 |
|
20 | littératnre | 842 |
|
21 | littérarature | 663 |
|
22 | LITTÉRATUREs | 537 |
|
23 | litteratures | 529 |
|
24 | littératuré | 513 |
|
25 | lilléralure | 509 |
|
26 | Littérature_ | 492 |
|
27 | litléralure | 404 |
|
28 | LITTERATURES | 375 |
|
29 | littératuro | 343 |
|
30 | Litlérature | 314 |
|
31 | liltéralure | 301 |
|
[p.269] | |||
32 | littérature. | 296 |
|
33 | littératuie | 293 |
|
34 | Littéralure | 278 |
|
35 | littérarure | 272 |
|
36 | littéraiure | 256 |
|
37 | littcrature | 256 |
|
38 | littératare | 240 |
|
39 | littératre | 205 |
|
40 | la'littérature | 202 |
|
41 | LITTÉRATURE. | 199 |
|
42 | LITTÉRATuRE | 196 |
|
43 | Lilléralure | 187 |
|
44 | Littératurc | 184 |
|
45 | llttérature | 150 |
|
46 | littératture | 147 |
|
47 | littératurè | 134 |
|
48 | Litte'rature | 125 |
|
49 | littératute | 116 |
|
50 | littératute | 116 |
|
51 | Litteratures | 108 |
|
52 | litlerature | 108 |
|
53 | littéralures | 102 |
|
[1] Google Books Ngram Viewer, https://books.google.com/ngrams (last accessed 8 September 2022). The raw data can be downloaded at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.
[2] See Michel et al., ‘Quantitative’, p.176. Google confirms that the updated, 2012 corpus was selected according to these same criteria. See Lin et al., ‘Syntactic’, p.170.
[3] The updates are explained in Lin, et al., ‘Syntactic’, p.169.
[4] Nadja Younes and Ulf-Dietrich Reips offer an overview of such studies in ‘Guideline for improving the reliability of Google Ngram studies: evidence from religious terms’, PLoS ONE 14 (2019), p.1-17 (2).
[5] See p.12-14.
[6] Ngram’s inability to account for a book’s popularity is discussed in Eitan Adam Pechenick, Christopher M. Danforth and Peter Sheridan Dodds, ‘Characterizing the Google Books corpus: strong limits to inferences of socio-cultural and linguistic evolution’, PLoS ONE 10 (2015), p.1-24 (2). The authors also note that despite Google’s claims to only include one copy of each book, ‘new editions and reprints allow some books to appear more than once’ (p.2).
[7] Pechenick, Danforth, and Dodds, ‘Characterizing’, p.23.
[8] One of Google’s updates to the 2012 iteration of Ngram was to tag each word according to grammatical function. For example, one finds ‘run’, ‘run_NOUN’ and ‘run_VERB’. I retain only untagged words, which comprise all uses of a term. For more on Google’s tagging, see Lin et al., ‘Syntactic’, p.171-72.
[9] The online Ngram Viewer only allows users to view Ngrams as a proportion of the total 1-grams in a given year. This would have been slightly inaccurate for my study, since I look not just at 1-grams, but also at 2-grams (‘belles lettres’) and 3-grams (‘belles-lettres’). The number of pages within the corpus represents, therefore, a common denominator, which can be used to calculate proportionate values for the different Ngrams I consider.