Skip to main content

Appendix 2: Constituting the graphs based on Google Books’s Ngram data: Graphs Based On Google Books’s Ngram Data

Appendix 2: Constituting the graphs based on Google Books’s Ngram data
Graphs Based On Google Books’s Ngram Data
    • Notifications
    • Privacy
  • Project HomeThe Emergence of Literature in Eighteenth-Century France
  • Projects
  • Learn more about Manifold

Notes

Show the following:

  • Annotations
  • Resources
Search within:

Adjust appearance:

  • font
    Font style
  • color scheme
  • Margins
table of contents
This text does not have a table of contents.

Constituting the Graphs based on Google Books’ Ngram Data

 

 

Figures 2 and 13 were created using the raw data behind Google Books’ Ngram Viewer.[1] An ‘Ngram’ is a consecutive sequence of case-sensitive blocks of text, with ‘N’ representing the number of blocks in a given series. For example, ‘littérature’ is a 1-gram, ‘belles lettres’ is a 2-gram, ‘belles-lettres’ is a 3-gram (since a hyphen is counted as its own block) and so on. At the time of writing, Ngram Viewer is a free, publicly available search engine that allows users to look for Ngrams (up to a 5-gram), in a large corpus of printed books, dating from 1500. This corpus is a subset of the books digitised as part of the Google Books project, for which Google initially scanned and digitised some 15 million books held in over forty different university libraries (with additional items contributed by publishers). These books were then run through optical character recognition (OCR) software, enabling each book to be converted into digital text that can be read by a machine. From this corpus of over 15 million books, the Google Ngram team originally selected just over 5 million to form the first Ngram corpus, which was constituted in 2009. These 5 million items were drawn from seven languages (Chinese, English, French, German, Hebrew, Russian, Spanish). The engineers and scholars who created the Ngram tool state that these books were chosen from the full Google Books corpus ‘on the basis of the quality of their OCR and metadata’, and that periodicals are excluded from the Ngram corpus.[2] NGram Viewer was launched online in 2010.

 

In 2012, the Ngram corpus was expanded and amended. A further language was added (Italian); the OCR software was improved, which increased accuracy of word recognition; and the total corpus size increased to over 8 million books. Google claimed that this represented around 6 per cent of all books ever published.[3] Although Google has not yet published a paper detailing the changes made to its 2020 corpus, a quick comparison shows that its French-language corpus has increased in size from a total of almost 800,000 volumes (in 2012) to more than 3 million. In 2012, these texts came from the [p.265] Bibliothèque municipale de Lyon, the Bibliothèque universitaire de Lausanne and (primarily) US university libraries. I have used the French-language corpus of the 2020 edition for this study. I consider those items published between 1650 and 1900 inclusive. This subset of the French-language corpus comprises 1,286,124 volumes, over 725 million pages and almost 140 billion 1-grams. Between 1750 and 1800, the subset totals 184,864 volumes.

Since the launch of Ngram Viewer, dozens of scholarly articles have used its data to help shed light on cultural changes.[4] However, many scholars and lay commentators have also noted the limitations of this software, and of the corpora on which it is based. Several of these are noted in the introduction.[5] Others include the fact that the Ngram corpus sets out to include only one of each book. This means that Ngram Viewer gives users no sense of whether a text (and the words within it) was frequently republished and likely to have been read by many people, or whether it had a small print run and could only have been seen by a few.[6] In sum, Ngram weights all uses of a word equally, even if one received much more publicity than another.

The most well-known criticism of Ngram Viewer relates to OCR errors in the data. Although Google’s OCR software has been much improved for recent editions of the Ngram corpus, errors do of course remain. In my search of Ngram’s raw data, for instance, I came across ‘betles-lettres’ and ‘littérature.’, misrecognitions of a ‘t’ and a full stop respectively. The OCR error means that such Ngrams are not returned in searches for belles-lettres or littérature made using the standard Ngram Viewer. On the basis of its limitations, Pechenick, Danforth and Dodds conclude that Google’s Ngram data ‘warrants a very cautious approach to any effort to extract scientifically meaningful results’.[7]

 

To reduce data inaccuracy, then, I downloaded the raw data files (publicly available in comma-separated value or CSV format), and searched for non-standard spellings and capitalisation and for OCR [p.266] errors for the words belles-lettres and littérature. I compiled a database of the results, which records annual frequency of each Ngram. I then retained only those Ngrams that occur one hundred times or more across the date range. I also omitted foreign-language words (such as ‘literature’), as well as Ngrams that might plausibly have been intended as a different word (for instance, littératur, which could have been a misrecognition of littérateur just as much as littérature). Finally, I chose not to include instances where a number is attached to a word (such as belles-lettres2), which is likely to be a footnote misrecognition. This left me with forty-one variants of belles-lettres and fifty-three variants of littérature, as listed below.[8] I totalled the frequency of all variants to arrive at one sum per year, for each of belles-lettres and littérature. Finally, I calculated these yearly totals as a proportion of the number of pages within the whole corpus, each year.[9] The result provides the relative use of these words within the Ngram French-language corpus, shown in Figures 2 and 13. Below are the raw figures for each variant:

 

 

NGram

No. of total counts (1650-1900)

Year in which annual count first reaches 200 or more (where applicable)

1              

belles-lettres

664,288

1751

2              

Belles-Lettres

231,148

1734

3              

belleslettres

85,565

1785

4              

belles lettres

85,515

1691

5              

BellesLettres

30,135

1802

6              

BELLES-LETTRES

26,747

1836

7              

Belles-lettres

18,618

1824

8              

Belles Lettres

16,942

1722

9              

belles Lettres

14,699

1701

[p.267]

10            

BELLEs-LETTREs

3,756

1833

11            

Belles lettres

2,978

1885

12            

BELLES LETTRES

2,791

 

13            

BELLES-LETTREs

2,335

 

14            

Belleslettres

2,004

 

15            

BELLESLETTRES

1,167

 

16            

belles-lettre

877

 

17            

belles-Lettres

832

 

18            

belles-leltres

649

 

19            

Belles-Lettre

549

 

20            

belles-letlres

505

 

21            

BELLEs LETTREs

486

 

22            

Belles-Letres

470

 

23            

Belles-Leltres

386

 

24            

belles-lellres

349

 

25            

bettes-lettres

346

 

26            

BELLEsLETTREs

328

 

27            

belles-lettrés

280

 

28            

bellesleltres

267

 

29            

Belles-Letlres

257

 

30            

Belles-Lellres

241

 

31            

bellesletlres

239

 

32            

Belles-Lettres_

231

 

33            

bellesLettres

189

 

34            

BELLESLETTREs

155

 

35            

belles-lettres_

147

 

36            

betles-lettres

143

 

37            

belles-letires

142

 

38            

BELLEsLETTRES

112

 

39            

Betles-Lettres

112

 

40            

belles lettre

105

 

41            

belles-tettres

105

 

 

[p.268]

 

 

NGram

No. of total counts (1650-1900)

Year in which annual count first reaches 200 or more (where applicable)

1              

littérature

5,260,362

1733

2              

Littérature

349,036

1730

3              

littératures

274,931

1811

4              

LITTÉRATURE

213,126

1798

5              

litterature

18,877

1800

6              

LITTERATURE

14,346

1829

7              

Littératures

7,807

1879

8              

Litterature

7,633

 

9              

LITTÉRATURES

4,327

 

10            

litérature

4,162

 

11            

lalittérature

3,537

 

12            

littéralure

2,310

 

13            

litlérature

1,934

 

14            

littératurc

1,729

 

15            

lillérature

1,694

 

16            

Littérature.

1,441

1822

17            

liltérature

1,394

 

18            

Litérature

1,093

 

19            

litte'rature

993

 

20            

littératnre

842

 

21            

littérarature

663

 

22            

LITTÉRATUREs

537

 

23            

litteratures

529

 

24            

littératuré

513

 

25            

lilléralure

509

 

26            

Littérature_

492

 

27            

litléralure

404

 

28            

LITTERATURES

375

 

29            

littératuro

343

 

30            

Litlérature

314

 

31            

liltéralure

301

 

[p.269]

32            

littérature.

296

 

33            

littératuie

293

 

34            

Littéralure

278

 

35            

littérarure

272

 

36            

littéraiure

256

 

37            

littcrature

256

 

38            

littératare

240

 

39            

littératre

205

 

40            

la'littérature

202

 

41            

LITTÉRATURE.

199

 

42            

LITTÉRATuRE

196

 

43            

Lilléralure

187

 

44            

Littératurc

184

 

45            

llttérature

150

 

46            

littératture

147

 

47            

littératurè

134

 

48            

Litte'rature

125

 

49            

littératute

116

 

50            

littératute

116

 

51            

Litteratures

108

 

52            

litlerature

108

 

53            

littéralures

102

 

 


[1] Google Books Ngram Viewer, https://books.google.com/ngrams (last accessed 8 September 2022). The raw data can be downloaded at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html.

[2] See Michel et al., ‘Quantitative’, p.176. Google confirms that the updated, 2012 corpus was selected according to these same criteria. See Lin et al., ‘Syntactic’, p.170.

[3] The updates are explained in Lin, et al., ‘Syntactic’, p.169.

[4] Nadja Younes and Ulf-Dietrich Reips offer an overview of such studies in ‘Guideline for improving the reliability of Google Ngram studies: evidence from religious terms’, PLoS ONE 14 (2019), p.1-17 (2).

[5] See p.12-14.

[6] Ngram’s inability to account for a book’s popularity is discussed in Eitan Adam Pechenick, Christopher M. Danforth and Peter Sheridan Dodds, ‘Characterizing the Google Books corpus: strong limits to inferences of socio-cultural and linguistic evolution’, PLoS ONE 10 (2015), p.1-24 (2). The authors also note that despite Google’s claims to only include one copy of each book, ‘new editions and reprints allow some books to appear more than once’ (p.2).

[7] Pechenick, Danforth, and Dodds, ‘Characterizing’, p.23.

[8] One of Google’s updates to the 2012 iteration of Ngram was to tag each word according to grammatical function. For example, one finds ‘run’, ‘run_NOUN’ and ‘run_VERB’. I retain only untagged words, which comprise all uses of a term. For more on Google’s tagging, see Lin et al., ‘Syntactic’, p.171-72.

[9] The online Ngram Viewer only allows users to view Ngrams as a proportion of the total 1-grams in a given year. This would have been slightly inaccurate for my study, since I look not just at 1-grams, but also at 2-grams (‘belles lettres’) and 3-grams (‘belles-lettres’). The number of pages within the corpus represents, therefore, a common denominator, which can be used to calculate proportionate values for the different Ngrams I consider.

Annotate

Corpora and Appendices
World English language rights reserved by Liverpool University Press
Powered by Manifold Scholarship. Learn more at
Opens in new tab or windowmanifoldapp.org