HOUSE_OVERSIGHT_016996.jpg

3.14 MB

Extraction Summary

13
People
7
Organizations
4
Locations
2
Events
2
Relationships
3
Quotes

Document Information

Type: Scientific research article / evidence document
File Size: 3.14 MB
Summary

This document is the first page of a scientific research article titled 'Quantitative Analysis of Culture Using Millions of Digitized Books' published in Sciencexpress on December 16, 2010. The paper introduces 'Culturomics' using data from Google Books. It is stamped 'HOUSE_OVERSIGHT_016996', indicating it was part of the House Oversight Committee's investigation, likely due to the involvement of author Martin A. Nowak, the director of the Program for Evolutionary Dynamics at Harvard, which received significant funding from Jeffrey Epstein.

People (13)

Name Role Context
Jean-Baptiste Michel Author
Affiliated with Harvard University and others; corresponding author
Yuan Kui Shen Author
Affiliated with Computer Science and Artificial Intelligence Laboratory, MIT
Aviva Presser Aiden Author
Affiliated with Harvard Medical School
Adrian Veres Author
Affiliated with Harvard College
Matthew K. Gray Author
Affiliated with Google, Inc.
Joseph P. Pickett Author
Affiliated with Houghton Mifflin Harcourt
Dale Hoiberg Author
Affiliated with Encyclopaedia Britannica, Inc.
Dan Clancy Author
Affiliated with Google, Inc.
Peter Norvig Author
Affiliated with Google, Inc.
Jon Orwant Author
Affiliated with Google, Inc.
Steven Pinker Author
Affiliated with Department of Psychology, Harvard University
Martin A. Nowak Author
Affiliated with Program for Evolutionary Dynamics, Harvard University (Note: Known associate of Jeffrey Epstein who r...
Erez Lieberman Aiden Author
Affiliated with Harvard University and others; corresponding author

Organizations (7)

Name Type Context
Harvard University
Multiple departments including Program for Evolutionary Dynamics
Google, Inc.
Affiliation of several authors; source of digitized books data
Houghton Mifflin Harcourt
Affiliation of Joseph P. Pickett
Encyclopaedia Britannica, Inc.
Affiliation of Dale Hoiberg
MIT
Computer Science and Artificial Intelligence Laboratory
Sciencexpress
Publisher of the article
House Oversight Committee
Implied by Bates stamp 'HOUSE_OVERSIGHT'

Timeline (2 events)

2010-12-16
Article downloaded from www.sciencemag.org
Online
2010-12-16
Publication date of the article in Sciencexpress
N/A
All Authors

Locations (4)

Location Context
Location of Harvard University and MIT
Location of Harvard Medical School and Houghton Mifflin Harcourt
Location of Google, Inc.
Location of Encyclopaedia Britannica, Inc.

Relationships (2)

Both listed as corresponding authors contributing equally.
Martin A. Nowak Employment/Affiliation Harvard University
Listed as affiliated with Program for Evolutionary Dynamics, Harvard University.

Key Quotes (3)

"We constructed a corpus of digitized texts containing about 4% of all books ever printed."
Source
HOUSE_OVERSIGHT_016996.jpg
Quote #1
"The corpus cannot be read by a human. If you tried to read only the entries from the year 2000 alone... it would take eighty years."
Source
HOUSE_OVERSIGHT_016996.jpg
Quote #2
"“Culturomics” extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities."
Source
HOUSE_OVERSIGHT_016996.jpg
Quote #3

Full Extracted Text

Complete text extracted from the document (5,060 characters)

Sciencexpress Research Article
Quantitative Analysis of Culture Using Millions of Digitized Books
Jean-Baptiste Michel, 1,2,3,4*† Yuan Kui Shen, 5 Aviva Presser Aiden, 6 Adrian Veres, 7 Matthew K. Gray, 8 The Google Books Team, 8 Joseph P. Pickett, 9 Dale Hoiberg, 10 Dan Clancy, 8 Peter Norvig, 8 Jon Orwant, 8 Steven Pinker, 4 Martin A. Nowak, 1,11,12 Erez Lieberman Aiden 1,12,13,14,15,16*†
1Program for Evolutionary Dynamics, Harvard University, Cambridge, MA 02138, USA. 2Institute for Quantitative Social Sciences, Harvard University, Cambridge, MA 02138, USA. 3Department of Psychology, Harvard University, Cambridge, MA 02138, USA. 4Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA. 5Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA 02139, USA. 6Harvard Medical School, Boston, MA, 02115, USA. 7Harvard College, Cambridge, MA 02138, USA. 8Google, Inc., Mountain View, CA, 94043, USA. 9Houghton Mifflin Harcourt, Boston, MA 02116, USA. 10Encyclopaedia Britannica, Inc., Chicago, IL 60654, USA. 11Dept of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA. 12Dept of Mathematics, Harvard University, Cambridge, MA 02138, USA. 13Broad Institute of Harvard and MIT, Harvard University, Cambridge, MA 02138, USA. 14School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA. 15Harvard Society of Fellows, Harvard University, Cambridge, MA 02138, USA. 16Laboratory-at-Large, Harvard University, Cambridge, MA 02138, USA.
*These authors contributed equally to this work.
†To whom correspondence should be addressed. E-mail: jb.michel@gmail.com (J.B.M.); erez@erez.com (E.A.).
We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of “culturomics”, focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. “Culturomics” extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.
Reading small collections of carefully chosen works enables scholars to make powerful inferences about trends in human thought. However, this approach rarely enables precise measurement of the underlying phenomena. Attempts to introduce quantitative methods into the study of culture (1-6) have been hampered by the lack of suitable data.
We report the creation of a corpus of 5,195,769 digitized books containing ~4% of all books ever published. Computational analysis of this corpus enables us to observe cultural trends and subject them to quantitative investigation. “Culturomics” extends the boundaries of scientific inquiry to a wide array of new phenomena.
The corpus has emerged from Google’s effort to digitize books. Most books were drawn from over 40 university libraries around the world. Each page was scanned with custom equipment (7), and the text digitized using optical character recognition (OCR). Additional volumes – both physical and digital – were contributed by publishers. Metadata describing date and place of publication were provided by the libraries and publishers, and supplemented with bibliographic databases. Over 15 million books have been digitized [12% of all books ever published (7)]. We selected a subset of over 5 million books for analysis on the basis of the quality of their OCR and metadata (Fig. 1A) (7). Periodicals were excluded.
The resulting corpus contains over 500 billion words, in English (361 billion), French (45B), Spanish (45B), German (37B), Chinese (13B), Russian (35B), and Hebrew (2B). The oldest works were published in the 1500s. The early decades are represented by only a few books per year, comprising several hundred thousand words. By 1800, the corpus grows to 60 million words per year; by 1900, 1.4 billion; and by 2000, 8 billion.
The corpus cannot be read by a human. If you tried to read only the entries from the year 2000 alone, at the reasonable pace of 200 words/minute, without interruptions for food or sleep, it would take eighty years. The sequence of letters is one thousand times longer than the human genome: if you wrote it out in a straight line, it would reach to the moon and back 10 times over (8).
To make release of the data possible in light of copyright constraints, we restricted our study to the question of how often a given “1-gram” or “n-gram” was used over time. A 1-gram is a string of characters uninterrupted by a space; this includes words (“banana”, “SCUBA”) but also numbers
Sciencexpress / www.sciencexpress.org / 16 December 2010 / Page 1 / 10.1126/science.1199644
Downloaded from www.sciencemag.org on December 16, 2010
HOUSE_OVERSIGHT_016996

Discussion 0

Sign in to join the discussion

No comments yet

Be the first to share your thoughts on this epstein document