HOUSE_OVERSIGHT_017023.jpg

2.29 MB

Extraction Summary

2
People
4
Organizations
8
Locations
1
Events
0
Relationships
2
Quotes

Document Information

Type: Scientific/academic paper (methodology section) / congressional oversight document
File Size: 2.29 MB
Summary

This document appears to be page 15 of a scientific methodology paper or appendix regarding linguistic analysis and '1-grams' (likely related to the 'Culturomics' study or Google Ngrams). It details control methods using historical data (presidents, treaties, country name changes) to verify frequency peaks in a dataset and estimates word counts using the American Heritage and Webster's dictionaries. While the document bears a 'HOUSE_OVERSIGHT' Bates stamp, indicating it was part of a congressional document production, the text itself contains no direct references to Jeffrey Epstein, his associates, or criminal activities.

People (2)

Name Role Context
President Truman Historical Figure
Used as an example for 'heads of state' control data.
President Roosevelt Historical Figure
Mentioned as an ambiguous name removed from the dataset.

Organizations (4)

Name Type Context
Wikipedia
Used as a primary source for word lists.
American Heritage Dictionary (AHD4)
Source of lexicon data (4th Edition, 2000).
Webster's Third New International Dictionary
Source of lexicon data (2002 edition).
House Oversight Committee
Implied by the Bates stamp 'HOUSE_OVERSIGHT_017023'.

Timeline (1 events)

19th and 20th Centuries
Signing of 198 treaties used as data points.
Global

Locations (8)

Location Context
Historical name change example.
Historical name change example.
Historical name change example.
Historical name change example.
Historical name change example.
Historical name change example.
Historical name change example.
Historical name change example.

Key Quotes (2)

"To confirm the quality of our data in the English language, we sought positive controls in the form of words that should exhibit very strong peaks around a date of interest."
Source
HOUSE_OVERSIGHT_017023.jpg
Quote #1
"We are indebted to the editorial staff of AHD4 for providing us the list of the 153,459 headwords that make up the entries of AHD4."
Source
HOUSE_OVERSIGHT_017023.jpg
Quote #2

Full Extracted Text

Complete text extracted from the document (3,286 characters)

language lexica, we tried whenever possible to have the annotation performed by a third party with no knowledge of the analyses we were undertaking
III.3. Controls
To confirm the quality of our data in the English language, we sought positive controls in the form of words that should exhibit very strong peaks around a date of interest. We used three categories of such words: heads of state ('President Truman'), treaties ('Treaty of Versailles'), and geographical name change ('Byelorussia' to 'Belarus'). We used Wikipedia as a primary source of such words, and manually curated the lists as described below. We computed the timeserie of each n-gram, centered it on the date of interest (year when the person became president, for instance), and normalized the timeserie by overall frequency. Then, we took the mean trajectory for each of the three cohorts, and plotted in Figure S5.
The list of heads of states include all US presidents and British monarchs who gained power in the 19th or 20th centuries (we removed ambiguous names, such as 'President Roosevelt'). The list of treaties is taken from the list of 198 treaties signed in the 19th or 20th centuries (S7); but we kept only the 121 names that referred to only one known treaty, and that have non zero timeseries. The list of country name changes is taken from Ref S8. The lists are given in APPENDIX.
The correspondence between the expected and observed presence of peaks was excellent. 42 out of 44 heads of state had a frequency increase of over 10-fold in the decade after they took office (expected if the year of interest was random: 1). Similarly, 85 out of 92 treaties had a frequency increase of over 10-fold in the decade after they were signed (expected: 2). Last, 23 out of 28 new country names became more frequent than the country name they replaced within 3 years of the name change; exceptions include Kampuchea/Cambodia (the name Cambodia was later reinstated), Iran/Persia (Iran is still today referred to as Persia in many contexts) and Sri Lanka/Ceylon (Ceylon is also a popular tea).
III.4. Lexicon Analysis
III.4A. Estimation of the number of 1-grams defined in leading dictionaries of the English language.
(a) American Heritage Dictionary of the English Language, 4th Edition (2000)
We are indebted to the editorial staff of AHD4 for providing us the list of the 153,459 headwords that make up the entries of AHD4. However, many headwords are not single words ("preferential voting" or "men's room"), and others are listed as many times as there are grammatical categories ("to console", the verb; "console", the piece of furniture).
Among those entries, we find 116,156 unique 1-grams (such as "materialism" or "extravagate").
(b) Webster's Third New International Dictionary (2002)
The editorial staff communicated to us the number of "boldface entries" of the dictionary, which are taken to be the number of n-grams defined: 476,330.
The editorial staff also communicated the number of multi-word entries 74,000 out of a total number of entries 275,000. They estimate a lower bound of multi-word entries at 27% of the entries.
Therefore, we estimate an upper bound of unique 1-grams defined by this dictionary as 0.27*476,330, which is approximately 348,000.
15
HOUSE_OVERSIGHT_017023

Discussion 0

Sign in to join the discussion

No comments yet

Be the first to share your thoughts on this epstein document