HOUSE_OVERSIGHT_017024.jpg

2.29 MB

Extraction Summary

2
People
3
Organizations
0
Locations
0
Events
0
Relationships
3
Quotes

Document Information

Type: Academic paper / scientific report (appendix or supplementary material)
File Size: 2.29 MB
Summary

This document is page 16 of a scientific or academic paper regarding quantitative linguistics, specifically the 'Estimation of Lexicon Size.' It details a methodology for analyzing word frequency (1-grams) over time (1900-2000) using the Oxford English Dictionary and other sources. The text explains a classification system for filtering words (e.g., typos, proper nouns, foreign words) to estimate the size of the English lexicon. While the content is purely academic, the footer 'HOUSE_OVERSIGHT_017024' indicates this document was collected as part of a House Oversight Committee investigation, likely related to the broader Epstein document production.

People (2)

Name Role Context
Native English Speaker (Unnamed) Annotator
Classified random samples of alphabetical forms into categories.
Different Native Speaker (Unnamed) Annotator
Repeated the sampling process for the year 2000 lexicon to confirm independence.

Organizations (3)

Name Type Context
Oxford English Dictionary
Used to estimate the upper bound of unique 1-grams.
House Oversight Committee
The document bears the Bates stamp 'HOUSE_OVERSIGHT', indicating it was part of a congressional document production.
AHD4 (American Heritage Dictionary, 4th Ed.)
Used to plot frequency histograms of 1-grams.

Key Quotes (3)

"Therefore, we estimate an upper bound of the number of unique 1-grams defined by this dictionary as 615,100-169,000 which is approximately 446,000."
Source
HOUSE_OVERSIGHT_017024.jpg
Quote #1
"We found that 90% of 1-gram headwords had a frequency greater than 10^-9, but only 70% were more frequent than 10^-8."
Source
HOUSE_OVERSIGHT_017024.jpg
Quote #2
"A typo is a one-time typing error by someone who presumably knows the correct spelling (as in improtant); a misspelling, which generally has the same pronunciation as the correct spelling, arises when a person is ignorant of the correct spelling (as in abberation)."
Source
HOUSE_OVERSIGHT_017024.jpg
Quote #3

Full Extracted Text

Complete text extracted from the document (3,639 characters)

(c) Oxford English Dictionary (Reference in main text)
From the website of the OED we can read that the “number of word forms defined and/or illustrated” is 615,100; and that we find 169,000 “italicized-bold phrases and combinations”.
Therefore, we estimate an upper bound of the number of unique 1-grams defined by this dictionary as 615,100-169,000 which is approximately 446,000.
III.4B. Estimation of Lexicon Size
How frequent does a 1-gram have to be in order to be considered a word? We chose a minimum frequency threshold for ‘common’ 1-grams by attempting to identify the largest frequency decile that remains lower than the frequency of most dictionary words.
We plotted a histogram showing the frequency of the 1-grams defined in AHD4, as measured in our year 2000 lexicon. We found that 90% of 1-gram headwords had a frequency greater than 10^-9, but only 70% were more frequent than 10^-8. Therefore, the frequency 10^-9 is a reasonable threshold for inclusion in the lexicon.
To estimate the number of words, we began by generating the list of common 1-grams with a higher chronological resolution, namely 11 different time points from 1900 until 2000 (1900, 1910, 1920, ... 2000) as described above. We next excluded all 1-grams with non-alphabetical characters in order to produce a list of common alphabetical forms for each time point.
For three of the time points (1900, 1950, 2000), we took a random sample of 1000 alphabetical forms from the resulting set of alphabetical forms. These were classified by a native English speaker with no knowledge of the analyses being performed. The results of the classification are found in Appendix. We asked the speaker to classify the candidate words were classified into 8 categories:
M if the word is a misspelling or a typo or seems like gibberish*
N if the word derives primarily from a personal or a company name
P for any other kind of proper nouns
H if the word has lost its original hyphen
F if the word is a foreign word not generally used in English sentences
B if it is a ‘borrowed’ foreign word that is often used in English sentences
R for anything that does not fall into the above categories
U unclassifiable for some reason
We computed the fraction of these 1000 words at each time point that were classified as P, N, B, or R, which we call the ‘word fraction for year X’, or WFx. To compute the estimated lexicon size for 1900, 1950, and 2000, we multiplied the word fraction by the number of alphabetical forms in those years.
For the other 8 time points, we did not perform a separate sampling step. Instead, we estimated the word fraction by linearly interpolating the word fraction of the nearest sampled time points; i.e., the word fraction in 1920 satisfied WF1920=.WF1900+.4*(WF1950.- WF1900). We then multiplied the word fraction by the number of alphabetical forms in the corresponding year, as above.
For the year 2000 lexicon, we repeated the sampling and annotation process using a different native speaker. The results were similar, which confirmed that our findings were independent of the person doing the annotation.
We note that the trends shown in Fig 2A are similar when proper nouns (N) are excluded from the lexicon (i.e., the only categories are P, B and R). Figure S7 shows the estimates of the lexicon excluding the category ‘N’ (proper nouns).
* A typo is a one-time typing error by someone who presumably knows the correct spelling (as in improtant); a misspelling, which generally has the same pronunciation as the correct spelling, arises when a person is ignorant of the correct spelling (as in abberation).
16
HOUSE_OVERSIGHT_017024

Discussion 0

Sign in to join the discussion

No comments yet

Be the first to share your thoughts on this epstein document