This document is page 16 of a scientific or academic paper regarding quantitative linguistics, specifically the 'Estimation of Lexicon Size.' It details a methodology for analyzing word frequency (1-grams) over time (1900-2000) using the Oxford English Dictionary and other sources. The text explains a classification system for filtering words (e.g., typos, proper nouns, foreign words) to estimate the size of the English lexicon. While the content is purely academic, the footer 'HOUSE_OVERSIGHT_017024' indicates this document was collected as part of a House Oversight Committee investigation, likely related to the broader Epstein document production.
| Name | Role | Context |
|---|---|---|
| Native English Speaker (Unnamed) | Annotator |
Classified random samples of alphabetical forms into categories.
|
| Different Native Speaker (Unnamed) | Annotator |
Repeated the sampling process for the year 2000 lexicon to confirm independence.
|
| Name | Type | Context |
|---|---|---|
| Oxford English Dictionary |
Used to estimate the upper bound of unique 1-grams.
|
|
| House Oversight Committee |
The document bears the Bates stamp 'HOUSE_OVERSIGHT', indicating it was part of a congressional document production.
|
|
| AHD4 (American Heritage Dictionary, 4th Ed.) |
Used to plot frequency histograms of 1-grams.
|
"Therefore, we estimate an upper bound of the number of unique 1-grams defined by this dictionary as 615,100-169,000 which is approximately 446,000."Source
"We found that 90% of 1-gram headwords had a frequency greater than 10^-9, but only 70% were more frequent than 10^-8."Source
"A typo is a one-time typing error by someone who presumably knows the correct spelling (as in improtant); a misspelling, which generally has the same pronunciation as the correct spelling, arises when a person is ignorant of the correct spelling (as in abberation)."Source
Complete text extracted from the document (3,639 characters)
Discussion 0
No comments yet
Be the first to share your thoughts on this epstein document