HOUSE_OVERSIGHT_017022.jpg

2.42 MB

Extraction Summary

1
People
3
Organizations
0
Locations
0
Events
3
Relationships
3
Quotes

Document Information

Type: Research paper page / technical report
File Size: 2.42 MB
Summary

This document page details methodologies for analyzing n-gram frequencies over time, specifically addressing how to handle multiple query cohorts by normalizing data to avoid bias from frequency differences. It also outlines the sources used for collecting historical and cultural data, primarily citing Wikipedia and Encyclopedia Britannica, while noting efforts to verify accuracy and minimize manual annotation bias.

People (1)

Name Role Context
Wolfgang Hermann

Organizations (3)

Name Type Context
Wikipedia
Encyclopedia Britannica
American Heritage Dictionary

Relationships (3)

Key Quotes (3)

"Such methods can be confounded by the vast frequency differences among the various constituent queries."
Source
HOUSE_OVERSIGHT_017022.jpg
Quote #1
"We often used Wikipedia in the process of obtaining these lists."
Source
HOUSE_OVERSIGHT_017022.jpg
Quote #2
"We avoided doing manual annotation ourselves wherever possible, in an effort to avoid biasing the results."
Source
HOUSE_OVERSIGHT_017022.jpg
Quote #3

Full Extracted Text

Complete text extracted from the document (4,062 characters)

a particular n-gram in year X as shown in the plots is the mean of the raw frequency value for the n-gram
in the year X, the year X-1, and the year X+1.
Note that for each n-gram in the corpus, we can provide three measures as a function of year of
publication:
1- the number of times it appeared
2- the number of pages where it appeared
3- the number of books where it appeared.
Throughout the paper, we make use only of the first measure; but the two others remain available. They
are generally all in agreement, but can denote distinct cultural effects. These distinctions are not explored
in this paper.
For example, we give in Appendix measures for the frequency of the word 'evolution'. In the first three
columns, we give the number of times it appeared, the normalized number of times it appeared (relative
to #words that year), the normalized number of pages it appeared in, and the normalized number of
books it appeared in, as a function of the date.
III.1B. Multiple Query/Cohort Timelines
Where indicated, timeline plots may reflect the aggregates of multiple query results, such as a cohort of
individuals or inventions. In these cases, the raw data for each query we used to associate each year with
a set of frequencies. The plot was generated by choosing a measure of central tendency to characterize
the set of frequencies (either mean or median) and associating the resulting value with the corresponding
year.
Such methods can be confounded by the vast frequency differences among the various constituent
queries. For instance, the mean will tend to be dominated by the most frequent queries, which might be
several orders of magnitude more frequent than the least frequent queries. If the absolute frequency of
the various query results is not of interest, but only their relative change over time, then individual query
results may be normalized so that they yield a total of 1. This results in a probability mass function for
each query describing the likelihood that a random instance of a query derives from a particular year.
These probability mass functions may then be summed to characterize a set of multiple queries. This
approach eliminates bias due to inter-query differences in frequency, making the change over time in the
cohort easier to track.
III.2. Note on collection of historical and cultural data
In performing the analyses described in this paper, we frequently required additional curated datasets of
various cultural facts, such as dates of rule of various monarchs, lists of notable people and inventions,
and many others. We often used Wikipedia in the process of obtaining these lists. Where Wikipedia is
merely digitizing the content available in another source (for instance, the blacklists of Wolfgang
Hermann), we corrected the data using the original sources. In other cases this was not possible, but we
felt that the use of Wikipedia was justifiable given that (i) the data – including all prior versions - is publicly
available; (ii) it was created by third parties with no knowledge of our intended analyses; and (iii) the
specific statistical analyses performed using the data were robust to errors; i.e., they would be valid as
long as most of the information was accurate, even if some fraction of the underlying information was
wrong. (For instance, the aggregate analysis of treaty dates as compared to the timeline of the
corresponding treaty, shown in the control section, will work as long as most of the treaty names and
dates are accurate, even if some fraction of the records is erroneous.
We also used several datasets from the Encyclopedia Britannica, to confirm that our results were
unchanged when high-quality carefully curated data was used. For the lexicographic analyses, we relied
primarily on existing data from the American Heritage Dictionary.
We avoided doing manual annotation ourselves wherever possible, in an effort to avoid biasing the
results. When manual annotation had to be performed, such as in the classification of samples from our
14
HOUSE_OVERSIGHT_017022

Discussion 0

Sign in to join the discussion

No comments yet

Be the first to share your thoughts on this epstein document