c. Using the online Wikipedia website, find all Wikipedia articles which are listed in the body of the chosen Wikipedia lists.
d. Intersect the set of all articles belonging to the relevant Lists and Categories with the set of people both 1800-1980. For people in both sets, append the occupation information.
e. Associate the records of these articles with the occupation.
III.7.A.3 - Extraction of individuals appearing in Encyclopedia Britannica.
Encyclopedia Britannica is a hand-curated, high quality encyclopedic dataset with many detailed biographical entries. We obtained, in a private communication, structured datasets from Encyclopedia Britannica Inc. These datasets contain a complete record of all entries relating to individuals in the Encyclopedia Britannica. Each record contains the birth and death of the person at hand, as well as set of information snippets summarizing the most critical biographical information available within the encyclopedia.
For the analysis of fame, we extract, from the dataset provided by Encyclopedia Britannica Inc., records of individuals born in between 1800 and 1980. For every person, we retain, as a measure of their notability, a count of the number of biographical snippets present in the dataset. Figure S10b outlines the number of records parsed from the Encyclopedia Britannica dataset, as well as the number of these records ultimately retained for final analysis. Table S8 displays examples of records parsed in this step of the analysis procedure.
3) Create a database of records referring to people born 1800-1980 in Encyclopedia Britannica.
a. Using the internal database records provided by Encyclopedia Britannica Inc., find all entries referring to individuals born 1700-1980. Only people both in 1800-1980 are used for the purposes of fame analysis. People born in 1700-1799 are used to identify naming ambiguities as described in section III.7.A.7 of this Supplementary Material.
b. For these entries, create a record identified by a unique integer containing the individual's full name, as listed in the encyclopedia, and the individual's birth year.
c. For every record, find the number of encyclopedic informational snippets present in the Encyclopedia Britannica dataset. Append this count to the record.
III.7.A.4 – Produce spelling variants of the full names of individuals.
We ultimately wish to identify the most relevant name used to commonly refer to an individual. Given the limits of OCR and the specificities of the method used to create the word frequency database, certain typographic elements such as accents, hyphens or quotation marks can complicate this process. As such, for every full name present in our database of people, we append variants of the full names where these typographic elements have been removed or, when possible, replaced. Table S9 presents examples of spelling variants for multiple names.
4) In both databases, for every record, create a set of raw names variants. To create the set:
a. Include the original raw name.
b. If the name includes apostrophes or quotation marks, include a variant where these elements are removed.
c. If the first word in the name contains a hyphen, include a name where this hyphen is replaced with a whitespace.
d. If the last word of the name is a numeral, include a name where this numeral has been removed.
e. For every element in the set which contains non-Latin characters, include a variant where this characters have been replaced using the closest Latin equivalent.
21
HOUSE_OVERSIGHT_017029
Discussion 0
No comments yet
Be the first to share your thoughts on this epstein document