Technical Methodology Report / Supplementary Material (House Oversight Committee) - HOUSE_OVERSIGHT_017029

Processing Document...

0%

Initializing...

Extraction Summary

0

People

3

Organizations

0

Locations

0

Events

1

Relationships

4

Quotes

Document Information

Type: Technical methodology report / supplementary material (house oversight committee)

File Size: 2.21 MB

Summary

This document page (21) appears to be part of a technical appendix or supplementary material for a report regarding data analysis methodology, specifically related to measuring 'fame' or 'notability'. It details the process of extracting and processing biographical data from Encyclopedia Britannica and Wikipedia for individuals born between 1800 and 1980, including handling spelling variants and OCR limitations. The footer 'HOUSE_OVERSIGHT_017029' indicates this document is part of materials collected by the House Oversight Committee.

Organizations (3)

Name	Type	Context
Encyclopedia Britannica Inc.		Provided structured datasets via private communication.
Wikipedia		Source of articles and lists for data extraction.
House Oversight Committee		Identified via footer stamp HOUSE_OVERSIGHT_017029.

Relationships (1)

Report Authors → Data sharing → Encyclopedia Britannica Inc.

We obtained, in a private communication, structured datasets from Encyclopedia Britannica Inc.

Key Quotes (4)

"Encyclopedia Britannica is a hand-curated, high quality encyclopedic dataset with many detailed biographical entries."

Source

HOUSE_OVERSIGHT_017029.jpg

Quote #1

"We obtained, in a private communication, structured datasets from Encyclopedia Britannica Inc."

Source

HOUSE_OVERSIGHT_017029.jpg

Quote #2

"For the analysis of fame, we extract, from the dataset provided by Encyclopedia Britannica Inc., records of individuals born in between 1800 and 1980."

Source

HOUSE_OVERSIGHT_017029.jpg

Quote #3

"We ultimately wish to identify the most relevant name used to commonly refer to an individual."

Source

HOUSE_OVERSIGHT_017029.jpg

Quote #4

Full Extracted Text

Complete text extracted from the document (3,546 characters)

c. Using the online Wikipedia website, find all Wikipedia articles which are listed in the body of the chosen Wikipedia lists.
d. Intersect the set of all articles belonging to the relevant Lists and Categories with the set of people both 1800-1980. For people in both sets, append the occupation information.
e. Associate the records of these articles with the occupation.

III.7.A.3 - Extraction of individuals appearing in Encyclopedia Britannica.
Encyclopedia Britannica is a hand-curated, high quality encyclopedic dataset with many detailed biographical entries. We obtained, in a private communication, structured datasets from Encyclopedia Britannica Inc. These datasets contain a complete record of all entries relating to individuals in the Encyclopedia Britannica. Each record contains the birth and death of the person at hand, as well as set of information snippets summarizing the most critical biographical information available within the encyclopedia.

For the analysis of fame, we extract, from the dataset provided by Encyclopedia Britannica Inc., records of individuals born in between 1800 and 1980. For every person, we retain, as a measure of their notability, a count of the number of biographical snippets present in the dataset. Figure S10b outlines the number of records parsed from the Encyclopedia Britannica dataset, as well as the number of these records ultimately retained for final analysis. Table S8 displays examples of records parsed in this step of the analysis procedure.

3) Create a database of records referring to people born 1800-1980 in Encyclopedia Britannica.
a. Using the internal database records provided by Encyclopedia Britannica Inc., find all entries referring to individuals born 1700-1980. Only people both in 1800-1980 are used for the purposes of fame analysis. People born in 1700-1799 are used to identify naming ambiguities as described in section III.7.A.7 of this Supplementary Material.
b. For these entries, create a record identified by a unique integer containing the individual's full name, as listed in the encyclopedia, and the individual's birth year.
c. For every record, find the number of encyclopedic informational snippets present in the Encyclopedia Britannica dataset. Append this count to the record.

III.7.A.4 – Produce spelling variants of the full names of individuals.
We ultimately wish to identify the most relevant name used to commonly refer to an individual. Given the limits of OCR and the specificities of the method used to create the word frequency database, certain typographic elements such as accents, hyphens or quotation marks can complicate this process. As such, for every full name present in our database of people, we append variants of the full names where these typographic elements have been removed or, when possible, replaced. Table S9 presents examples of spelling variants for multiple names.

4) In both databases, for every record, create a set of raw names variants. To create the set:
a. Include the original raw name.
b. If the name includes apostrophes or quotation marks, include a variant where these elements are removed.
c. If the first word in the name contains a hyphen, include a name where this hyphen is replaced with a whitespace.
d. If the last word of the name is a numeral, include a name where this numeral has been removed.
e. For every element in the set which contains non-Latin characters, include a variant where this characters have been replaced using the closest Latin equivalent.

21

HOUSE_OVERSIGHT_017029

View Original PDF

HOUSE_OVERSIGHT_017029.jpg

Processing Document...

Extraction Summary

Document Information

Organizations (3)

Relationships (1)

Key Quotes (4)

Full Extracted Text

Discussion 0