HOUSE_OVERSIGHT_017031.jpg

2.33 MB

Extraction Summary

0
People
3
Organizations
0
Locations
0
Events
0
Relationships
4
Quotes

Document Information

Type: Technical report / methodology appendix (house oversight committee production)
File Size: 2.33 MB
Summary

This document is page 23 of a technical report produced to the House Oversight Committee (Bates stamp HOUSE_OVERSIGHT_017031). It details a data processing methodology for analyzing name frequencies and 'fame signals' within a database. The text focuses on algorithmic steps (III.7.A.6 through III.7.A.8) to handle ambiguous names (homonymity) and distinguish between different individuals with similar names using data sources like Encyclopedia Britannica and Wikipedia.

Organizations (3)

Key Quotes (4)

"The frequency of the name, which corresponds to a measure of how often an individual is mentioned, provides a metric for the fame of that person."
Source
HOUSE_OVERSIGHT_017031.jpg
Quote #1
"The fame signal is the timeseries of normalized word matches in the complete English database."
Source
HOUSE_OVERSIGHT_017031.jpg
Quote #2
"Certain names are particularly popular and are shared by multiple people. This results in ambiguity, as the same query name may refer to a plurality of individuals."
Source
HOUSE_OVERSIGHT_017031.jpg
Quote #3
"For the database of people extracted from Encyclopedia Britannica, we argue that the quantity of information available about an individual provides a proxy for their relevance."
Source
HOUSE_OVERSIGHT_017031.jpg
Quote #4

Full Extracted Text

Complete text extracted from the document (3,564 characters)

h. Add to the set of query names all pairs of “first names + last names” produced by combining the sets of first and last names.
i. This procedure is carried for every raw name variant.
III.7.A.6 – Find the word match frequencies of all names.
Given the set of names which may refer to an individual, we wish to find the time resolved words frequencies of these names. The frequency of the name, which corresponds to a measure of how often an individual is mentioned, provides a metric for the fame of that person. We append the word frequencies of all the names which can potentially refer to an individual. This enables us, in a later step, to identify which name is the relevant.
6) Append the fame signal for each query name of each record. The fame signal is the timeseries of normalized word matches in the complete English database.
III.7.A.7 – Find ambiguous names which can refer to multiple individuals.
Certain names are particularly popular and are shared by multiple people. This results in ambiguity, as the same query name may refer to a plurality of individuals. Homonimity conflicts occur between a group of individuals when they share some part of, or all, their name. When these homonimity conflicts arise, the word frequency of a specific name may not reflect the number of references to a unique person, but to that of an entire group. As such, the word frequency does not constitute a clear means of tracking the fame of the concerned individuals. We identify homonimity conflicts by finding instances of individuals whose names contain complete or partial matches. These conflicts are, when possible, resolved on the basis of the importance of the conflicted individuals in the following step. Typical homonimity conflicts are shown in Table S11.
7) Identify homonimity conflicts. Homonimity conflicts arise when the query names of two or more individuals contain a substring match. These conflicts are distinguished as such :
a. For every query name of every record, find the set of substrings of query names.
b. For every query name of every record, search for matches in the set of query name substrings of all other records.
c. Bidirectional homonimity conflicts occur when a query name fully matches another query name. The name conflicted name could be used to refer to both individuals. Unidirectional conflicts occur when a query name has a substring match within another query name. Thus, the conflicted name can refer to one of the individuals, but also be part of a name referring to another.
III.7.A.8 – Resolve, when possible, the most likely origin of ambiguous names.
The problem of homonymous individuals is limiting because the word frequencies data do not allow us to resolve the true identity behind a homonymous name. Nonetheless, in some cases, it is possible to distinguish conflicted individuals on the basis of their importance. For the database of people extracted from Encyclopedia Britannica, we argue that the quantity of information available about an individual provides a proxy for their relevance. Likewise, for people obtained from Wikipedia, we can judge their importance by the size of the article written about the person and the quantity of traffic the article generates. As such, we approach the problem of ambiguous names by comparing the notability of individuals, as evaluated by the amount of information available about them in the respective encyclopedic source. Examples of conflict resolution are shown in Table S12 and S13.
8) Resolve homonimity conflicts.
23
HOUSE_OVERSIGHT_017031

Discussion 0

Sign in to join the discussion

No comments yet

Be the first to share your thoughts on this epstein document