HOUSE_OVERSIGHT_017030.jpg

2.23 MB

Extraction Summary

2
People
3
Organizations
0
Locations
0
Events
0
Relationships
3
Quotes

Document Information

Type: Technical methodology report / data processing protocol (house oversight committee production)
File Size: 2.23 MB
Summary

This is page 22 of a technical document produced to the House Oversight Committee (Bates stamp HOUSE_OVERSIGHT_017030). The text describes a data processing methodology (Section III.7.A.5) for standardizing and extracting individual names from databases like Encyclopedia Britannica and Wikipedia to create 'query names.' It outlines specific algorithmic rules for handling titles, prefixes (e.g., 'von', 'de'), and formatting issues to accurately identify individuals despite variations in how their names appear in text.

People (2)

Name Role Context
Henry David Thoreau Example Subject
Used as an example of naming conventions in Encyclopedia Britannica/Wikipedia.
Oliver Joseph Lodge Example Subject
Used as an example of naming conventions where the middle name is dropped.

Organizations (3)

Name Type Context
Encyclopedia Britannica
Source database mentioned for name extraction rules.
Wikipedia
Source database mentioned for name extraction rules.
US House Committee on Oversight and Accountability
Implied by the 'HOUSE_OVERSIGHT' Bates stamp.

Key Quotes (3)

"Find possible names used to refer to individuals."
Source
HOUSE_OVERSIGHT_017030.jpg
Quote #1
"Given a full name with complex structure potentially containing details such as titles, initials, nobility rights and ranks, in addition to multiple first and last names, we must extract a list of simple names"
Source
HOUSE_OVERSIGHT_017030.jpg
Quote #2
"Query names are (2,3) grams which will be used in order to measure the fame of the individual."
Source
HOUSE_OVERSIGHT_017030.jpg
Quote #3

Full Extracted Text

Complete text extracted from the document (3,487 characters)

III.7.A.5 – Find possible names used to refer to individuals.
The common name of an individual sometimes significantly differs from the complete, formal name present in Encyclopedia Britannica and Wikipedia. This encyclopedia full name can contain details such as titles, initials and military or nobility standings, which are not commonly used when referring to individual in most publications. Even in simpler cases, when the full name contains only first, middle and last names, there exists no systematic convention on which names to use when talking about an individual. Henry David Thoreau is most commonly referred to by his full name, not "Henry Thoreau" nor "David Thoreau", whereas Oliver Joseph Lodge is mentioned by his first and last name "Oliver Lodge", not his full name "Oliver Joseph Lodge".
Given a full name with complex structure potentially containing details such as titles, initials, nobility rights and ranks, in addition to multiple first and last names, we must extract a list of simple names, using three words at most, which can potentially be used to refer to this individual. This set of names is created by generating combinations of names found in the raw name. Furthermore, whenever they appear we systematically exclude common words such as titles or ranks from these names. The query name sets of several individuals are displayed in Table S10.
5) For every record, using the set of raw names, create a set of query names. Query names are (2,3) grams which will be used in order to measure the fame of the individual. The following procedure is iterated on every raw name variant associated with the record. Steps for which the record type is not specified are carried out for both.
a. For Encyclopedia Britannica records, truncate the raw name at the second comma, reorder so that the part of name preceding the first comma follows that succeeding the comma.
b. For Wikipedia records, replace the underscores with whitespaces.
c. Truncate the name string at the first (if any) parenthesis or comma.
d. Truncate the name string at the beginning of the words 'in', 'In', 'the', 'The', 'of' and 'Of', if these are present.
e. Create the last name set. Iterating from last to first in the words of the name, add the first name with the following properties:
i. Begin with a capitalized letter.
ii. Longer than 1 character.
iii. Not ending in a period.
iv. If the words preceding this last name are identified as a prefix ('von', 'de', 'van', 'der', 'de' , 'd", 'al-', 'la', 'da', 'the', 'le', 'du', 'bin', 'y', 'ibn' and their capitalized versions ), the last name is a 2gram containing both the prefix.
f. If the last name contains a capitalized character besides the first one, add a variant of this word where the only capital letter is the first to the set of last names.
g. Create the set of first names. Iterating on the raw name elements which are not part of the last name set, candidate first names are words with the following properties :
i. Begin with a capitalized letter.
ii. Longer than 1 character.
iii. Not ending in a period.
iv. Not a title. ('Archduke', 'Saint', 'Emperor', 'Empress', 'Mademoiselle', 'Mother', 'Brother', 'Sister', 'Father', 'Mr', 'Mrs', 'Marshall', 'Justice', 'Cardinal', 'Archbishop', 'Senator', 'President', 'Colonel', 'General', 'Admiral', 'Sir', 'Lady', 'Prince', 'Princess', 'King', 'Queen', 'de', 'Baron', 'Baroness', 'Grand', 'Duchess', 'Duke', 'Lord', 'Count', 'Countess', 'Dr')
22
HOUSE_OVERSIGHT_017030

Discussion 0

Sign in to join the discussion

No comments yet

Be the first to share your thoughts on this epstein document