HOUSE_OVERSIGHT_017028.jpg

2.35 MB

Extraction Summary

1
People
3
Organizations
0
Locations
0
Events
0
Relationships
3
Quotes

Document Information

Type: Academic paper / methodology supplement (house oversight production)
File Size: 2.35 MB
Summary

This document appears to be a page from a supplementary methodology section of an academic paper or report regarding data analysis of Wikipedia entries. It details the process of creating a database of people born between 1800 and 1980 to analyze 'fame' and identify occupations using DBPedia and Wikipedia categories. The document is stamped with 'HOUSE_OVERSIGHT_017028', indicating it was part of a document production to the House Oversight Committee.

People (1)

Name Role Context
Che Guevara Historical Figure / Example
Used as an example of occupation classification issues in Wikipedia (listed as Biologist, though a medical doctor by ...

Organizations (3)

Name Type Context
Wikipedia
Source of data for the methodology described.
DBPedia
Framework used to find articles and categories.
House Oversight Committee
Implied by the footer 'HOUSE_OVERSIGHT_017028'.

Key Quotes (3)

"Create a database of records referring to people born 1800-1980 in Wikipedia."
Source
HOUSE_OVERSIGHT_017028.jpg
Quote #1
"Only people both in 1800-1980 are used for the purposes of fame analysis."
Source
HOUSE_OVERSIGHT_017028.jpg
Quote #2
"For instance, 'Che Guevara' was listed in Biologists; so even though he was a medical doctor by training, this is not his primary historical contribution."
Source
HOUSE_OVERSIGHT_017028.jpg
Quote #3

Full Extracted Text

Complete text extracted from the document (3,731 characters)

known, their article will be a member of a “decade_births” category such as “1890s_births” and
“1930s_births”. We treat these individuals as if born at the beginning of the decade.
For every parsed article, we append metadata relating to the importance of the article within Wikipedia,
namely the size in words of the article and the number of page views which it obtains. The article word
count is created by directly accessing the article using its URL. The traffic statistics for Wikipedia articles
are obtained from http://stats.grok.se/.
Figure S10a displays the number of records parsed from Wikipedia and retained for the final cohort
analysis. Table S7 displays specific examples from the extraction’s output, including name, year of birth,
year of death, approximate word count of main article and traffic statistics for March 2010.
1) Create a database of records referring to people born 1800-1980 in Wikipedia.
a. Using the DBPedia framework, find all articles which are members of the categories
‘1700_births’ through ‘1980_births’. Only people both in 1800-1980 are used for the
purposes of fame analysis. People born in 1700-1799 are used to identify naming
ambiguities as described in section III.7.A.7 of this Supplementary Material.
b. For all these articles, create a record identified by the article URL, and append the birth
year.
c. For every record, use the URL to navigate to the online Wikipedia page. Within the main
article body text, remove all HTML markup tags and perform a word count. Append this
word count to the record.
d. For every record, use the URL to determine the page’s traffic statistics for the month of
March 2010. Append the number of views to the record.
III.7.A.2 – Identification of occupation for individuals appearing in Wikipedia.
Two types of structural elements within Wikipedia enable us to identify, for certain individuals, their
occupation. The first, Wikipedia Categories, was previously described and used to recognize articles
about people. Wikipedia Categories also contain information pertaining to occupation. The categories
“Physicists”, “Physicists by Nationality”, “Physicists stubs”, along with their subcategories, pinpoint articles
of relating to the occupation of physicist. The second are Wikipedia Lists, special pages dedicated to
listing Wikipedia articles which fit a precise subject. For physicists, relevant examples are “List of
physicists”, “List of plasma physicists” and “List of theoretical physicists”. Given their redundancy, these
two structural elements, when used in combination provide a strong means of identifying the occupation
of an individual.
Next, we selected the top 50 individuals in each category, and annotated each one manually as a function
of the individual’s main occupation, as determined by reading the associated Wikipedia article. For
instance, “Che Guevara” was listed in Biologists; so even though he was a medical doctor by training, this
is not his primary historical contribution. The most famous individuals of each category born between
1800 and 1920 are given in Appendix.
In our database of individuals, we append, when available, information about the occupations of people.
This enables the comparison, on the basis of fame, of groups of individuals distinguished by their
occupational decisions.
2) Associate Wikipedia records of individuals with occupations using relevant Wikipedia
“Categories” and “Lists” pages. For every occupation to be investigated :
a. Manually create a list of Wikipedia categories and lists associated with this defined
occupation.
b. Using the DBPedia framework, find all the Wikipedia articles which are members of the
chosen Wikipedia categories.
20
HOUSE_OVERSIGHT_017028

Discussion 0

Sign in to join the discussion

No comments yet

Be the first to share your thoughts on this epstein document