Technical Report / Academic Paper Supplement (Methodology Section) - HOUSE_OVERSIGHT_017013

Processing Document...

0%

Initializing...

Extraction Summary

1

People

3

Organizations

0

Locations

2

Events

0

Relationships

3

Quotes

Document Information

Type: Technical report / academic paper supplement (methodology section)

File Size: 2.3 MB

Summary

This document appears to be a methodology appendix for a study on 'Historical N-grams Corpora' utilizing Google Books data. It describes the technical process of filtering metadata to ensure accuracy, specifically removing serial publications via an algorithm dubbed 'Serial Killer.' The document bears a 'HOUSE_OVERSIGHT_017013' Bates stamp, indicating it was part of a document production to the House Oversight Committee, though the text itself contains no direct references to Epstein, Maxwell, or specific criminal activities.

People (1)

Name	Role	Context
Annotator	Researcher/Verifier	An individual with no knowledge of the study who manually determined date-of-publication for 1000 volumes.

Organizations (3)

Name	Type	Context
Google		Digitized 15 million books used as the source for the study.
US Government		Mentioned in the context of 'US Government report' as a filter phrase.
House Oversight Committee		Documents bears the Bates stamp 'HOUSE_OVERSIGHT'.

Timeline (2 events)

1550-2008

Analysis period for n-gram frequency tables.

Global (English and foreign language corpora)

View

1801-2000

Metadata accuracy examination of 1000 filtered volumes.

N/A

Annotator

View

Key Quotes (3)

"As noted in the paper text, we did not analyze the entire set of 15 million books digitized by Google."

Source

HOUSE_OVERSIGHT_017013.jpg

Quote #1

"Our 'Serial Killer' algorithm removed serial publications by looking for suggestive metadata entries"

Source

HOUSE_OVERSIGHT_017013.jpg

Quote #2

"For English books, 29.4% of books were filtered using the 'Serial Killer'"

Source

HOUSE_OVERSIGHT_017013.jpg

Quote #3

Full Extracted Text

Complete text extracted from the document (3,399 characters)

II. Construction of Historical N-grams Corpora

As noted in the paper text, we did not analyze the entire set of 15 million books digitized by Google.
Instead, we
1. Performed further filtering steps to select only a subset of books with highly accurate metadata.
2. Subdivided the books into 'base corpora' using such metadata fields as language, country of publication, and subject.
3. For each base corpus, construct a massive numerical table that lists, for each n-gram (often a word or phrase), how often it appears in the given base corpus in every single year between 1550 and 2008.
In this section, we will describe these three steps. These additional steps ensure high data quality, and also make it possible to examine historical trends without violating the 'fair use' principle of copyright law: our object of study is the frequency tables produced in step 3 (which are available as supplemental data), and not the full-text of the books.

II.1. Additional filtering of books

II.1A. Accuracy of Date-of-Publication metadata
Accurate date-of-publication data is crucial component in the production of time-resolved n-grams data. Because our study focused most centrally on the English language corpus, we decided to apply more stringent inclusion criteria in order to make sure the accuracy of the date-of-publication data was as high as possible.
We found that the lion's share of date-of-publication errors were due to so-called 'bound-withs' - single volumes that contain multiple works, such as anthologies or collected works of a given author. Among these bound-withs, the most inaccurately dated subclass were serial publications, such as journals and periodicals. For instance, many journals had publication dates which were erroneously attributed to the year in which the first issue of the journal had been published. These journals and serial publications also represented a different aspect of culture than the books did. For these reasons, we decided to filter out all serial publications to the extent possible. Our 'Serial Killer' algorithm removed serial publications by looking for suggestive metadata entries, containing one or more of the following:
1. Serial-associated titles, containing such phrases as 'Journal of', 'US Government report', etc.
2. Serial-associated authors, such as those in which the author field is blank, too numerous, or contains words such as 'committee'.
Note that the match is case-insensitive, and it must be to a complete word in the title; thus the filtering of titles containing the word 'digest' does not lead to the removal of works with 'digestion' in the title. The entire list of serial-associated title phrases and serial-associated author phrases is included as supplemental data (Appendix). For English books, 29.4% of books were filtered using the 'Serial Killer', with the title filter removing 2% and the author filter removing 27.4%. Foreign language corpora were filtered in a similar fashion.
This filtering step markedly increased the accuracy of the metadata dates. We determined metadata accuracy by examining 1000 filtered volumes distributed uniformly over time from 1801-2000 (5 per year). An annotator with no knowledge of our study manually determined the date-of-publication. The annotator was aware of the Google metadata dates during this process. We found that 5.8% of English books had
5
HOUSE_OVERSIGHT_017013

View Original PDF

HOUSE_OVERSIGHT_017013.jpg

Processing Document...

Extraction Summary

Document Information

People (1)

Organizations (3)

Timeline (2 events)

Key Quotes (3)

Full Extracted Text

Discussion 0