HOUSE_OVERSIGHT_017014.jpg

3.11 MB
View Original

Extraction Summary

1
People
2
Organizations
0
Locations
1
Events
0
Relationships
3
Quotes

Document Information

Type: Technical report / methodology paper (page 6)
File Size: 3.11 MB
Summary

This document is page 6 of a technical report detailing the methodology for creating data corpora from Google Book Search. It specifically discusses Section II.1B (OCR Quality) and II.1C (Accuracy of language metadata), explaining the algorithms used to filter out poor quality text and incorrect dates. While the content is technical, the footer 'HOUSE_OVERSIGHT_017014' indicates this document was produced as evidence or reference material for a US House Oversight Committee investigation.

People (1)

Name Role Context
Popat et al. Researcher/Algorithm Developer
Developed the algorithm used to assess OCR quality (Ref S3).

Organizations (2)

Name Type Context
Google
Mentioned in context of 'Google Book Search' and the generation of corpora.
House Oversight Committee
Implied by the footer 'HOUSE_OVERSIGHT_017014', indicating this document is part of a congressional investigation pro...

Timeline (1 events)

August 2009
Generation of base corpora for Google Book Search.
N/A

Key Quotes (3)

"Note that since the base corpora were generated (August 2009), many additional improvements have been made to the metadata dates used in Google Book Search itself."
Source
HOUSE_OVERSIGHT_017014.jpg
Quote #1
"The challenge of performing accurate OCR on the entire books dataset is compounded by variations in such factors as language, font, size, legibility, and physical condition of the book."
Source
HOUSE_OVERSIGHT_017014.jpg
Quote #2
"To ensure the highest quality data, we excluded volumes with poor OCR quality."
Source
HOUSE_OVERSIGHT_017014.jpg
Quote #3

Discussion 0

Sign in to join the discussion

No comments yet

Be the first to share your thoughts on this epstein document