HOUSE_OVERSIGHT_017014.jpg
3.11 MB
Extraction Summary
1
People
2
Organizations
0
Locations
1
Events
0
Relationships
3
Quotes
Document Information
Type:
Technical report / methodology paper (page 6)
File Size:
3.11 MB
Summary
This document is page 6 of a technical report detailing the methodology for creating data corpora from Google Book Search. It specifically discusses Section II.1B (OCR Quality) and II.1C (Accuracy of language metadata), explaining the algorithms used to filter out poor quality text and incorrect dates. While the content is technical, the footer 'HOUSE_OVERSIGHT_017014' indicates this document was produced as evidence or reference material for a US House Oversight Committee investigation.
People (1)
| Name | Role | Context |
|---|---|---|
| Popat et al. | Researcher/Algorithm Developer |
Developed the algorithm used to assess OCR quality (Ref S3).
|
Organizations (2)
| Name | Type | Context |
|---|---|---|
|
Mentioned in context of 'Google Book Search' and the generation of corpora.
|
||
| House Oversight Committee |
Implied by the footer 'HOUSE_OVERSIGHT_017014', indicating this document is part of a congressional investigation pro...
|
Key Quotes (3)
"Note that since the base corpora were generated (August 2009), many additional improvements have been made to the metadata dates used in Google Book Search itself."Source
HOUSE_OVERSIGHT_017014.jpg
Quote #1
"The challenge of performing accurate OCR on the entire books dataset is compounded by variations in such factors as language, font, size, legibility, and physical condition of the book."Source
HOUSE_OVERSIGHT_017014.jpg
Quote #2
"To ensure the highest quality data, we excluded volumes with poor OCR quality."Source
HOUSE_OVERSIGHT_017014.jpg
Quote #3
Discussion 0
No comments yet
Be the first to share your thoughts on this epstein document