Technical Report / Methodology Paper (Page 6) - HOUSE_OVERSIGHT_017014

Processing Document...

0%

Initializing...

Extraction Summary

1

People

2

Organizations

0

Locations

1

Events

0

Relationships

3

Quotes

Document Information

Type: Technical report / methodology paper (page 6)

File Size: 3.11 MB

Summary

This document is page 6 of a technical report detailing the methodology for creating data corpora from Google Book Search. It specifically discusses Section II.1B (OCR Quality) and II.1C (Accuracy of language metadata), explaining the algorithms used to filter out poor quality text and incorrect dates. While the content is technical, the footer 'HOUSE_OVERSIGHT_017014' indicates this document was produced as evidence or reference material for a US House Oversight Committee investigation.

People (1)

Name	Role	Context
Popat et al.	Researcher/Algorithm Developer	Developed the algorithm used to assess OCR quality (Ref S3).

Organizations (2)

Name	Type	Context
Google		Mentioned in context of 'Google Book Search' and the generation of corpora.
House Oversight Committee		Implied by the footer 'HOUSE_OVERSIGHT_017014', indicating this document is part of a congressional investigation pro...

Timeline (1 events)

August 2009

Generation of base corpora for Google Book Search.

N/A

Google

View

Key Quotes (3)

"Note that since the base corpora were generated (August 2009), many additional improvements have been made to the metadata dates used in Google Book Search itself."

Source

HOUSE_OVERSIGHT_017014.jpg

Quote #1

"The challenge of performing accurate OCR on the entire books dataset is compounded by variations in such factors as language, font, size, legibility, and physical condition of the book."

Source

HOUSE_OVERSIGHT_017014.jpg

Quote #2

"To ensure the highest quality data, we excluded volumes with poor OCR quality."

Source

HOUSE_OVERSIGHT_017014.jpg

Quote #3

HOUSE_OVERSIGHT_017014.jpg

Processing Document...

Extraction Summary

Document Information

People (1)

Organizations (2)

Timeline (1 events)

Key Quotes (3)

Discussion 0