Technical Report / Government Exhibit - HOUSE_OVERSIGHT_017011

Processing Document...

0%

Initializing...

Extraction Summary

2

People

6

Organizations

4

Locations

2

Events

3

Relationships

3

Quotes

Document Information

Type: Technical report / government exhibit

File Size: 2.33 MB

Summary

This document is a technical overview describing the Google Books Digitization project, specifically focusing on the processes of scanning, metadata collection from libraries and publishers, and the creation of a consensus record database. It details that by August 2010, Google had identified 129 million book editions. Despite the Bates stamp 'HOUSE_OVERSIGHT_017011' suggesting it is part of a government investigation production, the text contains no references to Jeffrey Epstein, his associates, or his activities.

People (2)

Name	Role	Context
Mark Twain	Author	Mentioned as the author of Tom Sawyer in the context of book editions.
Johan Braakensiek	Translator	Translator of the Dutch edition of Tom Sawyer.

Organizations (6)

Name	Type	Context
Google		The entity conducting the book digitization project.
University of Michigan		Source of books for scanning.
New York Public Library		Source of books for scanning.
Decitre		French bookseller providing metadata.
Ingram		Source of metadata.
House Oversight Committee		Implied by the footer 'HOUSE_OVERSIGHT'.

Timeline (2 events)

2004

Google began scanning books to make their contents searchable and discoverable online.

N/A

Google

View

August 2010

Evaluation identified 129 million editions of books ever published.

N/A

Google

View

Locations (4)

Location	Context
Michigan	Location of University of Michigan.
New York	Location of New York Public Library.
Bosnia	Mentioned regarding 'Bosnian libraries'.
France	Implied by 'French bookseller'.

Relationships (3)

Google → Partnership → University of Michigan

books... are borrowed from large libraries such as the University of Michigan

Google → Partnership → New York Public Library

books... are borrowed from large libraries such as... the New York Public Library

Johan Braakensiek → Translator/Author → Mark Twain

De lotgevallen van Tom Sawyer, translated from English to Dutch by Johan Braakensiek

Key Quotes (3)

"In 2004, Google began scanning books to make their contents searchable and discoverable online."

Source

HOUSE_OVERSIGHT_017011.jpg

Quote #1

"To date, Google has scanned over fifteen million books: over 11% of all the books ever published."

Source

HOUSE_OVERSIGHT_017011.jpg

Quote #2

"In August 2010, this evaluation identified 129 million editions, which is the working estimate we use in this paper of all the editions ever published"

Source

HOUSE_OVERSIGHT_017011.jpg

Quote #3

Full Extracted Text

Complete text extracted from the document (3,717 characters)

I. Overview of Google Books Digitization

In 2004, Google began scanning books to make their contents searchable and discoverable online. To date, Google has scanned over fifteen million books: over 11% of all the books ever published. The collection contains over five billion pages and two trillion words, with books dating back to as early as 1473 and with text in 478 languages. Over two million of these scanned books were given directly to Google by their publishers; the rest are borrowed from large libraries such as the University of Michigan and the New York Public Library. The scanning effort involves significant engineering challenges, some of which are highly relevant to the construction of the historical n-grams corpus. We survey those issues here.

The result of the next three steps is a collection of digital texts associated with particular book editions, as well as composite metadata for each edition combining the information contained in all metadata sources.

I.1. Metadata

Over 100 sources of metadata information were used by Google to generate a comprehensive catalog of books. Some of these sources are library catalogs (e.g., the list of books in the collections of University of Michigan, or union catalogs such as the collective list of books in Bosnian libraries), some are from retailers (e.g., Decitre, a French bookseller), and some are from commercial aggregators (e.g., Ingram). In addition, Google also receives metadata from its 30,000 partner publishers. Each metadata source consists of a series of digital records, typically in either the MARC format favored by libraries, or the ONIX format used by the publishing industry. Each record refers to either a specific edition of a book or a physical copy of a book on a library shelf, and contains conventional bibliographic data such as title, author(s), publisher, date of publication, and language(s) of publication.

Cataloguing practices vary widely among these sources, and even within a single source over time. Thus two records for the same edition will often differ in multiple fields. This is especially true for serials (e.g., the Congressional Record) and multivolume works such as sets (e.g., the three volumes of The Lord of the Rings).

The matter is further complicated by ambiguities in the definition of the word 'book' itself. Including translations, there are over three thousand editions derived from Mark Twain's original Tom Sawyer.

Google's process of converting the billions of metadata records into a single nonredundant database of book editions consists of the following principal steps:

1. Coarsely dividing the billions of metadata records into groups that may refer to the same work (e.g., Tom Sawyer).
2. Identifying and aggregating multivolume works based on the presence of cues from individual records.
3. Subdividing the group of records corresponding to each work into constituent groups corresponding to the various editions (e.g., the 1909 publication of De lotgevallen van Tom Sawyer, translated from English to Dutch by Johan Braakensiek).
4. Merging the records for each edition into a new "consensus" record.

The result is a set of consensus records, where each record corresponds to a distinct book edition and work, and where the contents of each record are formed out of fields from multiple sources. The number of records in this set -- i.e., the number of known book editions -- increases every year as more books are written.

In August 2010, this evaluation identified 129 million editions, which is the working estimate we use in this paper of all the editions ever published (this includes serials and sets but excludes kits, mixed media, and

3

HOUSE_OVERSIGHT_017011

View Original PDF

HOUSE_OVERSIGHT_017011.jpg

Processing Document...

Extraction Summary

Document Information

People (2)

Organizations (6)

Timeline (2 events)

Locations (4)

Relationships (3)

Key Quotes (3)

Full Extracted Text

Discussion 0