HOUSE_OVERSIGHT_017011.jpg

2.33 MB

Extraction Summary

2
People
6
Organizations
4
Locations
2
Events
3
Relationships
3
Quotes

Document Information

Type: Technical report / government exhibit
File Size: 2.33 MB
Summary

This document is a technical overview describing the Google Books Digitization project, specifically focusing on the processes of scanning, metadata collection from libraries and publishers, and the creation of a consensus record database. It details that by August 2010, Google had identified 129 million book editions. Despite the Bates stamp 'HOUSE_OVERSIGHT_017011' suggesting it is part of a government investigation production, the text contains no references to Jeffrey Epstein, his associates, or his activities.

People (2)

Name Role Context
Mark Twain Author
Mentioned as the author of Tom Sawyer in the context of book editions.
Johan Braakensiek Translator
Translator of the Dutch edition of Tom Sawyer.

Organizations (6)

Name Type Context
Google
The entity conducting the book digitization project.
University of Michigan
Source of books for scanning.
New York Public Library
Source of books for scanning.
Decitre
French bookseller providing metadata.
Ingram
Source of metadata.
House Oversight Committee
Implied by the footer 'HOUSE_OVERSIGHT'.

Timeline (2 events)

2004
Google began scanning books to make their contents searchable and discoverable online.
N/A
August 2010
Evaluation identified 129 million editions of books ever published.
N/A

Locations (4)

Location Context
Location of University of Michigan.
Location of New York Public Library.
Mentioned regarding 'Bosnian libraries'.
Implied by 'French bookseller'.

Relationships (3)

Google Partnership University of Michigan
books... are borrowed from large libraries such as the University of Michigan
Google Partnership New York Public Library
books... are borrowed from large libraries such as... the New York Public Library
Johan Braakensiek Translator/Author Mark Twain
De lotgevallen van Tom Sawyer, translated from English to Dutch by Johan Braakensiek

Key Quotes (3)

"In 2004, Google began scanning books to make their contents searchable and discoverable online."
Source
HOUSE_OVERSIGHT_017011.jpg
Quote #1
"To date, Google has scanned over fifteen million books: over 11% of all the books ever published."
Source
HOUSE_OVERSIGHT_017011.jpg
Quote #2
"In August 2010, this evaluation identified 129 million editions, which is the working estimate we use in this paper of all the editions ever published"
Source
HOUSE_OVERSIGHT_017011.jpg
Quote #3

Full Extracted Text

Complete text extracted from the document (3,717 characters)

I. Overview of Google Books Digitization
In 2004, Google began scanning books to make their contents searchable and discoverable online. To date, Google has scanned over fifteen million books: over 11% of all the books ever published. The collection contains over five billion pages and two trillion words, with books dating back to as early as 1473 and with text in 478 languages. Over two million of these scanned books were given directly to Google by their publishers; the rest are borrowed from large libraries such as the University of Michigan and the New York Public Library. The scanning effort involves significant engineering challenges, some of which are highly relevant to the construction of the historical n-grams corpus. We survey those issues here.
The result of the next three steps is a collection of digital texts associated with particular book editions, as well as composite metadata for each edition combining the information contained in all metadata sources.
I.1. Metadata
Over 100 sources of metadata information were used by Google to generate a comprehensive catalog of books. Some of these sources are library catalogs (e.g., the list of books in the collections of University of Michigan, or union catalogs such as the collective list of books in Bosnian libraries), some are from retailers (e.g., Decitre, a French bookseller), and some are from commercial aggregators (e.g., Ingram). In addition, Google also receives metadata from its 30,000 partner publishers. Each metadata source consists of a series of digital records, typically in either the MARC format favored by libraries, or the ONIX format used by the publishing industry. Each record refers to either a specific edition of a book or a physical copy of a book on a library shelf, and contains conventional bibliographic data such as title, author(s), publisher, date of publication, and language(s) of publication.
Cataloguing practices vary widely among these sources, and even within a single source over time. Thus two records for the same edition will often differ in multiple fields. This is especially true for serials (e.g., the Congressional Record) and multivolume works such as sets (e.g., the three volumes of The Lord of the Rings).
The matter is further complicated by ambiguities in the definition of the word 'book' itself. Including translations, there are over three thousand editions derived from Mark Twain's original Tom Sawyer.
Google's process of converting the billions of metadata records into a single nonredundant database of book editions consists of the following principal steps:
1. Coarsely dividing the billions of metadata records into groups that may refer to the same work (e.g., Tom Sawyer).
2. Identifying and aggregating multivolume works based on the presence of cues from individual records.
3. Subdividing the group of records corresponding to each work into constituent groups corresponding to the various editions (e.g., the 1909 publication of De lotgevallen van Tom Sawyer, translated from English to Dutch by Johan Braakensiek).
4. Merging the records for each edition into a new "consensus" record.
The result is a set of consensus records, where each record corresponds to a distinct book edition and work, and where the contents of each record are formed out of fields from multiple sources. The number of records in this set -- i.e., the number of known book editions -- increases every year as more books are written.
In August 2010, this evaluation identified 129 million editions, which is the working estimate we use in this paper of all the editions ever published (this includes serials and sets but excludes kits, mixed media, and
3
HOUSE_OVERSIGHT_017011

Discussion 0

Sign in to join the discussion

No comments yet

Be the first to share your thoughts on this epstein document