metadata dates that were more than 5 years from the date determined by a human examining the book.
Because errors are much more common among older books, and because the actual corpora are strongly
biased toward recent works, the likelihood of error in a randomly sampled book from the final corpus is
much lower than 6.2%. As a point of comparison, 27 of 100 books (27%) selected at random from an
unfiltered corpus contained date-of-publication errors of greater than 5 years. The unfiltered corpus was
created using a sampling strategy similar to that of Eng-1M. This selection mechanism favored recent
books (which are more frequent) and pre-1800 books, which were excluded in the sampling strategy for
filtered books; as such the two numbers (6.2% and 27%) give a sense of the improvement, but are not
strictly comparable.
Note that since the base corpora were generated (August 2009), many additional improvements have
been made to the metadata dates used in Google Book Search itself. As such, these numbers do not
reflect the accuracy of the Google Book Search online tool.
II.1B. OCR quality
The challenge of performing accurate OCR on the entire books dataset is compounded by variations in
such factors as language, font, size, legibility, and physical condition of the book. OCR quality was
assessed using an algorithm developed by Popat et al. (Ref S3). This algorithm yields a probability that
expresses the confidence that a given sequence of text generated by OCR is correct. Incorrect or
anomalous text can result from gross imperfections in the scanned images, or as a result of markings or
drawings. This algorithm uses sophisticated statistics, a variant of the Partial by Partial Matching (PPM)
model, to compute for each glyph (character) the probability that it is anomalous given other nearby
glyphs. ('Nearby' refers to 2-dimensional distance on the original scanned image, hence glyphs above,
below, to the left, and to the right of the target glyph.) The model parameters are tuned using multi-
language subcorpora, one in each of the 32 supported languages. From the per-glyph probability one can
compute an aggregate probability for a sequence of glyphs, including the entire text of a volume. In this
manner, every volume has associated with it a probabilistic OCR quality score (quantized to an integer
between 0-100; note that the OCR quality score should not be confused with character or word accuracy).
In addition to error detection, the Popat model is also capable of computing the probability that the text is
in a particular language given any sequence of characters. Thus the algorithm serves the dual purpose of
detecting anomalous text while simultaneously identifying the language in which the text is written.
To ensure the highest quality data, we excluded volumes with poor OCR quality. For the languages that
use a Latin alphabet (English, French, Spanish, and German), the OCR quality is generally higher, and
more books are available. As a result, we filtered out all volumes whose quality score was lower than
80%. For Chinese and Russian, fewer books were available, and we did not apply the OCR filter. For
Hebrew, a 50% threshold was used, because its OCR quality was relatively better than Chinese or
Russian. For geographically specific corpora, English US and English UK, a less stringent 60% threshold
was used, in order to maximize the number of books included (note that, as such, these two corpora are
not strict subsets of the broader English corpus). Figure S4 shows the distribution of OCR quality score
as a function of the fraction of books in the English corpus. Use of an 80% cut off will remove the books
with the worst OCR, while retaining the vast majority of the books in the original corpus.
The OCR quality scores were also used as a localized indicator of textual quality in order to remove
anomalous sections of otherwise high-quality texts. The end source text was ensured to be of
comparable quality to the post-OCR text presented in "text-mode" on the Google Books website.
II.1C. Accuracy of language metadata
We applied additional filters to remove books with dubious language-of-composition metadata. This filter
removed volumes whose meta-data language tag disagrees with the language determined by the
statistical language detection algorithm described in section 2A. For our English corpus, 8.56%
6
HOUSE_OVERSIGHT_017014
Discussion 0
No comments yet
Be the first to share your thoughts on this epstein document