HOUSE_OVERSIGHT_017017.jpg

1.49 MB

Extraction Summary

2
People
2
Organizations
0
Locations
0
Events
0
Relationships
2
Quotes

Document Information

Type: Technical specification / research methodology (discovery document)
File Size: 1.49 MB
Summary

This document appears to be page 9 of a technical appendix or methodology paper regarding text processing, specifically describing tokenization rules for a dataset (likely the Google Books N-gram corpus or similar). It details how algorithms handle specific characters like ampersands, periods, dollar signs, hashes, and plus signs, as well as Chinese characters. While the document bears a 'HOUSE_OVERSIGHT' Bates stamp indicating it was part of a congressional production, the text itself is purely technical and contains no narrative information regarding Jeffrey Epstein, his associates, or specific events.

People (2)

Name Role Context
ALICE Example
Used as a grammatical example for possessive apostrophe handling ('ALICE'S').
Bob Example
Used as a grammatical example for possessive apostrophe handling ('Bob's').

Organizations (2)

Name Type Context
AT&T
Used as an example of a word containing an ampersand.
House Oversight Committee
Implied by the Bates stamp 'HOUSE_OVERSIGHT_017017' at the bottom right.

Key Quotes (2)

"Each book edition was broken down into a series of 1-grams on a page-by-page basis."
Source
HOUSE_OVERSIGHT_017017.jpg
Quote #1
"The tokenization process for Chinese was different. For Chinese, an internal CJK (Chinese/Japanese/Korean) segmenter was used to break characters into word units."
Source
HOUSE_OVERSIGHT_017017.jpg
Quote #2

Full Extracted Text

Complete text extracted from the document (2,308 characters)

, (comma)
> (greater-than)
? (question-mark)
/ (forward-slash)
~ (tilde)
` (back-tick)
“ (double quote)
(3) The following characters are not tokenized as separate words:
& (ampersand)
_ (underscore)
Examples of the resulting words include AT&T, R&D, and variable names such as HKEY_LOCAL_MACHINE.
(4) . (period) is treated as a separate word, except when it is part of a number or price, such as 99.99 or $999.95. A specific pattern matcher looks for numbers or prices and tokenizes these special strings as separate words.
(5) $ (dollar-sign) is treated as a separate word, except where it is the first character of a word consisting entirely of numbers, possibly containing a decimal point. Examples include $71 and $9.95
(6) # (hash) is treated as a separate word, except when it is preceded by a-g, j or x. This covers musical notes such as A# (A-sharp), and programming languages j#, and x#.
(7) + (plus) is treated as a separate word, except it appears at the end of a sequence of alphanumeric characters or “+” s. Thus the strings C++ and Na2+ would be treated as single words. These cases include many programming language names and chemical compound names.
(8) ’ (apostrophe/single-quote) is treated as a separate word, except when it precedes the letter s, as in ALICE'S and Bob's
The tokenization process for Chinese was different. For Chinese, an internal CJK (Chinese/Japanese/Korean) segmenter was used to break characters into word units. The CJK segmenter inserts spaces along common semantic boundaries. Hence, 1-grams that appear in the Chinese simplified corpora will sometimes contain strings with 1 or more Chinese characters.
Given a sequence of n 1-grams, we denote the corresponding n-gram by concatenating the 1-grams with a plain space character in between. A few examples of the tokenization and 1-gram construction method are provided in Table S2.
Each book edition was broken down into a series of 1-grams on a page-by-page basis. For each page of each book, we counted the number of times each 1-gram appeared. We further counted the number of times each n-gram appeared (e.g., a sequence of n 1-grams) for all n less than or equal to 5. Because this was done on a page-by-page basis, n-grams that span two consecutive pages were not counted.
9
HOUSE_OVERSIGHT_017017

Discussion 0

Sign in to join the discussion

No comments yet

Be the first to share your thoughts on this epstein document