Technical Specification / Research Methodology (Discovery Document) - HOUSE_OVERSIGHT_017017

Processing Document...

0%

Initializing...

Extraction Summary

2

People

2

Organizations

0

Locations

0

Events

0

Relationships

2

Quotes

Document Information

Type: Technical specification / research methodology (discovery document)

File Size: 1.49 MB

Summary

This document appears to be page 9 of a technical appendix or methodology paper regarding text processing, specifically describing tokenization rules for a dataset (likely the Google Books N-gram corpus or similar). It details how algorithms handle specific characters like ampersands, periods, dollar signs, hashes, and plus signs, as well as Chinese characters. While the document bears a 'HOUSE_OVERSIGHT' Bates stamp indicating it was part of a congressional production, the text itself is purely technical and contains no narrative information regarding Jeffrey Epstein, his associates, or specific events.

People (2)

Name	Role	Context
ALICE	Example	Used as a grammatical example for possessive apostrophe handling ('ALICE'S').
Bob	Example	Used as a grammatical example for possessive apostrophe handling ('Bob's').

Organizations (2)

Name	Type	Context
AT&T		Used as an example of a word containing an ampersand.
House Oversight Committee		Implied by the Bates stamp 'HOUSE_OVERSIGHT_017017' at the bottom right.

Key Quotes (2)

"Each book edition was broken down into a series of 1-grams on a page-by-page basis."

Source

HOUSE_OVERSIGHT_017017.jpg

Quote #1

"The tokenization process for Chinese was different. For Chinese, an internal CJK (Chinese/Japanese/Korean) segmenter was used to break characters into word units."

Source

HOUSE_OVERSIGHT_017017.jpg

Quote #2

Full Extracted Text

Complete text extracted from the document (2,308 characters)

, (comma)
> (greater-than)
? (question-mark)
/ (forward-slash)
~ (tilde)
` (back-tick)
“ (double quote)

(3) The following characters are not tokenized as separate words:
& (ampersand)
_ (underscore)
Examples of the resulting words include AT&T, R&D, and variable names such as HKEY_LOCAL_MACHINE.

(4) . (period) is treated as a separate word, except when it is part of a number or price, such as 99.99 or $999.95. A specific pattern matcher looks for numbers or prices and tokenizes these special strings as separate words.

(5) $ (dollar-sign) is treated as a separate word, except where it is the first character of a word consisting entirely of numbers, possibly containing a decimal point. Examples include $71 and $9.95

(6) # (hash) is treated as a separate word, except when it is preceded by a-g, j or x. This covers musical notes such as A# (A-sharp), and programming languages j#, and x#.

(7) + (plus) is treated as a separate word, except it appears at the end of a sequence of alphanumeric characters or “+” s. Thus the strings C++ and Na2+ would be treated as single words. These cases include many programming language names and chemical compound names.

(8) ’ (apostrophe/single-quote) is treated as a separate word, except when it precedes the letter s, as in ALICE'S and Bob's

The tokenization process for Chinese was different. For Chinese, an internal CJK (Chinese/Japanese/Korean) segmenter was used to break characters into word units. The CJK segmenter inserts spaces along common semantic boundaries. Hence, 1-grams that appear in the Chinese simplified corpora will sometimes contain strings with 1 or more Chinese characters.

Given a sequence of n 1-grams, we denote the corresponding n-gram by concatenating the 1-grams with a plain space character in between. A few examples of the tokenization and 1-gram construction method are provided in Table S2.

Each book edition was broken down into a series of 1-grams on a page-by-page basis. For each page of each book, we counted the number of times each 1-gram appeared. We further counted the number of times each n-gram appeared (e.g., a sequence of n 1-grams) for all n less than or equal to 5. Because this was done on a page-by-page basis, n-grams that span two consecutive pages were not counted.

9

HOUSE_OVERSIGHT_017017

View Original PDF

HOUSE_OVERSIGHT_017017.jpg

Processing Document...

Extraction Summary

Document Information

People (2)

Organizations (2)

Key Quotes (2)

Full Extracted Text

Discussion 0