Dasher character language model, February 2021

Unlike our previous character language models, in this case we trained only on a single training set, namely Common Crawl. We also trained a small 4-gram model compared to our previous large 12-gram models. We evaluated the model by computing the average per-word perplexity on 36.6K words of conversational AAC-like data.

Compressed size	Perplexity	Download
2.8MB	4.6133	[Link]

The test set consisted of 4,567 sentences (36.6K words). The test set was a combination of:

COMM - Sentences written in response to hypothetical communication situations (2.1K words). [Link]
COMM2 - Crowdsourced sentences written in response to hypothetical communication situations (13K words). [Link]
Specialists - Phrases suggested by AAC specialists. Taken from University of Nebraska-Lincoln from web pages that are no longer available (4.2K words).
Enron mobile - Messages written by Enron employees on Blackberry mobile devices (18K words). [Link]

The test set was in lowercase and included limited punctuation (apostrophes, commas, periods, exclamation points, and question marks). We excluded the sentence end word from our perplexity calculations. This was necessary since the models were trained only on sentences ending in a period, exclamation point, or question mark. But some test sentences do not have end-of-sentence punctuation, this led to the sentence end word being very improbable.

This model is licensed under a Creative Commons Attribution 4.0 License.

Training details:

Vocabulary. These models use a 34 symbol vocabulary. The vocabulary contains the lowercase letters a-z, apostrophe, period, exclamation mark, question mark, comma, space (<sp>), sentence start (<s>), and sentence end (</s>).

Training data. We used the text of web pages from Common Crawl, September 2020 version. [Link]

We dropped sentences where 20% or more of words were out-of-vocabulary (OOV) with respect our 735K word list. Each sentence in the parsed data was treated as an independent training example. We removed sentences that occurred more than once on any particular web page. Our final parsed data set consisted of 16.1B sentences totaling 257B words.

Data filtering. We used cross-entropy difference selection to select a subset of sentences from the full training set. We scored each sentence under three different in-domain language models:

AAC - Conversational AAC-like data obtained via crowdsourcing (30K words). [Link]
Short email - Sentences with between 1 and 12 words taken from the W3C email, TREC, and SpamAssassin corpora (2.1M words). [Link] [Link] [Link]
Dialogue - Turns from everyday conversations, 90% of the training set of the Daily Dialogue corpus (948K words). [Link]

The out-of-domain model were trained on 100M words of data from Common Crawl. We took the score of a sentence to be the minimum cross-entropy difference under any of the three in-domain models.

We found an optimal cross entropy difference threshold with respect to the average perplexity of the three in-domain development sets. Only sentences less than this threshold were kept in the training data. We trained a 4-gram language model on the subset of a given training set. This model was trained using MITLM and modified Kneser-Ney smoothing. Discounting parameters were optimized with respect to a development set consisting of equal amounts of data from the three in-domain data sources (10K total words).

Dasher character language model, February 2021

Models:

Training details: