Dasher word language models, February 2021

Unlike our previous word language models, here we trained only on a single training set, namely Common Crawl. Here are our trained ARPA format 4-gram language models. We evaluated each model by computing the average per-word perplexity on 36.6K words of conversational AAC-like data.

Name	Compressed size	Perplexity	Download
Tiny	29MB	107.9	[Link]
Small	141MB	86.0	[Link]
Medium	268MB	80.3	[Link]
Large	836MB	74.0	[Link]
Huge	5.9GB	71.4	[Link]

The test set consisted of 4,567 sentences (36.6K words). The test set was a combination of:

COMM - Sentences written in response to hypothetical communication situations (2.1K words). [Link]
COMM2 - Crowdsourced sentences written in response to hypothetical communication situations (13K words). [Link]
Specialists - Phrases suggested by AAC specialists. Taken from University of Nebraska-Lincoln from web pages that are no longer available (4.2K words).
Enron mobile - Messages written by Enron employees on Blackberry mobile devices (18K words). [Link]

The test set was in lowercase and included limited punctuation (apostrophes, commas, periods, exclamation points, and question marks). We excluded the sentence end word from our perplexity calculations. This was necessary since the models were trained only on sentences ending in a period, exclamation point, or question mark. But some test sentences do not have end-of-sentence punctuation, this led to the sentence end word being very improbable.

These models are licensed under a Creative Commons Attribution 4.0 License.

Vocabulary. These models use a 100K vocabulary including an unknown word, and words for commas, periods, exclamation points, and questions marks. We first trained a unigram language model on all of our training data. We only included words that were in a list of 735K English words we obtained via human edited dictionaries, including a February 2019 parse of Wiktionary. We converted all words to lowercase. We took the most probable 100K words as our vocabulary.

Training data. We used the text of web pages from Common Crawl, September 2020 version. [Link]

We dropped sentences where 20% or more of words were out-of-vocabulary (OOV) with respect our 735K word list. Each sentence in the parsed data was treated as an independent training example. We removed sentences that occurred more than once on any particular web page. Our final parsed data set consisted of 16.1B sentences totaling 257B words.

Data filtering. We used cross-entropy difference selection to select a subset of sentences from the full training set. We scored each sentence under three different in-domain language models:

AAC - Conversational AAC-like data obtained via crowdsourcing (30K words). [Link]
Short email - Sentences with between 1 and 12 words taken from the W3C email, TREC, and SpamAssassin corpora (2.1M words). [Link] [Link] [Link]
Dialogue - Turns from everyday conversations, 90% of the training set of the Daily Dialogue corpus (948K words). [Link]

The out-of-domain model were trained on 100M words of data from Common Crawl. We took the score of a sentence to be the minimum cross-entropy difference under any of the three in-domain models.

We found an optimal cross entropy difference threshold with respect to the average perplexity of the three in-domain development sets. Only sentences less than this threshold were kept in the training data. We trained a 4-gram language model on the subset of a given training set. This model was trained using MITLM and modified Kneser-Ney smoothing. Discounting parameters were optimized with respect to a development set consisting of equal amounts of data from the three in-domain data sources (10K total words).

Pruning. We used SRILM's ngram tool to entropy prune the 4-gram language model. For pruning purposes, we used a 3-gram mixture model trained with Good-Turing discounting. The 3-gram mixture model used the same training data as the 4-gram MKN model. The thresholds for producing the Tiny, Small, Medium, Large, and Huge models were 1e-7, 1.6e-8, 7e-9, 1.3e-9, and 4e-11 respectively.

Here are some cute little language models with a 5K word vocabulary. They exist mostly for use in testing. They were trained on the same data as above except only sentences where all words were in the 5K vocabulary were included in the training data (865M words). The language models don't have an unknown word.

Order	Compressed size	Perplexity	Download
2-gram	1.4MB	126.6	[Link]
3-gram	3.6MB	90.7	[Link]

Dasher word language models, February 2021

Models:

Training details:

5K vocab models: