Dasher word language models, February 2021


Unlike our previous word language models, here we trained only on a single training set, namely Common Crawl. Here are our trained ARPA format 4-gram language models. We evaluated each model by computing the average per-word perplexity on 36.6K words of conversational AAC-like data.

NameCompressed sizePerplexityDownload
Tiny 29MB107.9[Link]
Small 141MB86.0[Link]
Large 836MB74.0[Link]
Huge 5.9GB71.4[Link]

The test set consisted of 4,567 sentences (36.6K words). The test set was a combination of: The test set was in lowercase and included limited punctuation (apostrophes, commas, periods, exclamation points, and question marks). We excluded the sentence end word from our perplexity calculations. This was necessary since the models were trained only on sentences ending in a period, exclamation point, or question mark. But some test sentences do not have end-of-sentence punctuation, this led to the sentence end word being very improbable.

These models are licensed under a Creative Commons Attribution 4.0 License.

Training details:

Vocabulary. These models use a 100K vocabulary including an unknown word, and words for commas, periods, exclamation points, and questions marks. We first trained a unigram language model on all of our training data. We only included words that were in a list of 735K English words we obtained via human edited dictionaries, including a February 2019 parse of Wiktionary. We converted all words to lowercase. We took the most probable 100K words as our vocabulary.

Training data. We used the text of web pages from Common Crawl, September 2020 version. [Link]

We dropped sentences where 20% or more of words were out-of-vocabulary (OOV) with respect our 735K word list. Each sentence in the parsed data was treated as an independent training example. We removed sentences that occurred more than once on any particular web page. Our final parsed data set consisted of 16.1B sentences totaling 257B words.

Data filtering. We used cross-entropy difference selection to select a subset of sentences from the full training set. We scored each sentence under three different in-domain language models: The out-of-domain model were trained on 100M words of data from Common Crawl. We took the score of a sentence to be the minimum cross-entropy difference under any of the three in-domain models.

We found an optimal cross entropy difference threshold with respect to the average perplexity of the three in-domain development sets. Only sentences less than this threshold were kept in the training data. We trained a 4-gram language model on the subset of a given training set. This model was trained using MITLM and modified Kneser-Ney smoothing. Discounting parameters were optimized with respect to a development set consisting of equal amounts of data from the three in-domain data sources (10K total words).

Pruning. We used SRILM's ngram tool to entropy prune the 4-gram language model. For pruning purposes, we used a 3-gram mixture model trained with Good-Turing discounting. The 3-gram mixture model used the same training data as the 4-gram MKN model. The thresholds for producing the Tiny, Small, Medium, Large, and Huge models were 1e-7, 1.6e-8, 7e-9, 1.3e-9, and 4e-11 respectively.

5K vocab models:

Here are some cute little language models with a 5K word vocabulary. They exist mostly for use in testing. They were trained on the same data as above except only sentences where all words were in the 5K vocabulary were included in the training data (865M words). The language models don't have an unknown word.

OrderCompressed sizePerplexityDownload
2-gram 1.4MB126.6[Link]
3-gram 3.6MB90.7[Link]