ImagineVille

Word language models, December 2019

Models:

Here are our trained ARPA format 4-gram language models. We evaluated each model by computing the average per-word perplexity on 36.6K words of conversational AAC-like data.

NameCompressed sizePerplexityDownload
Tiny 25MB90.1[Link]
Small 110MB73.7[Link]
Medium205MB69.2[Link]
Large 662MB63.3[Link]
Huge 4.9GB59.1[Link]

The test set consisted of 4,567 sentences (36.6K words). The test set was a combination of: The test set was in lowercase and included limited punctuation (apostrophes, commas, periods, exclamation points, and question marks). We excluded the sentence end word from our perplexity calculations. This was necessary since the models were trained only on sentences ending in a period, exclamation point, or question mark. But some test sentences do not have end-of-sentence punctuation, this led to the sentence end word being very improbable.

These models are licensed under a Creative Commons Attribution 4.0 License. There are also BerkeleyLM and KenLM versions of these models (see bottom of page).

Training details:

Vocabulary. These models use a 100K vocabulary including an unknown word, and words for commas, periods, exclamation points, and questions marks. We first trained a unigram language model on each of our training sources. We only included words that were in a list of 697K English words we obtained via human edited dictionaries, including a February 2019 parse of Wiktionary. We converted all words to lowercase. We trained a linear mixture model using equal weights for each of the unigram models. Finally, we took the most probable 100K words as our vocabulary.

Training data. We used the following training sources: We dropped sentences where 20% or more of words were out-of-vocabulary (OOV) with respect our 698K word list.

We wanted our models to learn across end-of-sentence punctuation words. After each parsed sentence from a training set, we concatenated another sentence from that training set with probability 0.5. Thus each training sentence consisted of one or more actual sentences from the original training data. We only concatenated sentences that came from the same apparent author (i.e. from the same tweet, blog post, web page, etc). Within each training set, we removed training sentences that occurred more than once.

Data filtering. We used cross-entropy difference selection to select a subset of sentences from each training set. We scored each sentence under three different in-domain language models: The out-of-domain models were trained on 100M words of data from each training source. We took the score of a sentence to be the minimum cross-entropy difference under any of the three in-domain models.

For each training set, we found an optimal cross entropy difference threshold. Only sentences less than this threshold were kept in the training data. We trained a 4-gram language model on the subset of a given training set. This model was trained using MITLM and modified Kneser-Ney smoothing. Discounting parameters were optimized with respect to a development set consisting of equal amounts of data from the three in-domain data sources (10K total words). The optimal cross-entropy threshold was the one that resulted in a model that minimized the perplexity on this development set.

Mixture model training. The resulting models from each training set at their optimal cross-entropy threshold were combined via linear interpolation into a single mixture model. The mixture weights were optimized with respect to the development set. At first we created a mixture model from all 14 training sets. Since only the Subtitle, Reddit, Twitter, and Common sets had a mixture weight greater than 0.01, we simplified and created a mixture model from just these four sets. Details for each set were as follows:

SubtitleRedditTwitterCommon
Mixture weight0.270.180.130.43
Cross-entropy threshold0.300.100.100.10
Training words113M3.3B1.2B3.9B

The unpruned 4-model mixture has a compressed disk size of 37GB and a perplexity on our test set of 58.15. The unpruned 14-model mixture has a compressed disk size of 40GB and a perplexity on our test set of 58.20.

Pruning. We used SRILM's ngram tool to entropy prune the 4-gram mixture model. For pruning purposes, we used a 3-gram mixture model trained with Good-Turing discounting. The 3-gram mixture model used the same training data as the 4-gram MKN model. We optimized the mixture weights on the same development data as the 4-gram model. The thresholds for producing the Tiny, Small, Medium, Large, and Huge models were 1e-7, 1.6e-8, 7e-9, 1.3e-9, and 4e-11 respectively.

Other models:

Some of our software clients use the excellent BerkeleyLM and KenLM libraries for making efficient language model queries. We also provide binary versions of our models compatible with these two libraries.

BerkeleyLM:

NameCompressed sizePerplexityDownload
Tiny 36MB90.1[Link]
Small 148MB73.7[Link]
Medium254MB69.2[Link]
Large 837MB63.3[Link]
Huge 6.2GB59.1[Link]

The BerkeleyLM models are context encoded using the -e switch of the MakeLmBinaryFromArpa program.

KenLM:

NameCompressed sizePerplexityDownload
Tiny 18MB90.1[Link]
Small 72MB73.6[Link]
Medium132MB69.2[Link]
Large 391MB63.2[Link]
Huge 2.9GB58.9[Link]

The KenLM models are trie models. We used 10 bits to quantize probabilities and 8 bits to quantize backoff weights.