ImagineVille

Character language models, December 2019

Models:

Here are our trained ARPA format 12-gram language models. We evaluated each model by computing the average per-character perplexity on conversational AAC-like data.

NameCompressed sizePerplexityDownload
Tiny 22MB3.1805[Link]
Small 114MB2.8757[Link]
Medium213MB2.8097[Link]
Large 473MB2.7581[Link]
Huge 2.7GB2.7263[Link]

The test set consisted of 4,567 sentences (36.6K words, 161K characters). The test set was a combination of: The test set was in lowercase and included limited punctuation (apostrophes, commas, periods, exclamation points, and question marks). We excluded the sentence end word from our perplexity calculations. This was necessary since the models were trained only on sentences ending in a period, exclamation point, or question mark. But some test sentences do not have end-of-sentence punctuation, this led to the sentence end word being very improbable.

These models are licensed under a Creative Commons Attribution 4.0 License. There are also BerkeleyLM and KenLM versions of these models (see bottom of page).

Training details:

Vocabulary. These models use a 34 symbol vocabulary. The vocabulary contains the lowercase letters a-z, apostrophe, period, exclamation mark, question mark, comma, space (<sp>), sentence start (<s>), and sentence end (</s>).

Training data. We used the following training sources: We dropped sentences where 20% or more of words were out-of-vocabulary (OOV) with respect our 698K word list.

We wanted our models to learn across end-of-sentence punctuation words. After each parsed sentence from a training set, we concatenated another sentence from that training set with probability 0.5. Thus each training sentence consisted of one or more actual sentences from the original training data. We only concatenated sentences that came from the same apparent author (i.e. from the same tweet, blog post, web page, etc). Within each training set, we removed training sentences that occurred more than once.

Finally, we converted the word-level training data into character-level data and converted to lowercase. We used the pseudo-word <sp> to represent the spaces between words.

Data filtering. We used cross-entropy difference selection to select a subset of sentences from each training set. We scored each sentence under three different in-domain character language models: The out-of-domain models were trained on 100M words of data from each training source. We took the score of a sentence to be the minimum cross-entropy difference under any of the three in-domain models.

For each training set, we found an optimal cross entropy difference threshold. Only sentences less than this threshold were kept in the training data. We trained a 12-gram language model on the subset of a given training set. This model was trained using MITLM using modified Kneser-Ney smoothing. No pruning was done during training. Discounting parameters were optimized with respect to a development set consisting of equal amounts of data from the three in-domain data sources (10K total words). The optimal cross-entropy threshold was the one that resulted in a model that minimized the perplexity on this development set.

Mixture model training. We computed the mixture weights between the 14 trained models with respect to the development data using the compute-best-mix script of SRILM. We found that only Subtitle, Reddit, Twitter, and Common sets had a mixture weights greater than 0.03. As such, we created our mixture models with just these four sets.

We created a mixture model on our four models trained with modified Kneser Ney smoothing. For comparison, we created a second mixture model using four models trained with Witten-Bell smoothing. The unpruned modified Kneser Ney mixture was 37GB compressed (3,812,961,036 n-grams) and had a perplexity of 2.684 on our test set. The unpruned Witten-Bell mixture was 34GB compressed (3,812,960,848 n-grams) and had a perplexity of 2.729 on our test set.

For our final model set, we used the Witten-Bell mixture. Effectively pruning the modified Kneser Ney model requires training and loading an 11-gram model trained with another smoothing method [Link]. This increased memory demands during pruning. Further, we found for aggressive amounts of pruning, Witten-Bell pruned models outperformed similar sized modified Kneser Ney models pruned using -prune-history-lm and a Good-Turing 11-gram model.

Details of the final Witten-Bell mixture model were as follows:

SubtitleRedditTwitterCommon
Mixture weight0.3530.2600.1740.331
Cross-entropy threshold0.100.000.050.00
Training characters739M5.44B8.46B6.45B

Pruning. We used SRILM's ngram tool to entropy prune the 12-gram mixture model. The thresholds for producing the Tiny, Small, Medium, Large, and Huge models were 4e-9, 1.3e-9, 8e-10, 4e-10, and 4e-11 respectively.

Other models:

Some of our software clients use the excellent BerkeleyLM and KenLM libraries for making efficient language model queries. We also provide binary versions of our models compatible with these two libraries.

BerkeleyLM:

NameCompressed sizePerplexityDownload
Tiny 39MB3.180[Link]
Small 198MB2.876[Link]
Medium353MB2.810[Link]
Large 782MB2.762[Link]
Huge 4.4GB2.726[Link]

The BerkeleyLM models are context encoded using the -e switch of the MakeLmBinaryFromArpa program.

KenLM:

NameCompressed sizePerplexityDownload
Tiny 15MB3.190[Link]
Small 71MB2.883[Link]
Medium127MB2.818[Link]
Large 231MB2.766[Link]
Huge 1.4GB2.742[Link]

The KenLM models are trie models. We used 10 bits to quantize probabilities and 8 bits to quantize backoff weights.