ImagineVille

Word language models, February 2019

Models:

Here are our trained ARPA format 4-gram language models. We evaluated each model by computing the average per-word perplexity on 37K words of conversational AAC-like data.

NameCompressed sizePerplexityDownload
Small 103MB73.2[Link]
Medium201MB69.9[Link]
Large 682MB65.9[Link]
Huge 5.1GB63.4[Link]

The test set was a combination of: We excluded the sentence end word from our perplexity calculations. The test set was in lowercase and included limited punctuation (apostrophes, commas, periods, exclamation points, and question marks). These models are licensed under a Creative Commons Attribution 4.0 License.


Training details:

Vocabulary. These models use a 100K vocabulary including an unknown word, and words for commas, periods, exclamation points, and questions marks. We first trained a unigram language model on each of our training sources. We only included words that were in a list of 697K English words we obtained via human edited dictionaries, including a February 2019 parse of Wiktionary. We converted all words to lowercase. We trained a linear mixture model using equal weights for each of the unigram models. Finally, we took the most probable 100K words as our vocabulary.

Training data. We used the following training sources:

We dropped sentences where 20% or more of words were out-of-vocabulary (OOV) with respect our 698K word list.

We wanted our models to learn across end-of-sentence punctuation words. After each parsed sentence from a training set, we concatenated another sentence from that training set with probability 0.5. Thus each training sentence consisted of one or more actual sentences from the original training data. We only concatenated sentences that came from the same apparent author (i.e. from the same tweet, blog post, web page, etc).

We used cross-entropy difference selection to select training sentences. Our in-domain language model was trained on 30K words of conversational AAC-like data. The out-of-domain model was trained on a similar amount of data from the training source being selected from. We merged training sentences from all sources, sorting them by their cross-entropy difference. To focus on the most conversational like text, we only included sentences with a cross-entropy difference of -0.2 or smaller. We also removed training sentences that occurred more than once. This reduced our parsed training data from 280B words to a final training set of 2.6B words.

Model estimation and pruning. These 4-gram word language models were trained and pruned using the variKN toolkit. We used the following command for training:

varigram_kn --dscale scale1 --dscale2 scale2 --arpa --3nzer --noorder 4 --longint --opti dev_data --clear_history --vocabin=vocab_lower_100k.txt
For dev_data, we used an equal amount of development data from each of the training sets and all our AAC-like development data (38K total words). Here are the scale values we used to generate the models at the top of the page: