ImagineVille

Language models, Feb 2019

Models:

Here are our trained ARPA format 4-gram language models. We evaluated the average per-word perplexity of each model on 34K words of conversational AAC-like data.

The test data was a combination of: These models are licensed under a Creative Commons Attribution 4.0 License.


Training details:

Vocabulary. These models use a 100K vocabulary including an unknown word, and words for commas, periods, exclamation points, and questions marks. We first trained a unigram language model on each type of training data. We only included words that were in a list of 697K English words we obtained via human edited dictionaries, including a February 2019 parse of Wiktionary. We converted all words to lowercase.

Training data. We used the following training sources:

We dropped sentences where 20% or more of words were out-of-vocabulary (OOV) with respect our 698K word list. We wanted our models to learn across end-of-sentence punctuation words. After each parsed sentence from a training set, we concatenated another sentence from that training set with probability 0.5. Thus each training sentence consisted of one or more actual sentences from the original training data.

We used cross-entropy difference selection to select training sentences. Our in-domain language model was trained on 30K words of conversational AAC-like data. The out-of-domain model was trained on a similar amount of data from the training source being selected from. We merged training sentences from all sources, sorting them by their cross-entropy difference. To focus on the most conversational like text, we only included sentences with a cross-entropy difference of -0.2 or smaller. We also removed training sentences that occurred more than once. This reduced our parsed training data from 280B words to a final training set of 2.6B words.

Model estimation and pruning. These 4-gram word language models were trained and pruned using the variKN toolkit. We used the following command for training:

varigram_kn --dscale scale1 --dscale2 scale2 --arpa --3nzer --noorder 4 --longint --opti dev_data --clear_history --vocabin=vocab_lower_100k.txt
For dev_data, we used an equal amount of development data from each of the training sets and all our AAC-like development data (38K total words). Here are the scale values we used to generate the models at the top of the page: