Word language models, December 2019
Models:
Here are our trained ARPA format 4-gram language models.
We evaluated each model by computing the average per-word perplexity on 36.6K words of conversational AAC-like data.
Name | Compressed size | Perplexity | Download |
Tiny | 25MB | 90.1 | [Link] |
Small | 110MB | 73.7 | [Link] |
Medium | 205MB | 69.2 | [Link] |
Large | 662MB | 63.3 | [Link] |
Huge | 4.9GB | 59.1 | [Link] |
The
test set consisted of 4,567 sentences (36.6K words). The test set was a combination of:
- COMM - Sentences written in response to hypothetical communication situations (2.1K words). [Link]
- COMM2 - Crowdsourced sentences written in response to hypothetical communication situations (13K words). [Link]
- Specialists - Phrases suggested by AAC specialists. Taken from University of Nebraska-Lincoln from web pages that are no longer available (4.2K words).
- Enron mobile - Messages written by Enron employees on Blackberry mobile devices (18K words). [Link]
The test set was in lowercase and included limited punctuation (apostrophes, commas, periods, exclamation points, and question marks).
We excluded the sentence end word from our perplexity calculations.
This was necessary since the models were trained only on sentences ending in a period, exclamation point, or question mark.
But some test sentences do not have end-of-sentence punctuation, this led to the sentence end word being very improbable.
These models are licensed under a
Creative Commons Attribution 4.0 License.
There are also BerkeleyLM and KenLM versions of these models (see bottom of page).
Training details:
Vocabulary.
These models use a
100K vocabulary including an unknown word, and words for commas, periods, exclamation points, and questions marks.
We first trained a unigram language model on each of our training sources.
We only included words that were in a list of
697K English words we obtained via human edited dictionaries, including a February 2019 parse of
Wiktionary.
We converted all words to lowercase.
We trained a linear mixture model using equal weights for each of the unigram models.
Finally, we took the most probable 100K words as our vocabulary.
Training data.
We used the following training sources:
- Amazon - Amazon review corpus (4.0B words). [Link]
- Common - Text of web pages from Common Crawl, Oct 2018 version (185B words). [Link]
- Blog - ICWSM 2009 Spinn3r Blog Dataset (599M words). [Link]
- Book - Books from Project Gutenberg (1.2B words). [Link]
- Email - Archived email messages from Apache lists, through Jan 2019 (167M words). [Link]
- Forum - Messages parsed from web forums (278M words). [Link]
- News - Newswire text from the one billion word language modeling benchmark (272M words). [Link]
- Reddit - Reddit corpus, Dec 2005 - May 2019 (82.0B words). [Link]
- Social - Social media data from the ICWSM 2011 Spinn3r blog data set (829M words). [Link]
- Subtitle - OpenSubtitles corpus (677M words). [Link]
- Twitter - Twitter messages sampled between Dec 2010 and Aug 2019 (18B words).
- Usenet - WestburyLAB reduced redundancy USENET corpus (2005-2011) (1.6B words). [Link]
- Wiki - Dump of Wikipedia including talk pages but not page history (1.5B words). [Link]
- Yelp - Yelp open dataset, 6.7M restaurant reviews (470M words). [Link]
We dropped sentences where 20% or more of words were out-of-vocabulary (OOV) with respect our 698K word list.
We wanted our models to learn across end-of-sentence punctuation words.
After each parsed sentence from a training set, we concatenated another sentence from that training set with probability 0.5.
Thus each
training sentence consisted of one or more actual sentences from the original training data.
We only concatenated sentences that came from the same apparent author (i.e. from the same tweet, blog post, web page, etc).
Within each training set, we removed training sentences that occurred more than once.
Data filtering.
We used
cross-entropy difference selection to select a subset of sentences from each training set.
We scored each sentence under three different in-domain language models:
- AAC - Conversational AAC-like data obtained via crowdsourcing (30K words). [Link]
- Short email - Sentences with between 1 and 12 words taken from the W3C email, TREC, and SpamAssassin corpora (2.1M words). [Link] [Link] [Link]
- Dialogue - Turns from everyday conversations, 90% of the training set of the Daily Dialogue corpus (948K words). [Link]
The out-of-domain models were trained on 100M words of data from each training source.
We took the score of a sentence to be the minimum cross-entropy difference under any of the three in-domain models.
For each training set, we found an optimal cross entropy difference threshold.
Only sentences less than this threshold were kept in the training data.
We trained a 4-gram language model on the subset of a given training set.
This model was trained using
MITLM and modified Kneser-Ney smoothing.
Discounting parameters were optimized with respect to a development set consisting of equal amounts of data from the three in-domain data sources (10K total words).
The optimal cross-entropy threshold was the one that resulted in a model that minimized the perplexity on this development set.
Mixture model training.
The resulting models from each training set at their optimal cross-entropy threshold were combined via linear interpolation into a single mixture model.
The mixture weights were optimized with respect to the development set.
At first we created a mixture model from all 14 training sets.
Since only the Subtitle, Reddit, Twitter, and Common sets had a mixture weight greater than 0.01, we simplified and created a mixture model from just these four sets.
Details for each set were as follows:
| Subtitle | Reddit | Twitter | Common |
Mixture weight | 0.27 | 0.18 | 0.13 | 0.43 |
Cross-entropy threshold | 0.30 | 0.10 | 0.10 | 0.10 |
Training words | 113M | 3.3B | 1.2B | 3.9B |
The unpruned 4-model mixture has a compressed disk size of 37GB and a perplexity on our test set of 58.15.
The unpruned 14-model mixture has a compressed disk size of 40GB and a perplexity on our test set of 58.20.
Pruning.
We used SRILM's
ngram tool to entropy prune the 4-gram mixture model.
For pruning purposes, we used a 3-gram mixture model trained with Good-Turing discounting.
The 3-gram mixture model used the same training data as the 4-gram MKN model.
We optimized the mixture weights on the same development data as the 4-gram model.
The thresholds for producing the Tiny, Small, Medium, Large, and Huge models were 1e-7, 1.6e-8, 7e-9, 1.3e-9, and 4e-11 respectively.
Other models:
Some of our software clients use the excellent
BerkeleyLM and
KenLM libraries for making efficient language model queries.
We also provide binary versions of our models compatible with these two libraries.
BerkeleyLM:
Name | Compressed size | Perplexity | Download |
Tiny | 36MB | 90.1 | [Link] |
Small | 148MB | 73.7 | [Link] |
Medium | 254MB | 69.2 | [Link] |
Large | 837MB | 63.3 | [Link] |
Huge | 6.2GB | 59.1 | [Link] |
The BerkeleyLM models are context encoded using the -e switch of the MakeLmBinaryFromArpa program.
KenLM:
Name | Compressed size | Perplexity | Download |
Tiny | 18MB | 90.1 | [Link] |
Small | 72MB | 73.6 | [Link] |
Medium | 132MB | 69.2 | [Link] |
Large | 391MB | 63.2 | [Link] |
Huge | 2.9GB | 58.9 | [Link] |
The KenLM models are trie models. We used 10 bits to quantize probabilities and 8 bits to quantize backoff weights.