Word language models, February 2019
Models:
Here are our trained ARPA format 4-gram language models.
We evaluated each model by computing the average per-word perplexity on 37K words of conversational AAC-like data.
Name | Compressed size | Perplexity | Download |
Small | 103MB | 73.2 | [Link] |
Medium | 201MB | 69.9 | [Link] |
Large | 682MB | 65.9 | [Link] |
Huge | 5.1GB | 63.4 | [Link] |
The
test set was a combination of:
- COMM - Sentences written in response to hypothetical communication situations (2.1K words). [Link]
- COMM2 - Crowdsourced sentences written in response to hypothetical communication situations (13K words). [Link]
- Specialists - Phrases suggested by AAC specialists. Taken from University of Nebraska-Lincoln from web pages that are no longer available (4.2K words).
- Enron mobile - Messages written by Enron employees on Blackberry mobile devices (18K words). [Link]
We excluded the sentence end word from our perplexity calculations.
The test set was in lowercase and included limited punctuation (apostrophes, commas, periods, exclamation points, and question marks).
These models are licensed under a
Creative Commons Attribution 4.0 License.
Training details:
Vocabulary.
These models use a
100K vocabulary including an unknown word, and words for commas, periods, exclamation points, and questions marks.
We first trained a unigram language model on each of our training sources.
We only included words that were in a list of
697K English words we obtained via human edited dictionaries, including a February 2019 parse of
Wiktionary.
We converted all words to lowercase.
We trained a linear mixture model using equal weights for each of the unigram models.
Finally, we took the most probable 100K words as our vocabulary.
Training data.
We used the following training sources:
- Amazon - Amazon review corpus (4.0B words). [Link]
- Common - Text of web pages from Common Crawl, Oct 2018 version (185B words). [Link]
- Blog - ICWSM 2009 Spinn3r Blog Dataset (599M words). [Link]
- Book - Books from Project Gutenberg (1.2B words). [Link]
- Email - Archived email messages from Apache lists, through Jan 2019 (167M words). [Link]
- Forum - Messages parsed from web forums (278M words). [Link]
- News - Newswire text from the one billion word language modeling benchmark (272M words). [Link]
- Reddit - Reddit corpus, Dec 2005 - Oct 2017 (70.3B words). [Link]
- Social - Social media data from the ICWSM 2011 Spinn3r blog data set (829M words). [Link]
- Subtitle - OpenSubtitles corpus (677M words). [Link]
- Twitter - Twitter messages sampled between Dec 2010 and Feb 2019 (16B words).
- Usenet - WestburyLAB reduced redundancy USENET corpus (2005-2011) (1.6B words). [Link]
- Wiki - Dump of Wikipedia including talk pages but not page history (1.5B words). [Link]
- Yelp - Yelp open dataset, 6.7M restaurant reviews (470M words). [Link]
We dropped sentences where 20% or more of words were out-of-vocabulary (OOV) with respect our 698K word list.
We wanted our models to learn across end-of-sentence punctuation words.
After each parsed sentence from a training set, we concatenated another sentence from that training set with probability 0.5.
Thus each
training sentence consisted of one or more actual sentences from the original training data.
We only concatenated sentences that came from the same apparent author (i.e. from the same tweet, blog post, web page, etc).
We used
cross-entropy difference selection to select training sentences.
Our in-domain language model was trained on 30K words of conversational AAC-like data.
The out-of-domain model was trained on a similar amount of data from the training source being selected from.
We merged training sentences from all sources, sorting them by their cross-entropy difference.
To focus on the most conversational like text, we only included sentences with a cross-entropy difference of -0.2 or smaller.
We also removed training sentences that occurred more than once.
This reduced our parsed training data from 280B words to a final training set of 2.6B words.
Model estimation and pruning.
These 4-gram word language models were trained and pruned using the
variKN toolkit.
We used the following command for training:
varigram_kn --dscale scale1 --dscale2 scale2 --arpa --3nzer --noorder 4 --longint --opti dev_data --clear_history --vocabin=vocab_lower_100k.txt
For dev_data, we used an equal amount of development data from each of the training sets and all our AAC-like development data (38K total words).
Here are the scale values we used to generate the models at the top of the page:
- Small - scale1 1e-1, scale2 9e-1
- Medium - scale1 1e-1, scale2 4e-1
- Large - scale1 1e-2, scale2 1e-1
- Huge - scale1 1e-2, scale2 1e-2