Character language models, December 2019
Models:
Here are our trained ARPA format 12-gram language models.
We evaluated each model by computing the average per-character perplexity on conversational AAC-like data.
Name | Compressed size | Perplexity | Download |
Tiny | 22MB | 3.1805 | [Link] |
Small | 114MB | 2.8757 | [Link] |
Medium | 213MB | 2.8097 | [Link] |
Large | 473MB | 2.7581 | [Link] |
Huge | 2.7GB | 2.7263 | [Link] |
The
test set consisted of 4,567 sentences (36.6K words, 161K characters). The test set was a combination of:
- COMM - Sentences written in response to hypothetical communication situations (2.1K words). [Link]
- COMM2 - Crowdsourced sentences written in response to hypothetical communication situations (13K words). [Link]
- Specialists - Phrases suggested by AAC specialists. Taken from University of Nebraska-Lincoln from web pages that are no longer available (4.2K words).
- Enron mobile - Messages written by Enron employees on Blackberry mobile devices (18K words). [Link]
The test set was in lowercase and included limited punctuation (apostrophes, commas, periods, exclamation points, and question marks).
We excluded the sentence end word from our perplexity calculations.
This was necessary since the models were trained only on sentences ending in a period, exclamation point, or question mark.
But some test sentences do not have end-of-sentence punctuation, this led to the sentence end word being very improbable.
These models are licensed under a
Creative Commons Attribution 4.0 License.
There are also BerkeleyLM and KenLM versions of these models (see bottom of page).
Training details:
Vocabulary.
These models use a
34 symbol vocabulary.
The vocabulary contains the lowercase letters a-z, apostrophe, period, exclamation mark, question mark, comma, space (<sp>), sentence start (<s>), and sentence end (</s>).
Training data.
We used the following training sources:
- Amazon - Amazon review corpus (4.0B words). [Link]
- Common - Text of web pages from Common Crawl, Oct 2018 version (185B words). [Link]
- Blog - ICWSM 2009 Spinn3r Blog Dataset (599M words). [Link]
- Book - Books from Project Gutenberg (1.2B words). [Link]
- Email - Archived email messages from Apache lists, through Jan 2019 (167M words). [Link]
- Forum - Messages parsed from web forums (278M words). [Link]
- News - Newswire text from the one billion word language modeling benchmark (272M words). [Link]
- Reddit - Reddit corpus, Dec 2005 - May 2019 (82.0B words). [Link]
- Social - Social media data from the ICWSM 2011 Spinn3r blog data set (829M words). [Link]
- Subtitle - OpenSubtitles corpus (677M words). [Link]
- Twitter - Twitter messages sampled between Dec 2010 and Aug 2019 (18B words).
- Usenet - WestburyLAB reduced redundancy USENET corpus (2005-2011) (1.6B words). [Link]
- Wiki - Dump of Wikipedia including talk pages but not page history (1.5B words). [Link]
- Yelp - Yelp open dataset, 6.7M restaurant reviews (470M words). [Link]
We dropped sentences where 20% or more of words were out-of-vocabulary (OOV) with respect our 698K word list.
We wanted our models to learn across end-of-sentence punctuation words.
After each parsed sentence from a training set, we concatenated another sentence from that training set with probability 0.5.
Thus each
training sentence consisted of one or more actual sentences from the original training data.
We only concatenated sentences that came from the same apparent author (i.e. from the same tweet, blog post, web page, etc).
Within each training set, we removed training sentences that occurred more than once.
Finally, we converted the word-level training data into character-level data and converted to lowercase.
We used the pseudo-word <sp> to represent the spaces between words.
Data filtering.
We used
cross-entropy difference selection to select a subset of sentences from each training set.
We scored each sentence under three different in-domain character language models:
- AAC - Conversational AAC-like data obtained via crowdsourcing (30K words). [Link]
- Short email - Sentences with between 1 and 12 words taken from the W3C email, TREC, and SpamAssassin corpora (2.1M words). [Link] [Link] [Link]
- Dialogue - Turns from everyday conversations, 90% of the training set of the Daily Dialogue corpus (948K words). [Link]
The out-of-domain models were trained on 100M words of data from each training source.
We took the score of a sentence to be the minimum cross-entropy difference under any of the three in-domain models.
For each training set, we found an optimal cross entropy difference threshold.
Only sentences less than this threshold were kept in the training data.
We trained a 12-gram language model on the subset of a given training set.
This model was trained using
MITLM using modified Kneser-Ney smoothing.
No pruning was done during training.
Discounting parameters were optimized with respect to a development set consisting of equal amounts of data from the three in-domain data sources (10K total words).
The optimal cross-entropy threshold was the one that resulted in a model that minimized the perplexity on this development set.
Mixture model training.
We computed the mixture weights between the 14 trained models with respect to the development data using the compute-best-mix script of SRILM.
We found that only Subtitle, Reddit, Twitter, and Common sets had a mixture weights greater than 0.03.
As such, we created our mixture models with just these four sets.
We created a mixture model on our four models trained with modified Kneser Ney smoothing.
For comparison, we created a second mixture model using four models trained with Witten-Bell smoothing.
The unpruned modified Kneser Ney mixture was 37GB compressed (3,812,961,036 n-grams) and had a perplexity of 2.684 on our test set.
The unpruned Witten-Bell mixture was 34GB compressed (3,812,960,848 n-grams) and had a perplexity of 2.729 on our test set.
For our final model set, we used the Witten-Bell mixture.
Effectively pruning the modified Kneser Ney model requires training and loading an 11-gram model trained with another smoothing method [
Link].
This increased memory demands during pruning.
Further, we found for aggressive amounts of pruning, Witten-Bell pruned models outperformed similar sized modified Kneser Ney models pruned using -prune-history-lm and a Good-Turing 11-gram model.
Details of the final Witten-Bell mixture model were as follows:
| Subtitle | Reddit | Twitter | Common |
Mixture weight | 0.353 | 0.260 | 0.174 | 0.331 |
Cross-entropy threshold | 0.10 | 0.00 | 0.05 | 0.00 |
Training characters | 739M | 5.44B | 8.46B | 6.45B |
Pruning.
We used SRILM's
ngram tool to entropy prune the 12-gram mixture model.
The thresholds for producing the Tiny, Small, Medium, Large, and Huge models were 4e-9, 1.3e-9, 8e-10, 4e-10, and 4e-11 respectively.
Other models:
Some of our software clients use the excellent
BerkeleyLM and
KenLM libraries for making efficient language model queries.
We also provide binary versions of our models compatible with these two libraries.
BerkeleyLM:
Name | Compressed size | Perplexity | Download |
Tiny | 39MB | 3.180 | [Link] |
Small | 198MB | 2.876 | [Link] |
Medium | 353MB | 2.810 | [Link] |
Large | 782MB | 2.762 | [Link] |
Huge | 4.4GB | 2.726 | [Link] |
The BerkeleyLM models are context encoded using the -e switch of the MakeLmBinaryFromArpa program.
KenLM:
Name | Compressed size | Perplexity | Download |
Tiny | 15MB | 3.190 | [Link] |
Small | 71MB | 2.883 | [Link] |
Medium | 127MB | 2.818 | [Link] |
Large | 231MB | 2.766 | [Link] |
Huge | 1.4GB | 2.742 | [Link] |
The KenLM models are trie models. We used 10 bits to quantize probabilities and 8 bits to quantize backoff weights.