Dasher character language model, February 2021


Unlike our previous character language models, in this case we trained only on a single training set, namely Common Crawl. We also trained a small 4-gram model compared to our previous large 12-gram models. We evaluated the model by computing the average per-word perplexity on 36.6K words of conversational AAC-like data.

Compressed sizePerplexityDownload

The test set consisted of 4,567 sentences (36.6K words). The test set was a combination of: The test set was in lowercase and included limited punctuation (apostrophes, commas, periods, exclamation points, and question marks). We excluded the sentence end word from our perplexity calculations. This was necessary since the models were trained only on sentences ending in a period, exclamation point, or question mark. But some test sentences do not have end-of-sentence punctuation, this led to the sentence end word being very improbable.

This model is licensed under a Creative Commons Attribution 4.0 License.

Training details:

Vocabulary. These models use a 34 symbol vocabulary. The vocabulary contains the lowercase letters a-z, apostrophe, period, exclamation mark, question mark, comma, space (<sp>), sentence start (<s>), and sentence end (</s>).

Training data. We used the text of web pages from Common Crawl, September 2020 version. [Link]

We dropped sentences where 20% or more of words were out-of-vocabulary (OOV) with respect our 735K word list. Each sentence in the parsed data was treated as an independent training example. We removed sentences that occurred more than once on any particular web page. Our final parsed data set consisted of 16.1B sentences totaling 257B words.

Data filtering. We used cross-entropy difference selection to select a subset of sentences from the full training set. We scored each sentence under three different in-domain language models: The out-of-domain model were trained on 100M words of data from Common Crawl. We took the score of a sentence to be the minimum cross-entropy difference under any of the three in-domain models.

We found an optimal cross entropy difference threshold with respect to the average perplexity of the three in-domain development sets. Only sentences less than this threshold were kept in the training data. We trained a 4-gram language model on the subset of a given training set. This model was trained using MITLM and modified Kneser-Ney smoothing. Discounting parameters were optimized with respect to a development set consisting of equal amounts of data from the three in-domain data sources (10K total words).