Software and data sets

Language models, Feb 2019 - Word language models trained on 2.6 billion words of text from a diverse set of training sources.

Filtered Turk dialogues - This is a filtered set of the dialogues invented by Amazon Mechanical Turk workers in previous work. We removed dialogues that contained potentially offensive language. This set served as the basis for our two-sided audio dialogue collection.

Audio collector - Java program for collecting two-sided audio dialogues. The program can record from multiple microphones simultaneously. This allows you to compare the recognition accuracy of different microphones on the same speech data.

Two-sided audio dialogues - Recordings of 196 of the filtered Turk dialogues using three different microphones. This corpus contains the audio data as well as recognition results using the Google and IBM speech recognizers.

Text prediction web API - REST web API providing text predictions based on an n-gram language model. The API can also generate random words from a given vocabulary.

Nomon keyboard - Nomon is a novel interface for single-switch computer access. In collaboration with MIT, we are working on an onscreen predictive keyboard based on the Nomon idea.

Text predictor for Python - Set of Python classes that make word predictions based on context and/or a prefix of the current word. The classes can also make character predictions based on context. These classes were incorporated into the Nomon keyboard.