Software and data

Language model personalization dataset - Contains resources we used in this paper to conduct language model adaptation experiments based on the Enron Personlization Validation Set.

Noisy typing on QWERTY keyboards - Contains typing data from participants in nine previously published text entry studies. Data was collected using a variety of devices (touchscreen phone, smartwatch, mid-air VR/AR keyboard, desktop keyboard), and using different keyboard features and user input strategies.

Dasher character language model (Feb 2021) - Character language model created for use in the Dasher project. Trained on 10.7 billion characters of text from web pages. Note there is only a 4-gram model; if you are looking for longer span models, see the other character models on this page.

Dasher word language models (Feb 2021) - Word language models created for use in the Dasher project. Trained on 2.3 billion words of text from web pages.

Dasher - We are part of a project to reimplement the Dasher text entry interface. Dasher is particularly well-suited for people who write via pointing, for example via an eye-tracker.

Baton - Application that allows AAC users to select sentences from their AAC history to contribute to researchers. If you are an AAC user, please consider joining our study.

Character language models (Dec 2019) - Character language models optimized for AAC-like text. Trained on 21 billion characters of text from four training sources.

Word language models (Dec 2019) - Word language models optimized for AAC-like text. Trained on 8.6 billion words of text from four training sources.

Word language models (Feb 2019) - Word language models optimized for AAC-like text. Trained on 2.6 billion words of text from a diverse set of training sources.

Filtered Turk dialogues - This is a filtered set of the dialogues invented by Amazon Mechanical Turk workers in previous work. We removed dialogues that contained potentially offensive language. This set served as the basis for our two-sided audio dialogue collection.

Audio collector - Java program for collecting two-sided audio dialogues. The program can record from multiple microphones simultaneously. This allows you to compare the recognition accuracy of different microphones on the same speech data.

Two-sided audio dialogues - Recordings of 196 of the filtered Turk dialogues using three different microphones. This corpus contains the audio data as well as recognition results using the Google and IBM speech recognizers.

Text prediction web API - REST web API providing text predictions based on an n-gram language model. The API can also generate random words from a given vocabulary. Version 2 of the API adds endpoints for character predictions and for anticipating future predictions in support of the browser-based version of Nomon.

Nomon keyboard - Nomon is a novel interface for single-switch computer access. In collaboration with MIT, we are working on an onscreen predictive keyboard based on the Nomon idea. You can learn more about Nomon in this video and poster.

Text predictor for Python - Set of Python classes that make word predictions based on context and/or a prefix of the current word. The classes can also make character predictions based on context. These classes were incorporated into the Nomon keyboard.

Software and data sets