Open Datasets

Training data is required for ML; often authentic data can be difficult to come by. This is a collection of datasets that are generally useful.

If you have a dataset you’d like to share (or want to find one not listed here), you might consider one of the open data repositories (e.g. Google Research Datasets, EdData, Data.gov, Dataverse, ICPSR, Datashop, UCI ML Repository).

Kaggle ASAP AES Dataset: A dataset containing essays to 8 different essay prompts.
Kaggle ASAP AES v2 Dataset: A large dataset of 24278 essays to various prompts.

Kaggle ASAP SAS Dataset: A dataset containing student responses to 10 different short constructed response prompts.
Powergrading Short Answer Grading Corpus: This corpus contains the original data analyzed in the following paper: Basu, Jacobs, and Vanderwende, “Powergrading: a Clustering Approach to Amplify Human Effort for Short Answer Grading,” Transactions of the ACL, 2013. Last published: October 4, 2013.

PERSUADE Corpus: The Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus which contains over 280,000 discourse annotations for over 25,000 argumentative essays.
IteraTeR, R3 System, and DEIIteraTeR: A repository that provides datasets and code for preprocessing, training and testing models for Iterative Text Revision.

SQuAD: This dataset combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. While often use to train question-answering models, this dataset has also been used to help train question generation.
FairytaleQA: The FairytaleQA dataset was created to address the gaps present in similar datasets, as existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. The goal of the challenge in the link is to generate high quality questions from the dataset.

EdNet - dataset of all student-system interactions collected over 2 years by Santa, a multi-platform AI tutoring service with more than 780K users in Korea available through Android, iOS and web. Details of dataset are available as a pre–print.
HarvardX-MITx Person-Course Dataset- this dataset contains information from the first year of edX courses, which are Harvard and MIT’s Massive Open Online Courses (MOOCs). The data includes over 600,000 students and over a billion records.
Open University Learning Analytics dataset - contains data about courses, students and their interactions with Virtual Learning Environment (VLE) for seven selected courses (called modules). Presentations of courses start in February and October - they are marked by “B” and “J” respectively. The dataset consists of tables connected using unique identifiers. All tables are stored in the csv format.