The recent explosion of “Generative AI” has broad and deep implications for Education Measurement. By “Generative AI” we mean algorithms that create new text, images, or sound using machine learning algorithms trained on previous existing samples of that data. This technology is already in operational use in assessments and learning solutions and is being rapidly developed.  At the same time, it brings significant challenges to foundational principles in measurement validity, fairness, and reliability. With evidence and constructs constantly recreated, and data not representative nor static, many concepts of measurement and practices to implement them need to be rethought.

Grounding innovation with principles from measurement can help to reduce the risk of these new approaches, improve their impact on student learning, and ensure that students from minoritized communities are not harmed, without rejecting their potential value. This genie will not be put back into its bottle.  Instead, measurement experts can help shape these solutions, collaboratively identifying important areas of focus and sharing innovative solutions. 

This special interest group (SIGIME) seeks to advance the theoretical and applied research into AI of educational measurement by bringing together data scientists, psychometricians, education researchers, and other interested stakeholders. The SIGIME will discuss current practices in using Generative AI, approaches to evaluate their precision/accuracy, and areas where more foundational research is required into the way we test and measure educational outcomes. This group seeks to create a strong professional identity and intellectual home for those interested in the use of AI in many areas, including automated scoring, item evaluation, validity studies, formative feedback, and generative AI for automated item generation. A critical initial responsibility is to ensure that there are guidelines around FATE (fair, accountable, transparent, and ethical) principles for applying AI in measurement (Harris 2023 ). Further, the interpretability of machine learning models is often a difficulty in evaluating the validity of these methods using conventional psychometric approaches, although there is active research in this area (Dorsey and Michaels 2022).

We suggest three initial applied areas for exploration: 

  1. Item Generation. This new technology is rapidly changing the way we think about education and problem-solving. With generative AI, the ability to create an infinite item pool is within our grasp and one that is customized to the interests or social context of the learner. However, some important ethical and psychometric considerations need to be addressed. How do we filter this pool for the best items? What are our standards of validity and accuracy when there is no baseline? How should assessment item development processes be designed? It is important for those of us who understand these systems best to help educators and test developers make informed decisions about how to use generative AI fairly and appropriately.

  2. Automated scoring of open-ended items. Automated Scoringis possibly the most widely used application of AI in education, and has been identified as the top use of AI published in measurement journals (Zheng et. all, 2023). There has been a major shift in the past few years in the methods used. The development and incorporation of large language models (LLMs) in automated scoring systems have led to substantial increases in scoring accuracy and increased flexibility to generalize models to new items with a lower training effort. However, LLMs are difficult to interpret and have a greater risk of bias. Since LLMs are trained on large amounts of data that is not from students (fortunately), this introduces new ways in which scoring engines could be biased. As we move forward, we need to ensure that using LLMs in automated scoring remains fair and equitable.

  3. Formative Feedback. In many contexts, assessment is rapidly moving from an isolated activity to one that is embedded within the learning experience. As AI methods continue to advance, an open area of research as to how we can leverage AI to provide students with improved diagnostic feedback. Integrating AI-based tools into automated writing evaluation (AWE) systems requires that the feedback returned is grounded in educational theory. Dialogue between data scientists, educators, and psychometricians is key to ensuring that AI is utilized to improve educational outcomes. A special interest group within the NCME could provide an appropriate forum for these discussions. 

Generative AI is taking the world by storm, and we believe NCME can make a deep and substantive contribution to how this technology is used in education.  A special interest group on Artificial Intelligence (AI) in Measurement and Education would provide a timely forum for researchers and practitioners in the use of AI in Education and Measurement to share their work, discuss challenges, and develop best practices. 


Dorsey, David W., and Hillary R. Michaels. 2022. “Validity Arguments Meet Artificial Intelligence in Innovative Educational Assessment: A Discussion and Look Forward.” Journal of Educational Measurement 59 (3): 389–94. https://doi.org/10.1111/jedm.12330.
Harris, Robert. 2023. “FATE (Fairness and Transparency) Hits the Mainstream: UK’s 5 Principles for AI Regulation.” https://feedzai.com/blog/fate-fairness-and-transparency-hits-the-mainstream-uks-5-principles-for-ai-regulation/.