Guides and resources

resources

Paper

Reza, M., Anastasopoulos, I., Bhandari, S., & Pardos, Z. A. (2025). PromptHive: Bringing subject matter experts back to the forefront with collaborative prompt engineering for educational content creation. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (Article 148, pp. 1–22).

Abstract

Involving subject matter experts in prompt engineering can guide LLM outputs toward more helpful, accurate, and tailored content that meets the diverse needs of different domains. However, iterating towards effective prompts can be challenging without adequate interface support for systematic experimentation within specific task contexts. In this work, we introduce PromptHive, a collaborative interface for prompt authoring designed to better connect domain knowledge with prompt engineering through features that encourage rapid iteration on prompt variations. We conducted an evaluation study with ten subject matter experts in math and validated our design through two collaborative prompt writing sessions and a learning gain study with 358 learners. Our results elucidate the prompt iteration process and validate the tool’s usability, enabling non-AI experts to craft prompts that generate content comparable to human-authored materials while reducing perceived cognitive load by half and shortening the authoring process from several months to just a few hours.

Read research paper 

Liu, Y., Bhandari, S., & Pardos, Z. A. (2025). Leveraging LLM respondents for item evaluation: A psychometric analysis. British Journal of Educational Technology, 56, 1028–1052.

Abstract

Effective educational measurement relies heavily on the curation of well-designed item pools. However, item calibration is time consuming and costly, requiring a sufficient number of respondents to estimate the psychometric properties of items. In this study, we explore the potential of six different large language models (LLMs; GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro and Cohere Command R Plus) to generate responses with psychometric properties comparable to those of human respondents. Results indicate that some LLMs exhibit proficiency in College Algebra that is similar to or exceeds that of college students. However, we find the LLMs used in this study to have narrow proficiency distributions, limiting their ability to fully mimic the variability observed in human respondents, but that an ensemble of LLMs can better approximate the broader ability distribution typical of college students. Utilizing item response theory, the item parameters calibrated by LLM respondents have high correlations (eg, >0.8 for GPT-3.5) with their human calibrated counterparts. Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).

Read research paper 

Pardos, Z. A., & Bhandari, S. (2024). ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills. PLOS ONE, 19(5), e0304013.

Abstract

Authoring of help content within educational technologies is labor intensive, requiring many iterations of content creation, refining, and proofreading. In this paper, we conduct an efficacy evaluation of ChatGPT-generated help using a 3 x 4 study design (N = 274) to compare the learning gains of ChatGPT to human tutor-authored help across four mathematics problem subject areas. Participants are randomly assigned to one of three hint conditions (control, human tutor, or ChatGPT) paired with one of four randomly assigned subject areas (Elementary Algebra, Intermediate Algebra, College Algebra, or Statistics). We find that only the ChatGPT condition produces statistically significant learning gains compared to a no-help control, with no statistically significant differences in gains or time-on-task observed between learners receiving ChatGPT vs human tutor help. Notably, ChatGPT-generated help failed quality checks on 32% of problems. This was, however, reducible to nearly 0% for algebra problems and 13% for statistics problems after applying self-consistency, a “hallucination” mitigation technique for Large Language Models.

Read research paper 

Bhandari, S., Liu, Y., Kwak, Y., & Pardos, Z. A. (2024). Evaluating the psychometric properties of ChatGPT-generated questions. Computers and Education: Artificial Intelligence, 7, 100284.

Abstract

Not much is known about how LLM-generated questions compare to gold-standard, traditional formative assessments concerning their difficulty and discrimination parameters, which are valued properties in the psychometric measurement field. We follow a rigorous measurement methodology to compare a set of ChatGPT-generated questions, produced from one lesson summary in a textbook, to existing questions from a published Creative Commons textbook. To do this, we collected and analyzed responses from 207 test respondents who answered questions from both item pools and used a linking methodology to compare IRT properties between the two pools. We find that neither the difficulty nor discrimination parameters of the 15 items in each pool differ statistically significantly, with some evidence that the ChatGPT items were marginally better at differentiating different respondent abilities. The response time also does not differ significantly between the two sources of items. The ChatGPT-generated items showed evidence of unidimensionality and did not affect the unidimensionality of the original set of items when tested together. Finally, through a fine-grained learning objective labeling analysis, we found greater similarity in the learning objective distribution of ChatGPT-generated items and the items from the target OpenStax lesson (0.9666) than between ChatGPT-generated items and adjacent OpenStax lessons (0.6859 for the previous lesson and 0.6153 for the subsequent lesson). These results corroborate our conclusion that generative AI can produce algebra items of similar quality to existing textbook questions that share the same construct or constructs as those questions.

Read research paper 

Kwak, Y., & Pardos, Z. A. (2024). Bridging large language model disparities: Skill tagging of multilingual educational content. British Journal of Educational Technology, 55, 2039–2057.

Abstract

The adoption of large language models (LLMs) in education holds much promise. However, like many technological innovations before them, adoption and access can often be inequitable from the outset, creating more divides than they bridge. In this paper, we explore the magnitude of the country and language divide in the leading open-source and proprietary LLMs with respect to knowledge of K-12 taxonomies in a variety of countries and their performance on tagging problem content with the appropriate skill from a taxonomy, an important task for aligning open educational resources and tutoring content with state curricula. We also experiment with approaches to narrowing the performance divide by enhancing LLM skill tagging performance across four countries (the USA, Ireland, South Korea and India–Maharashtra) for more equitable outcomes. We observe considerable performance disparities not only with non-English languages but with English and non-US taxonomies. Our findings demonstrate that fine-tuning GPT-3.5 with a few labelled examples can improve its proficiency in tagging problems with relevant skills or standards, even for countries and languages that are underrepresented during training. Furthermore, the fine-tuning results show the potential viability of GPT as a multilingual skill classifier. Using both an open-source model, Llama2-13B, and a closed-source model, GPT-3.5, we also observe large disparities in tagging performance between the two and find that fine-tuning and skill information in the prompt improve both, but the closed-source model improves to a much greater extent. Our study contributes to the first empirical results on mitigating disparities across countries and languages with LLMs in an educational context.

Read research paper 

Li, Z., Pardos, Z. A., & Ren, C. (2024). Aligning open educational resources to new taxonomies: How AI technologies can help and in which scenarios. Computers & Education, 216, 105027.

Abstract

Aligning open educational resources (OER) to skill taxonomies is a common task in the education field and helps teachers better locate material that aligns with the standards of their curriculum. When taxonomies change, as they periodically do, re-tagging the increasing mass of open educational resources is needed. The process of manual tagging is, however, exceedingly labor intensive. We propose and evaluate a novel combination of machine learning methods to help automate tagging open educational resources with skills from an existing taxonomy as well as skills from any newly introduced taxonomy. We collected text, image figures, and videos from tens of thousands of educational resources from two major digital learning platforms to answer the research questions of: how effective are machine learning models in automatically updating OER classification to reflect a new taxonomy (RQ1), and which models may be of practical use in different scenarios (RQ2)? Using several taxonomies, including the US Common Core, we find that while full automation is not practically viable, our most generalizable model can reach non-expert human labeling performance requiring only 100 labeled examples and near expert level with 5000. We believe these novel findings may have immediate utility for practitioners and policymakers and better ready the growing landscape of open educational resources for the advent of new taxonomies ahead. We publicly release our pre-trained US Common Core and new taxonomy tagging models, providing guidance on their viability in various real-world scenarios.

Read research paper 

Sheel, S., Anastasopoulos, I., & Pardos, Z. A. (2024). Comparing authoring experiences with spreadsheet interfaces vs GUIs. In Proceedings of the 14th Learning Analytics and Knowledge Conference (LAK '24) (pp. 598–607).

Abstract

There is little consensus over whether graphical user interfaces (GUIs) or programmatic systems are better for word processing. Even less is known about each interfaces’ affordances and limitations in the context of creating content for adaptive tutoring systems. In order to afford instructors the use of such systems with their own or adapted pedagogies, we must study their experiences in inputting their content. In this study, we conduct a between-subjects A/B test with two content authoring interfaces, a GUI and spreadsheet, to explore 32 instructors’ experiences in authoring algebra content with hints, scaffolds, images, and special characters. We study their experiences by measuring time taken, accuracy, and their perceptions of each interfaces’ usability. Our findings indicate no significant relationship between interface used and time taken authoring problems but significantly more accuracy in authoring problems in the spreadsheet interface over the GUI. Although both interfaces performed reasonably well in time taken and accuracy, both were perceived as average to low in usability, highlighting a dissonance between instructors’ perceptions and actual performances. Since both interfaces are reasonable in authoring content, other factors can be explored, such as cost and author incentive, when deciding which interface approach to take for authoring tutor content.

Read research paper 

Pardos, Z. A., & Bhandari, S. (2023). Learning gain differences between ChatGPT and human tutor generated algebra hints. arXiv.

Abstract

Large Language Models (LLMs), such as ChatGPT, are quickly advancing AI to the frontiers of practical consumer use and leading industries to re-evaluate how they allocate resources for content production. Authoring of open educational resources and hint content within adaptive tutoring systems is labor intensive. Should LLMs like ChatGPT produce educational content on par with human-authored content, the implications would be significant for further scaling of computer tutoring system approaches. In this paper, we conduct the first learning gain evaluation of ChatGPT by comparing the efficacy of its hints with hints authored by human tutors with 77 participants across two algebra topic areas, Elementary Algebra and Intermediate Algebra. We find that 70% of hints produced by ChatGPT passed our manual quality checks and that both human and ChatGPT conditions produced positive learning gains. However, gains were only statistically significant for human tutor created hints. Learning gains from human-created hints were substantially and statistically significantly higher than ChatGPT hints in both topic areas, though ChatGPT participants in the Intermediate Algebra experiment were near ceiling and not even with the control at pre-test. We discuss the limitations of our study and suggest several future directions for the field. Problem and hint content used in the experiment is provided for replicability.

Read research paper 

Open-Source Code

Explore the full OATutor codebase on GitHub. Our open-source repository provides all the tools you need to customize, deploy, and contribute to the development of this adaptive tutoring system.

Access Github

Guides and resources

Implementation Guide

Accessibility and User Guides

Paper

Open-Source Code

Stay Updated with OATutor

Contact Us