A Dataset for Reproducing How Students Use Generative AI for Essay Writing

Virginia Tech · Blacksburg, VA

NIRVANA CHI 2026 Paper

Abstract

Overview

With the rapid adoption of AI writing assistants in education, educators and researchers need empirical evidence to understand the impact on student writing and inform effective pedagogical design. Despite widespread use, we lack systematic understanding of how students engage with these tools during authentic writing tasks: when they seek assistance, what they ask, and how they incorporate AI-generated content into their essays. This gap limits evidence-based policy development and rigorous evaluation of generative AI's learning effects. To address this gap, we introduce NIRVANA, a dataset capturing how university students use generative AI while writing an analytical essay. The dataset includes 77 students who completed an essay task with access to ChatGPT, recording keystroke-level writing behavior, full ChatGPT conversation histories, and all text copied from ChatGPT, enabling a complete reconstruction of the writing process and revealing how AI assistance shapes student work. Our analysis identifies key behavioral patterns, including variation in ChatGPT query frequency and its relationship to essay characteristics such as length and readability. We identify four writing profiles based on students' contribution and revision patterns: Lead Authors, Collaborators, Drafters, and Vibe Writers. To support deeper investigation, we developed a replay interface that reconstructs the writing process; qualitative analysis of sampled replays demonstrates how this tool enables systematic examination of student-AI interactions.

The Dataset

NIRVANA at a Glance

77 university students completed an analytical essay task with unrestricted access to ChatGPT (GPT-3.5-turbo). Every keystroke, GPT query, and copy-paste event was recorded — enabling complete reconstruction of the writing process.

77ParticipantsUniversity students

4.51Avg. GPT Queriesσ = 5.11, median = 2

~30 minAvg. Session LengthPer writing task

9.96Avg. Dale-Chall ScoreCollege reading level (9.0+)

What is captured?

✓Keystroke-level writing behavior in a browser-based editor
✓Full ChatGPT conversation histories (queries & responses)
✓All text copied from ChatGPT responses into the essay editor
✓Copy and paste events with precise timestamps
✓Post-study surveys: Self-Efficacy, TAM, Perceived Ownership, CSI
✓Participant demographics: age, gender, race

Quantitative Analysis

Key Findings

Spearman rank correlations reveal significant associations between ChatGPT query frequency and writing process outcomes. No significant correlations were found between query count and post-task survey measures (Perceived Ownership or CSI).

ρ = 0.485

p < 0.001

Query Count ↔ Word Count

Students who asked more questions produced longer essays, suggesting ChatGPT may function as a generative scaffold.

ρ = 0.493

p < 0.001

Query Count ↔ Time Spent

Higher query frequency was associated with longer sessions — contrary to the expectation that AI use would expedite writing.

ρ = 0.380

p < 0.001

Query Count ↔ Dale-Chall Score

More queries correlated with higher readability difficulty scores, consistent with the lexical complexity of LLM-generated text.

Novel Metrics

HCR & HER

To characterize the degree of human involvement in AI-assisted writing, we introduce two complementary metrics derived from editor-tracked event counts.

Human Contribution Ratio

HCR

HCR = (HA − HD) / [(HA − HD) + (GP − GD)]

Measures the proportion of user-written words in the final essay. A value of 1 means every word was written by the participant; 0 means the essay consists entirely of pasted ChatGPT text.

Limitation: HCR does not capture cases where a participant wrote an entire draft and then replaced it with a ChatGPT revision — the score would be 0 even though the ideas originated with the student.

Human Edit Ratio

HER

HER = (HA + HD + GD) / (HA + HD + GP + GD)

Measures the proportion of all additions and deletions attributable to the participant. Deleting AI-generated content counts as human contribution, since it reflects deliberate curation. A value of 1 means no AI text was ever pasted.

Limitation: HER does not capture cognitive effort involved in planning, ideation, or deciding whether to incorporate AI output — it reflects observable editing effort, not cognitive contribution.

HA — Words added by the human

HD — Human words deleted

GP — Words pasted from ChatGPT

GD — ChatGPT words deleted by human

Cluster Analysis

Four Writer Profiles

K-means clustering (K=4, selected via silhouette analysis and the elbow method) on HCR and HER scores identified four distinct patterns of ChatGPT integration. Groups differed significantly in query count, time spent, essay readability, and perceived ownership.

Lead Authors

n = 37

Wrote essays primarily independently. ChatGPT was used for idea generation or information search, with little to no AI-generated text retained in the final essay.

HCR0.98

HER0.95

Collaborators

n = 15

Worked alongside ChatGPT in a balanced manner. Final essays contained a roughly even mixture of AI-generated and user-written text, with direct revision of pasted content.

HCR0.53

HER0.62

Drafters

n = 11

Wrote independent drafts first, then asked ChatGPT to rewrite them. Despite high editing effort, final essays were primarily composed of AI-generated content.

HCR0.09

HER0.64

Vibe Writers

n = 14

Delegated the writing entirely to ChatGPT. Analogous to 'Vibe Coding', these participants generated a full essay via prompts and pasted it with minimal or no personal edits.

HCR0.01

HER0.06

The Tool

NIRVANA Replay System

Quantitative metrics alone cannot capture all forms of human–AI interaction. To support deeper qualitative analysis, we developed a web-based replay interface that reconstructs the essay composition process event-by-event.

→Side-by-side view: essay editor replay alongside the full ChatGPT conversation
→Timeline annotated with copy events, paste events, and GPT queries
→Variable-speed playback: pause, scrub, or step through specific moments
→Word-count-over-time graph revealing sudden AI-generated text insertions
→Summary statistics for each session (word counts, query count, survey scores)

Case Studies

Two instructors of writing-intensive courses conducted a reflexive review of selected sessions. The following cases illustrate how process-level replay reveals distinctions that quantitative clustering alone may miss.

P77Vibe Writers

Copied the writing prompt into ChatGPT, received a complete essay, and pasted it directly without revision. A single sharp increase in word count marks the end of the session.

P30Lead Authors

Used ChatGPT exclusively for grammar and vocabulary help. Writing grew steadily, with ideas developed entirely by the student.

P57Lead Authors

Asked ChatGPT for a full essay but used it only as reference material. The submitted essay differed substantially from the AI-generated draft.

P54Drafters

Wrote a three-paragraph draft independently, then asked ChatGPT to elaborate and replaced the original with AI-generated content.

P26Collaborators

Generated an initial draft with ChatGPT and made minimal edits. Despite being classified as a Collaborator, qualitative review suggests ChatGPT acted as the primary author.

Publications

Publications & Citation

CHI 2026

An Empirical Study to Understand How Students Use ChatGPT for Writing Essays

Andrew Jelson, Daniel Manesh, Alice Jang, Daniel Dunlap, Young-Ho Kim, Sang Won Lee·Virginia Tech

doi PDF

2026

NIRVANA: A Dataset for Reproducing How Students Use Generative AI for Essay Writing

Andrew Jelson, Daniel Manesh, Sangwook Lee, Alice Jang, Daniel Dunlap, Tamara Maddox, Young-Ho Kim, Sang Won Lee·Virginia Tech

arXiv PDF

Interactive Tool

Replay a Session

Select any of the 77 participants and explore their writing process step by step.

Loading…