ChatGPT as an automated writing evaluation tool increases text complexity during use but does not produce sustained writing improvements when AI assistance is removed, emphasizing the need for balanced approaches in EFL instruction.
Objective: The main goal of this study was to examine the effectiveness of ChatGPT in improving English writing skills for English as a Foreign Language (EFL) students over a nine-week intervention period. Specifically, the research aimed to explore ChatGPT's impact on syntactic and lexical complexity—key dimensions of writing development—and determine whether gains observed during AI-assisted writing persisted when students wrote independently without AI support. The study sought to address gaps in existing research by focusing on longitudinal writing gains in naturalistic settings rather than short-term redrafting processes.
Methods: The research employed a quasi-experimental design with 105 first-year university students in Belgium enrolled in an English translation program. Participants were randomly assigned to two conditions: an experimental group (n=66) that received both teacher feedback and ChatGPT access, and a control group (n=39) that received only teacher feedback. The nine-week intervention began with a baseline writing assessment in week 1, followed by regular instruction plus a ChatGPT training module for the experimental group during weeks 2-4. The experimental group learned prompt engineering and various feedback applications including structural, grammatical, and lexical support. Writing assessments occurred in weeks 6, 8, and 9, with the final assessment conducted without AI access to measure sustained gains. All writing was performed under controlled classroom conditions with 90-minute time limits. The research used Natural Language Processing tools including TAALES (Tool for Automatic Analysis of Lexical Sophistication) and TAASSC (Tool for Automatic Analysis of Syntactic Sophistication and Complexity) to analyze over 400 indices of linguistic complexity across 373 essays totaling 161,532 words. Data analysis employed multilevel linear regression models controlling for initial ability, with syntactic complexity measured through mean length of sentences, clauses per sentence, mean length of T-units, and mean length of clauses. Lexical complexity was assessed via age of acquisition, concreteness ratings, word frequency, and academic word prevalence.
Key Findings: The study revealed mixed results regarding ChatGPT's effectiveness as a writing development tool. During the intervention period when ChatGPT was available, the experimental group consistently wrote significantly longer texts with large effect sizes, and demonstrated improved syntactic complexity including longer sentences and clauses, higher vocabulary age of acquisition, increased academic lexis concentration, and more low-frequency word usage. However, these advantages disappeared when AI access was removed in week 9, with no significant differences remaining between groups. Multilevel analysis showed significant main effects of time across all syntactic and lexical complexity measures for both groups, indicating that traditional instruction contributed to writing development. Critically, the condition (access to ChatGPT) did not have statistically significant main effects on any variables, suggesting ChatGPT alone did not substantially contribute to sustained gains. Two significant interaction effects were found for syntactic complexity measures (mean length of T-unit and mean length of clause), indicating that the combination of traditional instruction and ChatGPT-supported feedback provided additional support for syntactic development specifically. No significant interaction effects were observed for lexical complexity variables, suggesting that vocabulary improvements were primarily due to instructional time rather than AI assistance. Post-writing surveys revealed that students primarily used ChatGPT for grammatical and lexical feedback rather than content generation, viewing it as a scaffolding tool rather than a text generator.
Implications: The findings have significant implications for integrating AI tools in language education, particularly regarding realistic expectations for automated writing evaluation systems. The research demonstrates that while ChatGPT can provide immediate benefits during use—such as increased text length and complexity—these gains may not translate into sustained writing competency improvements. This suggests that AI tools should complement rather than replace human feedback in writing instruction, serving as scaffolding devices within broader pedagogical frameworks. The study highlights the importance of targeted training for both educators and students in effective AI utilization, particularly in prompt engineering and specific feedback applications. The results indicate that traditional instruction remains the primary driver of writing development, with AI providing supplementary support particularly for syntactic complexity. For EFL contexts, the findings suggest that ChatGPT may be most valuable when integrated strategically with teacher instruction rather than used as a standalone solution. The research underscores the need for careful monitoring of AI usage patterns and clear guidance on maximizing educational benefits while avoiding over-reliance on technological tools.
Limitations: Several important limitations affect the interpretation and generalizability of results. The nine-week intervention period, while longer than most studies in this research area, may not be sufficient to capture long-term writing development patterns or sustained learning effects. The study's focus on syntactic and lexical complexity measures excluded other critical writing dimensions such as content quality, organization, coherence, and rhetorical effectiveness, meaning the research did not comprehensively evaluate the full construct of writing ability. The real-world classroom setting, while enhancing ecological validity, limited researchers' ability to fully control how students used ChatGPT, potentially introducing variability in treatment implementation. The participant population was restricted to first-year university students in Belgium enrolled in English translation programs, which may limit generalizability to other educational contexts, age groups, or language learning backgrounds. The study utilized ChatGPT version 3, and findings may not apply to more advanced AI versions with enhanced capabilities. Additionally, the research did not examine individual differences in AI usage effectiveness or explore optimal training approaches for maximizing educational benefits.
Future Directions: The research identifies several critical areas requiring further investigation to advance understanding of AI integration in language education. Longitudinal studies spanning multiple semesters or academic years are needed to assess long-term impacts of sustained AI usage on writing development and determine optimal intervention durations. Research should expand beyond complexity measures to examine AI effects on content quality, organization, argumentation skills, and overall communicative effectiveness. Comparative studies across different AI versions and tools would help identify which technological features most effectively support language learning. Investigation into individual differences in AI usage effectiveness, including learning styles, proficiency levels, and technological literacy, could inform personalized integration approaches. Research on optimal training protocols for both educators and students would help maximize educational benefits while minimizing potential negative effects. Studies examining AI integration across diverse educational contexts, age groups, and languages would enhance generalizability. Future work should also explore the development of AI-resistant assessment methods and investigate potential negative effects such as over-reliance, reduced critical thinking, or diminished human interaction in learning processes. Cross-cultural research could examine how AI effectiveness varies across different educational systems and cultural contexts.
Title and Authors: "The impact of automated writing evaluation on writing gains" by Bart Deygers, Liisa Buelens, David Chan, Laura Schildt, Amaury Van Parys, and Marieke Vanbuel.
Published On: February 2025
Published By: ELT Journal