home

Grammar Error Correction (GEC) using GPT

Author
Youngsook Song/ML Researcher
Category
Paper Review
Tags
Startups
LLM
Published
2024/06/27
5 more properties
This article examines four papers on Grammar Error Correction (GEC) using GPT. Grammar correction includes methods to detect and correct grammatical errors for both native speakers and foreign language learners. Therefore, it is an important area of research as it helps native speakers use language according to norms and lowers language barriers for foreign language learners. There were studies on grammatical errors before GPT, but the main focus was on minimal corrections that did not go against the user's intentions, such as spacing corrections. However, with the emergence of generative models and the diversification of AI language models, user requirements are also diversifying. This is due to the development of models that (1) have improved speech recognition accuracy and (2) can modify entire sentences to sound more natural, like a native language. Here, we examine how well GEC models using GPT can meet these requirements through four papers.
Paper
Is ChatGPT a Highly Fluent Grammatical Error Correction System?A Comprehensive Evaluation
ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark
Exploring Effectiveness of GPT-3 in Grammatical Error Correction: A Study on Performance and Controllability in Prompt-Based Methods
GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning
Data samples
CoNLL14, NLPCC18
CoNLL14
CoNLL14
- Total 1,061 parallel data samples collected from websites
Experiment
Zero-shot (CoT & Few-shot CoT
- Analyzed performance differences based on sentence length - Investigated cases of under-correction, mis-correction, and over-correction
Zero-shot & Few-shot
Built a hybrid dataset combining ChatGPT and human annotations
Target Language
English, German, Chinese
English
English
Chinese
Results
- Precision and F0.5 scores lower than SOTA models, but higher Recall scores - Strong in sentence-level fluency and error detection Weak in document-level error correction such as inter-sentence consistency, pronoun reference, tense errors - Difficulty in correcting errors across sentence boundaries - Need improvement in document-level error detection and specific error types - Less adherence to minimal edit principle and tendency to over-correct Very high fluency in sentences generated by ChatGPT
- ChatGPT showed lower performance than Grammarly and GECToR on automatic evaluation metrics - ChatGPT's performance was lower for longer sentences - ChatGPT tends to naturally change sentence structure and expressions beyond simply correcting errors one by one - Human evaluation showed ChatGPT had less under-correction and mis-correction but more over-correction
- GEC tasks can be effectively controlled with GPT-3 by providing appropriate task instructions and examples - Performance tends to improve as the number of examples increases in few-shot settings - Examples play a more important role than task instructions in controlling model output
- GrammarGPT showed the best performance - Ranked 1st-3rd in NLPCC2023 SharedTask - Model parameters are 20 times larger, but data required for fine-tuning reduced to 1/1200 level

Quantitative Performance Comparison

Performance Report from “Is ChatGPT a Highly Fluent Grammatical Error Correction System?A Comprehensive Evaluation”

English
German and Chinese
Average scores provided by three evaluators for fluency, minimal edits, over-correction, and under-correction
Very high fluency in sentences generated by ChatGPT
Less adherence to minimal edit principle and tendency to over-correct

Performance Report from “ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark”

Performance Report from “Exploring Effectiveness of GPT-3 in Grammatical Error Correction:A Study on Performance and Controllability in Prompt-Based Methods”

Discussion Points

Grammar error correction (GEC) using general language models (GPT) on CoNLL-2014 is not significantly superior in performance compared to models designed for specific tasks (like Grammarly)
If GPT-type models change at the level of rephrasing rather than minimizing edits, it's difficult to trust the above quantitative results that evaluate simple grammar correction ability
Regarding controllability to prevent excessive rephrasing, using specific prompts on general models, like GrammarGPT, increases controllability
Human evaluation results showing high scores for GPT in fluency but low scores in controllability suggest this may be a case of value judgment issue about the range of modifications to produce more satisfactory results
Additional research seems necessary, such as using open-source models or extending to Korean language