Artificial Intelligence and Its Impact on Education Essay

Introduction, ai’s impact on education, the impact of ai on teachers, the impact of ai on students, reference list.

Rooted in computer science, Artificial Intelligence (AI) is defined by the development of digital systems that can perform tasks, which are dependent on human intelligence (Rexford, 2018). Interest in the adoption of AI in the education sector started in the 1980s when researchers were exploring the possibilities of adopting robotic technologies in learning (Mikropoulos, 2018). Their mission was to help learners to study conveniently and efficiently. Today, some of the events and impact of AI on the education sector are concentrated in the fields of online learning, task automation, and personalization learning (Chen, Chen and Lin, 2020). The COVID-19 pandemic is a recent news event that has drawn attention to AI and its role in facilitating online learning among other virtual educational programs. This paper seeks to find out the possible impact of artificial intelligence on the education sector from the perspectives of teachers and learners.

Technology has transformed the education sector in unique ways and AI is no exception. As highlighted above, AI is a relatively new area of technological development, which has attracted global interest in academic and teaching circles. Increased awareness of the benefits of AI in the education sector and the integration of high-performance computing systems in administrative work have accelerated the pace of transformation in the field (Fengchun et al. , 2021). This change has affected different facets of learning to the extent that government agencies and companies are looking to replicate the same success in their respective fields (IBM, 2020). However, while the advantages of AI are widely reported in the corporate scene, few people understand its impact on the interactions between students and teachers. This research gap can be filled by understanding the impact of AI on the education sector, as a holistic ecosystem of learning.

As these gaps in education are minimized, AI is contributing to the growth of the education sector. Particularly, it has increased the number of online learning platforms using big data intelligence systems (Chen, Chen and Lin, 2020). This outcome has been achieved by exploiting opportunities in big data analysis to enhance educational outcomes (IBM, 2020). Overall, the positive contributions that AI has had to the education sector mean that it has expanded opportunities for growth and development in the education sector (Rexford, 2018). Therefore, teachers are likely to benefit from increased opportunities for learning and growth that would emerge from the adoption of AI in the education system.

The impact of AI on teachers can be estimated by examining its effects on the learning environment. Some of the positive outcomes that teachers have associated with AI adoption include increased work efficiency, expanded opportunities for career growth, and an improved rate of innovation adoption (Chen, Chen and Lin, 2020). These benefits are achievable because AI makes it possible to automate learning activities. This process gives teachers the freedom to complete supplementary tasks that support their core activities. At the same time, the freedom they enjoy may be used to enhance creativity and innovation in their teaching practice. Despite the positive outcomes of AI adoption in learning, it undermines the relevance of teachers as educators (Fengchun et al., 2021). This concern is shared among educators because the increased reliance on robotics and automation through AI adoption has created conditions for learning to occur without human input. Therefore, there is a risk that teacher participation may be replaced by machine input.

Performance Evaluation emerges as a critical area where teachers can benefit from AI adoption. This outcome is feasible because AI empowers teachers to monitor the behaviors of their learners and the differences in their scores over a specific time (Mikropoulos, 2018). This comparative analysis is achievable using advanced data management techniques in AI-backed performance appraisal systems (Fengchun et al., 2021). Researchers have used these systems to enhance adaptive group formation programs where groups of students are formed based on a balance of the strengths and weaknesses of the members (Live Tiles, 2021). The information collected using AI-backed data analysis techniques can be recalibrated to capture different types of data. For example, teachers have used AI to understand students’ learning patterns and the correlation between these configurations with the individual understanding of learning concepts (Rexford, 2018). Furthermore, advanced biometric techniques in AI have made it possible for teachers to assess their student’s learning attentiveness.

Overall, the contributions of AI to the teaching practice empower teachers to redesign their learning programs to fill the gaps identified in the performance assessments. Employing the capabilities of AI in their teaching programs has also made it possible to personalize their curriculums to empower students to learn more effectively (Live Tiles, 2021). Nonetheless, the benefits of AI to teachers could be undermined by the possibility of job losses due to the replacement of human labor with machines and robots (Gulson et al. , 2018). These fears are yet to materialize but indications suggest that AI adoption may elevate the importance of machines above those of human beings in learning.

The benefits of AI to teachers can be replicated in student learning because learners are recipients of the teaching strategies adopted by teachers. In this regard, AI has created unique benefits for different groups of learners based on the supportive role it plays in the education sector (Fengchun et al., 2021). For example, it has created conditions necessary for the use of virtual reality in learning. This development has created an opportunity for students to learn at their pace (Live Tiles, 2021). Allowing students to learn at their pace has enhanced their learning experiences because of varied learning speeds. The creation of virtual reality using AI learning has played a significant role in promoting equality in learning by adapting to different learning needs (Live Tiles, 2021). For example, it has helped students to better track their performances at home and identify areas of improvement in the process. In this regard, the adoption of AI in learning has allowed for the customization of learning styles to improve students’ attention and involvement in learning.

AI also benefits students by personalizing education activities to suit different learning styles and competencies. In this analysis, AI holds the promise to develop personalized learning at scale by customizing tools and features of learning in contemporary education systems (du Boulay, 2016). Personalized learning offers several benefits to students, including a reduction in learning time, increased levels of engagement with teachers, improved knowledge retention, and increased motivation to study (Fengchun et al., 2021). The presence of these benefits means that AI enriches students’ learning experiences. Furthermore, AI shares the promise of expanding educational opportunities for people who would have otherwise been unable to access learning opportunities. For example, disabled people are unable to access the same quality of education as ordinary students do. Today, technology has made it possible for these underserved learners to access education services.

Based on the findings highlighted above, AI has made it possible to customize education services to suit the needs of unique groups of learners. By extension, AI has made it possible for teachers to select the most appropriate teaching methods to use for these student groups (du Boulay, 2016). Teachers have reported positive outcomes of using AI to meet the needs of these underserved learners (Fengchun et al., 2021). For example, through online learning, some of them have learned to be more patient and tolerant when interacting with disabled students (Fengchun et al., 2021). AI has also made it possible to integrate the educational and curriculum development plans of disabled and mainstream students, thereby standardizing the education outcomes across the divide. Broadly, these statements indicate that the expansion of opportunities via AI adoption has increased access to education services for underserved groups of learners.

Overall, AI holds the promise to solve most educational challenges that affect the world today. UNESCO (2021) affirms this statement by saying that AI can address most problems in learning through innovation. Therefore, there is hope that the adoption of new technology would accelerate the process of streamlining the education sector. This outcome could be achieved by improving the design of AI learning programs to make them more effective in meeting student and teachers’ needs. This contribution to learning will help to maximize the positive impact and minimize the negative effects of AI on both parties.

The findings of this study demonstrate that the application of AI in education has a largely positive impact on students and teachers. The positive effects are summarized as follows: improved access to education for underserved populations improved teaching practices/instructional learning, and enhanced enthusiasm for students to stay in school. Despite the existence of these positive views, negative outcomes have also been highlighted in this paper. They include the potential for job losses, an increase in education inequalities, and the high cost of installing AI systems. These concerns are relevant to the adoption of AI in the education sector but the benefits of integration outweigh them. Therefore, there should be more support given to educational institutions that intend to adopt AI. Overall, this study demonstrates that AI is beneficial to the education sector. It will improve the quality of teaching, help students to understand knowledge quickly, and spread knowledge via the expansion of educational opportunities.

Chen, L., Chen, P. and Lin, Z. (2020) ‘Artificial intelligence in education: a review’, Institute of Electrical and Electronics Engineers Access , 8(1), pp. 75264-75278.

du Boulay, B. (2016) Artificial intelligence as an effective classroom assistant. Institute of Electrical and Electronics Engineers Intelligent Systems , 31(6), pp.76–81.

Fengchun, M. et al. (2021) AI and education: a guide for policymakers . Paris: UNESCO Publishing.

Gulson, K . et al. (2018) Education, work and Australian society in an AI world . Web.

IBM. (2020) Artificial intelligence . Web.

Live Tiles. (2021) 15 pros and 6 cons of artificial intelligence in the classroom . Web.

Mikropoulos, T. A. (2018) Research on e-Learning and ICT in education: technological, pedagogical and instructional perspectives . New York, NY: Springer.

Rexford, J. (2018) The role of education in AI (and vice versa). Web.

Seo, K. et al. (2021) The impact of artificial intelligence on learner–instructor interaction in online learning. International Journal of Educational Technology in Higher Education , 18(54), pp. 1-12.

UNESCO. (2021) Artificial intelligence in education . Web.

  • Chicago (A-D)
  • Chicago (N-B)

IvyPanda. (2023, October 1). Artificial Intelligence and Its Impact on Education. https://ivypanda.com/essays/artificial-intelligence-and-its-impact-on-education/

"Artificial Intelligence and Its Impact on Education." IvyPanda , 1 Oct. 2023, ivypanda.com/essays/artificial-intelligence-and-its-impact-on-education/.

IvyPanda . (2023) 'Artificial Intelligence and Its Impact on Education'. 1 October.

IvyPanda . 2023. "Artificial Intelligence and Its Impact on Education." October 1, 2023. https://ivypanda.com/essays/artificial-intelligence-and-its-impact-on-education/.

1. IvyPanda . "Artificial Intelligence and Its Impact on Education." October 1, 2023. https://ivypanda.com/essays/artificial-intelligence-and-its-impact-on-education/.

Bibliography

IvyPanda . "Artificial Intelligence and Its Impact on Education." October 1, 2023. https://ivypanda.com/essays/artificial-intelligence-and-its-impact-on-education/.

  • The Age of Artificial Intelligence (AI)
  • The Importance of Trust in AI Adoption
  • Working With Artificial Intelligence (AI)
  • Effects of AI on the Accounting Profession
  • Artificial Intelligence and the Associated Threats
  • Artificial Intelligence in Cybersecurity
  • Leaders’ Attitude Toward AI Adoption in the UAE
  • Artificial Intelligence in “I, Robot” by Alex Proyas
  • The Aspects of the Artificial Intelligence
  • Robotics and Artificial Intelligence in Organizations
  • Machine Learning: Bias and Variance
  • Machine Learning and Regularization Techniques
  • Would Artificial Intelligence Reduce the Shortage of the Radiologists
  • Artificial Versus Human Intelligence
  • Artificial Intelligence: Application and Future

LSE - Small Logo

  • About the LSE Impact Blog
  • Comments Policy
  • Popular Posts
  • Recent Posts
  • Subscribe to the Impact Blog
  • Write for us
  • LSE comment

Mike Sharples

May 17th, 2022, new ai tools that can write student essays require educators to rethink teaching and assessment.

38 comments | 309 shares

Estimated reading time: 6 minutes

AI tools are available today that can write compelling university level essays. Taking an example of sample essay produced by the GPT-3 transformer, Mike Sharples discusses the implications of this technology for higher education and argues that they should be used to enhance pedagogy, rather than accelerating an ongoing arms race between increasingly sophisticated fraudsters and fraud detectors.

  With all the focus in artificial intelligence and education on AI personalised learning and tutoring systems , one AI technology has slipped by unnoticed. It could become a gift for student cheats, or a powerful teaching assistant, or a tool for creativity. It’s called a Transformer, it acts as a universal language tool and it is set to disrupt education. Here’s how.

Take a minute to read this graduate student essay on the controversial topic of “learning styles”. If this were your student, how would you assess the essay? What mark would you give? What are its strengths and weaknesses?

Learning Styles

The construct of “learning styles” is problematic because it fails to account for the processes through which learning styles are shaped. Some students might develop a particular learning style because they have had particular experiences. Others might develop a particular learning style by trying to accommodate to a learning environment that was not well suited to their learning needs. Ultimately, we need to understand the interactions among learning styles and environmental and personal factors, and how these shape how we learn and the kinds of learning we experience.

Learning styles and learning modes

In the 1970s, learning styles became an important feature of educational practice. In the 1990s, “learning modalities” became a major focus of learning style research. This is the idea that we process information in different ways. For example, some individuals may be auditory learners, benefiting from listening to lectures and using recorded materials. Others may be visual learners, who gain from using visual materials such as textbooks, handouts, and computer graphics. Still others may be kinesthetic learners, who benefit from learning by movement and doing. Numerous experiments have been conducted to test this idea, and a number of different classifications of learning modalities have been proposed.

In a review of this research, Dunn and Dunn (1997) found that the evidence does not support the idea that people have a particular learning mode that is dominant or that they process information in a particular way. In their study, the authors found that when students were given a particular instructional treatment, they could be classified in a particular learning modality category. But when they were given a different instructional treatment, they were classified in a different category. In a follow-up study in which they gave students a number of instructional treatments, Dunn and Dunn found that students could be placed in a particular category in some treatments, but not others.

Dunn and Dunn also found that the results varied by age. Older students generally fit the results of more than one modality, whereas younger students were more likely to fit the results of just one modality. This led them to conclude that style preferences are learned over the course of development. This is consistent with the idea that people are flexible in responding to the way they are taught.

Dunn, R., & Dunn, K. (1997). Learning styles: Research and practice. Journal of Research in Education, 7(2), 139-151.

The entire “student essay” was written by a Transformer AI program, GPT-3 . I gave it the prompt “The construct of ‘learning styles’ is problematic because” and it produced the rest, including headings and references. Plagiarism software will not detect essays written by Transformers, because the text is generated, not copied. A Google search of the essay shows that each sentence is original.

Note its reference for “Dunn and Dunn (1997)”. There is a journal of Research in Education, but no issue 7(2) in 1997. Dunn and Dunn did publish on learning styles, but not in that journal. GPT-3 has fashioned a plausible-looking but fake reference. It has also invented a fictitious research study from Dunn and Dunn to critique learning styles.

Anyone can sign up for an account with the GPT-3 Transformer then go to its “Playground”, type the opening words of an assignment, press Submit and let it do the rest. A student can now generate an entire essay or assignment in seconds, at a cost of around 1 US cent. Other companies and universities are developing Transformer systems that write in multiple languages and learn as they go.

Some Transformers, such as GPT-3, can filter bad language. But they are essentially amoral. They don’t know what they have written – they can’t reflect on their output. Can a powerful but amoral text generator be used for good in education? Here are some suggestions.

Just as a student can generate an essay in seconds, so a teacher can assess it. Add “Here is a short assessment of this student essay:” to the end of the text and GPT-3 writes a review.

The essay discusses the idea of learning styles, and how they may be developed over time. It cites research by Dunn and Dunn which suggests that learning styles are not fixed, and that people can learn in different ways depending on the situation.

The essay is well-organized and flows smoothly. It makes a clear argument and supports it with evidence from research.

One potential criticism is that the essay does not discuss the implications of learning styles research, or how this research can be used to improve learning. However, this is not necessarily a flaw, as the essay is only meant to be a brief overview of the topic.

It does a reasonable job. The program transforms surface text into a review but it doesn’t get to the essay’s meaning, nor does it find the fake reference or the made-up research study.

Students will employ AI to write assignments. Teachers will use AI to assess them. Nobody learns, nobody gains. If ever there were a time to rethink assessment, it’s now. Instead of educators trying to outwit AI Transformers, let’s harness them for learning.

If Transformer AI systems have a lasting influence on education, maybe that will come from educators and policy makers having to rethink how to assess students

First, Transformers can quickly show students different ways to express ideas and structure assignments. A teacher can run a classroom exercise to generate a few assignments on a topic, then get students to critique them and write their own better versions.

Second, AI Transformers can be creativity tools. Each student writes a short story with an AI. The student writes the first paragraph, AI continues with the second, and so on. It’s a good way to explore possibilities and overcome writer’s block.

Third, teachers can explore the ethics and limits of AI . How does it feel to interact with an expert wordsmith that has no morals and no experience of the world? Does a “deep neural network” have a mind, or is it just a big data processor?

Finally, as educators, if we are setting students assignments that can be answered by AI Transformers, are we really helping students learn? There are many better ways to assess for learning : constructive feedback, peer assessment, teachback. If Transformer AI systems have a lasting influence on education, maybe that will come from educators and policy makers having to rethink how to assess students, away from setting assignments that machines can answer, towards assessment for learning.

For more on AI Transformers and computers as story generators, see Mike Sharples and Rafael Pérez y Pérez, Story Machines: How Computers Have Become Creative Writers , to be published by Routledge in July 2022.

The content generated on this blog is for information purposes only. This Article gives the views and opinions of the authors and does not reflect the views and opinions of the Impact of Social Science blog (the blog), nor of the London School of Economics and Political Science.  Please review our  comments policy  if you have any concerns on posting a comment below.

Image Credit: Adapted from Openclipart .

Print Friendly, PDF & Email

About the author

artificial intelligence in education essay writing

Mike Sharples is Emeritus Professor of Educational Technology at The Open University, UK. His research involves human-centred design of new technologies and environments for learning. His recent books are Practical Pedagogy: 40 New Ways to Teach and Learn, and Story Machines: How Computers Have Become Creative Writers, published by Routledge.

38 Comments

Many thanks Mike, a really interesting and thought-provoking piece. I wonder if you’d be able to share the settings you used on GPT-3 to generate the essay above? I’ve not been able to reproduce anything close using the same prompt, which I’m sure is due to my lack of knowledge about the technology. Thanks.

Glad you like the piece.

I used the standard settings with the Davinci GPT-3 engine, apart from setting “maximum length” to 2000. It generated the text as shown, up to “they are taught.”. I then appended “References” and GPT-3 added the reference, as shown.

Thanks so much Mike! Really interesting. I wouldn’t expect to get the same result as you – I guess that’s the whole point. GPT-3 should be “creating” an original answer for anyone who puts in the same prompt (and students are unlikely to use the same prompt anyway).

Some of the first attempts gave me a pretty underwhelming attempt at an essay that might just pass for something a 14-year old could write, but nothing at UG or PG level. So I carried on playing around.

I’ve since reset all settings and have set maximum length to 2000 as per your query, and am getting the following:

“The concept of ‘learning styles’ is problematic because it is based on the idea that there is a single way to learn that is optimal for each individual. However, research has shown that there is no evidence to support the existence of learning styles. Furthermore, the idea of learning styles can actually be harmful, as it can lead students to believe that they cannot learn in any other way than their preferred style. This can lead to students feeling discouraged and unmotivated when they are faced with material that they find difficult to learn using their preferred style.”

Even some tweaking of settings doesn’t seem to encourage GPT-3 to give me any more than this. I wonder how much it might depend on my previous use of Playground, and whether I can “train” GPT-3 to give me “better” responses in future?

I’ll be really interested to see if anyone else tries this and what they get. Thanks so much for getting me thinking about this Mike!

The latest version of GPT-3 has been trained to accept instructions, such as “Write a student essay on the topic of ‘A critique of learning styles'”. I was using an earlier version trained for text continuation. You could either try with the earlier version (it should still be available on the OpenAI Playground), or try giving the be version a direct instruction.

Thank you so much Mike for the insight. It is interesting to realise that any entry repeated even with the same text same wording generates a different response.

Hi Mike – This is truly fascinating (and of course scary). I particularly liked your idea of using GPT-3 as a tool to teach students creative writing and critiquing academic writing. I created an account in GPT-3 and I must be doing something wrong because I am not able to get beyond the tutorials. I’ll keep trying.

You need to go to the API, then Playground.

Thanks Mike. That worked.

  • Pingback: AI writing has entered a new dimension, and it’s going to change education – The Spinoff – ID Hub For Technology

Hi Mike, I really like the way you’ve repositioned the debate.

I was inspired to have a go this morning, with a primary education creative writing focus.

My (partially) successful results are: https://www.linkedin.com/posts/activity-6940252465426571265-0u-M?utm_source=linkedin_share&utm_medium=member_desktop_web

Thanks Mike, how do we know this post wasnt written with AI? And responses on the comments generated by bots?

This has already gone mainstream. YouMakr.com is a tool which helps students with their writing assignment and has already gone viral in many countries globally. They are on track for a billion $ valuation

Thank you for this! I think you’ve really helped frame an important discussion about using LLM transformers to help students learn. However, I am curious about what sort of essays or writing in general we as educators could assign that a GPT-3 could not eventually answer?

I think Google’s LaMDA transformer contains nearly ten fold the amount of data engineers used with GPT-3, even causing the bizarre Google employee episode where he tried to convince folks that it had become sentient (it isn’t, of course). However, as transformers develop and progress I don’t really know what sort of assignments we can come up with that only a human could write.

Even SudoWrite’s algorithm can do a fairly decent job of mimicking phrases and moments of empathy, so I assume that transformers will be able to tackle creative writing one day.

Maybe the key is to continually engage students through co-writing with AI and let them practice critical thinking, self assessment, and reflection by emphasizing that an AI, when used effectively, can help an immature writer’s process or even a mature writer who suffers from decision fatigue.

Inspiring stuff Mike and your perspectives that move away form the sensationalist approach of the negative connotations of AI in education, are refreshing to see. I have taken a look at the tools you used and tried them out for myself. The one thing I am not able to represent is the length of the piece you were able to get the AI to produce. I seem only able to get one paragraph from the tool.

  • Pingback: Assessment, feedback and their digital futures – Teaching Matters blog

We are doing some exploratory research on views related to AI writing tools in education. Please share your thoughts and consider sending it onward.

We are using a tool called Polis, where you can vote on individual statements about the topic, see a visualisation of where your position sits in relation to others who voted, and you can also add your own perspective for others to vote on.

https://pol.is/7ncmuk4ume

  • Pingback: AI helped write this article. Can you tell which part? - Only News Network
  • Pingback: AI writing is here, and it's worryingly good. Can writers and academia adapt? - loanemu
  • Pingback: AI ayudó a escribir este artículo. ¿Puedes decir qué parte? - Inteligencia artificial
  • Pingback: Will ChatGPT Kill the Student Essay? - Elite News
  • Pingback: Will ChatGPT Kill the Student Essay? - The Atlantic

There’s a very simple way to control the abuse of “AI” to write student essays – the personal tutorial where the student has to read their essay out and be questioned on it. This has worked well in the past. Maybe it’s time to revive it.

In an attempt to cut down on plagiarism and purchased papers, I revised my assignments so that they were both scaffolded and required the use of assigned sources. When outside research is required, students must justify the reliability of the sources. Requiring regular annotated responses to the readings also gives me the ability to see when essays seem to be in line with student work. Based on my limited exploration of the app, that approach also seems to address the problems raised by the new technology.

The problem is that it requires a lot more grading on my part than the traditional exam-essay assessment approach, but I don’t see that very much anymore.

Suppose students were asked to integrate their own personal experiences with learning styles (or any topic of an assignment) and specific examples from different points in their life where they learned to learn as they do, how would AI handle that? Since our individual experiences are points on a distribution captured in research data, could application of research to understand experience help!

Mike, when I tried this on different topics with “citations” requested, it produced all fake citations. Plausible looking, but fake. A student doing that would likely fail! Or at the least be in for a grilling and a stern warning. So, I don’t think ChatGPT is much of a threat as a source of academic misconduct. Just ask for references and check them to catch chancers taken in by the hype.

  • Pingback: 2022 In Review: Academic Writing | Impact of Social Sciences
  • Pingback: Here we are again! | Honesty, honestly…
  • Pingback: Hype, or the future of teaching and learning? 3 Limits to AI's ability to write students essays | Impact of Social Sciences
  • Pingback: Bring back the blue book – Reflections in Education
  • Pingback: Should we trust ChatGPT? - DEFI
  • Pingback: A sociomaterial analysis of a learning space | Teaching and Learning Network of the Centre for Academic Language and Development
  • Pingback: Finahost Online Solutions

Can we please stop calling Turnitin etc “Plagiarism software”!? Thanks David Callaghan

Excellent points and thank you Mike for bringing up the topic of AI tools that can write university level essays. I am fascinated by the potential impact this technology could have on higher education. I appreciate your argument that these tools should be used to enhance pedagogy, rather than accelerating cheating and fraud.

It’s interesting to see how the GPT-3 Transformer AI program was able to generate a compelling essay on the topic of “learning styles”. The essay provides a well-organized and evidence-based argument, despite being generated by an AI language model. I particularly like how it explores the idea that learning styles are not fixed, and can be influenced by personal and environmental factors.

Although there are concerns about how these AI tools could be misused for cheating, there is potential for them to be used as a powerful teaching assistant or tool for creativity. I hope that these tools will be used responsibly and ethically to enhance the learning experience, rather than undermine it.

  • Pingback: Generative AI should mark the end of a failed war on student academic misconduct | Impact of Social Sciences
  • Pingback: AI writing is right here, and it’s worryingly good. Can writers and … – Euronews - Welcome to anything's Up Blog
  • Pingback: Special Issue Call: The Games People Play: Exploring Technology Enhanced Learning Scholarship & Generative Artificial Intelligence – Leigh Graves Wolf
  • Pingback: Can you pass the Turing Test? – The Box of Reflective Writing

Leave a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Notify me of follow-up comments by email.

Related Posts

artificial intelligence in education essay writing

Death of the author? AI generated books and the production of scientific knowledge

May 9th, 2019.

artificial intelligence in education essay writing

tl;dr – AI and the acceleration of research communication

February 24th, 2022.

artificial intelligence in education essay writing

Higher Education Science Fictions – How fictional narratives can shape AI futures in the academy

November 5th, 2021.

artificial intelligence in education essay writing

Is openness in AI research always the answer?

July 30th, 2019.

artificial intelligence in education essay writing

Visit our sister blog LSE Review of Books

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 24 February 2023

Artificial intelligence in academic writing: a paradigm-shifting technological advance

  • Roei Golan   ORCID: orcid.org/0000-0002-7214-3073 1   na1 ,
  • Rohit Reddy 2   na1 ,
  • Akhil Muthigi 2 &
  • Ranjith Ramasamy 2  

Nature Reviews Urology volume  20 ,  pages 327–328 ( 2023 ) Cite this article

3572 Accesses

21 Citations

63 Altmetric

Metrics details

  • Preclinical research
  • Translational research

Artificial intelligence (AI) has rapidly become one of the most important and transformative technologies of our time, with applications in virtually every field and industry. Among these applications, academic writing is one of the areas that has experienced perhaps the most rapid development and uptake of AI-based tools and methodologies. We argue that use of AI-based tools for scientific writing should widely be adopted.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

How artificial intelligence will affect the future of medical publishing

  • Jean-Louis Vincent

Critical Care Open Access 06 July 2023

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

195,33 € per year

only 16,28 € per issue

Buy this article

Purchase on Springer Link

Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

Checco, A., Bracciale, L., Loreti, P., Pinfield, S. & Bianchi, G. AI-assisted peer review. Humanit. Soc. Sci. Commun. 8 , 25 (2021).

Article   Google Scholar  

Hutson, M. Could AI help you to write your next paper? Nature 611 , 192–193 (2022).

Article   CAS   PubMed   Google Scholar  

Krzastek, S. C., Farhi, J., Gray, M. & Smith, R. P. Impact of environmental toxin exposure on male fertility potential. Transl Androl. Urol. 9 , 2797–2813 (2020).

Article   PubMed   PubMed Central   Google Scholar  

Khullar, D. Social media and medical misinformation: confronting new variants of an old problem. JAMA 328 , 1393–1394 (2022).

Article   PubMed   Google Scholar  

Reddy, R. V. et al. Assessing the quality and readability of online content on shock wave therapy for erectile dysfunction. Andrologia 54 , e14607 (2022).

Khodamoradi, K., Golan, R., Dullea, A. & Ramasamy, R. Exosomes as potential biomarkers for erectile dysfunction, varicocele, and testicular injury. Sex. Med. Rev. 10 , 311–322 (2022).

Stone, L. You’ve got a friend online. Nat. Rev. Urol. 17 , 320 (2020).

PubMed   Google Scholar  

Pai, R. K. et al. A review of current advancements and limitations of artificial intelligence in genitourinary cancers. Am. J. Clin. Exp. Urol. 8 , 152–162 (2020).

PubMed   PubMed Central   Google Scholar  

You, J. B. et al. Machine learning for sperm selection. Nat. Rev. Urol. 18 , 387–403 (2021).

Stone, L. The dawning of the age of artificial intelligence in urology. Nat. Rev. Urol. 18 , 322 (2021).

Download references

Acknowledgements

The manuscript was edited for grammar and structure using the advanced language model ChatGPT. The authors thank S. Verma for addressing inquiries related to artificial intelligence.

Author information

These authors contributed equally: Roei Golan, Rohit Reddy.

Authors and Affiliations

Department of Clinical Sciences, Florida State University College of Medicine, Tallahassee, FL, USA

Desai Sethi Urology Institute, University of Miami Miller School of Medicine, Miami, FL, USA

Rohit Reddy, Akhil Muthigi & Ranjith Ramasamy

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Ranjith Ramasamy .

Ethics declarations

Competing interests.

R.R. is funded by the National Institutes of Health Grant R01 DK130991 and the Clinician Scientist Development Grant from the American Cancer Society. The other authors declare no competing interests.

Additional information

Related links.

ChatGPT: https://chat.openai.com/

Cohere: https://cohere.ai/

CoSchedule Headline Analyzer: https://coschedule.com/headline-analyzer

DALL-E 2: https://openai.com/dall-e-2/

Elicit: https://elicit.org/

Penelope.ai: https://www.penelope.ai/

Quillbot: https://quillbot.com/

Semantic Scholar: https://www.semanticscholar.org/

Wordtune by AI21 Labs: https://www.wordtune.com/

Writefull: https://www.writefull.com/

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Golan, R., Reddy, R., Muthigi, A. et al. Artificial intelligence in academic writing: a paradigm-shifting technological advance. Nat Rev Urol 20 , 327–328 (2023). https://doi.org/10.1038/s41585-023-00746-x

Download citation

Published : 24 February 2023

Issue Date : June 2023

DOI : https://doi.org/10.1038/s41585-023-00746-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Techniques for supercharging academic writing with generative ai.

  • Zhicheng Lin

Nature Biomedical Engineering (2024)

Critical Care (2023)

What do academics have to say about ChatGPT? A text mining analytics on the discussions regarding ChatGPT on research writing

  • Rex Bringula

AI and Ethics (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

artificial intelligence in education essay writing

artificial intelligence in education

Artificial intelligence in education

Artificial Intelligence (AI) has the potential to address some of the biggest challenges in education today, innovate teaching and learning practices, and accelerate progress towards SDG 4. However, rapid technological developments inevitably bring multiple risks and challenges, which have so far outpaced policy debates and regulatory frameworks. UNESCO is committed to supporting Member States to harness the potential of AI technologies for achieving the Education 2030 Agenda, while ensuring that its application in educational contexts is guided by the core principles of inclusion and equity.   UNESCO’s mandate calls inherently for a human-centred approach to AI . It aims to shift the conversation to include AI’s role in addressing current inequalities regarding access to knowledge, research and the diversity of cultural expressions and to ensure AI does not widen the technological divides within and between countries. The promise of “AI for all” must be that everyone can take advantage of the technological revolution under way and access its fruits, notably in terms of innovation and knowledge.

Furthermore, UNESCO has developed within the framework of the  Beijing Consensus  a publication aimed at fostering the readiness of education policy-makers in artificial intelligence. This publication,  Artificial Intelligence and Education: Guidance for Policy-makers , will be of interest to practitioners and professionals in the policy-making and education communities. It aims to generate a shared understanding of the opportunities and challenges that AI offers for education, as well as its implications for the core competencies needed in the AI era

0000386693

The UNESCO Courier, October-December 2023

0000387029

  • Plurilingual

0000368303

by Stefania Giannini, UNESCO Assistant Director-General for Education

International Forum on artificial intelligence and education

  • More information
  • Analytical report

International Forum on AI and Education banner

Through its projects, UNESCO affirms that the deployment of AI technologies in education should be purposed to enhance human capacities and to protect human rights for effective human-machine collaboration in life, learning and work, and for sustainable development. Together with partners, international organizations, and the key values that UNESCO holds as pillars of their mandate, UNESCO hopes to strengthen their leading role in AI in education, as a global laboratory of ideas, standard setter, policy advisor and capacity builder.   If you are interested in leveraging emerging technologies like AI to bolster the education sector, we look forward to partnering with you through financial, in-kind or technical advice contributions.   'We need to renew this commitment as we move towards an era in which artificial intelligence – a convergence of emerging technologies – is transforming every aspect of our lives (…),' said Ms Stefania Giannini, UNESCO Assistant Director-General for Education at the International Conference on Artificial Intelligence and Education held in Beijing in May 2019. 'We need to steer this revolution in the right direction, to improve livelihoods, to reduce inequalities and promote a fair and inclusive globalization.’'

Robot in Education system

Related items

  • Artificial intelligence

Robotic hands at a computer.

Artificial intelligence is getting better at writing, and universities should worry about plagiarism

artificial intelligence in education essay writing

Assistant Professor, Faculty of Education, Brock University

artificial intelligence in education essay writing

Educational Leader in Residence, Academic Integrity and Assistant Professor, University of Calgary

Disclosure statement

The authors do not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and have disclosed no relevant affiliations beyond their academic appointment.

University of Calgary provides funding as a founding partner of The Conversation CA.

Brock University and University of Calgary provide funding as members of The Conversation CA-FR.

Brock University provides funding as a member of The Conversation CA.

View all partners

The dramatic rise of online learning during the COVID-19 pandemic has spotlit concerns about the role of technology in exam surveillance — and also in student cheating .

Some universities have reported more cheating during the pandemic, and such concerns are unfolding in a climate where technologies that allow for the automation of writing continue to improve.

Over the past two years, the ability of artificial intelligence to generate writing has leapt forward significantly , particularly with the development of what’s known as the language generator GPT-3. With this, companies such as Google , Microsoft and NVIDIA can now produce “human-like” text .

AI-generated writing has raised the stakes of how universities and schools will gauge what constitutes academic misconduct, such as plagiarism . As scholars with an interest in academic integrity and the intersections of work, society and educators’ labour, we believe that educators and parents should be, at the very least, paying close attention to these significant developments .

AI & academic writing

The use of technology in academic writing is already widespread. For example, many universities already use text-based plagiarism detectors like Turnitin , while students might use Grammarly , a cloud-based writing assistant. Examples of writing support include automatic text generation, extraction, prediction, mining, form-filling, paraphrasing , translation and transcription.

Read more: In an AI world we need to teach students how to work with robot writers

Advancements in AI technology have led to new tools, products and services being offered to writers to improve content and efficiency . As these improve, soon entire articles or essays might be generated and written entirely by artificial intelligence . In schools, the implications of such developments will undoubtedly shape the future of learning, writing and teaching.

A girl at a computer surrounded by zeros and ones as binary code.

Misconduct concerns already widespread

Research has revealed that concerns over academic misconduct are already widespread across institutions higher education in Canada and internationally.

In Canada, there is little data regarding the rates of misconduct. Research published in 2006 based on data from mostly undergraduate students at 11 higher education institutions found 53 per cent reported having engaged in one or more instances of serious cheating on written work, which was defined as copying material without footnoting, copying material almost word for word, submitting work done by someone else, fabricating or falsifying a bibliography, submitting a paper they either bought or got from someone else for free.

Academic misconduct is in all likelihood under-reported across Canadian higher education institutions .

There are different types of violations of academic integrity, including plagiarism , contract cheating (where students hire other people to write their papers) and exam cheating, among others .

Unfortunately, with technology, students can use their ingenuity and entrepreneurialism to cheat. These concerns are also applicable to faculty members, academics and writers in other fields, bringing new concerns surrounding academic integrity and AI such as:

  • If a piece of writing was 49 per cent written by AI, with the remaining 51 per cent written by a human, is this considered original work?
  • What if an essay was 100 per cent written by AI, but a student did some of the coding themselves?
  • What qualifies as “AI assistance” as opposed to “academic cheating”?
  • Do the same rules apply to students as they would to academics and researchers?

We are asking these questions in our own research , and we know that in the face of all this, educators will be required to consider how writing can be effectively assessed or evaluated as these technologies improve.

Augmenting or diminishing integrity?

At the moment, little guidance, policy or oversight is available regarding technology, AI and academic integrity for teachers and educational leaders.

Over the past year, COVID-19 has pushed more students towards online learning — a sphere where teachers may become less familiar with their own students and thus, potentially, their writing.

While it remains impossible to predict the future of these technologies and their implications in education, we can attempt to discern some of the larger trends and trajectories that will impact teaching, learning and research.

Technology & automation in education

A key concern moving forward is the apparent movement towards the increased automation of education where educational technology companies offer commodities such as writing tools as proposed solutions for the various “problems” within education.

An example of this is automated assessment of student work, such as automated grading of student writing . Numerous commercial products already exist for automated grading, though the ethics of these technologies are yet to be fully explored by scholars and educators.

Read more: Online exam monitoring can invade privacy and erode trust at universities

Overall, the traditional landscape surrounding academic integrity and authorship is being rapidly reshaped by technological developments. Such technological developments also spark concerns about a shift of professional control away from educators and ever-increasing new expectations of digital literacy in precarious working environments .

Read more: Precarious employment in education impacts workers, families and students

These complexities, concerns and questions will require further thought and discussion. Educational stakeholders at all levels will be required to respond and rethink definitions as well as values surrounding plagiarism, originality, academic ethics and academic labour in the very near future.

The authors would like to sincerely thank Ryan Morrison, from George Brown College, who provided significant expertise, advice and assistance with the development of this article.

  • Artificial intelligence (AI)
  • Academic cheating
  • Academic integrity
  • Academic writing

artificial intelligence in education essay writing

Events Officer

artificial intelligence in education essay writing

Lecturer (Hindi-Urdu)

artificial intelligence in education essay writing

Director, Defence and Security

artificial intelligence in education essay writing

Opportunities with the new CIEHF

artificial intelligence in education essay writing

School of Social Sciences – Public Policy and International Relations opportunities

Advertisement

Advertisement

Automated Essay Scoring and the Deep Learning Black Box: How Are Rubric Scores Determined?

  • Published: 15 September 2020
  • Volume 31 , pages 538–584, ( 2021 )

Cite this article

  • Vivekanandan S. Kumar   ORCID: orcid.org/0000-0003-3394-7789 1 &
  • David Boulanger 1  

6273 Accesses

22 Citations

Explore all metrics

This article investigates the feasibility of using automated scoring methods to evaluate the quality of student-written essays. In 2012, Kaggle hosted an Automated Student Assessment Prize contest to find effective solutions to automated testing and grading. This article: a) analyzes the datasets from the contest – which contained hand-graded essays – to measure their suitability for developing competent automated grading tools; b) evaluates the potential for deep learning in automated essay scoring (AES) to produce sophisticated testing and grading algorithms; c) advocates for thorough and transparent performance reports on AES research, which will facilitate fairer comparisons among various AES systems and permit study replication; d) uses both deep neural networks and state-of-the-art NLP tools to predict finer-grained rubric scores, to illustrate how rubric scores are determined from a linguistic perspective, and to uncover important features of an effective rubric scoring model. This study’s findings first highlight the level of agreement that exists between two human raters for each rubric as captured in the investigated essay dataset, that is, 0.60 on average as measured by the quadratic weighted kappa (QWK). Only one related study has been found in the literature which also performed rubric score predictions through models trained on the same dataset. At best, the predictive models had an average agreement level (QWK) of 0.53 with the human raters, below the level of agreement among human raters. In contrast, this research’s findings report an average agreement level per rubric with the two human raters’ resolved scores of 0.72 (QWK), well beyond the agreement level between the two human raters. Further, the AES system proposed in this article predicts holistic essay scores through its predicted rubric scores and produces a QWK of 0.78, a competitive performance according to recent literature where cutting-edge AES tools generate agreement levels between 0.77 and 0.81, results computed as per the same procedure as in this article. This study’s AES system goes one step further toward interpretability and the provision of high-level explanations to justify the predicted holistic and rubric scores. It contends that predicting rubric scores is essential to automated essay scoring, because it reveals the reasoning behind AIED-based AES systems. Will building AIED accountability improve the trustworthiness of the formative feedback generated by AES? Will AIED-empowered AES systems thoroughly mimic, or even outperform, a competent human rater? Will such machine-grading systems be subjected to verification by human raters, thus paving the way for a human-in-the-loop assessment mechanism? Will trust in new generations of AES systems be improved with the addition of models that explain the inner workings of a deep learning black box? This study seeks to expand these horizons of AES to make the technique practical, explainable, and trustable.

Similar content being viewed by others

artificial intelligence in education essay writing

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Ismail Celik, Muhterem Dindar, … Sanna Järvelä

artificial intelligence in education essay writing

An automated essay scoring systems: a systematic literature review

Dadi Ramesh & Suresh Kumar Sanampudi

artificial intelligence in education essay writing

How to cheat on your final paper: Assigning AI for student writing

Avoid common mistakes on your manuscript.

Introduction

Recent advances in deep learning and natural language processing (NLP) have challenged automated testing and grading methods to improve their performance and to harness valuable hand-graded essay datasets – such as the free Automated Student Assessment Prize (ASAP) datasets – to accurately measure performance. Presently, reports about the performance of automated essay scoring (AES) systems commonly – and perhaps inadvertently – lack transparency. Such ambiguity in research outcomes of AES techniques hinders performance evaluations and comparative analyses of techniques. This article argues that AES research requires proper protocols to describe methodologies and to report outcomes. Additionally, the article reviews state-of-the-art AES systems assessed using ASAP’s seventh dataset to: a) underscore features that facilitate reasonable evaluation of AES performances; b) describe cutting-edge natural language processing tools, explaining the extent to which writing metrics can now capture and indicate performance; c) predict rubric scores using six different feature-based multi-layer perceptron deep neural network architectures and compare their performance; and d) assess the importance of the features present in each of the rubric scoring models.

The following section provides background information on the datasets used in this study that are also extensively exploited by the research community to train and evaluate AES systems. The third section synthesizes relevant literature about recent developments in AES, compares contemporary AES systems, and evaluates their features. The fourth section examines methodologies that support finer-grained rubric score prediction. The fifth and sixth sections explore the distribution of holistic and rubric scores, delineate the performance of naïve and “smart” deep/shallow neural network predictors, and discuss implications. The seventh section initiates a discussion on the linguistic aspects considered by the rubric scoring models and how each rubric scoring model differs from each other. Finally, the last section summarizes conclusions, highlights limitations, and discusses next stages of AES research.

Background: The Automated Student Assessment Prize

In 2012, the Hewlett-Packard Foundation funded an Automated Student Assessment Prize (ASAP) contest to evaluate both the progress of automated essay scoring and its readiness to be implemented across the United States in state-wide writing assessments (Shermis 2014 ). Kaggle Footnote 1 collected eight essay datasets from state-wide assessments of student-written essays – which Grade 7 to Grade 10 students from six different states in the USA had written. Kaggle then subcontracted commercial vendors to grade the essays adhering to a thorough scoring process.

Each essay dataset originated from a single assessment for a specific grade (7–10) in a specific state. The ASAP contest asked participants to develop AES systems to automatically grade the essays in the database and report on the level of agreement between the machine grader and human graders, measured by the quadratic weighted kappa. This article argues that the performance comparison process was neither effective nor balanced since, as Table 1 demonstrates, each dataset had a unique underlying writing construct. Instead, AES performance should be analyzed per writing task instead of being analyzed globally.

Both commercial vendors and data scientists from academia participated in the contest. Officials determined the winners based on the average quadratic weighted kappa value on all eight essay datasets. While this measure was useful for contest purposes, it does not offer a transparent account of research processes and results. For instance, it has been shown that more interpretable and trustworthy models can be less accurate (Ribeiro et al. 2016 ). Following the publication of the contest results (Shermis 2014 ), Perelman ( 2013 , 2014 ) warned against swift conclusions that AES could perform better than human graders simply because it surpassed the level of agreement among human graders. For example, Perelman ( 2013 , 2014 ) illustrated how one could easily mislead an AES system by submitting meaningless text with a sufficiently large number of words.

The ASAP study design had several pitfalls. For example, none of the essay datasets had an articulated writing construct (Perelman 2013 , 2014 ; Kumar et al. 2017 ) and only essays in datasets 1, 2, 7, and 8 truly tested the writing ability of students. Datasets 1, 2, and 8 had a mean number of words greater than 350 words, barely approaching typical lengths of high-school essays. Finally, only datasets 7 and 8 were hand-graded according to a set of four rubrics.

The eighth essay dataset (D8) stood out from others because 1) it did not suffer from a bias in the way holistic scores were resolved (Perelman 2013 , 2014 ; Kumar et al. 2017 ), 2) it had the highest mean number of words (622), reflecting a more realistic essay length, 3) the holistic scores had the largest scoring scale computed out of a set of rubric scores (see Table 1 ), and 4) it had one of the lowest AES mean quadratic weighted kappa values (0.67). Accordingly, D8 seemed both challenging and promising for both machine learning and for providing formative feedback to students and teachers. However, a previous study (Boulanger and Kumar 2019 ) has shown that D8 was insufficient to train a model using feature-based deep learning and an accurate and generalizable AES model, because it had both an unbalanced distribution of holistic scores (high-quality essays were clearly under-represented) and a very small sample size (every holistic score and rubric score did not have enough samples to learn from). After the ASAP contest, only the labeled (holistic/rubric scores) training set was made available to the public; the labels of the validation and testing sets were no longer accessible. Thus, the essay sample totals currently available per dataset are less than the numbers listed in Table 1 ; only 722 essays of D8 were available to train an AES model. These limitations served as a key motivation for this study to target the seventh dataset (D7), which contained 1567 essay samples, despite its having only a mean number of words of about 171 words (about one paragraph). D7 was the only other available dataset that had essays graded following a grid of scoring rubrics. D7’s holistic scoring scale was 0–30 compared to D8’s 10–60, and D7’s rubric scoring scales were 0–3 compared to D8’s 1–6.

Table 2 (Shermis 2014 ) shows the level of agreement between the two human graders’ ratings and the resolved scores for all eight datasets. Each essay was scored by two human graders except for the second dataset where the final score was decided by only one human grader. For D7, the resolved rubric scores were computed by adding the human raters’ rubric scores. Footnote 2 Hence, each human rater gave a score between 0 and 3 for each rubric (Ideas, Organization, Style, and Conventions; see Table 3 ). Subsequently, the two scores were added together, yielding a rubric score between 0 and 6. Finally, the holistic score was determined according to the following formula: HS  =  R 1  +  R 2  +  R 3  + (2 ∗  R 4 ), for a score ranging from 0 to 30. All agreement levels are calculated using the quadratic weighted kappa (QWK). For each essay dataset, the mean quadratic weighted kappa value (AES mean) of the commercial vendors in 2012 is also reported.

D7’s writing assessment, intended for Grade-7 students, was of persuasive/narrative/expository type, and had the following prompt:

Write about patience. Being patient means that you are understanding and tolerant. A patient person experiences difficulties without complaining. Do only one of the following: write a story about a time when you were patient OR write a story about a time when someone you know was patient OR write a story in your own way about patience .

Table 3 describes the rubric guidelines that were provided to the two human raters who graded each of the 1567 essays made available in the training set.

Related Work

This section provides detailed analysis of recent advances in automated essay scoring, by examining AES systems trained on ASAP’s datasets. Most of the published research measured and reported their performance as the level of agreement between the machine and human graders, expressed in terms of both the quadratic weighted kappa on ASAP’s D7 and the average agreement level on all eight datasets. Table 13 (see Appendix 1; due to the size of some tables in this article, they have been moved to appendices so they do not interrupt its flow) compares the various methods and parameters used to achieve the reported performances.

One of the most relevant research projects involved experimenting with an AES system based on string kernels (i.e., histogram intersection string kernel), v-Support Vector Regression (v-SVR), and word embeddings (i.e., bag of super-word embeddings) (Cozma et al. 2018 ). String kernels measure the similarity between strings by counting the number of common character n-grams. The AES models were trained on the ASAP essay datasets and tested both with and without transfer learning across essay datasets. Transfer learning stores knowledge learned in one task and applies it to another (similar) task, in which labeled data is not abundant. Footnote 3 Accordingly, the knowledge from the former task becomes the starting point for the model in the latter task. Footnote 4 The outcomes of this experiment are reported in Table 13 .

A second, highly relevant study, using ASAP datasets, investigated how transfer learning could alleviate the need for big prompt-specific training datasets (Cummins et al. 2016 ). The proposed AES model consisted of both an essay rank prediction model and a holistic score prediction model. The AES model was trained based on the differences between the two essays, generating a difference vector. Accordingly, the model predicted which of the two essays had higher quality. Subsequently, a simple linear regression modeled the holistic scores using the ranking data. The process reduced the data requirements of AES systems and improved the performance of the proposed approach, which proved to be competitive.

Thirdly, the notable research from Mesgar and Strube ( 2018 ) effectively exhibited how deep learning could help with crafting complex writing indices, such as a neural local coherence model. Their architecture consisted of a convolutional neural network (CNN) layer at the top of a long short-term memory (LSTM) recurrent neural network (RNN). It leveraged word embeddings to derive sentence embeddings, which were inserted in the coherence model. The coherence model was designed to analyze the semantic flow between adjacent sentences in a text. A vector – which consisted of LSTM weights at a specific point in the sequence – modeled the evolving state of the semantics of a sentence at every word. The two most similar states in each pair of sentences were used to assess the coherence among them, and were given a value within 0 and 1, inclusively, where 1 indicated no semantic change and 0 a major change. The CNN layer extracted patterns of semantic changes that correlated with the final writing task.

A two-stage featured-based learning and raw text-based learning AES model was tested (Liu et al. 2019 ) and was found to be able to detect adversarial samples (i.e., essays with permuted sentences, prompt-irrelevant essays). Literature (Perelman 2013 , 2014 ) identifies such samples as a major weakness of AES. In the first stage, three distinct LSTM recurrent neural networks were employed to a) assess the semantics of a text independent of the essay prompt (e.g., through sentence embeddings), b) estimate coherence scores (to detect permutated paragraphs), and c) estimate prompt-relevant scores (to detect when an essay complies with prompt requirements). These three scores along with spelling and grammatical features were input in the second learning stage to predict the final score of the essay.

Another study examined the data constraints related to the deployment of a large-scale AES system (Dronen et al. 2015 ). Three optimal design algorithms were tested: Fedorov with D-optimality, Kennard-Stone, and K-means. Each optimal design algorithm recommended which student-written essays should be scored by a human or machine (a noteworthy example of the separation of duties among human and AI agents (Abbass 2019 )). However, a few hundred essays were required to bootstrap these optimal design algorithms. The goal was to minimize the teacher’s workload, while maximizing the information needed from the human grader to improve accuracy. The three optimal design algorithms were evaluated using ASAP’s eight datasets. Each essay was transformed into a 28-feature vector based on mechanics, grammar, lexical sophistication, and style, and extracted by the Intelligent Essay Assessor. The AES system also leveraged a regularized regression model (Ridge regression) to predict essays’ holistic scores. The Fedorov exchange algorithm with D-optimality delivered the best results. Studies show that for certain datasets, training a model with 30–50 carefully selected essays “yielded approximately the same performance as a model trained with hundreds of essays” (Dronen et al. 2015 ). Results were reported in terms of Pearson correlation coefficients between the machine and human scores. However, correlation coefficients were not provided for all ASAP essay datasets! For instance, the correlation coefficient for D7 was not included in the report.

A feature-based AES system called SAGE was designed and tested using several machine learning architectures such as linear regression, regression trees, neural network, random forest, and extremely randomized trees (Zupanc and Bosnić 2017 ). SAGE was unique in that it incorporated, for the first time, 29 semantic coherence and 3 consistency metrics, in addition to 72 linguistic and content metrics. Interestingly, SAGE appears to have been tested using the original labeled ASAP testing sets made available during the 2012 ASAP contest. Unfortunately, those testing sets are no longer available. Hence, most research on AES simply report their performance on the training sets, which may prevent a fair comparison of performance and technique against reports by Zupanc and Bosnić ( 2017 ). Nevertheless, SAGE distinguishes itself from other systems because it undertook a deeper analysis at the rubric level for the eighth ASAP dataset investigating D8’s second rubric ‘Organization’. SAGE’s capacity to both predict the Organization rubric, which will be discussed later in this article, and to leverage metrics related to semantic coherence are of special interest.

Automated essay scoring (AES) comprises few but highly distinct areas of exploration, and significant advances in deep learning have renewed interests in pushing the frontiers of AES. Table 13 shows that most publication years range between 2016 and 2019. The table highlights the latest research endeavors in AES, including respective algorithms. All models were trained on the ASAP’s datasets. As mentioned above, this article investigates the underpinnings of AES systems on ASAP’s seventh dataset. Accordingly, it reports both the performance of these models on that dataset and their average performance on all eight datasets. Table 13 shows that Zupanc and Bosnić ( 2017 ) reached the highest performance, i.e. a quadratic weighted kappa of 0.881 on the seventh dataset. However, note that they seemingly had access to the original labeled testing sets that were available during the 2012 ASAP competition, which should be factored in efforts to compare their performances against other models.

Literature is scarce when it comes to measuring the level of agreement between the machine and the human graders at the rubric score level (Jankowska et al. 2018 ). A synthesis of rubric level comparison is presented in the Discussion section below. This research investigates the prediction of rubric scores of ASAP’s D7 rubrics by applying deep learning techniques on a vast range of writing features.

Methodology

Natural language processing.

The essay samples (1567) were processed by the Suite of Automatic Linguistic Analysis Tools (SALAT) Footnote 5 – GAMET, SEANCE, TAACO, TAALED, TAALES, and TAASSC. Each essay was subjected to a total of 1592 writing features. This study opted for maximizing the number of low-level writing features, and the optimal selection of features for the AES model was performed by a deep learning mechanism in an automated fashion. The commercial AES system called Revision Assistant, developed by Turnitin, demonstrated that automatically selected features are not less interpretable than those engineered by experts (West-Smith et al. 2018 ; Woods et al. 2017 ). The following subsections describe the individual SALAT tools and the writing indices they measure, while the Analysis subsection will describe how these tools have been applied.

Grammar and Mechanics Error Tool (GAMET)

GAMET is an extension of the LanguageTool (version 3.2) API that measures structural and mechanical errors. LanguageTool has been demonstrated to have high precision but low recall (e.g., poor recognition of punctuation errors). It can flag a subset of 324 spelling, style, and grammar errors (Crossley et al. 2019a ) and classify them into six macrofeatures listed below:

Grammar: errors related to verb, noun, adjective, adverb, connector, negation, and fragment.

Spelling: deviations from conventional dictionary spellings of words.

Style: wordiness, redundancy, word choice, etc.

Typography: capitalization errors, missing commas and possessive apostrophes, punctuation errors, etc.

White space: inappropriate spacing such as unneeded space (e.g., before punctuation) or missing space.

Duplication: word duplications (e.g., You you have eaten this banana.).

For analysis purposes, these macrofeatures are more efficient than individual microfeatures. Literature shows that automated assessment of spelling accuracy had a higher correlation with human judgments of essay quality than grammatical accuracy, possibly due to certain interference that mechanical errors might have over meaning, and because grammatical errors were weakly associated with writing quality (less than 0.15) (Crossley et al. 2019a ).

Sentiment Analysis and Cognition Engine (SEANCE)

SEANCE is a sentiment analysis tool that calculates more than 3000 indices relying on third-party dictionaries (e.g., SenticNet, EmoLex, GALC, Lasswell, VADER, General Inquirer, etc.) and part-of-speech (POS) tagging, component scores (macrofeatures), and negation rules.

This study configured SEANCE to include only word vectors from the General Inquirer, which encompasses over 11,000 words organized into 17 semantic categories: semantic dimensions, pleasure, overstatements, institutions, roles, social categories, references to places and objects, communication, motivation, cognition, pronouns, assent and negation, and verb and adjective types.

Since most essays in ASAP’s D7 are not high-quality writings, this study only used the writing indices that were independent of POS. SEANCE includes a smaller set of 20 macrofeatures that combine similar indices from the full set of indices, which were derived by conducting a principal component analysis on a movie review corpus. For more information, please consult Crossley et al. ( 2017 ).

Tool for the Automatic Analysis of Cohesion (TAACO)

TAACO (Crossley et al. 2016 , 2019b ) provides a set of over 150 indices related to local, global, and overall text cohesion. Texts are first lemmatized and grouped per sentence and paragraph before TAACO employs a part-of-speech tagger and synonym sets from the WordNet lexical database to compute cohesion metrics.

TAACO’s indices can be grouped into five categories: connectives, givenness, type-token ratio, lexical overlap, and semantic overlap. Lexical overlap measures the level of local and global cohesion between adjacent sentences and paragraphs. The overlap between sentences or paragraphs is estimated by considering lemmas, content word lemmas, and the lemmas of nouns and pronouns. TAACO not only counts how many sentences or paragraphs overlap, but also assesses how much they overlap. Like lexical overlap, TAACO estimates the degree of semantic overlap between sentences and paragraphs.

TAACO assesses the amount of information that can be recovered from previous sentences, called givenness, and computes counts of various types of pronouns (i.e., first/second/third person pronouns, subject pronouns, quantity pronouns). It calculates the ratio of nouns to pronouns, the numbers of definite articles and demonstratives, and the number and ratio of unique content word lemmas throughout the text. Moreover, TAACO measures the repetition of words and provides indices to measure local cohesion through connectives.

Tool for the Automatic Analysis of Lexical Diversity (TAALED)

TAALED calculates 38 indices of lexical diversity. At the basic level, TAALED counts the number of tokens, the number of unique tokens, the number of tokens that are content words, the number of unique content words, the number of tokens that are function words, and the number of unique function words (6 metrics). Subsequently, it calculates features of lexical diversity and lexical density (Johansson 2009 ).

Lexical diversity metrics include simple, square root, and log type-token ratios (TTR) calculated on the sets of all words, content words, and function words (9 metrics). Lexical density metrics calculate the percentage of content words and the ratio of the number of unique content words over the number of unique tokens (2 metrics).

More complex variants of TTR are provided by TAALED such as the Maas index, which linearizes the TTR curve using log transformation (Fergadiotis et al. 2015 ) (3 metrics); the mean segmental TTR with 50 segments (MSTTR50) (3 metrics); and the more effective moving average TTR with window size of 50 (MATTR50) (Covington and McFall 2010 ) (3 metrics). These variants are all computed in relation to the sets of all words, content words, and function words.

Still more advanced metrics include the hypergeometric distribution’s D index (HD-D 42), which calculates the probability of drawing from the text a certain number of tokens of a particular type from a random sample of 42 tokens (McCarthy and Jarvis 2010 ; Torruella and Capsada 2013 ) (3 metrics).

Finally, TAALED’s features include the original measure of textual lexical diversity (MTLD), which “is calculated as the mean length of sequential word strings in a text that maintain a given TTR value” (McCarthy and Jarvis 2010 ), along with two of its variants, the bidirectional moving average (MTLD-MA-BI) and the wrapping moving average (MTLD-MA-Wrap) (9 metrics).

Tool for the Automatic Analysis of Lexical Sophistication (TAALES)

TAALES (Kyle et al. 2018 ) measures over 400 indices of lexical sophistication related to word and n-gram frequency and range, academic language, psycholinguistic word information, n-gram strength of association, contextual distinctiveness, word recognition norms, semantic network, and word neighbors. Several of these metrics are normed such as word and n-gram frequency and range metrics, which are measured according to the number of word or n-gram occurrences found in large corpora of English writings (i.e., Corpus of Contemporary American English (COCA), British National Corpus (BNC), and Hyperspace Analogue to Language (HAL) corpus) and frequency lists (i.e., Brown, Kucera-Francis, SUBTLEXus, and Thorndike-Lorge). These 268 frequency and range metrics are calculated according to five domains of literature: academic, fiction, magazine, news, and spoken. These metrics allow one to measure the number of times a word or n-gram occurs in a corpus and the number of texts in which it is found.

Fifteen academic language metrics measure the proportions of words or phrases in a text that are frequently found in academic contexts but are less generally used in mainstream language. Using the MRC database (Coltheart 1981 ), psycholinguistic word information (14 metrics) gauges concreteness, familiarity, meaningfulness, and age of acquisition observed in the text.

Further, age of exposure/acquisition values (7 metrics) are derived from the set of words in the Touchstone Applied Science Associates (TASA) corpus, which consists of 13 grade-level textbooks from USA. This makes it possible to measure the complexity of the words employed within a text and their links to semantic concepts as found in larger corpora.

Word recognition norms (8 metrics) make it possible to estimate the difficulty of processing a given word, such as, the time it takes a person to recognize that a specific word is an English word and the time it takes to read the word aloud. These word recognition scores have been calculated on a bank of 40,481 real words from the English Lexical Project, which includes the response latencies, standard deviations, and accuracies of 816 native English speakers on lexical-decision and word-naming tasks.

Word neighborhood indices (14 metrics) report the similarity of a word in a text to other similar orthographic (words that are formed by changing just one letter), phonographic (words that differ by one letter and one phoneme), and phonological words (words that differ by only one phoneme).

TAALES includes 8 metrics related to contextual distinctiveness, based on the diversity of contexts in which a word occurs. It evaluates how much the words in a text are contextually distinct using free association norms and corpus-driven statistical approaches based on the Edinburgh Associative Thesaurus and the University of South Florida norms.

TAALES provides information (14 metrics) using the WordNet lexical database on the polysemy and hypernymy semantic networks of a word making it possible to measure the number of related senses and the number of superordinate terms that the word has.

TAALES assesses the strength of association within n-grams by computing the conditional probability that the words in bigrams and trigrams in a specific text will occur together based on the n-gram frequency norms derived from large corpora (75 metrics).

Tool for the Automatic Assessment of Syntactic Sophistication and Complexity (TAASSC)

TAASSC (Kyle 2016 ) quantitatively evaluates the syntactic sophistication and complexity of English writing. It calculates 367 indices, grouped into four categories: 14 Syntactic Complexity Analysis (SCA) indices (Lu 2010 ), 31 fine-grained indices on clausal complexity, 132 indices related to fine-grained phrasal complexity, and 190 syntactic sophistication indices.

Basically, the 14 SCA indices are derived from the counts, ratios, and mean lengths of the following syntactic structures within a text: words, verb phrases, complex nominals, coordinate phrases, clauses, dependent clauses, T-units, complex T-units, and sentences. Table 4 provides some definitions of these structures.

TAASSC enhances the previous SCA set of indices by adding 31 new clausal complexity indices, obtained by calculating the average number of every type of structure per clause. In other words, TAASSC measures the length of clauses as the number of direct dependents rather than words; it also provides separate counts of each type of structure instead of combining them all; and, finally, it considers clauses as being both finite and non-finite. See (Kyle 2016 ) for the complete list of clausal dependent types.

TAASSC computes 132 indices of phrasal complexity by identifying seven types of noun phrases and 10 types of phrasal dependents.

Finally, TAASSC supplies 15 basic indices related to syntactic sophistication established on the empirical theories of language acquisition through reference corpora such as the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA). Each index has 38 variants derived from the five COCA subcorpora (academic, fiction, magazine, newspaper, and spoken), for a total of 190 indices.

How the Tools Were Applied

This article Footnote 6 revises the methodology adopted in the previous work (Boulanger and Kumar 2019 ) to predict rubric scores using feature-based deep learning (multi-layer perceptron (MLP) neural network) and proposes a way to explain the reasons that typically hide behind the deep learning algorithm. As mentioned in the the Background section, Boulanger and Kumar trained an AES model that scored D8’s essays with a feature-based approach, rather than training a sequence model such as a LSTM (long short-term memory) recurrent neural network using the raw text data. In addition to the inadequate sample size of D8, their study had one major limitation: feature selection. Ninety-six (96) writing features that most correlated with the holistic scores were selected to predict all four rubrics (Ideas and Content, Organization, Sentence Fluency, Conventions) of the D8 writing assessment. In other words, unique writing features should have been identified for each of the four rubrics instead of re-using the same feature set for all four of them. The study also assesses feature importance per rubric.

The first step in the analysis was to look into D7’s distributions of holistic and rubric scores and evaluate the number of essay samples per rubric score. The analysis continues by quantitatively measuring the interrelationships between the underlying rubrics. Next, the analysis establishes a baseline naïve predictor to precisely measure and compare the performance of various deep neural network architectures.

The second step designs and applies a feature selection process for each of the four rubrics. There are three types of feature selection methods: filter (use statistical method to determine which features should be pruned), wrapper (an external predictive model is trained to evaluate the impact of different combinations of features on model accuracy), and embedded (feature selection occurs as the model is trained through regularization methods). Footnote 7 This study applies both the filter and embedded methods. First, feature data were normalized, and features with variances lower than 0.01 were pruned. Second, the last feature of any pair of features having an absolute Pearson correlation coefficient greater than 0.9 was pruned (the one that comes last in terms of the column ordering in the datasets). Footnote 8 This second operation was performed individually on the feature set of each SALAT’s tool to ensure fair representation among grammar and mechanics, sentiment analysis and cognition, text cohesion, lexical diversity, lexical sophistication, and syntactic sophistication and complexity. Third, any feature that directly counted the number of words or tokens in the essays was also pruned to reduce the rubric scoring models’ dependencies on the very influential counts of words (Perelman 2013 , 2014 ). After the application of these filter methods, the number of features was reduced from 1592 to 397. All four rubric models fed upon these 397 features. Fourth and finally, the Lasso (Fonti and Belitser 2017 ) and Ridge regression regularization methods (whose combination is also called ElasticNet) were applied as part of the rubric scoring models’ training. Lasso is responsible for pruning further features, while Ridge regression is entrusted with eliminating multicollinearity among features. This final phase of feature selection allowed for the customized feature selection per rubric.

In the third step, six deep learning (MLP) architectures were tested. It is important to indicate that seven of the most important hyperparameters were tuned to find a quasi-optimal combination. They are 1) activation function (selu, elu, relu, tanh, sigmoid, exponential), 2) optimizer (Adam, SGD, Adadelta, Adamax, Nadam), 3) L1 penalties (10 equally distributed values between 0.0035 and 0.0125), 4) L2 penalties (10 equally distributed values between 0.0035 and 0.0125), 5) number of hidden layers (2, 3, 4, 5), 6) number of neurons in the first hidden layer (128, 256), and 7) number of neurons in the last hidden layer (16, 32). This hyperparameter space encompasses 48,000 different combinations. A randomized search consisting of 200 randomly sampled combinations of hyperparameters was tested for each of the six architectures. Table 5 delineates the final hyperparameters of the six architectures. Only three architectures are shown in Table 5 , each of them being used again as part of a bagging ensemble technique. Interestingly, all three architectures have only two hidden layers, below a recommended threshold of three to be considered “deep” learning (Rosebrock 2017 ). More hyperparameters, larger ranges of and finer-grained hyperparameter values, and a larger number of hyperparameter combinations should have been tested to learn better model parameters. The smaller sample size and hyperparameter space are a limitation of this study due to a lack of high-performance computing resources (HPC). As a follow-up study, we plan to re-design and re-run the analysis when the HPC resources are secured.

Finally, the last step estimates feature importance for each of the trained rubric scoring models. A technique called permutation importance (Breiman 2001 ) randomizes the values of a feature in the testing set multiple times, one feature at a time, measures the average change (called the weight) in the model’s predictive power (e.g., percentage of accurate predictions), and ranks features by weight in descending order. Rubric score predictions are presented in confusion matrices, and corresponding precision, recall, and F1-score values are reported. However, only one set of hyperparameter values has been optimized per deep learning architecture. Hence, all four rubric scoring models pertaining to each architecture have been trained with the same set of hyperparameter values. This is another limitation that we aim to address in the follow-up study.

This study investigated both the feasibility and benefits of applying automated essay scoring at the rubric level. Rubric scores provide high-level formative feedback that are useful to both student-writers and teachers. Most of the literature in this domain focuses almost exclusively on predicting holistic scores. This article goes one step further by analyzing the performance of deep/shallow learning on rubric score prediction and by investigating the most important writing indices that determine those rubric scores.

The first step in the process is to know the dataset that will train the rubric scoring models. Figure 1 shows the distribution of D7’s holistic scores. Note that the scoring scale is from 0 to 24. Footnote 9 The distribution of holistic scores appears quite balanced from score 6 up to 24. Very few or no essays have been assigned a final score lower than 6, which is reasonable. The most frequently given scores are 16 (199 times) and 17 (160 times); other scores’ frequencies are relatively uniform (between 20 and 118), allowing an AES system to learn from high-quality, average, and low-quality essays.

figure 1

Distribution of holistic scores

The human raters’ distributions of scores, all rubrics combined ( R 1  +  R 2  +  R 3  +  R 4 ), are exhibited in Fig.  2 . The figure highlights the raters’ biases. According to Table 2 , their agreement level, measured by QWK, is 0.72.

figure 2

Distributions of human raters’ scores (all rubrics combined)

Additionally, to take advantage of this unique hand-graded set of essays, this article was motivated by the idea that holistic scores could be better predicted through the prediction of its constituent rubric scores. The rationale for that is simple. In the best case, if the distribution of holistic scores was uniform, there would be about 1567/25 ≈ 62 examples per score, which is not much to teach all the intricacies of English writing. On the other side, the scale of rubric scores ranges from 0 to 6, implying that in the best case there would be 1567/7 ≈ 223 essays per rubric score, which means more essays from which to assess a narrower competence of English writing. However, it is beyond the scope of this single study to verify whether the prediction of holistic scores is more accurate through the prediction of its constituent rubric scores, especially given that less accurate models may as well be more interpretable (Ribeiro et al. 2016 ) and that predictive accuracy alone is not the only criterion to determine the trustworthiness of an AES system (Murdoch et al. 2019 ). Figure 3 exhibits the distributions of scores by rubric, while Table 6 presents their descriptive statistics.

figure 3

Distributions of rubric scores

Table 7 shows the level of agreement among rubrics, measured by both the quadratic weighted kappa and the average agreement level per rubric. Note from Fig. 3 and Table 6 that each rubric’s most frequent score is 4. Table 7 also indicates that the rubrics generally have moderate-strong levels of agreement, suggesting that an AES system could naively predict 4 as the rubric score for each rubric and a holistic score of 16 as per the formula: HS  =  R 1  +  R 2  +  R 3  +  R 4 .

The performance of this naïve AES system (also known as majority classifier) is delineated in Table 8 . Four metrics are used throughout this article to measure the performance of an AES model. They are, a) the quadratic weighted kappa; b) the percentage of exact predictions; c) the percentage of exact and adjacent (±1) predictions (simply denoted by “adjacent (±1)” from now on); and d) the percentage of exact, adjacent (±1), and adjacent (±2) predictions (simply denoted by “adjacent (±2)” from now on).

The quadratic weighted kappa measures the level of agreement between two raters by controlling for random guessing and by heavily penalizing higher distances (squared) between pairs of ratings. The weighted kappa normalizes weight assignment based on the agreement scale and ranges between 0 and 1. For example, holistic scores are predicted on a 0–24 scale, while rubric scores lie between 0 and 6. Hence, the penalty (weight) assigned to a predicted holistic score that is ‘off by 2’ will be less than the penalty assigned to a predicted rubric score that is ‘off by 2’.

Percentage of exact matches measures the accuracy of the AES model in terms of percentages of correct predictions in contrast to all other predictions that are wrong no matter their distance from the expected value. Percentages of adjacent matches indicate the numbers of predictions that fall within a certain distance of the expected value. This study considers distances of one and two.

Both supervised classification and regression techniques have been leveraged to model rubric scores. Table 9 reports the performance of six distinct deep learning MLP architectures that are trained, validated, and tested on ASAP’s seventh dataset. Each essay was converted to a vector of 397 writing features. The 1567 essays have been randomly split into a training and testing set, that is, 80% of the essays formed the training set (1254), while the remaining 20% formed the testing set (313). Although validation results are not reported in this article, 15% of the training set was used as a validation set to drive model training toward better accuracy (except in ensemble-based models). A 5-fold cross-validation was performed for each ensemble-based architecture, implying that the training set was randomly split into a smaller training dataset (80% = 1003 essays) and a validation set (20% = 251 essays). Following recommendations from the literature (Boulanger and Kumar 2019 ; Cozma et al. 2018 ; Taghipour and Ng 2016 ), model performance was reported on the testing set as an average of several testing iterations (various samplings) instead of choosing the “highest kappa” produced, to avoid overfitting the AES models to the testing set. Hence, each architecture has been trained and evaluated five times and the resulting performance measurements were averaged. Table 9 reports the average performance of each architecture for each rubric.

The first architecture called “Classification” selects the most likely rubric score among a set of seven discrete scores between 0 and 6. The second architecture called “Classification Ensemble” leverages a bagging ensemble technique. Essentially, each model trained per fold during cross-validation makes up a machine grader. Thus, five machine graders with distinct “expertise” determine the predicted score by averaging their assessed score (alternatively it could be determined through vote by selecting the most frequent score (mode) or using the median score). The third architecture “Regression” is similar to “Classification”. However, instead of selecting the most likely score on a scale of seven discrete scores, it computes a single real-number score (e.g., 4.3333) and rounds it to the nearest integer, truncating it to 0 or 6 if the real number falls short or exceeds the scale. The fourth architecture is identical to “Classification Ensemble”. However, it employs regression instead of classification. The fifth architecture (Multiple Regression), as Table 7 shows, considers the interdependencies among rubrics that underlie the determination of the holistic score. Thus, instead of predicting a single rubric score using a siloed approach, this approach predicts all four rubric scores all at once. Subsequently, the agreement level between the machine and the human graders is analyzed and reported per rubric. Finally, the sixth architecture “Multiple Regression Ensemble” employs an ensemble technique on top of the fifth architecture to predict rubric scores.

Given that on average classifiers have higher accuracy in terms of percentage of exact matches (the most important metric), their performance was investigated further. Figure 4 shows the normalized confusion matrix of each rubric along with the precision, recall, and F1-score of each rubric score. Note that precision is the ratio of essays within a rubric score category that were rightfully assigned the score of the category. Conversely, recall is the ratio of essays that were assigned to a specific rubric score by the human and that were also correctly predicted by the machine. The F1-score is the harmonic mean of the precision and recall. Note that precision and recall do not consider how much a prediction is off the expected value (the QWK does). For example, Fig.  4 demonstrates that 25% of all essays were given a 4 for Ideas Rubric by both the human graders and the machine grader (accurate predictions). By adding up the ratios along the diagonal, it is possible to calculate the percentage of exact matches. Hence, for Ideas Rubric, the percentage of exact matches is 0.05 + 0.06 + 0.25 + 0.04 + 0.05 = 0.45 (45%); for Organization Rubric, it is 0.08 + 0.02 + 0.23 + 0.05 + 0.07 = 0.45 (45%); Style Rubric, 0.05 + 0.04 + 0.37 + 0.04 + 0.05 = 55%; Conventions Rubric, 0.05 + 0.03 + 0.21 + 0.11 + 0.10 = 50%. Note that these are close to the exact percentages in Table 9 under the Classification heading (Ideas Rubric: 45.1%, Organization Rubric: 47.1%, Style Rubric: 54.3%, Conventions Rubric: 47.8%). They are slightly different because the confusion matrices were derived from a single classifier per rubric instead of being averaged (over five iterations). Similarly, adjacent matches (±1) are calculated by adding the ratios along the diagonal plus the ratios that are directly above and below each diagonal element.

figure 4

Normalized confusion matrices and classification reports for all rubrics (classifier). Human and machine scores are represented by the vertical and horizontal axes, respectively

Figure 5 shows the normalized confusion matrix of each rubric between the two human graders for comparison with those of the machine grader. Notice how the rubric score scales vary between Figs. 4 and 5 . The machine grader predicts rubric scores on a 0–6 scale, the scale of resolved rubric scores derived by adding the two human raters’ rubric scores which were originally on a 0–3 scale. Thus, the resolved scores are used to benchmark the performance of the machine marker, as depicted in Fig. 4 . Once the performance of the machine marker is measured, it is important to compare it against human performance to determine whether the AES system can minimally meet human expectations. Hence, measuring human performance implies assessing the level of agreement between human raters, which in this case can only be measured by comparing the two human raters’ rubric scores on the 0–3 scale, creating a scale discrepancy between machine and human performances.

figure 5

Confusion matrices for all rubrics between the two human raters. Rater 1 and Rater 2 are represented by the vertical and horizontal axes, respectively

Finally, to assess the role that each writing feature played in the essay scoring per rubric, a method called permutation importance was run. It estimates the impact that randomizing the values of a single feature in the testing set has on the accuracy of the trained model (classifier), while keeping all other feature values intact. Five randomized permutations are performed and the average impact on the percentage of exact matches along with the standard deviation are reported for the selected feature. The process is repeated for all writing features selected for training. Finally, the writing features are listed in order of importance. Tables  14 and 15 (see Appendix 2) delineate the 20 most important writing features by rubric. For example, the ‘you_gi’ feature (Rank 1) in Table 14 has a weight of 2.10% and a standard deviation of 0.25% (the weight will tend to vary by 0.25% depending on the selected permutation). This means that on average, the accuracy of the rubric scoring model in terms of exact matches drops by 2.10% when this piece of information is unknown. According to Table 9 , the accuracy of the ‘Classification’ architecture for Ideas Rubric is 45.1% (exact matches). Thus, dropping the feature will result in a model with a lower accuracy, that is, approximately 43.0%.

The next table (Table 16 ; see Appendix 2) shows by order of importance, all rubrics combined, the writing features that influenced the determination of essays’ holistic scores the most. The ‘Avg. Weight’ column indicates by how much each writing feature on average improves each rubric scoring model’s accuracy. The ‘Avg. Std. Dev.’ column is the average level of uncertainty (standard deviation) about the improvement that each feature brings to the four rubric scoring models. Table 16 identifies which writing features are of global importance to all four rubrics. By comparing every rubric’s list of most important features (Tables  14 and 15 ) against Table 16 , it is also possible to determine which features are of local importance to a specific rubric.

Discussion: Performance of Linguistic Indices-Based Deep Learning

The four rubric score distributions in Fig. 3 look relatively like each other, with their means ranging from 3.68 to 4.34 and their standard deviations from 1.10 to 1.62. Nevertheless, they all show that the most frequent score for all rubrics is 4. Table 7 shows that the rubric scores are moderately-strongly dependent on each other. The quadratic weighted kappa values all range from 0.46 to 0.79. Ideas Rubric scores on average agree at 0.62 with the other rubrics, Organization Rubric on average agrees at 0.73, Style Rubric at 0.69, and Conventions Rubric has the lowest average agreement level at 0.60. The following adjacent pairs of rubrics have strong levels of agreement: Ideas-Organization, Organization-Style, and Style-Conventions. In contrast, the non-adjacent Ideas-Conventions pair shows the weakest agreement. This implies a partial overlap but also a progression among the rubrics.

Table 8 reveals from the distributions of rubric scores that 34.5% of scores in the Ideas rubric are given a 4, 33.1% of scores in the Organization rubric are also assigned a 4, 46.3% for the Style rubric, and 33.6% for the Conventions rubric. Thus, assigning systematically a 4 to all rubrics would “predict” on average accurate scores 36.9% of the time. Similarly, giving 4 as scores to all rubrics as well as all essays would result in 65.5% of adjacent (±1) matches for Ideas Rubric, 69.8% for Organization Rubric, 79.7% for Style Rubric, and 70.3% for Conventions Rubric, equating to an average of 71.3%. This certainly is an “awesome” performance for a completely naïve AES model. Interestingly, the quadratic weighted kappa metric proves to be an effective indicator to detect random guessing as shown by the zeroes in every rubric. The fact that every predicted score is always 4 on the 0–6 scale while the resolved scores significantly vary on the same scale exhibits some gaming behavior that the QWK formula can detect. To be relevant, performance of an autonomous (or a human-in-the-loop) AES system should significantly exceed the baseline model.

By aggregating and selecting the best performance by rubric on the testing set, all architectures combined, as demonstrated in Table 10 , it can be observed that the QWKs lie between 0.69 and 0.77. The average QWK of the four rubrics is equal to the level of agreement on holistic scores between the two human raters (0.72) and close to the 2012 commercial vendors’ mean agreement level (0.76) (see Table 2 ). This is remarkable because 1) the models were trained, validated, and tested with a smaller essay dataset, that is, the equivalent of the original training dataset, and 2) smaller rubric score scales tend to produce lower QWKs than larger scales such as the holistic score scale (0–24 for D7), the scale of the performances reported in Table 2 (Perelman 2013 ).

Table 10 presents the agreement level between the two human raters at the rubric level. Notice that the original rubric scores given by the human raters were on a 0–3 scale. It can be seen from Table 10 that AES rubric models have a mean agreement level of 0.72 with the resolved scores, which is significantly greater than the mean agreement level between the two human raters (0.60). Nevertheless, as aforementioned, the human raters’ scoring scale is smaller than the scale of resolved scores, the prediction scale of the machine marker, producing smaller QWKs as agreement levels between the human raters are measured. Since the difference of performance between the machine marker and the human markers is quite large, it is very likely that the accuracy of the trained rubric scoring models is equivalent if not superior to human performance, even when controlling for this scale discrepancy.

This study’s best exact match percentages range between 45.6% and 55.4%, a 9.1–19.5% improvement in comparison to the majority classifiers previously discussed. The least accurate rubric is the Ideas rubric, which intuitively requires more background knowledge than the other rubrics. On average 89–96% of the predicted rubric scores are adjacent (±1) matches, compared to 65–80% for the naïve predictor. In other words, on average 50.6% of rubric scores are exact, 42.2% are off by 1, 6.7% are off by 2, and 0.5% are off by 3 or more.

Table 11 summarizes the information from the confusion matrices in Figs.  4 and 5 . On average the human raters assigned identical rubric scores and adjacent (±1) scores 63% and 99% of the time, respectively. On the other side, the machine marker’s predictions on average are exact, adjacent (±1), and adjacent (±2) 49%, 88%, and 99% of the time, respectively. The fact that the scale of resolved scores is almost twice as large as the human raters’ rubric score scale can justify the fact that the machine marker’s percentages of exact matches are smaller than its human counterpart. This can provide extra evidence that the machine marker is almost or as accurate as the human markers. For instance, the machine marker is 45% accurate on Ideas Rubric, and 85 %  − 45 %  = 40% of predictions are off by one. If the predicted rubric scores are rescaled on a 0–3 scale, approximately half of this 40% of predicted scores, that are off by one on Ideas Rubric, would be rounded toward their rescaled resolved score and the other half would be rounded away from their rescaled resolved score. The percentage of exact matches would then reach approximately 45 %  + (0.5 × 40%) = 65%, and the percentage of rescaled adjacent (±1) scores would be close to 100%. Note that this calculation is made on the machine marker’s performance on the Ideas rubric, which is the most faraway, among the four rubrics, from the human raters’ one.

To the best of our knowledge, only one study attempted to predict rubric scores using D7 (Jankowska et al. 2018 ), only one study investigated rubric score prediction on D8 (Zupanc and Bosnić 2017 ), and very few AES systems in general predict essay scores at the rubric level (Kumar et al. 2017 ). Zupanc and Bosnić ( 2017 ) reported an agreement level (QWK) of 0.70 on Organization Rubric (D8). Their feature-based AES model included 29 coherence metrics, which greatly contributed to the observed performance (alone these coherence metrics achieved a QWK of 0.60).

Similarly, Jankowska et al. ( 2018 ) trained an AES system on D7 using Common N-Gram, Support Vector Machines, and Naïve Bayes classifiers. The 13 feature sets used to train the various classifiers consisted of character n-grams, with n  ∈ {1, 2, 3, …, 10}, and word and stemmed word n-grams of length 1 and 2. Rubric scores were predicted on the 0–3 scale, the scale of human raters, and two machine markers were trained per rubric, one by human rater. Table 12 reports the best agreement levels between each machine marker and its corresponding human rater and between the two human raters (Jankowska et al. 2018 ). It is interesting to observe that the typical agreement level on holistic scores reported in the literature (between 0.76 and 0.88; see Tables  2 and 13 ) did not translate into as high agreement levels on rubric scores (between 0.428 and 0.657). The agreement levels on rubric scores between the two human raters (between 0.544 and 0.695) were also smaller than their agreement level on holistic scores (0.72; see Table 2 ). The exception is the Ideas rubric: the feature sets were likely more suited to the scoring of this rubric than to the other rubrics, which are more concerned with how ideas are formulated.

It can also be observed that this study’s models (Table 10 ) on average outperform results reported by Jankowska et al., although this study’s prediction scale is larger (0–6 instead of 0–3). Tables  10 and 12 reveal that the mean agreement level (QWK) between the two human raters is 0.60, averaged over all four rubrics. Jankowska et al.’s rubric scoring models at best had an average agreement level (QWK) of 0.531 with the human raters, below the human raters’ agreement level. In contrast, this study reports a mean agreement level with the human raters’ resolved scores of 0.72, well beyond the agreement level between the two human raters.

To evaluate whether the parallel prediction of the four rubrics could accurately predict and explain essay holistic scores, this study predicted holistic scores as the sum of the rubric score predictions. It was found that 1) the agreement level with the human raters’ resolved scores, as measured by QWK, was 0.785; 2) 17.2% of holistic score predictions were accurate; 3) 46.1% were adjacent ( ± 1); and 4) 64.0% were adjacent (±2). Remember that holistic scores range from 0 to 24 and that these performance indicators were averaged over five iterations to avoid reporting some overfit performance.

The rubric-based AES system proposed in this article exhibits comparative performance to contemporary related works (see Table 13 ), where these cutting-edge AES tools generate agreement levels between 0.766 and 0.811 (Cozma et al. 2018 ; Taghipour and Ng 2016 ; Wang et al. 2018 ). Note that Cozma et al., Taghipour and Ng, and Wang et al. are the only ones having averaged their reported performance by repeatedly training their AES models multiple times, which tend to lower performance numbers.

Discussion: What are the Most Important Features per Rubric?

As previously mentioned, D7’s writing assessment, written by Grade-7 students, had an average number of words of approximately 171 words, was of persuasive/narrative/expository type, and required of students to write a story about patience. The marking guidelines provided to the two human raters were described in Table 3 . Each essay was processed by the Suite of Automatic Linguistic Analysis Tools (SALAT), converting each essay into a 1592-metric vector. After having performed feature selection through filter (i.e., pruning of low-variance and correlated features) and embedded methods (i.e., ElasticNet regularization) and after having deleted all features which directly counted the number of words/tokens in an essay (Perelman 2013 , 2014 ), 397 writing features were selected to train the four rubric scoring models, presented in the previous section. From the 397 selected features, 12 were generated by GAMET (grammar and mechanics), 97 by SEANCE (sentiment analysis and cognition), 76 by TAACO (cohesion), 14 by TAALED (lexical diversity), 108 by TAALES (lexical sophistication), and 90 by TAASSC (syntactic sophistication and complexity).

Tables  14 and 15 list the 20 most important features per rubric based on the classification model trained for each of them. It can be noted that for Rubrics 1, 2, and 4, writing features from all six SALAT’s tools rank among the top 20; as for Style Rubric, no linguistic indices generated by GAMET were included in the list of the 20 most important features.

Each rubric has a distinct set of most important features and feature weights are relatively small, which leads to the hypothesis that the models do not significantly suffer from dependence on a single or a few dominant features. For example, Ideas Rubric’s most important feature (you_gi: number of 2nd-person pronouns divided by number of words) carries a weight as low as 2.10% on the model’s performance. In other words, knowledge about the usage of second-person pronouns (suggesting a direct dialogue) allowed the rubric scoring model (Ideas Rubric) to increase the accuracy of its predictions by 2.10%. Similarly, when the Organization rubric scoring model is informed of the usage of words associated with infants and adolescents (nonadlt_gi, a category of words expressing social relations), the accuracy of its predictions is improved by 1.21%. The measure of textual lexical diversity assessed on content words (mtld_original_cw) (Fergadiotis et al. 2015 ; McCarthy and Jarvis 2010 ) improved the accuracy of Style Rubric’s predictions by 3.12%. The type-token ratio of function words calculated by taking the square root of the total number of function words (root_ttr_fw) (Torruella and Capsada 2013 ) is responsible for 2.68% of the Conventions rubric scoring model’s accuracy. This linguistic index is an example of the indirect influence that text length has on rubric score predictions; although word-count variables have been removed from consideration, the ‘root_ttr_fw’ feature is actually dependent on text length (see Table 17 in Appendix 3).

Table 16 lists the 20 most important features, all rubrics combined, by averaging the weights and standard deviations that each feature carries across the four rubrics. It demonstrates the potential importance of each feature on the prediction of essays’ holistic scores. For instance, the most important feature, the hypergeometric distribution’s D index which assesses the diversity of function words, has an average weight of 1.53% across all four rubrics. This implies that the accuracy of each rubric scoring model is improved by approximately 1.53 %  ± 0.87%. Footnote 10 Only two writing features of global importance (‘pos_gi’ and ‘rcmod_nsubj_deps_nn_struct’) are not listed in at least one of the rubrics’ list of 20 most important features. Similarly, among the 66 linguistic indices forming the four lists of 20 most important features per rubric, only 18 of them are of global importance for all four rubrics. This offers a new perspective from which to analyze the importance of a feature. This confirms the necessity of performing customized feature selection at the rubric level.

Ideas Rubric

Five types of words (SEANCE) are of interest to the Ideas rubric scoring model. Together, their usage improves the model’s accuracy by 2.5% to 4.9% (3.7% ± 1.2%). These types of words are a) any direct reference to another person, b) abstract nouns (e.g., ability, accuracy, action, activity, administration), c) words expressing non-work social rituals (e.g., adjournment, affair, ambush, appointment, armistice), d) words introducing a time dimension in the story (e.g., abrupt, advance, after, afternoon, afterward), and e) words expressing the idea of fetching (denoting effort) (e.g., acquire, altruistic, apprehend, bandit, benefactor). In summary, the usage or the non-usage of these types of words might be indicative of whether the essay writer tries to tell a story about a tangible situation that happened in one’s life.

From a grammar and mechanics perspective (GAMET), the rubric scoring model only notices the number of grammatical errors and the number of misspellings of English contractions, representing between 0.3% and 2.3% of the model’s predictive power. This requires further verification to confirm whether the non-trivial presence of grammatical errors might prevent effective transmission of ideas.

The Ideas Rubric’s machine marker is attentive to four metrics of lexical diversity (TAALED) and six metrics of lexical sophistication, which account for 3.5–6.8% and 1.6–7.2% of its predictive power, respectively. It also takes notice of the variety of function words and content words through the lens of the measure of textual lexical diversity (MTLD) and the hypergeometric distribution’s D index (see Table 17 in Appendix 3 for more information about these indices). The machine marker also considers criteria such as the strength of association within trigrams and the number of occurrences of bigrams and content words in general usage (measured with COCA magazine corpus and HAL corpus). It looks at the mean range of content words, that is, the average number of texts in the COCA fiction corpus in which a content word is included. The rubric scoring model also feeds upon the proportion of trigrams in an essay text that are among the 10,000 most frequent trigrams in the COCA fiction corpus. In addition, it looks at the average generality or specificity (number of senses) of the adverbs used.

The rubric scoring model is not sensitive to text cohesion, except for the number of sentences with any verb lemma repeated somewhere in the next two sentences (0.4–1.5% of the predictive power). Likewise, syntactic sophistication and complexity play a limited role in determining the rubric score as they account for between 0.2% and 1.2% of the accuracy of the rubric score predictions.

Overall, the 20 most important features explain 16.2% ± 7.7% of the model’s predictive power. Consequently, they are arguably responsible for the increase in the percentages of exact and adjacent (±1) predictions in comparison to the naïve predictor, improving exact matches from 34.5% to 45.6% and improving adjacent (±1) matches from 65.5% to 89.5%.

Organization Rubric

The following vocabular elements play an important role in predicting the score of the Organization rubric: a) reference to children and adults (e.g., baby, boy, child, childish, children); b) expressions of positive affect through adjectives; c) usage of positive adjectives; d) expressions of increase in quality or quantity (e.g., abound, absorbent, absorption, accelerate, acceleration); and e) allusions to physical body parts and tangible objects. Together, they explain between 1.4% and 5.9% of the model’s predictive power. It is important to note that feature importance does not indicate whether important features are desirable traits to be found within an essay (e.g., high usage of body part words), but that the inclusion or exclusion of the information that they provide contributes to a more accurate scoring of the Organization rubric. Nevertheless, it can be hypothesized that these categories of vocabulary are desirable since they are indicative of storytelling, a requirement of the writing assessment’s prompt.

The rubric scoring model considers the following four TAALES lexical sophistication metrics: a) the frequencies of content words in general usage as in the HAL corpus, b) the degree of concreteness expressed by concrete words, c) the genericity and specificity of adjectives used (average number of senses per adjective), and d) the proportion of trigrams in an essay that are among the 10,000 most frequent trigrams in the COCA fiction corpus. Their contribution to the model’s accuracy is a bit ambiguous, that is, between −0.7% and 5.2%. Inclusion of certain features was, therefore, sometimes found to degrade the model’s predictive power.

The rubric scoring model uses only one punctuation-related metric from grammar and mechanics, that is, the number of times a comma is missing after a conjunctive/linking adverb at the beginning of a new sentence (e.g., however, besides, nonetheless, etc.). This writing feature contributes rather minimally to the model’s predictive power (between 0.4% and 1.0%).

Organization Rubric’s machine marker considers three writing features that measure aspects of text cohesion (0.8–2.6%). They are, a) the types of all connectives, b) the variety of adjectives used, and c) the presence of negative connectives (e.g., admittedly, alternatively, although).

Among the most important features are linguistic indices that describe the complexity of the noun phrases and clauses used (syntactic sophistication and complexity), such as a) the number of adjectival modifiers per direct object, b) number of modal auxiliaries per clause, c) prepositions per object of the preposition, and d) the number of phrasal verb particles per clause. Together they explain between −0.1% and 4.16% of the model’s accuracy. However, their impact shows some uncertainty because their interval crosses zero.

The list of 20 most important features includes only one metric pertaining to lexical diversity. This classical type-token ratio metric (the ratio of unique words to the total number of words (Gregori-Signes and Clavel-Arroitia 2015 )) has a negligible effect on the model’s performance (−0.1–1.1%).

Overall, the 20 most important features explain 11.8% ± 10.5% of the model’s predictive power. Consequently, it is hypothesized that they are responsible for the increase in the percentages of exact and adjacent (±1) matches in comparison to the naïve predictor, improving from 33.1% to 48.2% for exact predictions, and improving from 69.8% to 92.3% for adjacent (±1) matches. Remarkably, there is no dominant feature, with all weights ranging between 0.45% and 1.21%.

Style Rubric

Six features of lexical diversity carry a heavy weight on the rubric scoring model’s predictive power, between 9.8% and 21.7%. Noticeably, the feature ‘lexical_density_type’ is counted twice since it is included in both TAALED’s and TAACO’s sets of linguistic indices. This is a minor limitation that will be addressed in future work.

In terms of vocabulary, the model considers whether words are indicative of: a) an assessment of moral approval or good fortune (the writing assessment’s prompt requires writing about patience), b) quantities such as cardinal numbers, c) frequency or pattern of recurrence, and d) socially-defined interpersonal processes (e.g., abolish, absentee, absolve, abuse, accept). Between 3.8% and 9.7% of the overall predictive power of the rubric scoring model is explained by these features.

The machine marker examines the number of sentence linking words (e.g., nonetheless, therefore, although), the number of sentences with any lemma overlap with the next two sentences, and the variety of content word lemmas, accounting for 2.3–5.7% of the rubric scoring model’s accuracy.

Information about lexical sophistication is also important to the model and improves its accuracy by 1.5–6.3%. In particular, the model considers word neighbor information such as the average number of phonological neighbors for each word in text (number of words that differ by one phoneme, excluding homophones), the degree of academic language within the essay, and the average strength of association inside any bigrams of words, that is the mean probability that any two consecutive words will occur together (Kyle et al. 2018 ).

The machine marker considers the number of relative clause modifiers per nominal, variety of dependents per nominal complement, and the number of nominal complements per clause. These measures of noun phrase/clause complexity and variety together explain 1.5–5.4% of the model’s predictions.

In sum, Style Rubric’s 20 most important features explain 33.8% ± 14.9% of the model’s predictive power. Consequently, it is hypothesized that they are responsible for the increase in the percentages of exact and adjacent (±1) predictions in comparison to the naïve predictor, improving from 46.3% to 55.4% for exact matches and improving from 79.7% to 95.9% for adjacent (±1) matches. Notably, this set of 20 most important features is the most predictive among the four rubrics and it most accurately determines the scores of the Style rubric. At minimum, it represents 18.9% of the predictive power and at a maximum, 48.7%. Style Rubric scoring model has both the highest naïve predictor’s performance and the best trained model’s performance of all four rubrics.

Conventions Rubric

Conventions Rubric scoring model uses two metrics of lexical diversity to determine rubric scores. Both measure the variety of function words within an essay and carry a weight of 2.6–7.4% of the scoring model’s total accuracy.

Syntactic sophistication and complexity play an important role in the predictive power of the rubric scoring model, that is, between 5.0–11.2%. It pays attention to a) the number of complex nominals per clause; b) variety in the number of dependents per nominal complement; c) factors related to the preferential lexical company kept by a verb-construction (interface between lexis and grammar) measured in relation to an academic corpus; d) the usage of less frequent verb-construction combinations (based on the lemmas of the constructions); e) use of possessives in nominal subjects, direct objects, and prepositional objects (e.g., my, his, her, their, etc.); and f) the number of relative clause modifiers per nominal (Kyle 2016 ).

Between 4.4–10.4% of the model’s accuracy depends on information about aspects of lexical sophistication. For instance, the model is interested in the contextual distinctiveness of vocabulary employed, that is, the likelihood to which a word will come to mind in response to a variety of stimuli (Kyle et al. 2018 ). It also looks at the average probability that any two consecutive words in an essay will co-occur. Further, it considers how frequently words occur in general usage; low-frequency is indicative of more sophisticated vocabulary and carries more information, while higher-frequency denotes more familiar vocabulary and is less informative. Similarly, the average scope in which words are used in the literature (normed as per the SUBTLEXus corpus) is taken into account by the model, including the mean number of orthographic neighbors for each word in the essay (the average number of words that can be formed by changing just one letter) (Kyle et al. 2018 ) and word recognition norms in terms of response latency and accuracy.

Regarding text cohesion, the model considers a variety of nouns and pronouns, lexical diversity in terms of lemmas leveraged in the essay text, and the average similarity between any pair of adjacent sentences. Their combined impact on the model’s accuracy adds up to between 1.6–5.4%.

Surprisingly, Conventions Rubric’s list of 20 most important features includes only one metric directly related to grammar and mechanics, which is the number of all grammatical errors. This confirms the findings of Crossley et al. ( 2019a ) that GAMET’s macrofeatures are more efficient than individual microfeatures. The number of grammatical errors plays a limited role in the determination of the Conventions rubric score, that is, 1.0% ±0.7%, again confirming the weaker association that grammatical accuracy has with human judgment of essay quality as reported by Crossley et al. ( 2019a ). This study hypothesizes that the Conventions rubric is more sensitive to complex aspects of grammar than to the simple grammar/spelling rules tracked by GAMET. Alternatively, low-quality essays may not exhibit a minimum level of quality for effective parsing, preventing further detection of grammatical and spelling errors. The follow-up study will test the hypothesis that the effect of grammatical and spelling accuracy on the Conventions rubric scoring model’s performance is mediated through features measuring aspects of syntactic and lexical sophistication and complexity.

Conventions Rubric’s 20 most important features explain 27.1% ± 12.0% of the model’s predictive power. It is hypothesized that they are responsible for the increase in the percentages of exact and adjacent (±1) matches in comparison to the naïve predictor, improving from 33.6% to 53.1% for exact matches and improving from 70.3% to 93.1% for adjacent (±1) matches. Notably, Conventions Rubric scoring model is the most improved over its corresponding naïve predictor’s performance. This set of 20 most important features is the second most predictive among the four rubrics. At minimum, they represent 18.9% of the predictive power and at maximum 48.7%.

This article notes that writing features could be grouped more consistently among the rubrics. For example, all SEANCE writing features could have been grouped under the same rubric, that is, either Ideas or Style. Accordingly, this study hypothesizes that the moderate-strong levels of agreement among rubrics (Table 7 ) influence this phenomenon.

This study investigated the potential of feature-based deep learning (multi-layer perceptron) in AES to predict rubric scores. It explained how rubric scores are derived. The 1567 Grade-7 essays of the Automated Student Assessment Prize contest’s seventh dataset trained the rubric scoring models of this study. The Suite of Automated Linguistic Analysis Tools processed each essay, converting them into a vector of 1592 writing features.

This study was a continuation and an improvement of a previous study, which had certain limitations in training generalizable rubric scoring models. The previous study used the 722 Grade-10 essays of ASAP’s eighth dataset, which had a small sample size, large scale, and imbalanced distribution of holistic and rubric scores, with high-quality essays that were underrepresented (Boulanger and Kumar 2019 ). The previous study suffered from a limited feature selection process, and the fact that the rubric scoring models were all trained on the same set of features, which hindered the ability to select features with the best fit. The research discussed in this article employed a larger essay dataset, applied thorough feature selection that was customized to each rubric, and tested six different deep learning architectures trained on a sample of 200 combinations of hyperparameter values randomly selected out of a space of 48,000 possible combinations. Training revealed that the best deep learning architectures had only two hidden layers, suggesting that shallower MLP neural networks were more accurate than deeper ones (three and more hidden layers).

Based on thorough analyses of the distributions of rubric score predictions and distributions of resolved and human raters’ rubric scores, this study reveals that the rubric scoring models closely approximate the performance of human raters. Consequently, it begs the question: if a machine marker’s performance can become equivalent to that of human raters’ one, is it possible to teach the machine to absolutely outperform humans, from whom it learns? What quantity and variety of data are needed to carry out research into this issue? What additional writing features need to be developed? These questions will be considered in a longitudinal follow-up study.

This study adopted a series of good practices to train generalizable rubric scoring models and made these practices completely transparent. The black box of each rubric scoring model was then scrutinized to determine the features and the degree to which they contributed to the determination of rubric scores. A set of the 20 most important features for each rubric emerged, in which at least 15 features were unique to every rubric and did not significantly contribute to the prediction of the other rubric scores. The study also revealed that rubric score prediction does not directly depend on a few word count-based features (all word count features were pruned). Moreover, many intuitive features were found and selected by each rubric with no particularly dominant features, making it more difficult to trick the AES system.

The results of the study are innovative for the AES research community because 1) they are derived from non-linear models, that is, no linear assumption is made, 2) they provide explanations behind the “reasoning” of the AES system as to why it gave the rubric scores it gave, and 3) this study sets forward a methodology that promotes transparency and understanding of feature-based deep/shallow neural networks. Mechanisms to introduce AI accountability and build trust between AI and human agents are crucial for the reliable and large-scale deployment of AES systems.

This study has limitations. For example, the feature selection filter methods were applied on the entire original training set (from which a labeled validation and testing set was created). To minimize further overfitting, they should have been applied on the training set alone (not the derived validation and testing sets). However, it is expected that this had a trivial impact on the rubric scoring models’ performance. Nonetheless, it is recommended that feature selection filter methods be applied on the unlabeled original validation and testing sets provided by ASAP. Another limitation relates to how the study estimated feature importance by randomizing the values of a single feature at a time, allowing to observe the impact of the feature on the rubric scoring model’s accuracy. It remains obscure whether this approach helped to measure the interaction effects between features on rubric score prediction. This should be clarified and considered when explaining the AI’s reasoning. An additional limitation of this study is a lack of testing of the rubric scoring models against gaming behaviors. These types of tests will assess how well the most important features work together in detecting counterfeit essays. Finally, the rubric scoring models could be retrained using only the most important features to thwart disinformation introduced by less important and/or even detrimental features. This will be a part of a follow-up study.

This article envisions an end-to-end AES system that provides student writers with predicted holistic and rubric scores, and that clearly identifies the scoring criteria behind each rubric. It can detect suboptimal characteristics in student essays to offer formative feedback, which can help students navigate the learning plateau in English writing. Future work will target the clustering of student essays in clusters relative to the number of rubric scores, to discover discriminative patterns in student essays to improve formative and remedial feedback. Alternatively, statistical methods such as ANOVA or the non-parametric Kruskal-Wallis test could be used to detect differences in feature distributions per rubric score. Further, a LSTM recurrent neural network with an attention mechanism (Alikaniotis et al. 2016 ; Dong et al. 2017 ) could be trained to locate spots in student essays that influence the AES system’s decision when assigning rubric scores.

https://www.kaggle.com/c/asap-aes

As for D8, the resolved scores were determined by a set of adjudication rules, where a third human grader was involved if the disagreement between the first two human graders was too significant, making the adjudication process less biased (Kumar et al. 2017 ; Perelman 2013 , 2014 ).

https://www.datacamp.com/community/tutorials/transfer-learning

https://machinelearningmastery.com/transfer-learning-for-deep-learning/

https://www.linguisticanalysistools.org/

Code and data are available at: https://osf.io/puyvh/ .

For more information, consult https://machinelearningmastery.com/an-introduction-to-feature-selection/ .

https://github.com/WillKoehrsen/feature-selector

As previously indicated, the actual scale of holistic scores is 0–30 because the Conventions rubric score counts as twice: HS  =  R 1  +  R 2  +  R 3  + (2 ∗  R 4 ). For simplicity, the Conventions rubric will be counted just once in this study so that the scoring scale is 0–24.

Ranges of accuracy, as reported in this section, are rough estimates based on ±1 standard deviation from the weight.

Abbass, H. A. (2019). Social integration of artificial intelligence: Functions, automation allocation logic and human-autonomy trust. Cognitive Computation, 11 (2), 159–171.

Alikaniotis, D., Yannakoudakis, H., & Rei, M. (2016). Automatic text scoring using neural networks. ArXiv Preprint ArXiv:1606.04289.

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., Neely, J. H., Nelson, D. L., Simpson, G. B., & Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39 (3), 445–459.

Boulanger, D., & Kumar, V. (2019). Shedding light on the automated essay scoring process. In Proceedings of the 12th International conference on educational data mining (EDM). Retrieved from http://educationaldatamining.org/edm2019/proceedings/

Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46 (3), 904–911.

Coltheart, M. (1981). The MRC psycholinguistic database. The Quarterly Journal of Experimental Psychology Section A, 33 (4), 497–505.

Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of Quantitative Linguistics, 17 (2), 94–100.

Cozma, M., Butnaru, A. M., & Ionescu, R. T. (2018). Automated essay scoring with string kernels and word embeddings. ArXiv Preprint ArXiv:1804.07954.

Crossley, S. A., Kyle, K., & McNamara, D. S. (2016). The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior Research Methods, 48 (4), 1227–1237.

Crossley, S. A., Kyle, K., & McNamara, D. S. (2017). Sentiment analysis and social cognition engine (SEANCE): An automatic tool for sentiment, social cognition, and social-order analysis. Behavior Research Methods, 49 (3), 803–821.

Crossley, S. A., Bradfleld, F., & Bustamante, A. (2019a). Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. Journal of Writing Research, 11 (2), 251–270.

Crossley, S. A., Kyle, K., & Dascalu, M. (2019b). The tool for the automatic analysis of cohesion 2.0: Integrating semantic similarity and text overlap. Behavior Research Methods, 51 (1), 14–27.

Cummins, R., Zhang, M., & Briscoe, E. J. (2016). Constrained multi-task learning for automated essay scoring . Association for Computational Linguistics.

Dong, F., Zhang, Y., & Yang, J. (2017). Attention-based recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st conference on computational natural language learning (CoNLL 2017) (pp. 153–162).

Dronen, N., Foltz, P. W., & Habermehl, K. (2015). Effective sampling for large-scale automated writing evaluation systems. In Proceedings of the second (2015) ACM conference on learning@scale (pp. 3–10).

Fergadiotis, G., Wright, H. H., & Green, S. B. (2015). Psychometric evaluation of lexical diversity indices: Assessing length effects. Journal of Speech, Language, and Hearing Research, 58 (3), 840–852.

Fonti, V., & Belitser, E. (2017). Feature selection using lasso. VU Amsterdam Research Paper in Business Analytics.

Gregori-Signes, C., & Clavel-Arroitia, B. (2015). Analysing lexical density and lexical diversity in university students’ written discourse. Procedia-Social and Behavioral Sciences, 198 , 546–556.

Jankowska, M., Conrad, C., Harris, J., & Kešelj, V. (2018). N-gram based approach for automatic prediction of essay rubric marks. In E. Bagheri & J. C. K. Cheung (Eds.), Advances in Artificial Intelligence (pp. 298–303). Cham: Springer International Publishing.

Johansson, V. (2009). Lexical diversity and lexical density in speech and writing: A developmental perspective. Lund Working Papers in Linguistics, 53 , 61–79.

Kumar, V., Fraser, S. N., & Boulanger, D. (2017). Discovering the predictive power of five baseline writing competences. Journal of Writing Analytics, 1 (1), 176–226 Retrieved from https://journals.colostate.edu/analytics/article/view/107 .

Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication . [Doctoral dissertation, Georgia State University]. https://scholarworks.gsu.edu/alesl_diss/35

Kyle, K., Crossley, S., & Berger, C. (2018). The tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods, 50 (3), 1030–1046.

Liang, G., On, B.-W., Jeong, D., Kim, H.-C., & Choi, G. (2018). Automated essay scoring: A Siamese bidirectional LSTM neural network architecture. Symmetry, 10 (12), 682.

Liu, J., Xu, Y., & Zhao, L. (2019). Automated essay scoring based on two-stage learning. ArXiv Preprint ArXiv:1901.07744.

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15 (4), 474–496.

Malvern, D., Richards, B., Chipere, N., & Durán, P. (2004). Lexical diversity and language development . New York: Palgrave Macmillan.

McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42 (2), 381–392.

Mesgar, M., & Strube, M. (2018). A neural local coherence model for text quality assessment. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 4328–4339).

Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences, 116 (44), 22071–22080.

Perelman, L. (2013). Critique of Mark D. Shermis & Ben Hammer, contrasting state-of-the-art automated scoring of essays: Analysis. Journal of Writing Assessment, 6 (1).

Perelman, L. (2014). When “the state of the art” is counting words. Assessing Writing, 21 , 104–111.

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions of any classifier. CoRR, abs/1602.0.

Rosebrock, A. (2017). Deep learning for computer vision with Python . PyImageSearch.

Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20 , 53–76.

Stefanowitsch, A., & Gries, S. T. (2003). Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics, 8 (2), 209–243.

Taghipour, K., & Ng, H. T. (2016). A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1882–1891).

Torruella, J., & Capsada, R. (2013). Lexical statistics and typological structures: A measure of lexical richness. Social and Behavioral Sciences, 95 , 447–454.

Wang, Y., Wei, Z., Zhou, Y., & Huang, X. (2018). Automatic essay scoring incorporating rating schema via reinforcement learning. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 791–797).

West-Smith, P., Butler, S., & Mayfield, E. (2018). Trustworthy automated essay scoring without explicit construct validity. In 2018 AAAI Spring Symposium Series.

Woods, B., Adamson, D., Miel, S., & Mayfield, E. (2017). Formative essay feedback using predictive scoring models. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2071–2080).

Zhao, S., Zhang, Y., Xiong, X., Botelho, A., & Heffernan, N. (2017). A memory-augmented neural model for automated grading. In Proceedings of the fourth (2017) ACM conference on learning@scale (pp. 189–192).

Zupanc, K., & Bosnić, Z. (2017). Automated essay evaluation with semantic analysis. Knowledge-Based Systems, 120 , 118–132.

Download references

Author information

Authors and affiliations.

Faculty of Science & Technology, Athabasca University, Edmonton, AB, Canada

Vivekanandan S. Kumar & David Boulanger

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Vivekanandan S. Kumar .

Additional information

Prof Jim Greer, along with Prof Gord McCalla, in the late 90s supervised my (Vive Kumar’s) doctoral research at the ARIES lab, University of Saskatchewan. In those days, the pursuit of autonomous AIED was at its frenzy. Jim was one of the first ones to realize the need for the continued existence of an umbilical cord even after the birth of a machine intelligence from its human creators. ARIES later formalized it as ‘human-in-the-loop’, where humans co-create knowledge by cooperating, at various degrees of aggregation and abstraction, with an autonomous learning machine. Jim’s vision was a companionship, where every piece of data, knowledge, advice, decision, and policy that were in play would require an equal say from both the machine and its human creator. The human might convince the machine, or the machine might explain away its reasoning for something to exist in that world of companions. Jim and I had several thoroughly enjoyable conversations about the centrality of humans in a machine-supplemented world and vice-versa. We even had one during a friendly faculty-student baseball game, as he differentiated between a baseball catcher and a cricket wicketkeeper, on a beautiful spring day, at the best university campus in North America. Jim argued for the continued existence of the cord, as a precursor to building a notion of trust between the two entities. That was Jim, seeding his ideas in our minds, no matter the place or the situation.

About a decade later, Jim was on the advisory board of the Faculty of Science and Technology at Athabasca University where I had joined as a faculty member in 2008. Normally, he would attend the board meetings via teleconference, but in one such meeting he was there in Edmonton, Alberta, in person. For some reason, he took me aside during the lunch break for a chat. He said he was looking deeply into analytics and urged me to pursue the low-hanging fruits of learning analytics! He wondered about the feasibility of doing analytics with smalldata while not ignoring the compelling need for the AIED community to push the data boundary toward bigdata. We joked about the luxury of our research colleagues in Physics, Astronomy and Biology working with truly big exabyte datasets in subatomic data, astronomy data and genomic datasets, respectively. We talked about ways in which AIED researchers could find a way to collect live educational data autonomously and continually, from a rather large number of educational institutions, at a global level, for use as secondary data in exascale. That, we felt was a way to target multiple sigma improvements in educational outcomes. That was the last time I saw Jim in his blue shirt and that was also the last time I heard him passionately speak about AIED research, sharing a piece of his wisdom with me. And, he must have triggered something then, as he always does. Since then our little research group at Athabasca University managed to churn out a crop of publications, including this one, inspired a number of budding researchers, including the second author of this article, based solely on open data, a sort of low hanging, smalldata research. We will pursue several more harvests along the lines of deep-learned automated essay scoring, human-in-the-loop, and trust models, in memory of Jim – a great soul, my supervisor, my inspiration – Vive

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This appendix lists the related works in the literature that have used the Automated Student Assessment Prize’s seventh essay dataset like this paper to train their automated essay scoring models. Table 13 is referenced in the Related Work section.

This appendix includes Tables  14 , 15 , and 16 , which delineate the most important writing indices per rubric and all rubrics combined. These tables are referenced in Section 5.

This appendix lists and describes, along with their corresponding automated linguistic tools, the important writing indices mentioned in this article. Table 17 is mainly referenced in the “ Discussion: What are the Most Important Features per Rubric? ” section.

Rights and permissions

Reprints and permissions

About this article

Kumar, V.S., Boulanger, D. Automated Essay Scoring and the Deep Learning Black Box: How Are Rubric Scores Determined?. Int J Artif Intell Educ 31 , 538–584 (2021). https://doi.org/10.1007/s40593-020-00211-5

Download citation

Published : 15 September 2020

Issue Date : September 2021

DOI : https://doi.org/10.1007/s40593-020-00211-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Automated essay scoring
  • Deep learning
  • Neural network
  • Natural language processing
  • Feature importance
  • Find a journal
  • Publish with us
  • Track your research
  • International edition
  • Australia edition
  • Europe edition

Illustration on feature about AI use in universities.

‘Full-on robot writing’: the artificial intelligence challenge facing universities

AI is becoming more sophisticated, and some say capable of writing academic essays. But at what point does the intrusion of AI constitute cheating?

  • Follow our Australia news live blog for the latest updates
  • Get our morning and afternoon news emails , free app or daily news podcast

“Waiting in front of the lecture hall for my next class to start, and beside me two students are discussing which AI program works best for writing their essays. Is this what I’m marking? AI essays?”

The tweet by historian Carla Ionescu late last month captures growing unease about what artificial intelligence portends for traditional university assessment. “No. No way,” she tweeted. “Tell me we’re not there yet.”

But AI has been banging on the university’s gate for some time now.

In 2012, computer theorist Ben Goertzel proposed what he called the “robot university student test” , arguing that an AI capable of obtaining a degree in the same way as a human should be considered conscious.

Goertzel’s idea – an alternative to the more famous “Turing test” – might have remained a thought experiment were it not for the successes of AIs employing natural language processing (NLP): most famously, GPT-3 , the language model created by the OpenAi research laboratory.

Two years ago, computer scientist Nassim Dehouche published a piece demonstrating that GPT-3 could produce credible academic writing undetectable by the usual anti-plagiarism software.

“[I] found the output,” Dehouche told Guardian Australia, “to be indistinguishable from an excellent undergraduate essay, both in terms of soundness and originality. [My article] was initially subtitled, ‘The best time to act was yesterday, the second-best time is now’. Its purpose was to call for an urgent need to, at the very least, update our concepts of plagiarism.”

Ben Goertzel

He now thinks we’re already well past the time when students could generate entire essays (and other forms of writing) using algorithmic methods.

“A good exercise for aspiring writers,” he says, “would be a sort of reverse Turing test: ‘Can you write a page of text that could not have been generated by an AI, and explain why?’ As far as I can see, unless one is reporting an original mathematics theorem and its proof, it is not possible. But I would love to be proven wrong.”

Many others now share his urgency. In news and opinion articles, GPT-3 has convincingly written on whether it poses a threat to humanity ( it says it doesn’t ), and about animal cruelty in the styles of both Bob Dylan and William Shakespeare.

A 2021 Forbes article about AI essay writing culminated in a dramatic mic-drop: “this post about using an AI to write essays in school,” it explained, “was written using an artificial intelligence content writing tool”.

Of course, the tech industry thrives on unwarranted hype. Last month S Scott Graham in a piece for Inside Higher Education described encouraging students to use the technology for their assignments with decidedly mixed results. The very best, he said, would have fulfilled the minimum requirements but little more. Weaker students struggled, since giving the system effective prompts (and then editing its output) required writing skills of a sufficiently high level to render the AI superfluous.

“I strongly suspect,” he concluded, “full-on robot writing will always and forever be ‘just around the corner’.”

That might be true, though only a month earlier, Slate’s Aki Peritz concluded precisely the opposite, declaring that “with a little bit of practice, a student can use AI to write his or her paper in a fraction of the time that it would normally take”.

Nevertheless, the challenge for higher education can’t be reduced merely to “full-on robot writing”.

Universities don’t merely face essays or assignments entirely generated by algorithms: they must also adjudicate a myriad of more subtle problems. For instance, AI-powered word processors habitually suggest alternatives to our ungrammatical phrases. But if software can algorithmically rewrite a student’s sentence, why shouldn’t it do the same with a paragraph – and if a paragraph, why not a page?

At what point does the intrusion of AI constitute cheating?

Deakin University’s Prof Phillip Dawson specialises in digital assessment security .

He suggests regarding AI merely as a new form of a technique called cognitive offloading.

“Cognitive offloading,” he explains, is “when you use a tool to reduce the mental burden of a task. It can be as simple as writing something down so you don’t have to try to remember it for later. There have long been moral panics around tools for cognitive offloading, from Socrates complaining about people using writing to pretend they knew something, to the first emergence of pocket calculators.’

Dawson argues that universities should make clear to students the forms and degree of cognitive offloading permitted for specific assessments, with AI increasingly incorporated into higher level tasks.

“I think we’ll actually be teaching students how to use these tools. I don’t think we’re going to necessarily forbid them.”

The occupations for which universities prepare students will, after all, soon also rely on AI, with the humanities particularly affected. Take journalism, for instance. A 2019 survey of 71 media organisations from 32 countries found AI already a “significant part of journalism”, deployed for news gathering (say, sourcing information or identifying trends), news production (anything from automatic fact checkers to the algorithmic transformation of financial reports into articles) and news distribution (personalising websites, managing subscriptions, finding new audiences and so on). So why should journalism educators penalise students for using a technology likely to be central to their future careers?

University students

“I think we’ll have a really good look at what the professions do with respect to these tools now,” says Dawson, “and what they’re likely to do in the future with them, and we’ll try to map those capabilities back into our courses. That means figuring out how to reference them, so the student can say: I got the AI to do this bit and then here’s what I did myself.”

Yet formulating policies on when and where AI might legitimately be used is one thing – and enforcing them is quite another.

Dr Helen Gniel directs the higher education integrity unit of the Tertiary Education Quality and Standards Agency (TEQSA), the independent regulator of Australian higher education.

Like Dawson, she sees the issues around AI as, in some senses, an opportunity – a chance for institutions to “think about what they are teaching, and the most appropriate methods for assessing learning in that context”.

Transparency is key.

“We expect institutions to define their rules around the use of AI and ensure that expectations are clearly and regularly communicated to students.”

She points to ICHM, the Institute of Health Management and Flinders Uni as three providers now with explicit policies, with Flinders labelling the submission of work “generated by an algorithm, computer generator or other artificial intelligence” as a form of “contract cheating”.

But that comparison raises other issues.

In August, TEQSA blocked some 40 websites associated with the more traditional form of contract cheating – the sale of pre-written essays to students. The 450,000 visits those sites received each month suggests a massive potential market for AI writing, as those who once paid humans to write for them turn instead to digital alternatives.

Research by Dr Guy Curtis from the University of Western Australia found respondents from a non-English speaking background three times more likely to buy essays than those with English as a first language. That figure no doubt reflects the pressures heaped on the nearly 500,000 international students taking courses at Australian institutions, who may struggle with insecure work, living costs, social isolation and the inherent difficulty of assessment in a foreign language.

But one could also note the broader relationship between the expansion of contract cheating and the transformation of higher education into a lucrative export industry. If a university degree becomes merely a product to be bought and sold, the decision by a failing student to call upon an external contractor (whether human or algorithmic) might seem like simply a rational market choice.

It’s another illustration of how AI poses uncomfortable questions about the very nature of education.

Ben Goertzel imagined his “robot university student test” as a demonstration of “artificial general intelligence”: a digital replication of the human intellect. But that’s not what NLP involves. On the contrary, as Luciano Floridi and Massimo Chiriatti say , with AI, “we are increasingly decoupling the ability to solve a problem effectively … from any need to be intelligent to do so”.

Bob Dylan

The new AIs train on massive data sets, scouring vast quantities of information so they can extrapolate plausible responses to textual and other prompts. Emily M Bender and her colleagues describe a language model as a “stochastic parrot”, something that “haphazardly [stitches] together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning”.

So if it’s possible to pass assessment tasks without understand their meaning, what, precisely, do the tasks assess?

In his 2011 book For the University: Democracy and the Future of the Institution , the University of Warwick’s Thomas Docherty suggests that corporatised education replaces open-ended and destabilising “knowledge” with “the efficient and controlled management of information”, with assessment requiring students to demonstrate solely that they have gained access to the database of “knowledge” … and that they have then manipulated or “managed” that knowledge in its organisation of cut-and-pasted parts into a new whole.

The potential proficiency of “stochastic parrots” at tertiary assessment throws a new light on Docherty’s argument, confirming that such tasks do not, in fact, measure knowledge (which AIs innately lack) so much as the transfer of information (at which AIs excel).

To put the argument another way, AI raises issues for the education sector that extend beyond whatever immediate measures might be taken to govern student use of such systems. One could, for instance, imagine the technology facilitating a “boring dystopia” , further degrading those aspects of the university already most eroded by corporate imperatives. Higher education has, after all, invested heavily in AI systems for grading , so that, in theory, algorithms might mark the output of other algorithms, in an infinite process in which nothing whatsoever ever gets learned.

But maybe, just maybe, the challenge of AI might encourage something else. Perhaps it might foster a conversation about what education is and, most importantly, what we want it to be. AI might spur us to recognise genuine knowledge, so that, as the university of the future embraces technology, it appreciates anew what makes us human.

  • Australian universities
  • Australian education
  • Artificial intelligence (AI)

Comments (…)

Most viewed.

artificial intelligence in education essay writing

25,000+ students realised their study abroad dream with us. Take the first step today

Meet top uk universities from the comfort of your home, here’s your new year gift, one app for all your, study abroad needs, start your journey, track your progress, grow with the community and so much more.

artificial intelligence in education essay writing

Verification Code

An OTP has been sent to your registered mobile no. Please verify

artificial intelligence in education essay writing

Thanks for your comment !

Our team will review it before it's shown to our readers.

Leverage Edu

  • School Education /

Artificial Intelligence Essay in 100, 200, and 300 Words for School Students

' src=

  • Updated on  
  • Jan 10, 2024

Essay On Artificial Intelligence

Artificial Intelligence (AI) has revolutionized every economic sector in the 21st century, owing to its ability to create efficient machines with cognitive ability. Key elements of this contemporary advancement like machine learning (ML), deep learning, natural language processing (NLP), etc have resulted in the creation of machines that use data and algorithms to perform human-like tasks. To learn more about AI, you can access 3 samples of Artificial Intelligence essay in this blog. Keep browsing!!

This Blog Includes:

Artificial intelligence essay in 100 words for school students, artificial intelligence essay in 200 words for school students, artificial intelligence essay in 300 words for school students.

Also Read: Types of Artificial Intelligence

Artificial Intelligence or AI is a revolutionary technology that allows machines to undertake tasks that usually need human intelligence. It includes natural language processing (NLP), problem-solving capabilities, and machine learning (ML). In contemporary times, AI technology has entered all aspects of human life. Currently, AI technology is present in virtual assistants like Siri to advanced applications in the finance, industrial, and healthcare sectors.

Even though AI presents extensive potential for innovation and efficiency, it has adverse effects like ethical concerns and job displacement. Therefore, it is important that we use AI technology in a controlled manner and always aspire for the wholesome development of human civilization. 

Also Read: Career in Artificial Intelligence: Opportunities & Outlook

Artificial Intelligence (AI) is an evolving technology in the field of computer science. This technology aims to curate machines capable of reasoning, learning, and problem-solving by mimicking human cognitive functions. This technology allows computer and machine systems to adapt and improve performance without explicit programming using machine learning (ML), a subset of AI. Furthermore, AI technology is used in streaming platforms, autonomous vehicles, and advanced medical diagnostics. Due to its widespread usage, AI has become an integral part of the modern society.

However, the rapid development of AI technology often raises ethical concerns, especially about data privacy and security. Also, advancements in this latest technology have the potential to increase unemployment across various sectors. Thus, it is essential to strike a balance between innovation and protection of human elements and interests in the job sector and personal life. 

Therefore, as we adopt AI in our daily lives, it becomes our responsibility to generate awareness about crucial elements of AI and its impact on human civilization. On top of that, it is significant to mankind that technology should always adhere to the ethical guidelines of data privacy and cyber security. Also, federal and state governments must implement laws that prevent AI developers from threatening social harmony and national security within and beyond the nation’s territorial boundaries. 

Also Read: Top AI Courses after 12th with Salaries

The advent of Artificial Intelligence (AI) is an attempt to converge computer science, mathematics, and cognitive science into singular units to create machines with human intelligence. These machines help in learning, reasoning, problem-solving, and language understanding. Owing to the varied tasks performed by AI-enabled machines, AI technology has become a landmark element of the technological revolution of the 21st century. 

AI technology has transformed the fields of healthcare, finance, automobile, and streaming services. In healthcare, AI is used for drug discovery and diagnosis. Whereas, in the financial sector, AI is used for risk management and optimization of trading strategies. On top of that, this technology has led to the development of AI-powered vehicles, which promise efficient and safe transportation. Furthermore, AI is used in recommendation systems on platforms like Netflix and Amazon to provide personalized suggestions based on the behavior and preferences of users. 

However, exemplary technological advancements always come with numerous drawbacks. One of the grave consequences of the large-scale adoption of AI is the fear of job loss. In recent times, automation has left many jobless, especially in the manufacturing sector. Thus, governments and technology companies must ensure that there should be a balanced discussion between technological advancements and societal well-being. Also, policymakers must have regular dialogue with ethicists to control the exploitation of citizen’s data by technology companies. 

Nevertheless, AI has opened exciting opportunities for inquisitive minds as it has created exciting jobs in coding, data science, and robotics. Owing to the increased use of AI technology in these arenas, people have the opportunity to improve their critical thinking, creativity, and problem-solving skills. To leverage new opportunities like these, students should start gaining hands-on experience with AI technologies at an early age. 

Thus, Artificial Intelligence is a revolutionary force that will shape the future of mankind. To harness the maximum benefits of this technology, it is advisable to use it wisely. Any development or innovation in AI technology must be directed towards the betterment of society and aim to protect the privacy of the weakest section of society. 

Also Read: Applications of Artificial Intelligence

Ans: AI is a branch of technology and computer science that empowers machines to perform tasks that require human intelligence. AI systems or AI-enabled machines use data and algorithms to learn, reason, and solve problems. It is widely used for speech recognition, autonomous vehicles, image processing, and disease diagnosis.

Ans: Machine learning (ML), natural language processing (NLP), deep learning, algorithms, and problem-solving are some of the essential AI elements. 

Ans: You can start by defining Artificial Intelligence. Then you can highlight its application in different fields. Thereafter, you can highlight the ethical concerns around this technology. Finally, you can conclude by weighing in on the future implications of this technological advancement. 

Related Reads:

For more information on such informative articles for your school, visit our essay writing page and follow Leverage Edu .

' src=

Ankita Singh

Ankita is a history enthusiast with a few years of experience in academic writing. Her love for literature and history helps her curate engaging and informative content for education blog. When not writing, she finds peace in analysing historical and political anectodes.

Leave a Reply Cancel reply

Save my name, email, and website in this browser for the next time I comment.

Contact no. *

artificial intelligence in education essay writing

Connect With Us

artificial intelligence in education essay writing

25,000+ students realised their study abroad dream with us. Take the first step today.

artificial intelligence in education essay writing

Resend OTP in

artificial intelligence in education essay writing

Need help with?

Study abroad.

UK, Canada, US & More

IELTS, GRE, GMAT & More

Scholarship, Loans & Forex

Country Preference

New Zealand

Which English test are you planning to take?

Which academic test are you planning to take.

Not Sure yet

When are you planning to take the exam?

Already booked my exam slot

Within 2 Months

Want to learn about the test

Which Degree do you wish to pursue?

When do you want to start studying abroad.

January 2024

September 2024

What is your budget to study abroad?

artificial intelligence in education essay writing

How would you describe this article ?

Please rate this article

We would like to hear more.

Have something on your mind?

artificial intelligence in education essay writing

Make your study abroad dream a reality in January 2022 with

artificial intelligence in education essay writing

India's Biggest Virtual University Fair

artificial intelligence in education essay writing

Essex Direct Admission Day

Why attend .

artificial intelligence in education essay writing

Don't Miss Out

Artificial Intelligence Essay for Students and Children

500+ words essay on artificial intelligence.

Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of humans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and problem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines. It is probably the fastest-growing development in the World of technology and innovation . Furthermore, many experts believe AI could solve major challenges and crisis situations.

Artificial Intelligence Essay

Types of Artificial Intelligence

First of all, the categorization of Artificial Intelligence is into four types. Arend Hintze came up with this categorization. The categories are as follows:

Type 1: Reactive machines – These machines can react to situations. A famous example can be Deep Blue, the IBM chess program. Most noteworthy, the chess program won against Garry Kasparov , the popular chess legend. Furthermore, such machines lack memory. These machines certainly cannot use past experiences to inform future ones. It analyses all possible alternatives and chooses the best one.

Type 2: Limited memory – These AI systems are capable of using past experiences to inform future ones. A good example can be self-driving cars. Such cars have decision making systems . The car makes actions like changing lanes. Most noteworthy, these actions come from observations. There is no permanent storage of these observations.

Type 3: Theory of mind – This refers to understand others. Above all, this means to understand that others have their beliefs, intentions, desires, and opinions. However, this type of AI does not exist yet.

Type 4: Self-awareness – This is the highest and most sophisticated level of Artificial Intelligence. Such systems have a sense of self. Furthermore, they have awareness, consciousness, and emotions. Obviously, such type of technology does not yet exist. This technology would certainly be a revolution .

Get the huge list of more than 500 Essay Topics and Ideas

Applications of Artificial Intelligence

First of all, AI has significant use in healthcare. Companies are trying to develop technologies for quick diagnosis. Artificial Intelligence would efficiently operate on patients without human supervision. Such technological surgeries are already taking place. Another excellent healthcare technology is IBM Watson.

Artificial Intelligence in business would significantly save time and effort. There is an application of robotic automation to human business tasks. Furthermore, Machine learning algorithms help in better serving customers. Chatbots provide immediate response and service to customers.

artificial intelligence in education essay writing

AI can greatly increase the rate of work in manufacturing. Manufacture of a huge number of products can take place with AI. Furthermore, the entire production process can take place without human intervention. Hence, a lot of time and effort is saved.

Artificial Intelligence has applications in various other fields. These fields can be military , law , video games , government, finance, automotive, audit, art, etc. Hence, it’s clear that AI has a massive amount of different applications.

To sum it up, Artificial Intelligence looks all set to be the future of the World. Experts believe AI would certainly become a part and parcel of human life soon. AI would completely change the way we view our World. With Artificial Intelligence, the future seems intriguing and exciting.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [{ “@type”: “Question”, “name”: “Give an example of AI reactive machines?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Reactive machines react to situations. An example of it is the Deep Blue, the IBM chess program, This program defeated the popular chess player Garry Kasparov.” } }, { “@type”: “Question”, “name”: “How do chatbots help in business?”, “acceptedAnswer”: { “@type”: “Answer”, “text”:”Chatbots help in business by assisting customers. Above all, they do this by providing immediate response and service to customers.”} }] }

Customize your course in 30 seconds

Which class are you in.

tutor

  • Travelling Essay
  • Picnic Essay
  • Our Country Essay
  • My Parents Essay
  • Essay on Favourite Personality
  • Essay on Memorable Day of My Life
  • Essay on Knowledge is Power
  • Essay on Gurpurab
  • Essay on My Favourite Season
  • Essay on Types of Sports

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Download the App

Google Play

To revisit this article, visit My Profile, then View saved stories .

  • Backchannel
  • Newsletters
  • WIRED Insider
  • WIRED Consulting

Estelle Erasmus

How to Resist the Temptation of AI When Writing

Red laptop displaying chat bubbles

Whether you're a student, a journalist, or a business professional, knowing how to do high-quality research and writing using trustworthy data and sources, without giving in to the temptation of AI or ChatGPT , is a skill worth developing.

As I detail in my book Writing That Gets Noticed , locating credible databases and sources and accurately vetting information can be the difference between turning a story around quickly or getting stuck with outdated information.

For example, several years ago the editor of Parents.com asked for a hot-take reaction to country singer Carrie Underwood saying that, because she was 35, she had missed her chance at having another baby. Since I had written about getting pregnant in my forties, I knew that as long as I updated my facts and figures, and included supportive and relevant peer-reviewed research, I could pull off this story. And I did.

The story ran later that day , and it led to other assignments. Here are some tips I’ve learned that you should consider mastering before you turn to automated tools like generative AI to handle your writing work for you.

Identify experts, peer-reviewed research study authors, and sources who can speak with authority—and ideally, offer easily understood sound bites or statistics on the topic of your work. Great sources include professors at major universities and media spokespeople at associations and organizations.

For example, writer and author William Dameron pinned his recent essay in HuffPost Personal around a statistic from the American Heart Association on how LGBTQ people experience higher rates of heart disease based on discrimination. Although he first found the link in a secondary source (an article in The New York Times ), he made sure that he checked the primary source: the original study that the American Heart Association gleaned the statistic from. He verified the information, as should any writer, because anytime a statistic is cited in a secondary source, errors can be introduced.

Jen Malia, author of  The Infinity Rainbow Club  series of children’s books (whom I recently interviewed on my podcast ), recently wrote a piece about dinosaur-bone hunting for Business Insider , which she covers in her book Violet and the Jurassic Land Exhibit.

After a visit to the Carnegie Museum of Natural History in Pittsburgh, Pennsylvania, Malia, whose books are set in Philadelphia, found multiple resources online and on the museum site that gave her the history of the Bone Wars , information on the exhibits she saw, and the scientific names of the dinosaurs she was inspired by. She also used the Library of Congress’ website, which offers digital collections and links to the Library of Congress Newspaper Collection.

Malia is a fan of searching for additional resources and citable documents with Google Scholar . “If I find that a secondary source mentions a newspaper article, I’m going to go to the original newspaper article, instead of just stopping there and quoting,” she says.

How an iPhone Powered by Google’s Gemini AI Might Work

Boone Ashworth

Lenovo’s 12th-Gen ThinkPad X1 Carbon Is Too Dang Expensive

Christopher Null

Q Acoustics’s Superb New M40 Speakers Prove Bigger Isn’t Always Better

Chris Haslam

The 15 Best Electric Bikes for Every Kind of Ride

Adrienne So

Your local public library is a great source of free information, journals, and databases (even ones that generally require a subscription and include embargoed research). For example, your search should include everything from health databases ( Sage Journals , Scopus , PubMed) to databases for academic sources and journalism ( American Periodical Series Online , Statista , Academic Search Premier ) and databases for news, trends, market research, and polls (t he Harris Poll , Pew Research Center , Newsbank , ProPublica ).

Even if you find a study or paper that you can’t access in one of those databases, consider reaching out to the study’s lead author or researcher. In many cases, they’re happy to discuss their work and may even share the study with you directly and offer to talk about their research.

For journalist Paulette Perhach’s article on ADHD in The New York Times, she used Epic Research to see “dual team studies.” That's when two independent teams address the same topic or question, and ideally come to the same conclusions. She recommends locating research and experts via key associations for your topic. She also likes searching via Google Scholar but advises filtering it for studies and research in recent years to avoid using old data. She suggests keeping your links and research organized. “Always be ready to be peer-reviewed yourself,” Perhach says.

When you are looking for information for a story or project, you might be inclined to start with a regular Google search. But keep in mind that the internet is full of false information, and websites that look trustworthy can sometimes turn out to be businesses or companies with a vested interest in you taking their word as objective fact without additional scrutiny. Regardless of your writing project, unreliable or biased sources are a great way to torpedo your work—and any hope of future work.

Author Bobbi Rebell researched her book Launching Financial Grownups using the IRS’ website . “I might say that you can contribute a certain amount to a 401K, but it might be outdated because those numbers are always changing, and it’s important to be accurate,” she says. “AI and ChatGPT can be great for idea generation,” says Rebell, “but you have to be careful. If you are using an article someone was quoted in, you don’t know if they were misquoted or quoted out of context.”

If you use AI and ChatGPT for sourcing, you not only risk introducing errors, you risk introducing plagiarism—there is a reason OpenAI, the company behind ChatGPT, is being sued for downloading information from all those books.

Audrey Clare Farley, who writes historical nonfiction, has used a plethora of sites for historical research, including Women Also Know History , which allows searches by expertise or area of study, and JSTOR , a digital library database that offers a number of free downloads a month. She also uses Chronicling America , a project from the Library of Congress which gathers old newspapers to show how a historical event was reported, and Newspapers.com (which you can access via free trial but requires a subscription after seven days).

When it comes to finding experts, Farley cautions against choosing the loudest voices on social media platforms. “They might not necessarily be the most authoritative. I vet them by checking if they have a history of publication on the topic, and/or educational credentials.”

When vetting an expert, look for these red flags:

  • You can’t find their work published or cited anywhere.
  • They were published in an obscure journal.
  • Their research is funded by a company, not a university, or they are the spokesperson for the company they are doing research for. (This makes them a public relations vehicle and not an appropriate source for journalism.)

And finally, the best endings for virtually any writing, whether it’s an essay, a research paper, an academic report, or a piece of investigative journalism, circle back to the beginning of the piece, and show your reader the transformation or the journey the piece has presented in perspective.

As always, your goal should be strong writing supported by research that makes an impact without cutting corners. Only then can you explore tools that might make the job a little easier, for instance by generating subheads or discovering a concept you might be missing—because then you'll have the experience and skills to see whether it's harming or helping your work.

You Might Also Like …

In your inbox: Introducing Politics Lab , your guide to election season

Google used her to tout diversity. Now she’s suing for discrimination

Our in-house physics whiz explains how heat pumps work

The big questions the Pentagon’s new UFO report fails to answer

AirPods Pro or AirPods Max? These are the best Apple buds for your ears

Google’s GenAI Bots Are Struggling. But So Are Its Humans

Michael Calore

Google Podcasts Is Gone. Here’s How to Transfer Your Subscriptions

Reece Rogers

Is Your Gmail Inbox Full? Here’s How To Clear Out Some Space

WIRED COUPONS

https://www.wired.com/coupons/static/shop/30208/logo/_0047_Dyson--coupons.png

Dyson promo code: 20% off all purchases + free shipping

https://www.wired.com/coupons/static/shop/31565/logo/GoPro_Logo_-_WIRED_-_8.png

GoPro Promo Code: 15% off Cameras and Accessories

https://www.wired.com/coupons/static/shop/30173/logo/Samsung_promo_code.png

Up to +30% Off with your Samsung promo code

https://www.wired.com/coupons/static/shop/30178/logo/_0049_Dell-coupons.png

10% Off Everything w/ Dell Coupon Code

https://www.wired.com/coupons/static/shop/32722/logo/VistaPrint_promo_code.png

Shop Grad Balloons w/ 30% off

https://www.wired.com/coupons/static/shop/30169/logo/newegg_logo.png

15% off Sitewide - Newegg promo code

The Hechinger Report

Covering Innovation & Inequality in Education

artificial intelligence in education essay writing

OPINION: Artificial intelligence can be game-changing for students with special needs

' src=

Share this:

  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Pinterest (Opens in new window)
  • Click to share on Reddit (Opens in new window)
  • Click to share on WhatsApp (Opens in new window)
  • Click to email a link to a friend (Opens in new window)

The Hechinger Report is a national nonprofit newsroom that reports on one topic: education. Sign up for our  weekly newsletters  to get stories like this delivered directly to your inbox. Consider supporting our stories and becoming  a member  today.

artificial intelligence in education essay writing

Get important education news and analysis delivered straight to your inbox

  • Weekly Update
  • Future of Learning
  • Higher Education
  • Early Childhood
  • Proof Points

Much has been made of artificial intelligence’s potential to revolutionize education. AI is making it increasingly possible to break down barriers so that no student is ever left behind.

This potential is real, but only if we are ensuring that all learners benefit.

Far too many students, especially those with special needs, do not progress as well as their peers do academically. Meanwhile, digital media, heavily reliant on visuals and text, with audio often secondary, is playing an increasing role in education.

For a typical user in most cases, this is fine. But not for blind or deaf students, whose sensory limitations frequently impede their access to quality education. The stakes are much higher for these students, and digital media often underserves them.

That’s why the development of AI-powered tools that can accommodate all learners must be a priority for policymakers, districts and the education technology industry.

Related: ‘We’re going to have to be a little more nimble’: How school districts are responding to AI

Good instruction is not a one-way street where students simply absorb information passively. For learning content to be most effective, the student must be able to interact with it. But doing so can be especially challenging for students with special needs working with traditional digital interfaces.

A mouse, trackpad, keyboard or even a touch screen may not always be appropriate for a student’s sensory or developmental capabilities. AI-driven tools can enable more students to interact in ways that are natural and accessible for them.

For blind and low-vision students

For blind and low-vision students, digital classroom materials have historically been difficult to use independently. Digital media is visual, and to broaden access, developers usually have to manually code descriptive information into every interface.

These technologies also often impose a rigid information hierarchy that the user must tab through with keys or gestures. The result is a landscape of digital experiences that blind and low-vision students either cannot access at all or experience in a form that lacks the richness of the original.

For these students, AI-powered computer vision offers a solution — it can scan documents, scenes and apps and then describe visual elements aloud through speech synthesis. Coupled with speech recognition, this allows seamless conversational navigation without rigid menus or keyboard commands.

Free tools like Ask Envision and Be My Eyes demonstrate this potential. Using just an AI-enabled camera and microphone, these apps can capture and explain anything the user points them toward, and then answer follow-up questions.

These technologies have the potential to allow blind and low-vision students to get the full benefit of the same engaging, personalized ed tech experiences that their peers have been using for years.

For deaf and hard-of-hearing students

In some ways, the visually oriented world of digital media is an ideal fit for deaf and hard-of-hearing students. Audio is often a secondary consideration; particularly once users can read.

In cases in which audio is required for comprehension, like with video, the accommodation most digital developers provide is text-based captioning. Unfortunately, this means that a user must already be a proficient reader.

For younger learners, or any learner who does not read fluently or quickly, translation into sign language is a preferable solution. AI can be of service here, translating speech and text into animated signs while computer vision reads the user’s gestures and translates them into text or commands.

There are some early developments in this area, but more work is needed to create a fully sign language-enabled solution.

For the youngest learners

For young learners, even those without diagnosed disabilities, developmentally appropriate interactions with conventional desktop/mobile apps remain a challenge. A young child cannot read or write, which makes most text-based interfaces impossible for them. And their fine motor control is not fully developed, which makes using a mouse or keyboard or trackpad more difficult.

AI voice controls address these problems by enabling students to simply speak requests or responses, a more natural interaction for these pre-readers and -writers. Allowing a child to simply ask for what they want and verbally answer questions gives them a more active role in their learning.

Voice control may also enable a more reliable assessment of their knowledge, as there are fewer confounding variables when the student is not trying to translate what they understand into an input that a computer will understand.

Computer vision can smooth over text-based methods of interaction. For example, username/password login forms can be replaced with QR codes; many school-oriented systems have already done so.

Computer vision can also be used to enable interactions between the physical and digital world. A student can complete a task by writing or drawing on paper or constructing something from objects, and a computer can “see” and interpret their work.

Using physical objects can be more developmentally appropriate for teaching certain concepts. For example, having a child count with actual objects is often better than using digital representations. Traditional methods can also be more accurate in some cases, such as practicing handwriting with pencil and paper instead of a mouse or trackpad.

Even without physical objects, computer vision can enable the assessment of kinesthetic learning, like calculating on fingers or clapping to indicate syllables in a word.

Related : STUDENT VOICE: Teachers assign us work that relies on rote memorization, then tell us not to use artificial intelligence

A major hurdle in education is that although every student is unique, we have not had the tools or resources to truly tailor their learning to their individualized strengths and needs. AI technology has the potential for transformative change.

The responsibility falls on all of us — districts, policymakers and the ed tech industry — to collaborate and ensure that AI-powered accessibility becomes the norm, not the exception.

We must share knowledge and urgently advocate for policies that prioritize and fund the swift deployment of these game-changing tools to all learners. Accessibility can’t be an afterthought; it must be a top priority baked into every program, policy and initiative.

Only through concerted efforts can we bring the full potential of accessible AI to every classroom.

Diana Hughes is the vice president of Product Innovation and AI at Age of Learning .

This story about AI and special needs students was produced by The Hechinger Report , a nonprofit, independent news organization focused on inequality and innovation in education . Sign up for Hechinger’s newsletter .

Related articles

The Hechinger Report provides in-depth, fact-based, unbiased reporting on education that is free to all readers. But that doesn't mean it's free to produce. Our work keeps educators and the public informed about pressing issues at schools and on campuses throughout the country. We tell the whole story, even when the details are inconvenient. Help us keep doing that.

Join us today.

Letters to the Editor

At The Hechinger Report, we publish thoughtful letters from readers that contribute to the ongoing discussion about the education topics we cover. Please read our guidelines for more information. We will not consider letters that do not contain a full name and valid email address. You may submit news tips or ideas here without a full name, but not letters.

By submitting your name, you grant us permission to publish it with your letter. We will never publish your email address. You must fill out all fields to submit a letter.

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Sign me up for the newsletter!

artificial intelligence in education essay writing

The End of Foreign-Language Education

Thanks to AI, people may no longer feel the need to learn a second language.

Listen to this article

Produced by ElevenLabs and News Over Audio (NOA) using AI narration.

A few days ago, I watched a video of myself talking in perfect Chinese. I’ve been studying the language on and off for only a few years, and I’m far from fluent. But there I was, pronouncing each character flawlessly in the correct tone, just as a native speaker would. Gone were my grammar mistakes and awkward pauses, replaced by a smooth and slightly alien-sounding voice. “My favorite food is sushi,” I said— wo zui xihuan de shiwu shi shousi —with no hint of excitement or joy.

I’d created the video using software from a Los Angeles–based artificial-intelligence start-up called HeyGen. It allows users to generate deepfake videos of real people “saying” almost anything based on a single picture of their face and a script, which is paired with a synthetic voice and can be translated into more than 40 languages. By merely uploading a selfie taken on my iPhone, I was able to glimpse a level of Mandarin fluency that may elude me for the rest of my life.

HeyGen’s visuals are flawed—the way it animates selfies almost reminded me of the animatronics in Disney’s It’s a Small World ride—but its language technology is good enough to make me question whether learning Mandarin is a wasted effort. Neural networks, the machine-learning systems that power generative-AI programs such as ChatGPT, have rapidly improved the quality of automatic translation over the past several years, making even older tools like Google Translate far more accurate.

At the same time, the number of students studying foreign languages in the U.S. and other countries is shrinking. Total enrollment in language courses other than English at American colleges decreased 29.3 percent from 2009 to 2021, according to the latest data from the Modern Language Association, better known as the MLA. In Australia, only 8.6 percent of high-school seniors were studying a foreign language in 2021—a historic low. In South Korea and New Zealand , universities are closing their French, German, and Italian departments. One recent study from the education company EF Education First found that English proficiency is decreasing among young people in some places.

Many factors could help explain the downward trend, including pandemic-related school disruptions, growing isolationism, and funding cuts to humanities programs. But whether the cause of the shift is political, cultural, or some mix of things, it’s clear that people are turning away from language learning just as automatic translation becomes ubiquitous across the internet.

Read: High-school English needed a makeover before ChatGPT

Within a few years, AI translation may become so commonplace and frictionless that billions of people take for granted the fact that the emails they receive, videos they watch, and albums they listen to were originally produced in a language other than their native one. Something enormous will be lost in exchange for that convenience. Studies have suggested that language shapes the way people interpret reality. Learning a different way to speak, read, and write helps people discover new ways to see the world—experts I spoke with likened it to discovering a new way to think. No machine can replace such a profoundly human experience. Yet tech companies are weaving automatic translation into more and more products. As the technology becomes normalized, we may find that we’ve allowed deep human connections to be replaced by communication that’s technically proficient but ultimately hollow.

AI language tools are now in social-media apps, messaging platforms, and streaming sites. Spotify is experimenting with using a voice-generation tool from the ChatGPT maker OpenAI to translate podcasts in the host’s own voice, while Samsung is touting that its new Galaxy S24 smartphone can translate phone calls as they’re occurring . Roblox, meanwhile, claimed last month that its AI translation tool is so fast and accurate , its English-speaking users might not realize that their conversation partner “is actually in Korea.” The technology—which works especially well for “ high-resource languages ” such as English and Chinese, and less so for languages such as Swahili and Urdu—is being used in much more high-stakes situations as well, such as translating the testimony of asylum seekers and firsthand accounts from conflict zones. Musicians are already using it to translate songs , and at least one couple credited it with helping them to fall in love.

One of the most telling use cases comes from a start-up called Jumpspeak, which makes a language-learning app similar to Duolingo and Babbel. Instead of hiring actual bilingual actors, Jumpspeak appears to have used AI-generated “people” reading AI-translated scripts in at least four ads on Instagram and Facebook. At least some of the personas shown in the ads appear to be default characters available on HeyGen’s platform. “I struggled to learn languages my whole life. Then I learned Spanish in six months, I got a job opportunity in France, and I learned French. I learned Mandarin before visiting China,” a synthetic avatar says in one of the ads, while switching between all three languages. Even a language-learning app is surrendering to the allure of AI, at least in its marketing.

Alexandru Voica, a communications professional who works for another video-generating AI service, told me he came across Jumpspeak’s ads while looking for a program to teach his children Romanian, the language spoken by their grandparents. He argued that the ads demonstrated how deepfakes and automated-translation software could be used to mislead or deceive people. “I'm worried that some in the industry are currently in a race to the bottom on AI safety,” he told me in an email. (The ads were taken down after I started reporting this story, but it’s not clear if Meta or Jumpspeak removed them; neither company returned requests for comment. HeyGen also did not immediately respond to a request for comment about its product being used in Jumpspeak’s marketing.)

The world is already seeing how all of this can go wrong. Earlier this month, a far-right conspiracy theorist shared several AI-generated clips on X of Adolf Hitler giving a 1939 speech in English instead of the original German. The videos, which were purportedly produced using software from a company called ElevenLabs, featured a re-creation of Hitler’s own voice. It was a strange experience, hearing Hitler speak in English, and some people left comments suggesting that they found him easy to empathize with: “It sounds like these people cared about their country above all else,” one X user reportedly wrote in response to the videos. ElevenLabs did not immediately respond to a request for comment. ( The Atlantic uses ElevenLabs’ AI voice generator to narrate some articles.)

Read: The last frontier of machine translation

Gabriel Nicholas, a research fellow at the nonprofit Center for Democracy and Technology, told me that part of the problem with machine-translation programs is that they’re often falsely perceived as being neutral, rather than “bringing their own perspective upon how to move text from one language to another.” The truth is that there is no single right or correct way to transpose a sentence from French to Russian or any other language—it’s an art rather than a science. “Students will ask, ‘How do you say this in Spanish?’ and I’ll say, ‘You just don’t say it the same way in Spanish; the way you would approach it is different,’” Deborah Cohn, a Spanish- and Portuguese-language professor at Indiana University Bloomington who has written about the importance of language learning for bolstering U.S. national security , told me.

I recently came across a beautiful and particularly illustrative example of this fact in an article written by a translator in China named Anne. “Building a ladder between widely different languages, such as Chinese and English, is sometimes as difficult as a doctor building a bridge in a patient's heart,” she wrote. The metaphor initially struck me as slightly odd, but thankfully I wasn’t relying on ChatGPT to translate Anne’s words from their original Mandarin. I was reading a human translation by a professor named Jeffrey Ding, who helpfully noted that Anne may have been referring to a type of heart surgery that has recently become common in China. It's a small detail, but understanding that context brought me much closer to the true meaning of what Anne was trying to say.

Read: The college essay is dead

But most students will likely never achieve anything close to the fluency required to tell whether a translation rings close enough to the original or not. If professors accept that automated technology will far outpace the technical skills of the average Russian or Arabic major, their focus would ideally shift from grammar drills to developing cultural competency , or understanding the beliefs and practices of people from different backgrounds. Instead of cutting language courses in response to AI, schools should “stress more than ever the intercultural components of language learning that tremendously benefit the students taking these classes,” Jen William, the head of the School of Languages and Cultures at Purdue University and a member of the executive committee of the Association of Language Departments, told me.

Paula Krebs, the executive director of the MLA, referenced a beloved 1991 episode of Star Trek: The Next Generation to make a similar point. In “Darmok,” the crew aboard the starship Enterprise struggles to communicate with aliens living on a planet called El-Adrel IV. They have access to a “universal translator” that allows them to understand the basic syntax and semantics of what the Tamarians are saying, but the greater meaning of their utterances remains a mystery.

It later becomes clear that their language revolves around allegories rooted in the Tamarians’ unique history and practices. Even though Captain Picard was translating all the words they were saying, he “couldn’t understand the metaphors of their culture,” Krebs told me. More than 30 years later, something like a universal translator is now being developed on Earth. But it similarly doesn’t have the power to bridge cultural divides the way that humans can.

Is a robot writing your kids’ essays? We asked educators to weigh in on the growing role of AI in classrooms.

Educators weigh in on the growing role of ai and chatgpt in classrooms..

Kara Baskin talked to several educators about what kind of AI use they’re seeing in classrooms and how they’re monitoring it.

Remember writing essays in high school? Chances are you had to look up stuff in an encyclopedia — an actual one, not Wikipedia — or else connect to AOL via a modem bigger than your parents’ Taurus station wagon.

Now, of course, there’s artificial intelligence. According to new research from Pew, about 1 in 5 US teens who’ve heard of ChatGPT have used it for schoolwork. Kids in upper grades are more apt to have used the chatbot: About a quarter of 11th- and 12th-graders who know about ChatGPT have tried it.

For the uninitiated, ChatGPT arrived on the scene in late 2022, and educators continue to grapple with the ethics surrounding its growing popularity. Essentially, it generates free, human-like responses based on commands. (I’m sure this sentence will look antiquated in about six months, like when people described the internet as the “information superhighway.”)

Advertisement

I used ChatGPT to plug in this prompt: “Write an essay on ‘The Scarlet Letter.’” Within moments, ChatGPT created an essay as thorough as anything I’d labored over in AP English.

Is this cheating? Is it just part of our strange new world? I talked to several educators about what they’re seeing in classrooms and how they’re monitoring it. Before you berate your child over how you wrote essays with a No. 2 pencil, here are some things to consider.

Adapting to new technology isn’t immoral. “We have to recalibrate our sense of what’s acceptable. There was a time when every teacher said: ‘Oh, it’s cheating to use Wikipedia.’ And guess what? We got used to it, we decided it’s reputable enough, and we cite Wikipedia all the time,” says Noah Giansiracusa, an associate math professor at Bentley University who hosts the podcast “ AI in Academia: Navigating the Future .”

“There’s a calibration period where a technology is new and untested. It’s good to be cautious and to treat it with trepidation. Then, over time, the norms kind of adapt,” he says — just like new-fangled graphing calculators or the internet in days of yore.

“I think the current conversation around AI should not be centered on an issue with plagiarism. It should be centered on how AI will alter methods for learning and expressing oneself. ‘Catching’ students who use fully AI-generated products ... implies a ‘gotcha’ atmosphere,” says Jim Nagle, a history teacher at Bedford High School. “Since AI is already a huge part of our day-to-day lives, it’s no surprise our students are making it a part of their academic tool kit. Teachers and students should be at the forefront of discussions about responsible and ethical use.”

Sign up for Parenting Unfiltered.

Teachers and parents could use AI to think about education at a higher level. Really, learning is about more than regurgitating information — or it should be, anyway. But regurgitation is what AI does best.

“If our system is just for students to write a bunch of essays and then grade the results? Something’s missing. We need to really talk about their purpose and what they’re getting out of this, and maybe think about different forms of assignments and grading,” Giansiracusa says.

After all, while AI aggregates and organizes ideas, the quality of its responses depends on the users’ prompts. Instead of recoiling from it, use it as a conversation-starter.

“What parents and teachers can do is to start the conversation with kids: ‘What are we trying to learn here? Is it even something that ChatGPT could answer? Why did your assignment not convince you that you need to do this thinking on your own when a tool can do it for you?’” says Houman Harouni , a lecturer on education at the Harvard Graduate School of Education.

Harouni urges parents to read an essay written by ChatGPT alongside their student. Was it good? What could be done better? Did it feel like a short cut?

“What they’re going to remember is that you had that conversation with them; that someone thought, at some point in their lives, that taking a shortcut is not the best way ... especially if you do it with the tool right in front of you, because you have something real to talk about,” he says.

Harouni hopes teachers think about its implications, too. Consider math: So much grunt work has been eliminated by calculators and computers. Yet kids are still tested as in days of old, when perhaps they could expand their learning to be assessed in ways that are more personal and human-centric, leaving the rote stuff to AI.

“We could take this moment of confusion and loss of certainty seriously, at least in some small pockets, and start thinking about what a different kind of school would look like. Five years from now, we might have the beginnings of some very interesting exploration. Five years from now, you and I might be talking about schools wherein teaching and learning is happening in a very self-directed way, in a way that’s more based on … igniting the kid’s interest and seeing where they go and supporting them to go deeper and to go wider,” Harouni says.

Teachers have the chance to offer assignments with more intentionality.

“Really think about the purpose of the assignments. Don’t just think of the outcome and the deliverable: ‘I need a student to produce a document.’ Why are we getting students to write? Why are we doing all these things in the first place? If teachers are more mindful, and maybe parents can also be more mindful, I think it pushes us away from this dangerous trap of thinking about in terms of ‘cheating,’ which, to me, is a really slippery path,” Giansiracusa says.

AI can boost confidence and reduce procrastination. Sometimes, a robot can do something better than a human, such as writing a dreaded resume and cover letter. And that’s OK; it’s useful, even.

“Often, students avoid applying to internships because they’re just overwhelmed at the thought of writing a cover letter, or they’re afraid their resume isn’t good enough. I think that tools like this can help them feel more confident. They may be more likely to do it sooner and have more organized and better applications,” says Kristin Casasanto, director of post-graduate planning at Olin College of Engineering.

Casasanto says that AI is also useful for de-stressing during interview prep.

“Students can use generative AI to plug in a job description and say, ‘Come up with a list of interview questions based on the job description,’ which will give them an idea of what may be asked, and they can even then say, ‘Here’s my resume. Give me answers to these questions based on my skills and experience.’ They’re going to really build their confidence around that,” Casasanto says.

Plus, when students use AI for basics, it frees up more time to meet with career counselors about substantive issues.

“It will help us as far as scalability. … Career services staff can then utilize our personal time in much more meaningful ways with students,” Casasanto says.

We need to remember: These kids grew up during a pandemic. We can’t expect kids to resist technology when they’ve been forced to learn in new ways since COVID hit.

“Now we’re seeing pandemic-era high school students come into college. They’ve been channeled through Google Classroom their whole career,” says Katherine Jewell, a history professor at Fitchburg State University.

“They need to have technology management and information literacy built into the curriculum,” Jewell says.

Jewell recently graded a paper on the history of college sports. It was obvious which papers were written by AI: They didn’t address the question. In her syllabus, Jewell defines plagiarism as “any attempt by a student to represent the work of another, including computers, as their own.”

This means that AI qualifies, but she also has an open mind, given students’ circumstances.

“My students want to do the right thing, for the most part. They don’t want to get away with stuff. I understand why they turned to these tools; I really do. I try to reassure them that I’m here to help them learn systems. I’m focusing much more on the learning process. I incentivize them to improve, and I acknowledge: ‘You don’t know how to do this the first time out of the gate,’” Jewell says. “I try to incentivize them so that they’re improving their confidence in their abilities, so they don’t feel the need to turn to these tools.”

Understand the forces that make kids resort to AI in the first place . Clubs, sports, homework: Kids are busy and under pressure. Why not do what’s easy?

“Kids are so overscheduled in their day-to-day lives. I think there’s so much enormous pressure on these kids, whether it’s self-inflicted, parent-inflicted, or school-culture inflicted. It’s on them to maximize their schedule. They’ve learned that AI can be a way to take an assignment that would take five hours and cut it down to one,” says a teacher at a competitive high school outside Boston who asked to remain anonymous.

Recently, this teacher says, “I got papers back that were just so robotic and so cold. I had to tell [students]: ‘I understand that you tried to use a tool to help you. I’m not going to penalize you, but what I am going to penalize you for is that you didn’t actually answer the prompt.”

Afterward, more students felt safe to come forward to say they’d used AI. This teacher hopes that age restrictions become implemented for these programs, similar to apps such as Snapchat. Educationally and developmentally, they say, high-schoolers are still finding their voice — a voice that could be easily thwarted by a robot.

“Part of high school writing is to figure out who you are, and what is your voice as a writer. And I think, developmentally, that takes all of high school to figure out,” they say.

And AI can’t replicate voice and personality — for now, at least.

Kara Baskin can be reached at [email protected] . Follow her @kcbaskin .

  • Share full article

Advertisement

Supported by

Guest Essay

A.I.-Generated Garbage Is Polluting Our Culture

A colorful illustration of a series of blue figures lined up on a bright pink floor with a red background. The farthest-left figure is that of a robot; every subsequent figure is slightly more mutated until the final figure at the right is strangely disfigured.

By Erik Hoel

Mr. Hoel is a neuroscientist and novelist and the author of The Intrinsic Perspective newsletter.

Increasingly, mounds of synthetic A.I.-generated outputs drift across our feeds and our searches. The stakes go far beyond what’s on our screens. The entire culture is becoming affected by A.I.’s runoff, an insidious creep into our most important institutions.

Consider science. Right after the blockbuster release of GPT-4, the latest artificial intelligence model from OpenAI and one of the most advanced in existence, the language of scientific research began to mutate. Especially within the field of A.I. itself.

artificial intelligence in education essay writing

Adjectives associated with A.I.-generated text have increased in peer reviews of scientific papers about A.I.

Frequency of adjectives per one million words

Commendable

artificial intelligence in education essay writing

A study published this month examined scientists’ peer reviews — researchers’ official pronouncements on others’ work that form the bedrock of scientific progress — across a number of high-profile and prestigious scientific conferences studying A.I. At one such conference, those peer reviews used the word “meticulous” more than 34 times as often as reviews did the previous year. Use of “commendable” was around 10 times as frequent, and “intricate,” 11 times. Other major conferences showed similar patterns.

Such phrasings are, of course, some of the favorite buzzwords of modern large language models like ChatGPT. In other words, significant numbers of researchers at A.I. conferences were caught handing their peer review of others’ work over to A.I. — or, at minimum, writing them with lots of A.I. assistance. And the closer to the deadline the submitted reviews were received, the more A.I. usage was found in them.

If this makes you uncomfortable — especially given A.I.’s current unreliability — or if you think that maybe it shouldn’t be A.I.s reviewing science but the scientists themselves, those feelings highlight the paradox at the core of this technology: It’s unclear what the ethical line is between scam and regular usage. Some A.I.-generated scams are easy to identify, like the medical journal paper featuring a cartoon rat sporting enormous genitalia. Many others are more insidious, like the mislabeled and hallucinated regulatory pathway described in that same paper — a paper that was peer reviewed as well (perhaps, one might speculate, by another A.I.?).

What about when A.I. is used in one of its intended ways — to assist with writing? Recently, there was an uproar when it became obvious that simple searches of scientific databases returned phrases like “As an A.I. language model” in places where authors relying on A.I. had forgotten to cover their tracks. If the same authors had simply deleted those accidental watermarks, would their use of A.I. to write their papers have been fine?

What’s going on in science is a microcosm of a much bigger problem. Post on social media? Any viral post on X now almost certainly includes A.I.-generated replies, from summaries of the original post to reactions written in ChatGPT’s bland Wikipedia-voice, all to farm for follows. Instagram is filling up with A.I.-generated models, Spotify with A.I.-generated songs. Publish a book? Soon after, on Amazon there will often appear A.I.-generated “workbooks” for sale that supposedly accompany your book (which are incorrect in their content; I know because this happened to me). Top Google search results are now often A.I.-generated images or articles. Major media outlets like Sports Illustrated have been creating A.I.-generated articles attributed to equally fake author profiles. Marketers who sell search engine optimization methods openly brag about using A.I. to create thousands of spammed articles to steal traffic from competitors.

Then there is the growing use of generative A.I. to scale the creation of cheap synthetic videos for children on YouTube. Some example outputs are Lovecraftian horrors, like music videos about parrots in which the birds have eyes within eyes, beaks within beaks, morphing unfathomably while singing in an artificial voice, “The parrot in the tree says hello, hello!” The narratives make no sense, characters appear and disappear randomly, and basic facts like the names of shapes are wrong. After I identified a number of such suspicious channels on my newsletter, The Intrinsic Perspective, Wired found evidence of generative A.I. use in the production pipelines of some accounts with hundreds of thousands or even millions of subscribers.

As a neuroscientist, this worries me. Isn’t it possible that human culture contains within it cognitive micronutrients — things like cohesive sentences, narrations and character continuity — that developing brains need? Einstein supposedly said : “If you want your children to be intelligent, read them fairy tales. If you want them to be very intelligent, read them more fairy tales.” But what happens when a toddler is consuming mostly A.I.-generated dream-slop? We find ourselves in the midst of a vast developmental experiment.

There’s so much synthetic garbage on the internet now that A.I. companies and researchers are themselves worried, not about the health of the culture, but about what’s going to happen with their models. As A.I. capabilities ramped up in 2022, I wrote on the risk of culture’s becoming so inundated with A.I. creations that when future A.I.s are trained, the previous A.I. output will leak into the training set, leading to a future of copies of copies of copies, as content became ever more stereotyped and predictable. In 2023 researchers introduced a technical term for how this risk affected A.I. training: model collapse . In a way, we and these companies are in the same boat, paddling through the same sludge streaming into our cultural ocean.

With that unpleasant analogy in mind, it’s worth looking to what is arguably the clearest historical analogy for our current situation: the environmental movement and climate change. For just as companies and individuals were driven to pollute by the inexorable economics of it, so, too, is A.I.’s cultural pollution driven by a rational decision to fill the internet’s voracious appetite for content as cheaply as possible. While environmental problems are nowhere near solved, there has been undeniable progress that has kept our cities mostly free of smog and our lakes mostly free of sewage. How?

Before any specific policy solution was the acknowledgment that environmental pollution was a problem in need of outside legislation. Influential to this view was a perspective developed in 1968 by Garrett Hardin, a biologist and ecologist. Dr. Hardin emphasized that the problem of pollution was driven by people acting in their own interest, and that therefore “we are locked into a system of ‘fouling our own nest,’ so long as we behave only as independent, rational, free-enterprisers.” He summed up the problem as a “tragedy of the commons.” This framing was instrumental for the environmental movement, which would come to rely on government regulation to do what companies alone could or would not.

Once again we find ourselves enacting a tragedy of the commons: short-term economic self-interest encourages using cheap A.I. content to maximize clicks and views, which in turn pollutes our culture and even weakens our grasp on reality. And so far, major A.I. companies are refusing to pursue advanced ways to identify A.I.’s handiwork — which they could do by adding subtle statistical patterns hidden in word use or in the pixels of images.

A common justification for inaction is that human editors can always fiddle around with whatever patterns are used if they know enough. Yet many of the issues we’re experiencing are not caused by motivated and technically skilled malicious actors; they’re caused mostly by regular users’ not adhering to a line of ethical use so fine as to be nigh nonexistent. Most would be uninterested in advanced countermeasures to statistical patterns enforced into outputs that should, ideally, mark them as A.I.-generated.

That’s why the independent researchers were able to detect A.I. outputs in the peer review system with surprisingly high accuracy: They actually tried. Similarly, right now teachers across the nation have created home-brewed output-side detection methods , like adding hidden requests for patterns of word use to essay prompts that appear only when copied and pasted.

In particular, A.I. companies appear opposed to any patterns baked into their output that can improve A.I.-detection efforts to reasonable levels, perhaps because they fear that enforcing such patterns might interfere with the model’s performance by constraining its outputs too much — although there is no current evidence this is a risk. Despite public pledges to develop more advanced watermarking, it’s increasingly clear that the companies are dragging their feet because it goes against the A.I. industry’s bottom line to have detectable products.

To deal with this corporate refusal to act we need the equivalent of a Clean Air Act: a Clean Internet Act. Perhaps the simplest solution would be to legislatively force advanced watermarking intrinsic to generated outputs, like patterns not easily removable. Just as the 20th century required extensive interventions to protect the shared environment, the 21st century is going to require extensive interventions to protect a different, but equally critical, common resource, one we haven’t noticed up until now since it was never under threat: our shared human culture.

Erik Hoel is a neuroscientist, a novelist and the author of The Intrinsic Perspective newsletter.

The Times is committed to publishing a diversity of letters to the editor. We’d like to hear what you think about this or any of our articles. Here are some tips . And here’s our email: [email protected] .

Follow the New York Times Opinion section on Facebook , Instagram , TikTok , WhatsApp , X and Threads .

More From Forbes

Hiring? Job-Seeking? Beware Of Artificial Education, Not AI

  • Share to Facebook
  • Share to Twitter
  • Share to Linkedin

Employers, job seekers and career changers can no longer broadly, naively nor generously trust university credentialing.

The AI-era whirlwind, soaring tuition blowback, a long-predicted enrollment cliff, staggering administrative bloat and decades of standards dilution imperil higher education, devalue degrees and warrant sharpened stakeholder diligence.

Ironically, overly dramatic ivory-tower pearl clutching about how AI undermines academic integrity, shortcuts learning and subverts curricula reveals, rather than conceals, widespread campus unwillingness, inability and disincentives to adapt or outright abandon industrial-revolution style pedagogy for digital world demands.

Geoffrey Collier’s seminal 2013 editorial , “We Pretend to Teach, They Pretend to Learn,” pinpointed multiple, strong incentives to “maintain low standards” and abdicate “the rigor that the credentials putatively represent.” A decade later, higher ed’s harsh, open secrets even more so long for meaningful exposés, overdue operational audits and independent assessments that will corroborate comfort conveniences’ firm dominance over educational excellence and spur overhaul.

That’s the reality that AI-era recruiters and applicants need to comprehend — before staking precious time, fortune and reputation.

Apple Just Released A Major Upgrade For Samsung Galaxy Watch 6, Pixel Watch

Trump posts 175 million bond thanks to billionaire don hankey, caitlin clark erupts for 41 points as iowa advances to final four with win over lsu, poseurs palace.

Since business schools naturally exemplify institutional challenges in managing the real, underlying tensions inherent in escalated competency-or-credential tradeoffs, here are seven straightforward, minimal time-investment ideas to start the higher ed due diligence that hiring and career advancement choices warrant and deserve.

1. Start with a 2030 vision exercise. Many rising professionals openly question (and later lament) draining post-college savings, bypassing career opportunities and sacrifice personal time for unfulfilling graduate studies. B-schools still reliant on antiquated case studies, coding assignments and whiteboard formula scrawling struggle to convince students how degrees will hold their value in next half-decade.

That prescient article questioned how many MBA programs were ready, willing and able to leverage AI to deliver personalized learning, meaningful simulations, novel on-demand virtual teaching tools, augmented experiences and future-proofed skills. As time passes, b-schools fall even further behind AI’s rapid advancement. Campus leaders who consider banning GPTs or obsess about students mindlessly acing cut-and-paste assignments are far better off seriously addressing how “read-memorize-repeat” credentialing masquerades as high-priced education. Graduates’ readiness to succeed as AI-era difference makers urgently hinges on credible plans and action.

2. Ask recent graduates about educational highlights, not coursework . Individuals with high quality learning experiences will quickly and enthusiastically describe meaningful academic components of their studies. Others will stammer mightily to evidence traditional schoolwork’s relevance. Course titles, concentrations and other transcript details are meaningless if the educational experience does not convey indispensable digital era adaptive problem-solving skills.

(Non)answers to a few simple questions, such as follow, can be quite telling. Which courses best enhanced your business acumen? Name three classmates you valued most. Which elective courses were most meaningful? What are the most critical problems companies face in the next decade and how can you help solve them? Can you explain cash flows’ importance? Why is industry insight vital? What is due diligence? What makes a business case convincing? How is AI your co-pilot?

3. Check for news stories about financial distress. Favoring credentialing, relaxing standards and defending overstaffing are often reflexive administrative responses to mounting financial pressures. For instance, the University of Delaware recently disclosed an estimated $20-40 million budget shortfall due to an unanticipated health insurance cost spike. As reported , university president Dennis Assanis responded with “perfect storm, pause button and moment to breathe” clichés, “aggressive” enrollment growth plans and a remote learning push. These quality-dilutive, austerity steps rarely touch top campus executives’ pay (Assanis and chief fundraiser Keith Walter draw over $2.5 million combined annually), tap endowments (such as Delaware’s near $2 billion ), but inevitably seek alumni rescue. The consequences ultimately hide in a more costly “race to the bottom” aftermath.

Recruiters and admission candidates should visit in-person and virtual classrooms.

4. Observe classroom experience. The debate over in-person and online education quality is a false binary choice. Exceptional educational experiences, as well “academic anesthesia,” exist in both modalities. Recruiters and admission candidates should visit in-person and virtual classrooms. Wander hallways or request temporary online course access. Assess interaction, student energy and content. Is the institution playing school or business? The differences will be palpable.

At the top, Indiana University’s Kelley School of Business prioritizes competencies. Its perennially and universally top-ranked (by U.S. News and World Report , Fortune , Princeton Review and Poets & Quants ) online MBA program is also very selective. Now in its 26 th year of distance delivery, IU admits 27% of applicants, expects a decade of work experience and boasts a near-perfect 98% retention rate. Kelley staffs the program exclusively with full-time faculty, broadcasts from the well-funded, state-of-the-art Jellison Studios , delivers individualized job coaching, offers lifetime career services and fosters an intellectual curiosity culture.

Contrast such excellence with the post-pandemic, desperate rise of online credentialing via modern-day correspondence courses rooted in long-outdated curricula, asynchronous narrated PowerPoint lectures and rewards for rote memorization. Many (if not most) online MBA programs self-report porous admission rates over 75%, with some approaching near 100% acceptance.

5. Assess program partnerships and seals of approval. Is executive education contemporary, robust and thriving? Such trailblazing hallmarks the best. As Randy Bean detailed on Forbes , Northeastern University’s novel Institute for Experiential AI has quickly grown to “over 65 staff, 80+ core faculty members, and 30+ affiliate faculty members” and works with industry leaders, including NASA, Sanofi and Verizon to deliver high-level experiential learning, host AI career fairs and “work on AI-driven innovation and transformation problems and solutions.”

Alternatively, other institutions resort to dangerous “metric management” to game rankings and appease accreditors. “Assurance of learning” has morphed into a cottage-industry, expensive, quality control paper drill led by compliance tabulators who, in great conflict of interest, also often manage and dress school rankings data. Senior campus leaders’ treacherous fixation on underlings’ manicured data can fuel strategic blind spots and program obsolescence. Far worse, data malfeasance has sparked administrator scandals, resignations and even federal convictions .

6. Scrub campus websites far beyond landing page sparkle. Replicate a lender’s preliminary due diligence. Do marketing pitches promote speed or substance? Check class schedule sites for actual enrollment numbers. Are electives abundant? Are in-person classes well subscribed? How much face time does each course require? Examine faculty profile pages closely. Do the sites have an antiquated “GeoCities look” or appealing external validation (i.e. intriguing publications, experiential learning, videos, podcasts, company visits and external awards)? Further, non-profits’ Form 990s unmask spending priorities, financial viability and compensation hierarchy. Collectively, these indicators litmus-test espoused veracity.

University staffing has grown exponentially in recent decades.

7. Calculate student-to-staff ratios. Universities eagerly report student-to-faculty ratios as an educational quality proxy. Administrative bloat is a better metric.

According to the Education Data Initiative, while enrollments have declined from a 2010 peak of approximately 21 million students, tuition has since grown at nearly 12% annually. U.S. college costs now hover near $37,000 per year, inclusive of all expenses — with private schools topping $60,000 and some over six figures yearly. Overstaffing and compounded cost mismanagement drives much of that inflation.

A recent Review of Social Economy study found that over the past five decades, U.S. colleges and universities increased full-time faculty by 92% and student enrollment by 78%. Yet, “full-time administrators and other professionals employed by those institutions increased by 164% and 452%, respectively.” Further, “the proliferation of part-time and adjunct faculty reduced the percentage of full-time faculty from 67% to 54%, while the percentage of full-time administrators increased from 96% to 97%.”

Progressive Policy Institute senior fellow Paul Weinstein, Jr. summarizing , “How to Cut Administrative Bloat at U.S. Colleges,” wrote , “on average, the top 50 schools have one faculty per eleven students whereas the same institutions have one non-faculty employee per four students. There are now three times as many administrators and other professionals, as there are faculty (on a per student basis).” He added, “This does not include consultants and contractors, many of whom work with university admissions and marketing offices to boost enrollments.”

Outsiders can only wonder what undue and outsized influence such entrenched bureaucratic bloat exerts on student expectations, curriculum and grade inflation.

Overdue diligence

Worthy business degrees qualify recipients to conduct due diligence. Applying such frameworks to AI-era schooling will quickly reveal what’s truly artificial on campus — repackaged, outdated and pricy education. That casts an eerie, imminent future. Or more optimistically, it may be the exact blueprint “stuck in the middle” institutions need. Who’s paying the real price?

Noah Barsky

  • Editorial Standards
  • Reprints & Permissions

IMAGES

  1. Artificial Intelligence Essay

    artificial intelligence in education essay writing

  2. ≫ Artificial Intelligence in Education Free Essay Sample on Samploon.com

    artificial intelligence in education essay writing

  3. WRITING AN ESSAY USING ARTIFICIAL INTELLIGENCE

    artificial intelligence in education essay writing

  4. Essay on Artificial Intelligence in English, Write an Essay on AI

    artificial intelligence in education essay writing

  5. Essay on Artificial intelligence

    artificial intelligence in education essay writing

  6. (PDF) ARTIFICIAL INTELLIGENCE IN EDUCATION

    artificial intelligence in education essay writing

VIDEO

  1. 10 Lines on Artificial Intelligence (AI)

  2. How to Use AI Legitimately to Aid Your Academic Research

  3. Artificial intelligence essay for Students

  4. Artificial Intelligence Essay In English || Artificial Intelligence Essay || #mdwriting #handwriting

  5. Artificial Intelligence Essay In English l 10 Lines On Artificial intelligence l 10 Line Essay On AI

  6. CSS Essay on Artificial Intelligence || AI and Jobless future

COMMENTS

  1. Essay on Artificial Intelligence in Education

    Essay on Artificial Intelligence in Education. Topics: Artificial Intelligence Technology in Education. Words: 818. Pages: 2. This essay sample was donated by a student to help the academic community. Papers provided by EduBirdie writers usually outdo students' samples.

  2. Exploring Artificial Intelligence in Academic Essay: Higher Education

    In conclusion, the literature review on artificial intelligence in writing academic essays illustrates the transformative impact of AI in education. AI-powered writing tools offer valuable support for students and educators, but challenges in contextual understanding, bias, and ethical use demand continued research and development.

  3. Artificial Intelligence and Its Impact on Education Essay

    Introduction. Rooted in computer science, Artificial Intelligence (AI) is defined by the development of digital systems that can perform tasks, which are dependent on human intelligence (Rexford, 2018). Interest in the adoption of AI in the education sector started in the 1980s when researchers were exploring the possibilities of adopting ...

  4. Using artificial intelligence in academic writing and research: An

    Academic writing is a key aspect of research and education, involving a structured method of expressing ideas. ... From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing ... peer feedback perceptions and essay performance. Assessm. Evaluat. High. Educ., 43 (6 ...

  5. PDF The Impact of Artificial Intelligence on Higher Education: An Empirical

    judge up to 100 features and that grading essay is highly accurate (Brad Rose Consulting, 2019). In short, artificial intelligence is playing a more prominent role in the evaluation and classification of higher education in the United States of America. Though the above studies are valuable from different perspectives in addressing the

  6. Exploring Artificial Intelligence in Academic Essay: Higher Education

    Abstract. Artificial Intelligence (AI) and academic essay writing merge to create a transformative intersection in education, each reciprocally refining and reforming the other. AI, through its innovative technologies and flexible learning strategies, elevates academic writing by offering dynamic, interactive learning settings, and personalized ...

  7. Automated Essay Writing: An AIED Opinion

    That led me to research automated essay writing, write a Twitter thread that has garnered 43,000 engagements, and author a blog article (Sharples, 2022). The current piece is a revised and expanded version of that article. ... International Journal of Artificial Intelligence in Education, 30(4), 537-580. Article Google Scholar

  8. PDF Exploring AI-Generated Text in Student Writing: How Does AI Help?

    three-dimensional graphs, visually the students can be categorized into four super clusters, 1) students with high CLO scores, but fewer AI words; 2) students with high scores, but more AI. words; 3) students with low scores, but fewer AI words; and 4) students with low scores, but. more AI words. Figure 9.

  9. New AI tools that can write student essays require educators to rethink

    AI tools are available today that can write compelling university level essays. Taking an example of sample essay produced by the GPT-3 transformer, Mike Sharples discusses the implications of this technology for higher education and argues that they should be used to enhance pedagogy, rather than accelerating an ongoing arms race between increasingly sophisticated fraudsters and fraud detectors.

  10. Artificial intelligence in academic writing: a paradigm-shifting

    The use of artificial intelligence (AI) in academic writing can be divided into two broad categories: those that assist authors in the writing process; and those that are used to evaluate and ...

  11. PDF Automated Essay Writing: An AIED Opinion

    International Journal of Arti cial Intelligence in Education (2022) 32:1119-1126 writing a 20-page essay1. A study by Newton (2018) found that 15.7% of students surveyed admitted to paying someone else to write an assignment. Now, an unin-tended consequence of generative "Transformer" AI systems such as GPT-3 is that they democratise ...

  12. Artificial intelligence in education

    Artificial Intelligence (AI) has the potential to address some of the biggest challenges in education today, innovate teaching and learning practices, and accelerate progress towards SDG 4. However, rapid technological developments inevitably bring multiple risks and challenges, which have so far outpaced policy debates and regulatory frameworks.

  13. PDF A New Era of Artificial Intelligence in Education: a Multifaceted

    solutions to college students on a range of tasks from essay writing to code creation. Ultimately, the best way forward is to incorporate AI into the educational system and ... A NEW ERA OF ARTIFICIAL INTELLIGENCE IN EDUCATION 5 FIGURE 2. Performance of GPT models on various standardized test [12]. multi-head attention, and feed-forward layers ...

  14. Artificial intelligence is getting better at writing, and universities

    AI-generated writing has raised the stakes of how universities and schools will gauge what constitutes academic misconduct, such as plagiarism. As scholars with an interest in academic integrity ...

  15. AI Writing in the College Classroom

    AI Writing in the College Classroom. Written by Nate Brown. Attendant to the rise of powerful artificial intelligence (AI) technologies like ChatGPT 4, Perplexity, Jasper, YouChat, Chatsonic, and others, instructors in higher education must consider how students and teachers will (or will not) use AI tools in the classroom and beyond.

  16. Educators' Perspectives on the Impact of Artificial Intelligence on

    on t he Impact of Artificial Intelligence on Writing Competence," International Journal of Multidisciplinary Research and Publications (IJMRA P) , Volume 6, Issue 6, pp. 29 - 34 , 20 23 .

  17. Full article: The impact of AI writing tools on the content and

    A recent study conducted in Indonesian universities revealed the positive influence of Artificial Intelligence (AI) writing tools on English as a Foreign Language (EFL) student writing. ... and teaching expertise in advanced courses such as essay writing and argumentative essay writing. ... Computers and Education: Artificial Intelligence, 3 ...

  18. Automated Essay Scoring and the Deep Learning Black Box: How ...

    International Journal of Artificial Intelligence in Education - This article investigates the feasibility of using automated scoring methods to evaluate the quality of student-written essays. ... Each essay was converted to a vector of 397 writing features. The 1567 essays have been randomly split into a training and testing set, that is, 80% ...

  19. 'Full-on robot writing': the artificial intelligence challenge facing

    A 2021 Forbes article about AI essay writing culminated in a dramatic mic-drop: "this post about using an AI to write essays in school," it explained, "was written using an artificial ...

  20. PDF WHO WROTE THIS ESSAY? DETECTING AI-GENERATED WRITING in SECOND LANGUAGE

    into control (n=9) and experimental groups (n=9) to write short essays (800-1000 words). The control group wrote their essays traditionally and the experimental group used the assistance of ChatGPT. Both groups had four hours to complete their essays which were then scored by two professors using the same essay rubric.

  21. Artificial Intelligence Essay in 100, 200, and 300 Words for School

    Artificial Intelligence Essay in 300 Words for School Students. The advent of Artificial Intelligence (AI) is an attempt to converge computer science, mathematics, and cognitive science into singular units to create machines with human intelligence. These machines help in learning, reasoning, problem-solving, and language understanding.

  22. Using artificial intelligence-based instruction to develop EFL higher

    Download Citation | On Apr 1, 2023, Marghany Mahmoud Marghany published Using artificial intelligence-based instruction to develop EFL higher education students' essay writing skills | Find ...

  23. Artificial Intelligence Essay for Students and Children

    Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of humans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and problem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.

  24. How to Resist the Temptation of AI When Writing

    Follow these tips to produce stronger writing that stands out on the web even in the age of AI and ChatGPT. Whether you're a student, a journalist, or a business professional, knowing how to do ...

  25. OPINION: Artificial intelligence can be game-changing for students with

    Much has been made of artificial intelligence's potential to revolutionize education. AI is making it increasingly possible to break down barriers so that no student is ever left behind. ... A student can complete a task by writing or drawing on paper or constructing something from objects, and a computer can "see" and interpret their ...

  26. The End of Foreign-Language Education

    Produced by ElevenLabs and News Over Audio (NOA) using AI narration. A few days ago, I watched a video of myself talking in perfect Chinese. I've been studying the language on and off for only a ...

  27. Is a robot writing your kids' essays?

    Now, of course, there's artificial intelligence. According to new research from Pew, about 1 in 5 US teens who've heard of ChatGPT have used it for schoolwork. Kids in upper grades are more ...

  28. AI Garbage Is Already Polluting the Internet

    A.I.-Generated Garbage Is Polluting Our Culture. Mr. Hoel is a neuroscientist and novelist and the author of The Intrinsic Perspective newsletter. Increasingly, mounds of synthetic A.I.-generated ...

  29. Artificial General Intelligence Or AGI: A Very Short History

    Artificial General Intelligence, AGI concept. AI can learn and solve any human's intellectual tasks. getty. AGI is the new AI, promoted by tech leaders and AI experts, all promising its imminent ...

  30. Hiring? Job-Seeking? Beware Of Artificial Education, Not AI

    A recent Review of Social Economy study found that over the past five decades, U.S. colleges and universities increased full-time faculty by 92% and student enrollment by 78%. Yet, "full-time ...