A massive crowd-sourced study utilising more than 25,000 questions from 186 institutions’ accounting assessments has found that students outperform ChatGPT overall.
The study also found that the artificial intelligence tool sometimes made up facts, made nonsensical errors such as adding two numbers in a subtraction problem, and often provided descriptive explanations for its answers, even if they were incorrect.
The study’s 328 co-authors from around the world, including University of Auckland accounting and finance academics Ruth Dimes and David Hay, entered assessment questions into ChatGPT-3 and evaluated the accuracy of its responses between December 2022 and January 2023.
Ruth Dimes, who directs the Business School’s Business Masters programme, utilised two recent exams from the ‘analysing financial statements’ course.
“I entered the exam questions into ChatGPT and recorded how it performed compared to the students’ grades. My findings were consistent with the study overall and I was surprised that ChatGPT didn’t perform as well as I thought it might have,” she says.
Meanwhile, David Hay, Professor of Auditing, used exam and test questions from the auditing course and found that the bot was able to perform slightly better in auditing courses compared to financial accounting courses, but still not as well as the students.
The study, led by Professor David Wood of Brigham Young University in Utah, includes a total of 25,817 questions (25,181 gradable by ChatGPT) that appeared across 869 different class assessments, as well as 2,268 questions from textbook test banks covering topics such as accounting information systems (AIS), auditing, financial accounting, managerial accounting, and tax.
The co-authors evaluated ChatGPT’s answers to the questions they entered and determined whether they were correct, partially correct, or incorrect. The results indicate that across all assessments, students scored an average of 76.7 percent, while ChatGPT scored 47.4 percent based on fully correct answers. However, after giving ChatGPT some credit for partially correct answers, it would have scraped through many courses with an average of 56.5 percent overall.
The study also revealed differences in ChatGPT’s performance based on the topic area of the assessment. Specifically, the chatbot performed relatively better on AIS and auditing assessments compared to tax, financial, and managerial assessments.
Dimes says she’s interested in seeing how newer versions of ChatGPT and other AI tools would perform if a similar study were undertaken at another point in time.
“These tools will perform better over time and the study highlights the importance of thinking carefully about what universities assess and how. Are we assessing critical thinking as opposed to something that can be rote learned and regurgitated?
“ChatGPT has already changed how we teach and learn. Many teaching staff run our assessments through the tool so we’re aware of what it might come up with.”
Dimes says the study, believed to be the first of its kind in the accounting field, was a unique experience.
“One of the most interesting parts of this for me was the process of gathering the data. It was amazing to see the speed at which researchers all over the world collated their data and trusted in the process. It was a really collaborative and effective way to do research.”
The study, The ChatGPT Artificial Intelligence Chatbot: How Well Does It Answer Accounting Assessment Questions? is forthcoming in Issues in Accounting Education, published by the American Accounting Association. https://aaahq.org/portals/0/documents/publications/issues-2023-013.pdf