Comparing and Integrating Information from Corpora and AI/LLMS
I recently finished a large-scale investigation (https://www.english-corpora.org/ai-llms/) of how the predictions on linguistic variation from two Large Language Models (GPT and Gemini) match actual corpus data from corpora like COCA, COHA, GloWbE, NOW, iWeb, the TV and Movies corpora (all from English-Corpora.org), as well as Sketch Engine. I will talk about the strengths and weaknesses of LLMs for linguistic research (especially regarding lexical issues). In addition, I will discuss how LLMs can be used to augment corpus data, such as the semantic categorization, grouping, and lebeling of collocates and phrases; comparisons between words (via collocates); and the analysis of differences between genres, historical periods, and dialects. I am currently working on updating the architecture and interface for English-Corpora.org to use API requests to LLMs, which will occur “behind the scenes”, and which will expose all of this “linguistic knowledge” to end users. The updates at English-Corpora.org will be publicly available in Summer 2025.
Biodata
Mark Davies is Professor Emeritus of Linguistics at Brigham Young University in Provo, Utah, USA. He is the author of six books and 90 articles; he has been the keynote speaker at many international conferences; and he is the recipient of several large research grants. All of these research activities deal with creating corpora and using corpus data for research and teaching, especially in terms of genre-based, historical, and dialectal variation in English. Perhaps most importantly, he is the (sole) creator of most of the corpora from English-Corpora.org, which are probably the most widely used corpora for teaching, learning, and research.