November 22
Towards multilingual and linguistically diverse Large Language Models
Ben Bergen
(reporting on work with Tyler Chang, Catherine Arnett, and James Michaelov)
Cognitive Science Department at University of California, San Diego
If Large Language Models (LLMs) are to have their broadest possible scientific and social benefits, they must reflect the world’s linguistic diversity. Yet to date, a small number of languages (particularly English and Mandarin) have enjoyed the most attention and investment in LLM development, and if LLMs are occasionally multilingual, this is usually by accident rather than by design. I will discuss several lines of recent work in my lab that aim to better understand first how multilingual LLMs work and second how to build LLMs for under-resourced languages. We find that multilingual LLMs encode shared multilingual representations for abstract grammatical structures, as well as language-specific ones. We test this by administering a cross-linguistic structural priming task, where LLMs produce similar behavioral effects to human multilinguals. We also find that learning multiple languages influences how models learn each language. For under-resourced languages with relatively little available training data, training LLMs on other languages can produce better outcomes, depending on a variety of factors, including the size of the model and the training sets and the similarity between the languages. Finally, we tackle the finding that LLMs seem to perform better for some types of languages (like fusional languages) than others (like agglutinative languages). We find a surprising explanation for this difference that turns out to have relatively little to do with language typology, and more to do with typography.