Welcome to the Center for Research in Language (CRL)

CRL brings together faculty, students and research associates who share an interest in the nature of language, the processes by which language is acquired and used, and the mediation of language in the human brain.

CRL is housed in the Cognitive Science Building on the Thurgood Marshall Campus at the University of California, San Diego and boasts an interdisciplinary academic staff comprised of specialists in a wide variety of fields:

  • Cognitive science
  • Communication
  • Communication disorders
  • Computer science
  • Developmental psychology
  • Linguistics
  • Neurosciences
  • Pediatrics
  • Psycholinguistics

CRL Talks

November 22

Towards multilingual and linguistically diverse Large Language Models

Ben Bergen
(reporting on work with Tyler Chang, Catherine Arnett, and James Michaelov)

Cognitive Science Department at University of California, San Diego

If Large Language Models (LLMs) are to have their broadest possible scientific and social benefits, they must reflect the world’s linguistic diversity. Yet to date, a small number of languages (particularly English and Mandarin) have enjoyed the most attention and investment in LLM development, and if LLMs are occasionally multilingual, this is usually by accident rather than by design. I will discuss several lines of recent work in my lab that aim to better understand first how multilingual LLMs work and second how to build LLMs for under-resourced languages. We find that multilingual LLMs encode shared multilingual representations for abstract grammatical structures, as well as language-specific ones. We test this by administering a cross-linguistic structural priming task, where LLMs produce similar behavioral effects to human multilinguals.  We also find that learning multiple languages influences how models learn each language. For under-resourced languages with relatively little available training data, training LLMs on other languages can produce better outcomes, depending on a variety of factors, including the size of the model and the training sets and the similarity between the languages. Finally, we tackle the finding that LLMs seem to perform better for some types of languages (like fusional languages) than others (like agglutinative languages). We find a surprising explanation for this difference that turns out to have relatively little to do with language typology, and more to do with typography.