Data analysis and language join forces for Scottish Gaelic

Many people think of cute seals or a vanishing rainforest when hearing the word “endangered.” Sophie Brown ’25, Mark Liberko ’26, and Assistant Professor of Statistics Tyler George, however, think about the Scottish Gaelic language and how they can help.

Scottish Gaelic is a low-resource, endangered language, with approximately 60,000 speakers in Scotland. These Cornellians, along with collaborators Peter Barclay and Alistair Lawson of Edinburgh Napier University, are on a mission to use data science to help preserve the heritage and cultural identity of Scotland through a revival of its native Celtic language.

Since computers can analyze text much faster than humans, the students are using computer programs to gather, store, break down, and analyze Gaelic. This process helps create much-needed resources, such as a library of Gaelic words, that will help preserve the language. Their analysis will be shared with scholars around the globe who can help expand the project.

City street in Scotland
Photo of Edinburgh streets taken during a Fall 2024 trip to Scotland, where Brown and Liberko gave a presentation on their CSRI Scottish Gaelic project. Photo provided by Sophie Brown '25.

Brown and Liberko got to work during the 2024 Cornell Summer Research Institute (CSRI), and work will continue on the project this summer for CSRI 2025. The pair of researchers says that not knowing the language was not a problem. 

“Since we didn’t know the language, it was easier to not try and look at the meaning of the words and just focus on the analysis,” Brown said. “Of course, I think we picked up some in the process, and I was also just curious on my own; I started the course on Duolingo for that.”

Liberko approached the issue through the lens of individual parts of speech. For him, it wasn’t about what the words meant. He needed to focus on turning language into its base parts, such as nouns and verbs, and then programming a computer to identify them. 

“We started with 60,000 words,” Liberko said, “And then we had to break the words down into different spaces and rules. I found what was basically a grammar textbook, written in English for Scottish people, and I read through that. It was very, very dense, but I just took note of every single rule and special exception. Then I found a way to put those exceptions into code.”

After spending 40 hours a week over the eight weeks of CSRI, they had built their database of the Scottish Gaelic

Group photo of Cornell students visiting Scotland
A group photo of the Cornell students who visited Scotland in fall of 2024, including Sophie Brown '25 (front left) and Mark Liberko '26 (front row; third from right). Photo provided by Assistant Professor of Statistics Tyler George.

language, written code for labeling the base parts of language, and shared that code in SpaCy. SpaCy is a library of language-processing code that can be accessed, used, or even altered by anyone. Both students were excited that they were able to not only make progress with the project but also leave something behind that would carry on beyond their summer of hard work. George says there’s enough work for several more summers of CSRI.

“We’re hoping this summer to put together a manuscript about the project thus far, to help get the word out that people are working on this,” George said. There could be predictive models, technology for learning and using the language, and even a branch of statistical analysis of the language that we haven’t touched on yet. This year’s students will take vital next steps to make text analysis with Scottish Gaelic possible in practice.” 

During CSRI 2025, students will remove non-Gaelic words from previously collected texts, build a program to reduce words to their dictionary form, and build Large Language Models (LLM) AI to perform various tasks.