Understanding Lost Languages using ML – Artificial Intelligence news

In recent Artificial Intelligence news, researchers at MIT Computer Software and Artificial Intelligence Lab have created a machine learning algorithm to help decode and understand ancient languages. Many such languages are lost in the unknown as there is no way to decipher them. Language is one of many mediums to pass centuries-old knowledge from one generation to another. Understanding such languages mean discovering a whole lot of knowledge.  But some of these languages are quite hard to work on. Some have no relative language and some have no traditional dividers like whitespaces etc. Such languages hinder linguists as they have no grammar or vocabulary to guide them.

This goal of deciphering knowledge motivated Professor Regina Barzilay and her team to work on an algorithm capable enough to decipher a language without any need for a relative language. This algorithm is advanced enough to even guess the relationship between languages.

The researchers have made use of many linguistic principles in devising this algorithm. They studied how languages have usually evolved over time. How pronunciations have changed with time from language to language. These insights helped them in formulating a credible algorithm.

This algorithm embeds language phonics into a multi-dimensional space where distance in vectors relates to the difference in the pronunciations. In this way, the algorithm figures out patterns in language, finds out words, and relates them to corresponding words in the relative languages.

This project is an extension of research Barzilay and Luo did last year. They deciphered Ugaritic and Linear B languages. It has taken 100s of years of linguists to decipher Linear B. However, for these two languages, researchers knew that these belong to the Hebrew and Greek families.

Understanding Lost Languages using ML - Artificial Intelligence news

When this algorithm was tested on different languages to predict their correlation, it surprisingly gave correct results. For example, researchers were interested in knowing language related to Iberian. For this, they used Basque and even less related languages like Romance, Germanic, and Turkic language families. Basque and Latin were identified to be more related to Iberian although they are too different to be considered.

In future work, researchers aim to identify the semantic meanings of the words too. Up till now, the work was limited to connect words to known words in a related known language. This approach is called cognate-based decipherment. But as seen for Iberian, this is not always the case. Language may not have a related language for decipherment.

Researchers, therefore, are motivated to expand the research methods for a better understanding of lost languages. Commenting on this Artificial Intelligence news Barzilay says, “For instance, we may identify all the references to people or locations in the document which can then be further investigated in light of the known historical evidence,” says Barzilay. “These methods of ‘entity recognition’ are commonly used in various text processing applications today and are highly accurate, but the key research question is whether the task is feasible without any training data in the ancient language.”

Leave a Reply

Your email address will not be published. Required fields are marked *