Teresa Berndtsson / Higher Pictures of AI / Letter Phrase Textual content Taxonomy / Licenced by CC-BY 4.0.
Nearly all of pure language processing (NLP) datasets and analysis at current give attention to a small variety of high-resource languages, with research on English dominating the sector. Clearly, such an imbalance is undesirable, placing those that don’t use English at a drawback.
On this article, we spotlight among the work and initiatives being carried out on low-resource languages.
Lanfrica
Africa is without doubt one of the most linguistically various areas on the planet. Regardless of this, African languages are barely represented in expertise and analysis. Lanfrica goals to mitigate the issue encountered within the discovery of African language sources by making a centralised hub. The staff at Lanfrica have constructed a language-focused search engine that makes it quick and simple to seek out info on the web about sources regarding African languages. Now with greater than 1000 sources, their goal is to catalogue and join all African language sources, one file at a time.
In addition to this platform, Lanfrica additionally hosts common on-line talks the place you may hear from researchers within the area. This speak sequence offers a platform for anybody to share/showcase their efforts (analysis, tasks, software program, purposes, datasets, fashions, initiatives, and many others.) in NLP.
Masakane
Masakhane is a grassroots organisation whose mission is to strengthen and spur NLP analysis in African languages. The organisation is at the moment engaged in various tasks, together with:
Urdu
On this paper, Maaz Amjad, Sabur Butt, Hamza Imam Amjad, Alisa Zhila, Grigori Sidorov and Alexander Gelbukh define their method when collaborating within the shared activity UrduFake@FIRE2021, which centred on pretend information detection in Urdu. This shared activity aimed to draw and encourage researchers working in several NLP domains to handle the automated pretend information detection activity and assist to mitigate the proliferation of faux content material on the internet.
The staff have additionally seemed into tweets in Urdu, of their paper Threatening Language Detection and Goal Identification in Urdu Tweets.
Indian regional languages
B. S. Harish and R. Kasturi Rangan present a complete survey on Indian regional language processing, duties similar to machine translation, named entity recognition, sentiment evaluation and parts-of-speech tagging.
Bengali
Md. Rajib Hossain and Mohammed Moshiul Hoque research Bengali phrase embedding of their paper In the direction of Bengali Phrase Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations. They presents three embedding methods with totally different hyperparameters carried out on a Bengali corpus with consists of 180 million phrases.
Indigenous languages of the Americas
Introducing QuBERT: A Massive Monolingual Corpus and BERT Mannequin for Southern Quechua, by Rodolfo Zevallos et al., introduces a big mixed corpus for deep studying of Quechua. The authors additionally present a public, pre-trained, BERT mannequin known as QuBERT. They’ve examined their corpus and its corresponding BERT mannequin on two main duties: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging.
On this paper you may learn in regards to the AmericasNLP 2021 shared activity on open machine translation for indigenous languages of the Americas. Manuel Mager et al. report on the 214 submissions from eight groups, which focussed on 10 totally different languages: Asháninka, Aymara, Bribri, Guarani, Nahuatl, Otomí, Quechua, Rarámuri, Shipibo-Konibo, and Wixarika.
Axolotl: a Internet Accessible Parallel Corpus for Spanish-Nahuatl, by Ximena Gutierrez-Vasques, Gerardo Sierra and Isaac Hernandez Pompa, presents a venture which contains a Spanish-Nahuatl parallel corpus and its search interface.
Gina Bustamante, Arturo Oncevay, Roberto Zariquiey introduce monolingual corpora for 4 indigenous and endangered languages from Peru (Shipibo-konibo, Ashaninka, Yanesha and Yine) of their paper No knowledge to crawl? Monolingual corpus creation from PDF recordsdata of really low-resource languages in Peru.
Dysarthric speech recognition
Karima Kadaoui is researching easy methods to assist speech-impaired folks talk. A part of her venture is to construct an software to “translate” speech which can by unclear. She talks in regards to the inspiration behind her work, and what she plans to attain, on this video.
Signal language
Steven Kolawole created a dataset for Nigerian signal language with the assistance of a TV signal language broadcaster and two faculties. Utilizing this dataset, he constructed a sign-to-speech mannequin for the language. You’ll find out extra on this interview.
Of their place paper, Together with Signed Languages in Pure Language Processing, Kayo Yin, Amit Moryossef, Julie Hochgesang, Yoav Goldberg, and Malihe Alikhani name on the NLP neighborhood to incorporate signed languages as a analysis space with excessive social and scientific influence. They talk about the linguistic properties of signed languages, assessment the constraints of present signal language processing fashions, and determine the open challenges to increase NLP to signed languages.
In her paper Approaches to the Anonymisation of Signal Language Corpora, Amy Isard considers the state-of-the-art for the anonymisation of signal language corpora. She explores the motivations behind anonymisation, and particulars the processes which can be utilized to anonymise each the video and the annotations belonging to a corpus.
Additional studying
tags: AI all over the world
Lucy Smith
, Managing Editor for AIhub.