248x Filetype PDF File size 1.13 MB Source: research.rug.nl
University of Groningen
Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages
Dhar, Prajit; Bisazza, Arianna; van Noord, Gertjan
Published in:
Proceedings of the 8th Workshop on Asian Translation (WAT2021)
IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from
it. Please check the document version below.
Document Version
Publisher's PDF, also known as Version of record
Publication date:
2021
Link to publication in University of Groningen/UMCG research database
Citation for published version (APA):
Dhar, P., Bisazza, A., & van Noord, G. (2021). Optimal Word Segmentation for Neural Machine Translation
into Dravidian Languages. In T. Nakazawa, H. Nakayama, I. Goto, H. Mino, C. Ding, R. Dabre, A.
Kunchukuttan, S. Higashiyama, H. Manabe, W. Pa Pa, S. Parida, O. Bojar, C. Chu, A. Eriguchi, K. Abe, Y.
Oda, K. Sudoh, S. Kurohashi, & P. Bhattacharyya (Eds.), Proceedings of the 8th Workshop on Asian
Translation (WAT2021) (pp. 181-190). Association for Computational Linguistics (ACL).
Copyright
Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the
author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).
The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license.
More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne-
amendment.
Take-down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the
number of authors shown on this cover page is limited to 10 maximum.
Download date: 23-09-2022
OptimalWordSegmentationfor
Neural Machine Translation into Dravidian Languages
Prajit Dhar AriannaBisazza Gertjan van Noord
University of Groningen
{p.dhar, a.bisazza, g.j.m.van.noord}@rug.nl
Abstract instance that an average English sentence contains
almost ten times as many words as its Kannada
Dravidian languages, such as Kannada and
equivalent. For the other three languages, the ra-
Tamil, are notoriously difficult to translate
tio is a bit smaller but the difference with English
by state-of-the-art neural models. This stems
remains considerable. This indicates why it is im-
from the fact that these languages are mor-
portant to consider word segmentation algorithms
phologically very rich as well as being low-
resourced. In this paper, we focus on subword as part of the translation system.
segmentationandevaluateLinguisticallyMoti-
In this paper we describe our work on Neural
vated Vocabulary Reduction (LMVR) against
MachineTranslation(NMT)fromEnglishintothe
the more commonly used SentencePiece (SP)
Dravidian languages Kannada, Malayalam, Tamil
for the task of translating from English into
and Telugu. We investigated the optimal transla-
four different Dravidian languages. Addition-
tionsettingsforthepairsandinparticularlookedat
ally weinvestigatetheoptimalsubwordvocab-
theeffectofwordsegmentation. Theaimofthepa-
ulary size for each language. We find that SP
is the overall best choice for segmentation, and per is to answer the following research questions:
that larger subword vocabulary sizes lead to
higher translation quality. • Does LMVR, a linguistically motivated
wordsegmentationalgorithm,outperformthe
1 Introduction
purely data-driven SentencePiece?
Dravidian languages are an important family of
• What is the optimal subword dictionary size
languages spoken by about 250 million of people
fortranslatingfromEnglishintotheseDravid-
primarily located in Southern India and Sri Lanka
ian languages?
(Steever,2019). Kannada(KN),Malayalam(MA),
Tamil (TA) and Telugu (TE) are the four most
In what follows, we review the relevant previ-
spoken Dravidian languages with approximately
ous work (Sect. 2), introduce the two segmenters
47, 34, 71 and 79 million native speakers, respec-
(Sect. 3), describe the experimental setup (Sect. 4),
tively. Together, they account for 93% of all Dra-
andpresentouranswerstotheaboveresearchques-
vidian language speakers. While Kannada, Malay-
tions (Sect. 5).
alam and Tamil are classified as South Dravidian
languages, Telugu is a part of South-Central Dra- 2 PreviousWork
vidian languages. All four languages are SOV
2.1 Translation Systems
(Subject-Object-Verb) languages with free word
order. Theyarehighlyagglutinativeandinflection- Statistical MachineTranslation Oneoftheear-
ally rich languages. Additionally, each language liest automatic translation systems for English into
has a different writing system. Table 1 presents a Dravidian language was the English→Tamil sys-
an English sentence example and its Dravidian- tem by Germann (2001). They trained a hy-
language translations. brid rule-based/statistical machine translation sys-
The highly complex morphology of the Dravid- tem that was trained on only 5k English-Tamil
ian languages under study is illustrated if we com- parallel sentences. Ramasamy et al. (2012) cre-
pare translated sentence pairs. The analysis of our ated SMTsystems(phrase-basedandhierarchical)
parallel datasets (section 4.1, Table 3) shows for which were trained on a dataset of 190k parallel
181
Proceedings of the 8th Workshop on Asian Translation, pages 181–190
Bangkok, Thailand (online), August 5-6, 2021. ©2021 Association for Computational Linguistics
EN HewasborninThirukkuvalaivillageinNagapattinamDistrict on 3rd June, 1924.
ɲ
ಅವರು³ಗಪಟಣಂÎĻಯÖರುಕುವಲȗ¢ɻಮದʉ1924ರಜೂȑ3ರಂದುಜÚèದರು.
ɫ ɡ
KN
avaru nāgapattanam jilleya tirukkuvalay grāmadalli 1924ra jūn 3randu janisiddaru.
̣̣ ̣
് ്
1924ല്നാഗപണംജിയിെലതിരുുവൈളwഗാമിലാണഅേഹംജനിത
ML
1924l nāgapattanam jillayile tirukkuvalai grāmattilān addēham janiccat.
̣̣ ̣ ̣ ̣
நாகïபêனð மாவêடð Êå¾வைளå ராமìô அவò 1924-ஆð ஆëÂ
ஜூîமாதð3-ஆðேதறíதாò.
TA
nāgappattinam māvattam tirukkuvalaik kirāmattil avar 1924-ām āntu jūn mātam 3-ām tēti
̣̣ ̣̣ ̣ ̣̣
pirantār.
̊
ఆయనƵగపట˸ణంǍƾǕ̡͞ˮǀȪƤ̫మంʖ1924˧˕3నజǙ̆ంƧ͞.
ౖ
TE
āyana nāgapattanam jillā tirukkuvālai grāmanlō 1924 jūn 3na janmincāru.
̣̣ ̣
Table 1: Example sentence in English along with its translation and transliteration in the four Dravidian languages.
sentences (henceforth referred to as UFAL). They findings was also reported by Ramesh et al. (2020)
also reported that applying pre-processing steps in- for Tamil and DandapatandFedermann(2018)for
volving morphological rules based on Tamil suf- Telugu .
fixes improved the BLEU score of the baseline To the best of our knowledge and as of 2021,
model to a small extent (from 9.42 to 9.77). For therehasnotbeenanyscientificpublicationinvolv-
the Indic languages multilingual tasks of WAT- ing translation to and from Kannada, except for
2018,thePhrasal-basedSMTsystemofOjhaetal. Chakravarthietal.(2019). Onepossiblereasonfor
(2018) with a BLEU score of 30.53. this could be the fact that sizeable corpora involv-
Subsequent papers also focused on SMT sys- ing Kannada (i.e. in the order of magnitude of at
temsforMalayalamandTeluguwithsomenotable least thousand sentences) have been readily avail-
workincluding: (AntoandNisha,2016;Sreelekha ableonlysince2019,withthereleaseoftheJW300
and Bhattacharyya, 2017, 2018) for Malayalam Corpus (Agić and Vulić, 2019).
and (Lingam et al., 2014; Yadav and Lingam,
Multilingual NMT Since 2018 several studies
2017) for Telugu.
havepresentedmultilingualNMTsystemsthatcan
Neural Machine Translation On the neural handle English → Malayalam, Tamil and Telugu
machine translation (NMT) side, there have
translation (Dabre et al., 2018; Choudhary et al.,
been a handful of NMT systems trained on
2020; Ojha et al., 2018; Sen et al., 2018; Yu et al.,
English→Tamil. On the aforementioned Indic 2020;DabreandChakrabarty,2020). Inparticular,
languages multilingual tasks of WAT-2018, Sen
Senetal.(2018)presentedresultswheretheBLEU
et al. (2018), Dabre et al. (2018) reported only
score improved whencomparingmonolingualand
11.88 and 18.60 BLEU scores, respectively, for
multilingual models. Conversely, Yu et al. (2020)
English→Tamil. The poor performance of these found that NMT systems that were multi-way (In-
systems compared to the 30.53 BLEU score of the dic ↔ Indic) performed worse than English ↔ In-
SMTsystem(Ojhaetal.,2018)showedthatthose
dic systems.
NMTsystemswerenotyetsuitablefortranslating
To our knowledge, no work so far has explored
into the morphologically rich Tamil.
theeffectofthesegmentationalgorithmanddictio-
However,thefollowingyear,Philipetal.(2019)
nary size on the four languages: Kannada, Malay-
outperformed Ramasamy et al. (2012) on the
alam, Tamil and Telugu.
UFALdatasetwithaBLEUscoreof13.05(thepre-
vious best score on this test set was 9.77). They
3 SubwordSegmentationTechniques
report that techniques such as domain adaptation
and back-translation can make training NMT sys- Prior to the emergence of subword segmenters,
tems on low-resource languages possible. Similar translation systems were plagued with the issue of
182
Available in:
Name Domain
Kannada Malayalam Tamil Telugu
Bible Religion 18 1 14
ELRC COVID-19 <1 <1 <1
GNOME Technical <1 <1 <1 <1
JW300 Religion 70 45 52 45
KDE Technical 1 <1 <1 <1
NLPC General <1
OpenSubtitles Cinema 26 3 3
CVIT-PIB Press 5 10 10
PMIndia Politics 10 4 3 8
Tanzil Religion 18 9
Tatoeba General <1 <1 <1 <1
Ted2020 General <1 <1 <1 1
TICO-19 COVID-19 <1
Ubuntu Technical <1 <1 <1 <1
UFAL Mixed 11
Wikimatrix General <1 10 18
Wikititles General 1
Table2: Compositionoftrainingcorpora. Thenumbersindicatetherelativesize(inpercentages)ofthecorrespond-
ing part for that language.
out-of-vocabulary (OOV) tokens. This was partic- To address this, Ataman et al. (2017) proposed a
ularly an issue for translations involving agglutina- modificationofMorfessorFlatCat(Grönroosetal.,
tive languages such as Turkish (Ataman and Fed- 2014), called Linguistically Motivated Vocabu-
erico, 2018) or Malayalam (Manohar et al., 2020). lary Reduction (LMVR). Specifically, LMVR
Varioussegmentationalgorithmswerebroughtfor- imposes an extra condition on the cost function of
wardtocircumventthis issue and in turn, improve Morfessor Flatcat so as to favour vocabularies of
translation quality. thedesiredsize. InacomparisonofLMVRtoBPE,
Ataman et al. (2017) reported a +2.3 BLEU im-
PerhapsthemostwidelyusedalgorithminNMT
provement on the English-Turkish translation task
to date is the language-agnostic Byte Pair Encod-
of WMT18.
ing (BPE) by Sennrich et al. (2016). Initially pro-
Given the encouraging results reported on the
posed by Gage (1994), BPE was repurposed by
agglutinative Turkish language, we hypothesise
Sennrich et al. (2016) for the task of subword
that translation into Dravidian languages may also
segmentation, and is based on a simple principle
benefit from a linguistically motivated segmenter,
whereby pairs of character sequences that are fre-
andevaluate LMVRagainstSPacrossvaryingvo-
quently observed in a corpus get merged itera-
cabulary sizes.
tively until a predetermined dictionary size is at-
tained. In this paper we use a popular implemen-
4 ExperimentalSetup
tation of BPE, called SentencePiece (SP) (Kudo
and Richardson, 2018).
4.1 Training Corpora
Theparallel training data is mostly taken from the
While purely statistical algorithms are able to datasets available for the MultiIndicMT task from
segment any token into smaller segments, there is WAT 2021. If a certain dataset is not available
no guarantee that the generated tokens will be lin- from the MultiIndicMT training repository, we re-
guistically sensible. Unsupervised morphological sorted to extract that dataset from OPUS (Tiede-
induction is a rich area of research that also aims mann, 2012) or WMT20. Table 2 reports on the
at learning a segmentation from data, but in a lin- datasets that we used along with their domain and
guistically motivated way. The most well-known their source.
example is Morphessor with its different variants After extracting and cleaning the data (see be-
(Creutz and Lagus, 2002; Kohonen et al., 2010; low), approximately 8 million English tokens and
Grönroos et al., 2014). An important obstacle to their corresponding target language tokens are se-
applying Morfessor to the task of NMT is the lack lected as our training corpora. We fixed the num-
of a mechanism to determine the dictionary size. ber of source tokens across language pairs in or-
183
no reviews yet
Please Login to review.