263x Filetype PDF File size 0.26 MB Source: www.cse.iitb.ac.in
Introducing Sanskrit Wordnet
Malhar Kulkarni Chaitali Dangarikar Irawati Kulkarni
Department of Humanities Center for Indian Lan- Center for Indian Lan-
and Social Sciences, guage Technology, guage Technology,
Indian Institute of Tech- Indian Institute of Tech- Indian Institute of Tech-
nology Bombay nology Bombay nology Bombay
malhar@iitb.ac.in chaita- irawatikulkar-
li.dangarikar@gmai ni@gmail.com
l.com
Abhishek Nanda Pushpak Bhattacharyya
Center for Indian Language Technology, Center for Indian Language Technology,
Indian Institute of Technology Bombay Indian Institute of Technology Bombay
abhi.nanda@gmail.com pb@cse.iitb.ac.in
guages range from 10 million (Konkani) to 500
Abstract million (Hindi/Urdu).
2. Being a heritage language, there is need to
How does one build the wordnet of a lan- digitize and preserve ancient texts in Sanskrit.
guage that has a rich lexical tradition span- This activity is greatly helped by word lists. An
ning over millennia? The sheer volume of Optical Character Recognition Device (OCR) for
words and their nuances, the rich, deep and Sanskrit, for example, would need spell correc-
diverse grammatical tradition, the pressure of tion after scan, and this would need an exhaus-
modern developments on the language- all tive lexicon.
these factors and more combine to pose 3. Simlarly, there exists real need for trans-
unique challenges in creating lexical re-
sources for such languages. This present pa- lating ancient texts to preserve traditional culture
per describes the construction of Sanskrit and knowledge. An online wordnet would no
wordnet, being built using the expansion ap- doubt be a great help to a translator.
proach. It presents the processes and chal- 4. Machine aided translation (MAT) is ma-
lenges involved in this task that purports to turing fast, and automatic translation of Sanskrit
uncover the intimate linkage that underlies text is a challenging problem needing wordnet.
Indian languages most of which have speaker 5. There is an enormous amount of Sanskrit
population numbering 20 to 500 million. text which should be available in keyword based
searchable form. Text search is greatly helped by
1 Introduction wordnets.
Sanskrit is historically an Indo-Aryan language 6. The tradition of developing lexical resource
Deshpande1992 and one of the 22 official is very old in Sanskrit. There are diverse koshas
(traditional and rich monolingual dictionaries) in
languages of India. It has a vast literature and the Sanskrit (see section 1.2 below). Sanskrit word-
interest in analyzing and translating these texts is net will serve as the single reference point
always on the rise, worldwide. representing and pointing to all these resources.
Specifically, our motivation for building
Sanskrit wordnet arises from the following facts: 1.1 Sanskrit language
1. For all languages in the Indo European Indian subcontinent is inhabited by a very
family in India, the roots can be traced to San- large population who speak languages belong-
skrit. A large part of the vocabulary of these lan- ing to 4 major families, Indo-Aryan (a sub-
guages is derived from Sanskrit which can, there- family of Indo-European), Dravidian, Tibeto-
fore, provide the pivot resource for many Indian Burman and Austro-Asiatic. Sanskrit is the
languages. The speaker population for these lan- oldest member of the Indo-Aryan language
family, a sub branch of Indo-Iranian, which in 1.2 Rich lexical tradition of Sanskrit
turn is a branch of Indo European language
family. Sanskrit has a rich tradition of creating léxica
4
There is a traditional fourfold division of lex- (Kulkarni, 2008). Nighantu (700BC) on which
ical units of Indian languages into: Yaska is believed to have written a commentary
1 called Nirukta is the oldest known treatise that
1. tatsama - words having their origin arranged lexical material from the point of view
in Sanskrit and accepted in the modern Indo- of synonymy as well as homonymy, and this tradi-
Aryan languages without any change in their 5
tion continued to Pali tradition as well. The first
phonology. and the foremost popular name of lexicon work
2. tadbhava2- words which have their
in classical Sanskrit is Amarasimha’s Amarako-
origin in Sanskrit but their phonological forms sha (6th century AD) (Oka, 1913). The Cata-
are changed as per the rules of the modern Indo- logous Catalogorum lists at least 40 commenta-
Aryan languages. ries on Amarkosha alone, which shows how im-
3.
deshwords which are the native portant and popular this synonyms dictionary in
words of the particular language and ancient India was.
4.
videshwords borrowed from for- There were many other léxica created more
eign languages. or less in the style of Amarakosha which are giv-
The links to tatsama and tadbhava en in Appendix A (11 of them).
The first modern-day dictionary of Sanskrit
words, in particular, will be a great pan-Indian was the Sanskrit-English Dictionary compiled by
linguistic resource for computational purposes. Professor H.H. Wilson and published in 1819
Table 1 below lists some examples of Sanskrit (Wilson, 1819)Two Indian dictionaries came out
3 6
words in Hindi wordnet . soon after, namely, the Shabdakalpadruma
Deb1988 of Pt. Sir Raja Radhakanta Dev
HWN Synset Tatsam HWN English and Vacasptyam7 Bhattacharya, 2003 com-
word synset meaning
{, , , , basil
piled by Pt Taranatha Tarkavacaspati.
, , , , So far the electronic lexical resources availa-
,
-!", , 8
ble for Sanskrit are mainly online dictionaries.
# ,
# $, %, The linguistic resources like Shabdakalpadruma
#, &,&$, # #
' } 4 Nighantu is Sanskrit term for the collection of words,
{(,(,)!, *, )! )! eyebrow, grouped thematic categories with brief annotations
brow, superci-
,+ ,,+,,,'-} * * lium 5 Pali is a Middle Indo-Aryan language (or Prakrit) of India.
{
,./
,/
muscle, mus- It is best known as the language of the earliest extant Budd-
,.
,
,.
culus hist scriptures.
,
,} 6 Shabdakalpadruma is a first Sanskrit uni-lingual dictio-
01,1,*,.*,
2 01 eggplant, nary arranged in the modern alphabetical principles. It gives
2,
2, aubergine, full quotations and definitions from the original Koshas
&4
01 mad_apple which were unavailable in print at that time. Sets of syn-
, ,3 , 67#8 01 onymous words from the traditional Koshas are arranged
&4
, 5, 1,, under the headword, followed by the brief gloss. Each entry
67#8,9:, 9;:,6<, 01 in the lexicon includes headword, its category, meaning,
8,*
# 6< 01 usages in the Sanskrit texts.
7 Vacasptyam is a modern mono-lingual Sanskrit lexicon. It
8 01 arranges words in the Sanskrit alphabetical order and gives
grammatical information with word derivations as per the
Table 1: Tatsama words in the HWN traditional Sanskrit grammar. It contains about 46970
unique words. Each entry in the lexicon includes headword,
These representative examples show that the its category, meaning, set of synonymous words, usages and
synsets in Hindi wordnet contain 60-70% tatsa- some other information.
8 The online dictionaries available for Sanskrit are-(1)
ma (directly borrowed from Sanskrit) words. Monier Williams dictionary < http://webapps.uni-
koeln.de/tamil/>, (2) Apte’s Sanskrit-English Dictionary <
http://www.aa.tufs.ac.jp/~tjun/sktdic/>, (3) Apte’s English-
1 Tatsama Shabda Kosha (Tatsama words dictionary) is Sanskrit Dictionary < http://www.sanskrit-lexicon.uni-
published by Kendriya Hindi Nideshalaya, Shiksha Vibha- koeln.de/aequery/index.html> and (4) Spoken Sanskrit Dic-
ga, Manava Samsadhana Vikasa Mantralaya, Bharata Sara- tionary: an online hypertext dictionary for Sanskrit - English
kara in 1988. and English - Sanskrit.< http://spokensanskrit.de/>. Apart
2 See Hindi ki Tadbhava Shabdavali=Sarma, 1968>. from that various scanned versions of the printed dictiona-
ries prepared by European scholars are available at <
3 www.cfilt.iitb.ac.in/wordnet/webhwn. http://www.sanskrit-lexicon.uni-koeln.de/>.
and Vaacaspatyam are vast. For example, a 1.4 Expansion approach for Indian lan-
comparison of the entries for the word war in guage wordnets
these electronic dictionaries with the synsets of
the same word in the Sanskrit Wordnet is a good Wordnet construction activities in India started in
2000 and the Hindi wordnet9 (Narayan et al.,
indicator of the richness of this lexical tradition
in Sanskrit. 2002) is the first one which got released on the
Web in 2006. It was built ab initio using words
1. Spoken Sanskrit Dictionary: (7 words) ?@, ?A, from available lexical resources of Hindi. The
B
design of the Hindi wordnet follows the famous
C, , D?+A, D , ;? . 10
English WordNet .
2. Apate’s Sanskrit-English Dictionary: (7 While following the expand method, the
words)
CE, CE, E, , CE, ?@, Sanskrit wordnet follows the hierarchy preserva-
3. Monier Williams Dictionary: (56 words) tion principle (HPP) (Tufis et al., 2008). In the
''F? G'H3'D"D,G D?+A, hierarchy of the Hindi wordnet, if synset H is a
2
hyponym of synset H , and the translation equi-
D, D, ;:, , I, , , 6C 1
valents in the Sanskrit wordnet for H and H are
J
J;,? 1 2
S and S respectively, then in the hierarchy of
?@?+A?+A;?K*L
M
1 2
Sanskrit wordnet S should be a hyponym of syn-
G
GH6I ?? 1 2
B set S Thus, in the expansion approach lexico-
1.
N8+*N8
*O O P ?1M graphers are spared the task of establishing
Q ? ?@ 3G +6 afresh semantic relations for the synsets of San-
H?# and skrit wordnet. Appendix 2 describes and shows
4. Sanskrit Wordnet: ( words) ?@ , CE , the screenshots of lexicographers’ interface for
B creating the Sanskrit wordnet.
E, E, , D?+A, D, ;?, 'E,
B B B B 1.5 Synset creation in Sanskrit wordnet
', '6HE, 'F? GE, 'E, DO E, ?+A,
B B
H?, JA, J
, A , DN , P?, Domains: Initially the Sanskrit wordnet started
B B B B B B
, H?6?, E,
CE, JE, 6E, creating synsets with random synsets from the
B B Hindi Wordnet. Later on, lists of important San-
N8+*E, ?1E, QE, 'F?1E, DE, ?E, skrit words were acquired from different sources.
?, 66E, D"E, 6, ?, E, DE,
B B B University of Hyderabad provided a list of most
H?E,
E, ,
, H?E, , frequent words in their Sanskrit corpus. It con-
B B B
DREG , '6E, ?E,
,
I E, E, E, sisted of 8338 words. Another word list available
B 11
DO E, S?, ', , ?,
6A, on the indology forum contains a list of 127796
B B B B unique words from two major epics of Sanskrit
TE, , U_
, E, N , NV , , V , , 12 13
B B B B B literature: Ramayana and Mahabharata. The
V , , ?EG , , +E, 6ME, TIE, T1E, third list is prepared based on the lexicon called
B B Bharatiya Vyavahara Kosha(Naravane, 1961).
?1, T1ME, T1E, # !?G , W E, D"E, 6E,
B B Table 2 shows the part of speech distribution of
6E, , IE, IE, (N?, AE, E,
B B Naravane’s lexicon. It contains 2766 words
', X, ?, ?V, E
B B which are used for 1969 concepts related to the
day to day life. Table 3 shows a comparison be-
1.3 The process of building the Sanskrit tween the lists of Sanskrit words gleaned from
wordnet various sources mentioned above.
There are two methods to develop a Wordnet:
(1) Expand method and (2) Merge method (Vos-
sen, 2002). In the first method, a wordnet is con-
9 www.cfilt.iitb.ac.in/wordnet/webhwn
structed based on an existing wordnet. In the 10
Wordnet.princetoon.edu
second method, sub-Wordnets for specific do- 11
mains are built and later merged. For Sanskrit 12
Ramayana is an ancient Sanskrit epic. The Valmiki Ra-
Wordnet, the Hindi wordnet is considered as the mayana is published in 7 volumes, Baroda: University of
source resource. Though expanded from Hindi Baroda Oriental Institute, 1960-1975.
13
wordnet, care was taken to ensure that Sanskrit Mahabharata is one of the two important epics of India.
The Critical Edition of the Mahabharata is prepared by the
wordnet captures the real lexical structure of Bhandarkar Oriental Institute, Pune from April 1919 to
Sanskrit language. September 1966. It has 19 volumeconsisting18 Parvan-s;
89000+ verses in the Constituted Text, and an elaborate
Critical Apparatus.
The above mentioned words are organized kUla-vyApAraH where the members of the com-
14
into 52 domains. Omitting function words, a pounds are '? (anya)NM (sthAna)?+1 (sa-
core set of concepts was prepared and then by 17
Sept. 2009 synsets for all these core concepts Myoga), '! (anukUla)Y? (vyApAra) .and
15 they are indicated by inserting hyphen. For ex-
were created. ample- the gloss of a verb in Sanskrit is generally
created using technical terms like Y? vyApAra
Nouns Verbs Adjectives Adverbs ‘action’, ? janya ‘produced,’ '! anukUla
1512 225 180 52 18
‘helpful,’ etc.
Table 2: POS distribution of the synsets created
(core concepts) 2 Problems faced in the expansion ap-
proach
Sanskrit List 1 Sanskrit List 2 Sanskrit List 3 Hindi List
1 In this section we enumerate the challenges faced
Sanskrit Word Number of Hindi in creating the synsets of Sanskrit wordnet in
Univ. of Hyderabad list Sanskrit wordnet
most frequent (Based on Ra- Words in Nara- Total num- consonance with those of Hindi.
words in Sanskrit mayana vane's ber of
(Amba Kulkarni) and Mahabhara- Bhasha Vyava- unique
ta) har Kosh words
8338 127796 2766 105157 17
This way of giving definitions is typical of Sanskritic
Table 3: Sanskrit word list tradition which used to strongly emphasise precision. The
long compound simply defines the act of going.
18
So using these expressions, Hindi Wordnet gloss is
While creating synsets the following considera- adapted in following ways- (1){+,L ,D. !
tions are kept in mind: ,O ronA, rudana karanA, AMsu bhAnA, kran-
dana karanA} HWN DI.
D. ! 61 AMkha se AMsu
Inserting concepts or glosses in the Sanskrit girAnA SWN I/ EI?+E
1
#F?'& /
wordnet: A combination of the glosses given in B B
dictionaries like Shabdakalpadruma and the -EY?EZ sukha-duHkhayoH bhAvanAvegAt netrAb-
translation of the gloss of the Hindi wordnet syn- hyAm aZrupatan-rUpaH vyApAraH, (2){,*,J
set is used to create the Sanskrit synset glosses. ,:[,:+,
*\,A ,A \,
While writing the gloss, complicated "As ],J], mAranA, pITanA, prahAra karanA,
sandhis16 ands samAsas (compounds) are ThokanA, piTAI karanA, dhunanA, dhunAI karanA, tADa-
nA, pratADanA, rasIda karanA }HWN KKN
avoided. Whenever lengthy compounds (having DK
DQ kisi par kisI vastu Adi se AghAta kara-
5-6 members) became necessary, the members of nA SWN "N'
'
ND-! G E
the compounds were invariably joined with the B
hyphen symbol (-) as in: ‘‘'?/NM/?+1 !/ Y?EZkasmin api kena api vastunA Ahanana-pUrvakaH
Y? meaning the activity that is helpful in vyApAraH (3) {I ,O?,+
,
kharIdanA,
kraya karanA, mola lenA, lenA} HWN
DK
K
reaching a place’’ anya-sthAna-saMyogAnu- ,Y?
9DK
^_ +
paise Adi dekar kisI
dukAna, vyakti Adi se kuch saudA mol lenA SWN D
14 NM7!2?`?+ED /J EY?EZ
These domains are: 1) Grains and Cereals, 2) Limbs of B
Humans, 3) Medical treatment, 4) Tools & implements, 5) ApaNe vastu tathA cha tanmUlyam etayoH AdAna-
Worms & Insects, 6) Minerals, 7) Food and Drinks, pradAnAtmakaH vyApAraH, (4) {-:, -a +, 'I,
8)Games & sports, 9) Ornaments & Trinkets, 10) House- -, b, 8! , ', 'I, ' rUThanA,
hold articles, 11) Limbs of animals, 12) Post office, 13)
Vegetables, 14) Directions, 15) Country, 16) Religion, 17) ruSTa honA, anakhanA, rUsanA, risAnA, phUlanA, anasA-
Court, 18) Birds, 19) Trees & plants, 20) Dress, 21) Nature, nA, anakhAnA} HWN 'J+c ,7 ?'1+
22) Animals, 23) Fruits, 24) Flowers, 25) Young-ones of aprasanna hokara udAsIna, cupa yA alaga ho jAnA
animals, 26) Amusement, 27) Spices, 28) Weights & meas- SWN 'J
?E
?+1-Ed ?8E
ures, 29) Colours, 30) Relatives, 31) Diseases, 32) Reptiles, Y?EZ aprasannatAhetujanyaH viyogarUpaH audAsInya-
33) Conveyances, 34) Occupations, 35) Education, 36)
Time, 37) Government, 38) Verbs, 39) Adverbs, 40) Ab- phalajanakaH vA vyApAraH (5) {De1, .7, 7,
stract nouns, 41) Adjectives, 42) Prepositions, 43) Numer- A, ', D1 AnaA, pahuMcanA, pahucanA, pad-
als, 44) Conjunctions, 45) Collective words, 46) Pronouns, hAranA, avanA, AgamanA} HWN `NM
D
47) Ordinals, 48) Feminines, 49) Interjections, 50) War, 51) !
House, and 52) Miscellaneous. NMc"NM+ eka stAna se Akara dUsare stAna
15 From this time Sanskrit Wordnet became a part of Indo- para upasthita honA SWN '?/NM/
?+1/! G E'?/
WordNet activity which provided a common platform for NM/?+1 ! /Y?EZ anya-sthAna-viyoga-pUrvakaH
the lexicographers working on various Indian language
Wordnets. anya-sthAna saMyogAnukUla-vyApAraH.
16 Phonological conjoining
no reviews yet
Please Login to review.