Granularities Of Tokenization Through Semantics For Tweeter Datasets

Research Article
Rashmi H Patil and Siddu P Algur
DOI: 
http://dx.doi.org/10.24327/ijrsr.2019.1004.3327
Subject: 
science
KeyWords: 
Multi-Word Expressions (MWE), Granularities, DRUID, Fine- and CoarseGrained Tokenization, Parts-of-Speech (POS) Tagging.
Abstract: 

The idea of tokenization which is currently based on low-level token based identification has to be extended to identification of meaningful and useful language units. This idea of tokenization involves the splitting of single word into their meaning parts called Multiword Expressions (MWE’s) and also combining multiple words which have similar meaning. This paper introduces two methods namely unsupervised and knowledge-free methods for the task of identifying tokens. The main idea of the paper is based on the fact that methods are primarily based on distributional similarity. We use two possibilities as - sparse count based and neural-based distributional semantic model. The calculation of MWE-annotated data sets in two languages and newly extracted evaluation data sets for 32 languages shows that DRUID has compared favorably over previous methods which does not utilize distributional information. By considering the keyword ‘Accident’ in several Tweeter tweets, we analyze the semantics for granularities of tokens. In our experiment, we show how both decompounding and MWE information can be used in information retrieval. We get the results when we combine word information with MWEs and the compound parts in a bag-ofwords retrieval set-up. This covers the way to automatic detection of lexical units beyond standard tokenization techniques without language-specific preprocessing steps such as POS tagging.