Baseer, F. and Jaafar, J. and Aziz, I.B.A. and Habib, A. (2020) Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset. In: UNSPECIFIED.
Full text not available from this repository.Abstract
Urdu is among the most widely used languages in the world for verbal and written communication. Due to lack of optimized and user friendly native Urdu-script support on various platforms, it is mostly written in Romanized script in soft form. In our research, we have developed a refined Urdu lexicon using tokens with the highest frequency of occurrence in the data set. This data set is basically a raw corpus of colloquial Urdu written in Romanized script. The corpus was collected from volunteer participants who used this language as a mode of communication on the Internet and text massaging. The raw corpus is passed through a series of steps such as Prepossessing, Tokenization and Annotation before passing it to computationally extensive subsequent steps. Edit Distance and K-means Clustering techniques are used for identification of candidate tokens and their potential selection/ inclusion in the refined lexicon. We have also identified most commonly used tokens, candidate tokens and other lingual attributes from the data collected. Based on analysis, we have proposed a computational model for refined colloquial Romanized Urdu lexicon development. © 2020 IEEE.
Item Type: | Conference or Workshop Item (UNSPECIFIED) |
---|---|
Impact Factor: | cited By 0 |
Uncontrolled Keywords: | Computation theory; Computational methods; Intelligent computing, Computational model; Edit distance; K-means clustering techniques; Potential selection; Tokenization; Urdu lexicon; User friendly; Written communications, K-means clustering |
Depositing User: | Ms Sharifah Fahimah Saiyed Yeop |
Date Deposited: | 25 Mar 2022 02:58 |
Last Modified: | 25 Mar 2022 02:58 |
URI: | http://scholars.utp.edu.my/id/eprint/29859 |