Effectiveness of Methods and Tools used for Collection of Roman Urdu and English Multilingual Corpus

  • Sajadul Hassan Kumhar, Wasim Qadir, Mudasir M Kirmani, Haroon Bashir, Mudasir Hassan
Keywords: Corpus, Tweepy, Phantom JS, Selenium, Roman Urdu, English etc.

Abstract

Corpus or data acquisition is a critical step in the development of any system for natural language processing. The collection of corpora aids in the development of more precise systems. Collecting a mixed multilingual corpus is a time-consuming and difficult process that yields a limited amount of data. Only mixed Roman Urdu and English multilingual corpora are collected in the research work. Tweepy, Phantom JS, Selenium, Urlib and Amazon Mechanical Turk have been used to collect the mixed multilingual Roman Urdu and English corpus. Results showed the collected text is good quality corpus.

Published
2021-09-23
How to Cite
Haroon Bashir, Mudasir Hassan, S. H. K. W. Q. M. M. K. (2021). Effectiveness of Methods and Tools used for Collection of Roman Urdu and English Multilingual Corpus. Design Engineering, 13428-13438. Retrieved from http://www.thedesignengineering.com/index.php/DE/article/view/4607
Section
Articles