Effectiveness of Methods and Tools used for Collection of Roman Urdu and English Multilingual Corpus
Abstract
Corpus or data acquisition is a critical step in the development of any system for natural language processing. The collection of corpora aids in the development of more precise systems. Collecting a mixed multilingual corpus is a time-consuming and difficult process that yields a limited amount of data. Only mixed Roman Urdu and English multilingual corpora are collected in the research work. Tweepy, Phantom JS, Selenium, Urlib and Amazon Mechanical Turk have been used to collect the mixed multilingual Roman Urdu and English corpus. Results showed the collected text is good quality corpus.