December 18, 2017

LARMAS – Language Resource Management System

Please find the source code for this project at github.com/tino1b2be/LARMAS

Abstract

Natural Language Processing (NLP) is a major area of artificial intelligence that is concerned with the ability of a computer to recognise, translate and understand human speech. Research in NLP over the past 70 years has improved the algorithms used in NLP, moving from the simple matching of words and rules in machine translation to using statistical models and neural networks. Modern methods of machine translation and automatic speech recognition (major fields of study in NLP) use machine learning algorithms and neural networks, which require a large amount of speech and text training data to produce accurate results. Therefore, there is a need for a way to efficiently collect, store and access language resources for NLP.

LARMAS (Language Resource Management System) is a client-server web application that has been designed to efficiently collect language resources for NLP and manage easy access to these resources for third-party NLP engines and projects. The project was broken down into three phases. The first phase lasted the first 6 weeks. This phase was characterised by research into machine translation and automatic speech recognition to determine what data needed to be collected. After determining the data that needed to be collected, research was done on the different literature, tools and web technologies needed to implement, comparing them to aid with the overall design of the system. LARMAS will store prompts which will be distributed to users to record and annotate before sending their contributions to the system. Users will also be able to translate these prompts into any of the languages they speak.

A relational database schema was designed to facilitate managing users and authentication, storing prompts, annotations and translations, and creating relations to allow for parallel text to be extracted. Several application architectural designs were investigated and Model-View-Controller layered architecture was chosen as the main application architecture for LARMAS. However, this architecture borrowed some concepts from the other architectures that were investigated, namely, the spacebased and microkernel architectures. Object storage will be used for storing the media files uploaded to LARMAS by the contributors. A REST API was created as the interface that will be used by third party applications to interact with LARMAS. The endpoints were well documented and put on the home page of LARMAS. A browsable API was also created to enable developers to interact with the API in a browser without having to write code to test its functionality. Different server architectures that could be used to deploy LARMAS were investigated and the recommended architecture is highly scalable, both vertically and horizontally.

The second phase of the project lasted 4 weeks and this is when a prototype for LARMAS was built in Django using Python 3.5. The development process was modelled after the Scrum framework where work was split into small “actions” which last 3 to 6 days (instead of 2-3 weeks in Scrum). The development of LARMAS was rapid and test driven, using TravisCI for Continuous Integration and Git for version control.

The last two weeks of the project is when tests and experiments on LARMAS were carried out. These tests tested both the functional and non-functional requirements of LARMAS and they included comparing databases, measuring response times and scalability and comparing load balancing algorithms. After discussions of the test results, conclusions were made, addressing the project objectives that were laid out in the introduction.

RECENT POSTS