UPFMT

Unified Processing Framework for raw Multilingual Text

UPFMT is a lightweight and easy to use tool for converting raw text into Universal Dependencies (aka CONLLU format) that support all major languages from the UD corpus (http://universaldependencies.org/)

Those in a hurry to get started should first go through the Prerequisites section and then directly to the QuickStart guide. However, building and tuning your own system will take time and effort - we provide full technical details as-well as a guide to train our system on your own data.

Prerequisites

Python 2.7 is included in major Linux distributions and is easy to install for Windows or OSX-based systems. If your OS does not include Python 2.7, check https://wiki.python.org/moin/BeginnersGuide/Download for installation instructions. Also, JAVA/OpenJDK should be easily installable via major package manegement systems such as yum and apt or by downloading the binary distribution from Oracle (https://www.oracle.com/java/index.html)

Pretrained models are already included in the standard repository and dynet install will be covered in the quick-start quide.

Quick start guide

First, make sure pip is installed with your Python 2.7 distribution. If not: for Debian/Ubuntu

$ sudo apt-get install python-pip

or for Redhat/CentOS

$ yum install python-pip

Next, install DyNET:

$ pip install git+https://github.com/clab/dynet#egg=dynet

Next, get UPFMT by downloading the ZIP arhcive or by cloning this REPO using GIT:

$ cd ~
$ git clone https://github.com/dumitrescustefan/UPFMT.git

You can now do a dry run of the system to see if everything is set up correctly. In the folder where you cloned or downloaded and extracted this repo, type:

$ cd UPFMT
$ mkdir test; mkdir test/in mkdir test/out
$ echo "This is a simple test." > test/in/input.txt
$ python2 main.py --input=test/in --output=test/out --param:language=en

If everything worked fine, after the last command you should have a file with your results in the test/out folder:

$ cat test/out/input.conllu
1   This    this    PRON    DT  Number=Sing|PronType=Dem    0   -   _   _
2   is  be  AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   -   _   _
3   a   a   DET DT  Definite=Ind|PronType=Art   0   -   _   _
4   simple  simple  ADJ JJ  Degree=Pos  0   -   _   _
5   test    test    NOUN    NN  Number=Sing 0   -   _   SpaceAfter=No
6   .   .   PUNCT   .   _   0   -   _   _

Advanced

The instructions above cover the one-liner installation of DyNET. It is sufficient if you only want to run the software and not train your own models. However, good speedups both in runtime and training time are obtained by building your own DyNET from source. As such, we recommend you follow the instructions at https://github.com/clab/dynet and build DyNET with support for Intel’s Math Kernel Lib.