UPFMT¶
Unified Processing Framework for raw Multilingual Text
UPFMT is a lightweight and easy to use tool for converting raw text into Universal Dependencies (aka CONLLU format) that support all major languages from the UD corpus (http://universaldependencies.org/)
Those in a hurry to get started should first go through the Prerequisites section and then directly to the QuickStart guide. However, building and tuning your own system will take time and effort - we provide full technical details as-well as a guide to train our system on your own data.
Prerequisites¶
- Python 2.7
- JAVA > 1.8
- DyNET (https://github.com/clab/dynet)
- Pretrained models (included) or data from the UD corpus (http://universaldependencies.org/)
Python 2.7 is included in major Linux distributions and is easy to
install for Windows or OSX-based systems. If your OS does not include
Python 2.7, check https://wiki.python.org/moin/BeginnersGuide/Download
for installation instructions. Also, JAVA/OpenJDK should be easily
installable via major package manegement systems such as yum
and
apt
or by downloading the binary distribution from Oracle
(https://www.oracle.com/java/index.html)
Pretrained models are already included in the standard repository and dynet install will be covered in the quick-start quide.
Quick start guide¶
First, make sure pip
is installed with your Python 2.7 distribution.
If not: for Debian/Ubuntu
$ sudo apt-get install python-pip
or for Redhat/CentOS
$ yum install python-pip
Next, install DyNET:
$ pip install git+https://github.com/clab/dynet#egg=dynet
Next, get UPFMT by downloading the ZIP arhcive or by cloning this REPO using GIT:
$ cd ~
$ git clone https://github.com/dumitrescustefan/UPFMT.git
You can now do a dry run of the system to see if everything is set up correctly. In the folder where you cloned or downloaded and extracted this repo, type:
$ cd UPFMT
$ mkdir test; mkdir test/in mkdir test/out
$ echo "This is a simple test." > test/in/input.txt
$ python2 main.py --input=test/in --output=test/out --param:language=en
If everything worked fine, after the last command you should have a file
with your results in the test/out
folder:
$ cat test/out/input.conllu
1 This this PRON DT Number=Sing|PronType=Dem 0 - _ _
2 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 - _ _
3 a a DET DT Definite=Ind|PronType=Art 0 - _ _
4 simple simple ADJ JJ Degree=Pos 0 - _ _
5 test test NOUN NN Number=Sing 0 - _ SpaceAfter=No
6 . . PUNCT . _ 0 - _ _
Advanced¶
The instructions above cover the one-liner installation of DyNET. It is sufficient if you only want to run the software and not train your own models. However, good speedups both in runtime and training time are obtained by building your own DyNET from source. As such, we recommend you follow the instructions at https://github.com/clab/dynet and build DyNET with support for Intel’s Math Kernel Lib.