Author:

Status:

released

OS:

Linux, Windows, OS X

License:

Mozilla Public License 2.0

Tags:

Taggers, Tools

MorphoDiTa

Introduction
Online Web Application and Web Service
Release
- 3.1. Download
  - 3.1.1. Language Models
- 3.2. License
MorphoDiTa Installation
MorphoDiTa User's Manual
MorphoDiTa API Tutorial
MorphoDiTa API Reference
Contact
Acknowledgements

1. Introduction

MorphoDiTa: Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural language texts. It performs morphological analysis, morphological generation, tagging and tokenization and is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, MorphoDiTa achieves state-of-the-art results with a throughput around 10-200K words per second. MorphoDiTa is a free software under Mozilla Public License 2.0 and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions. MorphoDiTa is versioned using Semantic Versioning.

2. Online Web Application and Web Service

MorphoDiTa Web Application is available at http://lindat.mff.cuni.cz/services/morphodita/ using LINDAT/CLARIN infrastructure.

MorphoDiTa REST Web Service is also available, with the API documentation available at http://lindat.mff.cuni.cz/services/morphodita/api-reference.php.

3. Release

3.1. Download

MorphoDiTa releases are available on GitHub, both as source code and as a pre-compiled binary package. The binary package contains Linux, Windows and OS X binaries, Java bindings binary, C# bindings binary, and source code of MorphoDiTa and all language bindings). While the binary packages do not contain compiled Python or Perl bindings, packages for those languages are available in standard package repositories, i.e. on PyPI and CPAN.

3.1.1. Language Models

To use MorphoDiTa, a language model is needed. The language models are available from LINDAT/CLARIN infrastructure and described further in the MorphoDiTa User's Manual. Currently the following language models are available:

Czech MorfFlex2+PDT-C: czech-morfflex2.0-pdtc1.0-220710 (requires MorphoDiTa 1.9, documentation)
Czech MorfFlex+PDT: czech-morfflex-pdt-161115 (requires MorphoDiTa 1.9, documentation); older versions: czech-morfflex-pdt-160310 (documentation), czech-morfflex-pdt-131112 (documentation)
Slovak MorfFlex+PDT: slovak-morfflex-pdt-170914 (requires MorphoDiTa 1.9, documentation)
English Morphium+WSJ: english-morphium-wsj-140407 (documentation)

3.2. License

MorphoDiTa is an open-source project and is freely available for non-commercial purposes. The library is distributed under Mozilla Public License 2.0 and the associated models and data under CC BY-NC-SA, although for some models the original data used to create the model may impose additional licensing conditions.

If you use this tool for scientific work, please give credit to us by referencing MorphoDiTa website and Straková et al. 2014.

4. MorphoDiTa Installation

MorphoDiTa Installation on separate page.

5. MorphoDiTa User's Manual

MorphoDiTa User's Manual on separate page.

6. MorphoDiTa API Tutorial

MorphoDiTa API Tutorial on separate page.

7. MorphoDiTa API Reference

MorphoDiTa API Reference on separate page.

8. Contact

Authors:

MorphoDiTa website.

MorphoDiTa LINDAT/CLARIN entry.

9. Acknowledgements

This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).

Acknowledgements for individual language models are listed in MorphoDiTa User's Manual page.

9.1. Publications

(Straková et al. 2014) Straková Jana, Straka Milan and Hajič Jan. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13-18, Baltimore, Maryland, June 2014. Association for Computational Linguistics.
(Spoustová et al. 2009) Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab, Miroslav Spousta. 2009. Semi-Supervised Training for the Averaged Perceptron POS Tagger. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 763-771, Athens, Greece, March. Association for Computational Linguistics.

9.2. Bibtex for Referencing

@InProceedings{strakova14,
  author    = {Strakov\'{a}, Jana  and  Straka, Milan  and  Haji\v{c}, Jan},
  title     = {Open-{S}ource {T}ools for {M}orphology, {L}emmatization, {POS} {T}agging and {N}amed {E}ntity {R}ecognition},
  booktitle = {Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland},
  publisher = {Association for Computational Linguistics},
  pages     = {13--18},
  url       = {http://www.aclweb.org/anthology/P/P14/P14-5003.pdf}
}

9.3. Persistent Identifier

If you prefer to reference MorphoDiTa by a persistent identifier (PID), you can use http://hdl.handle.net/11858/00-097C-0000-0023-43CD-0.

Screenshot:

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form