Jan Hajič
Curriculum Vitae
My current CV in English (in "Euro" format) with a list of selected publications.
Selected Bibliography
- Google Scholar
- ORCID: 0000-0002-3503-7730
- Scopus ID: 6602292051
- Researcher ID: D-3429-2017
(Major, Recent) News
2024/09/10 Our project called "Linguistics, artificial intelligence and language and speech technologies: from research to innovations", in which we are teaming with the Speech'@'FIT group at Brno University of Technology and four Czech companies interested in our results for use in the commercial world (Phonexia, MamaAI, Phrase and Seznam.cz), has scored first in the Humanities and Social Sciences Area of the call, to be funded from the Structural Funds ("OP JAK"). We have been invited to formally enter project negotiations (which might still take some time given the unique combination of academia and industry participation).
2024/03/01 Three infrastructural projects have been awarded to broad consortia, in which also LINDAT/CLARIAH-CZ participates. While I am not involved personally in any of them, it is a big success of the Research Infrastructure to take part and to, in some cases, represent the CLARIN community: EVERSE for proper open source research software archiving, distribution and long-term preservation, ATRIUM for digitization and preservation of archaeologicical resources Europe-wide, and FIDELIS (as a sort of EUDAT continuation).
2024/01/01 Two new projects have started: the infrastructural EOSC CZ National Repository Platform project, to which we are contributing the LINDAT/CLARIAH-CZ updated repository platform, and an equipment upgrade project for LINDAT/CLARIAH-CZ itself, which will allow us to work with bigger data and improve our language tools and services.
2023/11/09 I have been invited to speak at and moderate the panel on Generative AI: opportunities, risks and challenges at the European Parliament's STOA committee. With other panelists from the EC, industry and Academia, we have agreed that continued funding is needed in the EU fror further research, most importantly, for basic research in language(s) and language technology, to fundamentally understand how language works, which would in turn help to use the latest AI technology for both economic and societal benefits, not only in the EU.
2023/07/14 At the just-ended ACL 2023 in Toronto, Canada, our paper on "superhuman" performance (What’s the Meaning of Superhuman Performance in Today’s NLU?) has been awarded an "Outstanding paper" prize.
2023/06/01 The project European Language Equality II has ended. Using several contribution from external partners funded by the FSTP mechanism, we were able to put together enough material for a Springer-pubished book, update the language technology scores in the dashboards, and see new EU calls for projects in the Computational Linguistics and Language Technology area.
2023/03/10 My Adjunct Full Professor appointment at University of Colorado boulder (0%) has been renewed until end of August, 2025. Budget allowing, I will continue to visit UCB for consultations on the LUSyD and UMR projects, as well as on lexical semantic and valency issues in general. If another opportunity arises, I will teach in the Summer semester the successful 2019 UCB Master's course on Multilingual NLP, which I have developed there.
2023/03/01 The Uniform Meaning Representation project for Czech (UMR) has started today. Funded by the Ministry of Education, Youth and Sports under the INTER-EXCELLENCE programme, it aims as fostering long-term cooperation between my team working on semantic annotation and the U.S.-based UMR project consortium, led by Brandeis University.
Research Interests, Grants
My research interest evolved from morphology and tagging of inflective languages (lexicons, analysis and generation tools - now reimplemented by Milan Straka as MorphoDiTa) to machine translation (French-English while at IBM and Czech-English; also, Czech-Russian and other closely related languages). I am also interested in parsing (see e.g. the CLSP Workshop on parsing Czech) and generation. However, in the past 10 years, I devoted most of my research time to creating linguistic resources, such as the Prague Dependency Treebank family of projects (Czech, English, Arabic) and managing new research projects, mainly funded by the EU (see below for a complete list). I am also involved in the Universal Dependencies project, led by Joakim Nivre of Uppsala University and hosted by LINDAT/CLARIAH-CZ as the official UD repository. My management responsibilities include the LINDAT/CLARIAH-CZ Research Infrastructure and also being the Deputy Chair of the Institute.
I am also interested in spoken language understanding. I participated in the now finishing project Malach, both on the language modeling part (for ASR), on thesaurus translation and on the IR Czech test collection.
I closely work not only with my students, but also with other Czech and foreign teams, such as the University or West Bohemia in the Czech Republic, Center for Speech and Language Processing at the JHU, the CLEAR lab at CU Boulder, Linguistic Data Consortium, the European Language Resources Association (ELRA/ELDA), and several European Universities on EU projects (see below).
I am or have been the PI, or the national PI of several major Czech, EU and NSF (US) research projects. The list of current projects (or of those finished within the last 10 years) is below.
Projects
2021-2022 |
European Language Equality project, by EP/EC, PPP Action. Preparation of Language-centric AI Strategic Research Agenda for Digital Language Equality by 2030. |
2020-2023 | Humane-AI-Net / AI Centre of Excellence, Call 48 H2020 project (Co-PI for Chalres University) |
2019-2022 | LINDAT/CLARIAH-CZ, Large infrastructural grant for digital humanities, language resources, data access and distribution and related reseearch, project LM2018101 of the Ministry of Education of the Czech Republic (continuation and extension to digital hummanities of LM2010013 and LM2015071); complemented by the Structural Funds project "OP VVV VI 2 LINDAT/CLARIAH-CZ" for equipment and computing facilities extensions and renewal (PI) |
2020-2024 | Language Understanding: from Syntax to Discourse (LUSyD). Grant Agency of the Czech Republic, large grant from the EXPRO programme (PI). |
2019-2022 | European Lannguage Grid (ELG), EC Call 27 project for building a Language Technology platform with a host of resources and LT services for both commercial and reserach use. I am the Co-PI and lead of the Charles University team provding 600+ services and all resource metadata to the ELG. I also supervise the effort to fund ELG Pilot Projects which are selected in Open Calls throughout Europe, distributing almost EUR 2M through the FSTP financial mechanism. |
2010-2015, extended to 2019 | LINDAT/CLARIN, Large infrastructural grant for language resources, data access and distribution and related reseearch, project LM2010013 and since 2016 as LM2015071 of the Ministry of Education of the Czech Republic; now complemented by the Structural Funds project "OP VVV LINDAT/CLARIN" for equipment and computing facilities extensions and renewal (PI) |
2017-2018 | "Document Access" (sub)project for Prague's Mayor's Office (PI) |
2016-2020 | Mellon Foundation grant, with Brandeis Univ. (coordinator), Vassar Univ., Univ. of Tübingen - harmonization of access to language resources and tools between LAPPS Grid and Clarin |
2016-2019 | VIADAT, Virtual Assistent for Access to Oral History Archives, with the Institute of Contemporary Hisotory of the Academy of Sciences of the Czech Republic and the National Film Archive of the Czech Republic (PI) |
2015-2017 | CRACKER, Cracking the Language Barrier: Coordination, Evaluation and Resources for European MT Research. H2020 CSA, PI of the Czech partner, Charles University in Prague. Under negotiation. Coordinated by Hans Uszkoreit, DFKI Berlin, Germany. |
2015-2018 | HimL, Health in my Language. H2020 Innovation Action. PI of the Czech partner, Charles University in Prague. Under negotiation. Coordinated by Barry Haddow, University of Edinburgh, Scotland. |
2015-2018 | QT21, Quality Translation 21. H2020 Research and Innovatio Action. PI of the Czech partner, Charles University in Prague. Under negotiation. Coordinated by Barry Haddow, University of Edinburgh, Scotland. |
2013-2016 | QTLeap, Quality Translation by Deep Language Engineering Approaches. FP7 STREP project. PI of the Czech partner, Charles University in Prague. Coordinated by Antonio Branco, FCUL, Lisabon, Portugal. |
2011-2015 | AMALACH, Access to multilingual archives, with ZCU in Pilsen and USC, Los Angeles, USA, a Czech Ministry of Culture applied project. PI of the grant. |
Before that, I have been the PI or Co-PI of many other projects, such as 6th and 7th Framework Projects by the EU (including large ones, such as the "Companions" or "Khresmoi" projects), the Czech Grant-Agency supported highly collaborative, nation-wide Czech National Corpus project (2003-2006), and several collaborative grants with the U.S. - for example the MALACH project for language technology support for the Shoa Foundation (now USC) Archives, and for mutual visits to/from U.S. institutions (Johns Hopkins University, University of Pennsylvania, Univ. of Colorado), and of several smaller subcontracting grants (such as the U.S.-based GALE project). In the 90s, I have been the Czech PI of several collaborative EU projects specifically aimed at the formerly Soviet Bloc Countries (EU project STEEL, EU project CEGLEX).
I have been working on some other grants as a researcher as well, such as the predecessor Center for Computational Linguistics (2000-2004), the Laboratory for Linguistic Data (1996-2000), Czech-English MT project supported by the Czech Grant Agency MATRACE (1993-1995), and many smaller projects.
Several industrial projects have got my attention as well, such as the Czech Grammar Checker project and certain lexicon(s) for Microsoft, morphological databases for companies like IBM, Xerox, Lotus, Morphologic, Zi Corp., Lernout & Hauspie, and cooperation on product development for several Czech companies, such as ASPI (legal information system using natural language-based search), Oracle (the Oracle Context product) and morphological dictionary development for the Czech and Slovak portals centrum.cz and centrum.sk. I now continue to be engaged in negotiations with national as well as international companies regarding licensing of language resources and/or providing services, such as secure machine translation and others.
Back to top.
Short Bio
2003- | Institute of Formal and Applied Linguistics, School of Computer Science, Faculty of Mathematics and Physics, Charles University in Prague. Vice-director (2012-). LINDAT/CLARIN infrastructural project director/coordinator (2010-). Director (2003-2011). Acting director (2001-2003, 2011-2012). |
2017/8 | Fellow at Norwegian Academy of Sciences, SymSem group at the Center for Advanced Studies |
2016- | Department of Computer Science, University of Colorado in Boulder. Adjunct Professor. Teaching 2019 Summer Term b (Multilingual Natural Language Processing, CS/LING 7800) |
2008- | Full Professor of the Charles University in Prague |
2003-2007 | Associate Professor of the Charles University in Prague |
2002 | Team Leader, CLSP JHU Summer Workshop, Generation in the Context of Machine Translation |
1999-2000 | Visiting Assistant Professor, Computer Science Dept. and Center for Speech and Language Processing, Johns Hopkins University, Baltimore, MD, USA. Teaching "Introduction to NLP" (two semesters) and "Data Structures" |
1998 | Team Leader, CLSP JHU Summer Workshop, Core Natural Language Processing Technology Applicable to Multiple Languages |
1995 | PhD ("Dr.") in Computational Linguistics, Faculty of Mathematics and Physics, Charles University in Prague. Topic: Computational Morphology of Czech. |
1993-2003 | Researcher, Assistant Professor, Institute of Formal and Applied Linguistics, School of Computer Science, Faculty of Mathematics and Physics,Charles University in Prague. |
1991-1993 | Visiting Scientist, IBM T.J.Watson Research Center, Yorktown Heights, NY, USA. Project: Candide (Statistical Machine Translation French -> English, project head(s): Robert Mercer, Peter Brown) |
1990,1991 | Visiting Scientist at ISSCO, Univ. of Geneva, Switzerland. Project: Multilingual Morphological Analysis. |
1984-1991 | Researcher, Research Institute of Mathematical Machines, Prague. Project: Machine Translation Czech -> Russian (software documentation). |
1979-1984 | Bc. & Master Degree study, Faculty of Mathematics and Physics, Charles University in Prague (high honors, RNDr. 1984, thesis topic: Natural Language Robot Control). |
Back to top.
Publications
My complete list of publications as recorded in our Institute's bibliography system is here. The full publication database of our Institute can be found here.
Some pre-2000 publications can be missing from the above system. For a complete list of my publications published before 2008 please see this PDF.
Back to top.
Teaching
I am now teaching an adapted version of the "Introduction to (statistical) NLP" course which I developed while at JHU. The current course is divided into two parts: NPFL067 and NPFL068. Please see also my Hopkins' archive web pages for more information and the complete set of foils in html form.
Service
General Conference Chair
2010 | ACL'10, Uppsala, Sweden |
Program Committee Chair, Co-chair
2018 | Treebanks and Linguistic Theories 17, Oslo, Norway, with S. Oepen, M. Candito, K. Gerdes, S. Kübler |
2018 | Treebanks and Linguistic Theories 16, Prague, Czech Republic, with S. Oepen, S. Kübler |
2017 | Treebanks and Linguistic Theories 15, Bloomington, Indiana, USA, with S. JKübler, M. Dickinson and A. Przieporkowski |
2014 | Coling 2014, Dublin, Ireland; Programme Committee co-chair, with Jun-ichi Tsujii. |
2012 | META-RESEARCH Workshop on Advanced Treebanking, LREC 2012, Istanbul, Turkey (with Koenraad deSmedt, Antonio Branco and Marko Tadic). |
2007 | TLT'07 (Treebanks and Linguistic Theories), Bergen, Norway |
2006 | TLT'06 (Treebanks and Linguistic Theories), Prague, Czech Rep. |
2003 | EACL'03 (European ACL Conference), Budapest, Hungary |
2002 | EMNLP'02 (Empirical Methods in NLP), Philadelphia, PA, USA |
1999 | Thematic Session on "Parsing inflective and free word order languages" ACL '99, June 1999, College Park, MD, USA |
Program Committee Area Chair, Full PC Member
2004 | EMNLP'04, Barcelona, Spain |
2004 | EAMT Workshop, La Valetta, Malta |
2002 | ACL'02, Philadelphia, PA, USA |
1995 | EACL'95, Dublin, Ireland |
2003- | Text, Speech and Dialog Conference, Czech Rep., (standing) PC (SC) Member |
I have also served as a reviewer at additional 94 conferences or workshops (between 1994 and 2024).
Organization or co-organization of conferences and workshops
2021 | INTERSPEECH 2021, Brno, Czech Republic, Plenary program co-chair |
2020 | The Second International Workshop on Designing Meaning Representations (DMR 2020), at Coling 2020 |
2020 | Cross-Framework Meaning Representation 2nd Parsing Shared Task 2020, at EMNLP/CoNLL 2020 |
2019 | Cross-Framework Meaning Representation Parsing 1st Shared Task 2020, coorganization and Publication Chair |
2018 | Treebanks and Linguistic Theories 16, Prague, Czech Republic, with and associated Data Provenance Workshop by M. Butt |
2018 | CoNLL 2018 Second Shared Task on Multilingual Parsing Universal Dependencies, at CoNLL 2018, Brussels, Belgium |
2017 | CoNLL 2017 Fisrt Shared Task on Multilingual Parsing Universal Dependencies, at ACL/CoNLL 2017, Vancouver, Canada |
2014 | Fred Jelinek JHU Summer Workshop for Speech and Language Processing, July 2014, Prague, Czech Rep. (in cooperation with Johns Hopkins Univ., Baltimore, MD, USA) |
2012 | META-RESEARCH Workshop on Advanced Treebanking, LREC 2012, Istanbul, Turkey. |
2007 | ACL'07 and EMNLP'07, Prague, Czech Republic (Local Coordinator) |
2006 | TLT'06, Prague, Czech Republic |
2006-2010 | Vilem Matheisus Courses (Schools), Prague, Czech Republic |
Committees, Boards
2023-2026 | Member of the Scientific Council of the Czech Science Foundation (Grantová agentura ČR) |
2018- | Member of the Scientific Council of Charles University in Prague |
2015-2024 | Executive Board of META-NET, chair. |
2015-2021 | Member (external) of the Scientific Council of the Faculty of Electrical Engineering, Czech Technical University |
2015- | Member (external) of the Scientific Council of the Czech Institute for Informatics, Robotics and Cybernetics, Czech Technical University |
2013- | Member of the joint Clarin DE / Dariah DE Technical Advisory Board (Germany). |
2012-2014 | Member of the International Advisory Board, Clarin NL (Netherlands). |
2012- | Member of the International Committee for Computational Linguistics. |
2013-2017 | Member of the Management Committee (for Czech Republic) for the COST IC1207 Action of the ESF, within the 7th FP EU (PARSEME, IC1207). |
2012-2018 | Member of the Standing Committee for CLARIN Technical Centres (SCCTC), of the EU-wide language resource infrastructure Clarin ERIC.(1st and 2nd term) |
2012-2024 | Member of the Scientific Council of the Faculty of Mathematics and Physics, Charles University in Prague |
2012-2021 | Member of the Council of the core research PRVOUK project, awarded to the Computer Science School by the Charles University in Prague; continuing in the PROGRESS Q48 and Q18 project boards |
2011-2019 | Research Council of the Technology Agency of the Czech Republic, member (2 terms) |
2011-2012 | Expert panel of the Coordinating Committee on the strategy of applied research in the Czech Republic ("Priorities 2030") of the Council for Science, Research and Innovations of the the Czech Republic |
2011- | Steering Committee for the establishment of the Transactions of the Association for Computational Linguistics journal; head of search committee |
2011-2012 | Scientific Council of the Faculty of Mathematics and Physics, Charles University in Prague (1st term) |
2010-2014 | Subcommittee for social sciences and humanities, Council for Science, Research and Innovations of the government of the Czech Republic |
2008-2012 | Computational Linguistics, Editorial Board Member |
2003- | NSF Panels (ITR, HLT) |
2002- | ACL SIGDAT Advisory Board member |
1999-2002 | TEI Consortium Board of Directors Member, ACL Representative |
1998-1999 | TEI Steering Committee Member, ACL Representative |
1997- | EU Evaluation Committee(s), Research Projects |
1996- | Grant Agency of the Czech Republic, reviewer (Linguistic and Computer Science Programs) |
1995-1996 | European Chapter of the ACL Advisory Board Member |
1990- | Czech National Corpus Founding Member, member of CNC Advisory Board (2016-) |
Awards
2022 | Donatio Prize of Charles University |
2020 | Silver Medal of the Charles University in Prague |
2012 | Silver Medal I of the Faculty of Mathematics and Physics, Charles University in Prague. |
2009 | Award of the Academy of Sciences of the Czech Republic for the best research project in the programme "Information Society" 2005-2009 (Project: "From natural language to the semantic web") |
2005 | Co-author of a best student paper at EMNLP 2005, Vancouver, with Ryan McDonald, Fernando Pereira and Kiril Ribarov: "Non-projective Dependency Parsing using Spanning Tree Algorithms" |
2001 | Silver Medal of the Charles University in Prague (as a member of the Czech National Corpus team) |
Membership
I am member of the ACL, ISCA, IEEE, CLAIRE, CzADH/EADH, AICzechia and the Prague Linguistic Circle
Other Web Page(s)
You might also want to visit our Institute's pages at http://ufal.mff.cuni.cz