Tokenization for NLP is Hard

Monday, 21 October, 2024 - 14:45

Room:

Tokenization for NLP is Hard

Jindřich Helcl (ÚFAL MFF UK)

Neural networks – purely numerical models – are used to process text every day. This requires converting the text into a numerical format in a process called tokenization. The seemingly simple approach of splitting text into word-like tokens has severe drawbacks caused by out-of-vocabulary tokens. The widely-adopted solutions such as Byte-Pair Encoding (BPE) or SentencePiece provide subwords which can still carry some lexical meaning (unlike splitting down to the individual characters). This segmentation to subwords is trained statistically from data and the resulting subwords do not usually overlap with morphemes. Frustratingly, having subwords that obey morphology (often for additional computational costs) simply does not improve the performance of downstream tasks. In this talk I will present our recent quixotic adventures in pursuit of subword segmentations that would finally "make sense".

*** The talk will be delivered in person (MFF UK, Malostranské nám. 25, 4th floor, room S1) and will be streamed via Zoom. For details how to join the Zoom meeting, please write to sevcikova et ufal.mff.cuni.cz ***

Institute of Formal and Applied Linguistics

Charles University, Czech Republic
Faculty of Mathematics and Physics

Search form

Tokenization for NLP is Hard

Jindřich Helcl (ÚFAL MFF UK)