The500Feed.Live

Everything going on in AI - updated daily from 500+ sources

← Back to The 500 Feed
📄 ResearchMay 21, 2026

Tokenization with Split Trees

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vo...

Read Original Article →

Source

http://arxiv.org/abs/2605.22705v1