Thanks Mohammad!
The current heuristics will never do a transformation “<unk>” to “<unk> <unk>”. However, I’m imposing an extra rule that the original word is not in the vocabulary, because I would like to avoid splits like “damage” => “dam age” or “input” => “in put” .
Otherwise, I totally agree that Byte Pair Encoding or Subword Regularization could be used instead. In my use case I’d prefer to transform the entire text with subwords, instead of just the <unk> words. Unfortunately, there are downstream processes which are not prepared to handle this shift in input, so I’d like to keep the text as close as possible to the original.
Did you have a specific text segmentation algorithm in mind?