Martin Boyanov
1 min readDec 6, 2018

--

Thanks Mohammad!

The current heuristics will never do a transformation “<unk>” to “<unk> <unk>”. However, I’m imposing an extra rule that the original word is not in the vocabulary, because I would like to avoid splits like “damage” => “dam age” or “input” => “in put” .

Otherwise, I totally agree that Byte Pair Encoding or Subword Regularization could be used instead. In my use case I’d prefer to transform the entire text with subwords, instead of just the <unk> words. Unfortunately, there are downstream processes which are not prepared to handle this shift in input, so I’d like to keep the text as close as possible to the original.

Did you have a specific text segmentation algorithm in mind?

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Martin Boyanov
Martin Boyanov

Written by Martin Boyanov

Data Scientist passionate about NLP and Graph Modeling

No responses yet

Write a response