I made neat little progam that will scan any text file and fix typos based on the words I have defined and their correct replacements. As I've been building the wordlist, I've realized that what I'm actually building is a Hive Dictionary for ASR transcripts.
This significantly increases the quality of the transcripts, making it possible to generate higher quality data (like summaries) based on them.
You are going to need to elaborate on this. Perhaps either here or in an article.
Good idea! I've been lagging behind on my posting anyways..
A short version here though: Based on the 107 entries I've created in my dictionary so far, I was just able to replace 26k misspelled words with their properly spelled Hive-words/names, found throughout the approximately 20 million words transcribed so far.
Now if you can figure how to put LeoGlossary links in there. LOL
Here's the write-up you asked for: https://inleo.io/@mightpossibly/increased-data-quality-with-hive-asr-dictionary
Very cool. Thanks for that.