Simon 0.4.90 beta released
The second version (0.4.90) towards Simon 0.5.0 is out in the wilds. Please download the source code, test it and send us feedback.
What we changed since the alpha release:
- Bugfix: The download of Simon Base Models work again flawlessly (bug: 377968)
- Fix detection of utterid APIs in Pocketsphinx
You can get it here:
https://download.kde.org/unstable/simon/0.4.90/simon-0.4.90.tar.xz.mirrorlist
In the work is also an AppImage version of Simon for easy testing. We hope to deliver one for the Beta release coming soon.
Known issues with Simon 0.4.90 are:
- Some Scenarios available for download don't work anymore (BUG: 375819)
- Simon can't add Arabic or Hebrew words (BUG: 356452)
We hope to fix these bugs and look forward to your feedback and bug reports and maybe to see you at the next Simon IRC meeting: Tuesday, 23rd of May, at 10pm (UTC+2) in #kde-accessibility on freenode.net.
About Simon
Simon is an open source speech recognition program that can replace your mouse and keyboard. The system is designed to be as flexible as possible and will work with any language or dialect. For more information take a look at the Simon homepage.
Simon 0.4.80 alpha released
The first version (0.4.80) towards Simon 0.5.0 is out in the wilds. Please download the source code, test it and send us feedback.
Some new features are:
- MPRIS Support (Media Player control)
- MacOS port (thanks to René there is a first Macports script
- A series of bug fixes.
You can get it here:
https://download.kde.org/unstable/simon/0.4.80/src/simon-0.4.80.tar.xz.mirrorlist
In the work is also an AppImage version of Simon for easy testing. We hope to deliver one for the Beta release coming soon.
Known issues with Simon 0.4.80 are:
- Base model download doesn't work as expected. You can search for some base models though (BUG: 377968)
- Some Scenarios available for download don't work anymore (BUG: 375819)
- Simon can't add Arabic or Hebrew words (BUG: 356452)
We hope to fix these bugs and look forward to your feedback and bug reports and maybe to see you at the next Simon IRC meeting: Tuesday, 4th of April, at 10pm (UTC+2) in #kde-accessibility on freenode.net.
About Simon
Simon is an open source speech recognition program that can replace your mouse and keyboard. The system is designed to be as flexible as possible and will work with any language or dialect. For more information take a look at the Simon homepage.
New life in Simon speech recognition
As my blog as FSFE Fellow No. 1 is temporarily not aggregated on planet.kde.org and my private blog about woodwork (German only) currently only tells about a wooden staircase (but soon again about wooden jewelry) I'm building I found a new place for my KDE (non-Randa) related stuff: KDE Blogs. Thanks to the KDE Sysadmin team for the quick setup!
Since the beginning of this year there is some new activity about and around the Simon speech recogition. We had several weekly IRC meeting (Logs: W02, W03, W04, W05, W06, W07 and W08) and there is a workboard with tasks. Our plan for the near future is it to release a last Kdelibs4 and Qt4 based version of Simon. Afterwards we focus on the KDE Frameworks 5 and Qt5 port and then we might have time and work power to look at new feature development like e.g. Lera or the integration of the Kaldi speech recognition framework. But there is parallel work as well like creating Scenarios, working on speech (acustic) and language models and document all this.
So to reach this first goal of a last kdelibs4/Qt4 based version of Simon (the last stable release of Simon happened back in 2013 and there are some commits waiting to be released) we need your help. Would you like to work on documentation checking, compiling first Alpha versions of the new release or just writing about Simon or showcasing it in videos then please get in contact with is via email, IRC (#kde-accessibility on freenode.net) or the KDE Forums
And if you'd like to start right away you'll find us tomorrow (Tuesday, 14th of March) at 10pm (CEST) in #kde-accessibility on freenode.net. Looking forward to meeting you!
PS: Something different and how times change: Just bought a dishwasher and got a printed copy of the GNU GPL ;-).
Simon on OS X
Let's make this short and sweet: Starting today, the Simon development version officially supports Mac OS X (10.6+).
If you want to try it out, you can find instructions on KDE's userbase. Enjoy!
Open Academy
Back in 2012, Facebook and Stanford University introduced their "Open Academy" program. The aim was and still is simple: Give University students an opportunity to work on real open source projects in exchange for University credit - and a ton of valuable experience.
This year, KDE has joined as a mentoring organization with a total of 11 students assigned to work on 3 different projects. One of those projects is Simon's upcoming natural language dialog manager: A system building on the current "Dialog" plugin to enable the creation of advanced spoken dialogs like the ones made popular by Apple's Siri.
Kickoff event at Facebook's headquarters
Three students from the University of Texas at Austin are rising to the challenge: Ben, Eddie and Tom. Guided by both Professor Downing and myself, they will be working on bringing a natural language dialog system to Simon.
Throughout the development process, they will post status update on their respective blogs, which have been aggregated to Planet KDE so do watch out for updates!
ReComment: A speech-based Recommender System
Most of you will probably know that as my "day job", I am a student currently pursuing my master's degree in computer science. This, of course, also entails some original research.
In this blog post, I will describe both one of these efforts and a practical use case of Simon's upcoming dictation features, all conveniently rolled up into one project: ReComment.
A recommender system tries to aid users in selecting e.g., the best product, the optimal flight, or, in the case of a dating website, even the ideal partner - all specifically tailored to the users needs. Most of you have probably already used a recommender system at some point: Who hasn't ever clicked on one of the products in the "Customers Who Bought This Item Also Bought..." section on Amazon?
The example from Amazon uses what is conventionally called a "single-shot" approach. Based on the systems information about the user a set of products are suggested. In contrast, "conversational" recommender systems actively interact with the user, thereby refining their understanding of the user's preferences incrementally.
Such conversational recommender systems have been shown to work really well in finding great items, but obviously require more effort from any single user than a single-shot system. Many different interaction methods have been proposed to keep this effort to a minimum while still finding optimal products in reasonable time. However, these two goals (user effort and convergence performance) are often contradictatory as the less information the user provides, the less information is also available for the recommendation strategy.
In our research we intend to slightly sidestep this problem that is traditionally combated with increasingly complex recommendation strategies and instead make it easier for the user to provide complex information to the system: ReComment is a speech-based approach to build a more efficient conversational recommender system.
What this means exactly is probably best explained with a short video demonstration.
(The experiment was conducted in German to find more native speaking testers in Austria; be sure to turn on subtitles!)
Powering ReComment is Simond with the SPHINX backend using a custom-built, German speech model. The NLP layer uses relatively straight-forward keyword spotting to extract meaning from user feedback.
A pilot study was conducted with 11 users to confirm and extend the choice of recognized keywords and grammar structures. The language model was modified to heavily favor keywords recognized by ReComment during decoding. Recordings of the users from the pilot study were manually annotated and used to adapt the acoustic model to the local dialect.
ReComment itself is built in pure Qt to run on Linux and the Blackberry PlayBook.
Results and Further InformationTo evaluate the performance of ReComment, we conducted an empirical study with 80 participants, comparing the speech-based interface to a traditional mouse-based interface and found that users not just reported higher overall satisfaction using the speech-based system, but also reported finding better products in significantly less interaction cycles.
The research was published at this years ACM Recommender System conference. You can find the presentation and the full paper as pdf, in the publications section on my homepage.
The code for the developed prototype, including both the speech-based and mouse-based interface has been released as well.
Launching the Open Speech Initiative
Over the course of the summer, I have been working on bringing dictation capabilities to Simon. Now, I'm trying to build up a network of developers and researchers that work together to build high accuracy, large vocabulary speech recognition systems for a variety of domains (desktop dictation being just one of them).
Building such systems using free software and free resources requires a lot of work in many different areas (software development, signal processing, linguistics, etc.). In order to facilitate collaboration and to establish a sustainable community between volunteers of such diverse backgrounds, I am convinced that the right organizational structure is crucial to ensuring continued long-term success.
With this in mind, I am pleased to introduce the new Open Speech Initiative under the KDE umbrella: A team of developers looking to bring first class speech processing to the world of free software.
TeamThe current team consists of Simon, a german programmer getting into speech technology, Adam Nash, a Simon GSoC graduate that I'm very happy to welcome again, Mario Fux, well known for - among other things - the legendary Randa meetings, Jon Lederman, Co-Founder of SonicCloud (a cloud telephony platform) and myself.
If you are interested in joining or actually already are working on a project that also deals with speech processing, please feel free to get in touch with us.
InfrastructureWebsite: speech.kde.org (currently under construction)
IRC: #kde-speech on Freenode
Mailing list: kde-speech@kde.org
Projects: https://projects.kde.org/projects/extragear/speech
Right now, we're mostly working on the lower levels, setting up infrastructure and automatic systems to help us build better speech models quicker in the future.
However, we also have some end-user applications planned that range from dictation software to automatic subtitling.
Watch the Open Speech Initiative's website for updates!
Open Source Dictation: Wrapping up
Ten days ago, I completed the dictation prototype - just in time for this years Akademy conference.
AkademyAt Akademy, I gave a talk about open source speech recognition and demoed the dictation prototype.
The slides and the video of the talk are both already available. If you've seen the talk, please consider leaving me some feedback - it's always appreciated.
On Tuesday, I held a 2 hour BoF session dedicated to open source speech recognition: First, I quickly re-established the core pillars of an LVCSR system and explained how all those components fit together. Then we talked about where one could potentially source additional material for building higher quality language and acoustic models and discussed some applications of speech recognition technology relevant to the larger KDE community.
As a side note: This years Akademy was certainly one of the best conferences I've been to thus far. The talks and BoF sessions were great, the atmosphere inspired and the people - as always - just awesome. A special thanks also to the local team and all the organizers which put together a program that was simply sublime.
Where's the code?When I started, I told you that I'll share all data created during the course of this project. As promised:
- Updated dictation plugin: Check out the "dictation" branch from Simon Git
- Language model
- Acoustic model
(I decided to share the unadapted acoustic model instead of the final, adapted one I used in the video because the latter is specifically tailored to my own voice and I suppose that is not really useful for anyone but me. If you're really interested in the adapted model for the sake of reproducability, I'm of course also happy to share this model as well.)
As I mentioned repeatedly, this is "just" a prototype and absolutely not intended for end-user consumption. Even with all the necessary data files, setting up a working system is anything but trivial. If you're looking for a ready-to-use system - and I can't stress this enough: Simon is not (yet) it!
Where to go from here?As many of you will have noticed, the project was partly also intended to find potentially interested contributers to join me in building open source speech recognition systems. In this regard, I'm happy to report that in the last 10 days, quite a few people contacted me and asked how to get involved.
I'll hold an IRC meeting in the coming week to discuss possible tasks and how to get started. If you're interested in joining the meeting, please get in touch.
Open Source Dictation: Demo Time
Over the last couple of weeks, I've been working towards a demo of open source speech recognition. I did a review of existing resources, and managed to improve both acoustic- and language model. That left turning Simon into a real dictation system.
Making Simon work with large-vocabulary modelsFirst of all, I needed to hack Simond a bit to accept and use an n-gram based language model instead of the scenarios grammar when the first was available. With this little bit of trickery, Simon was already able to use the models I built in the last weeks.
Sadly, I immediately noticed a big performance issue: Up until now, Simon basically recorded one sample until the user stopped speaking and then started recognizing. While not a problem when the "sentences" are constrained to simple, short commands, this would cause significant lag as the length of the sentences, and therefore the time required for recognition, increased. Even when recognizing faster than real time, this essentially meant that you had to wait for ~ 2 seconds after saying a ~ 3 second sentence.
To keep Simon snappy, I implemented continuous recognition in Simond (for pocketsphinx): Simon now feeds data to the recognizer engine as soon as the initial buffer is filled, making the whole system much more responsive.
Even before this project started, Simon already had a "Dictation" command plugin. Basically, this plugin would just write out everything that Simon recognizes. But that's far from everything there is to dictation from a software perspective.
First of all, I needed to take care of replacing the special words used for punctuation, like ".period", with their associated signs. To do that, I implemented a configurable list of string replaces in the dictation plugin.
An already existing option to add a given text at the end of a recognition result takes care of adding spaces after sentences if configured to do so. I also added the option to uppercase the first letter of every new spoken sentence.
Then, I set up some shortcut commands that would be useful while dictating ("Go to the end of the document" for ctrl+end or "Delete that" for backspace, for example).
To deal with incorrect recognition results, I also wanted to be able to modified already written text. To do that, I made Simon aware of the currently focused text input field by using AT-SPI 2. I then implemented a special "Select x" command that would search through the current text field and select the text "x" if found. This enables the user to select the offending word(s) to either remove them or simply dictate the correction.
DemonstrationSo without much ado, this is the end result:
What's next?Of course, this is just the beginning. If we want to build a real, competitive open source speech recognition offering we have to tackle - among others - the following challenges:
- Turning the adaption I did manually into an integrated, guided setup procedure for Simon (enrollment).
- Continuing to work towards better language- and acoustic models in general. There's a lot to do there.
- Improving the user interface for the dictation: We should show off the current (partial) hypothesis even while the user is speaking. That would make the system feel even more responsive.
- Better accounting for spontaneous input: Simon should be aware of (and ignore) filler words, support mid-sentence corrections, false starts, etc.
- Integrating semantic logic into the language model; For example, in the current prototype, recognizing "Select x" is pretty tricky because e.g., "Select hear" is not a sentence that makes sentence according to the language model - it does in the application, though (select the text "hear" in the written text for correction / deletion).
- Better incorporating the dictation with traditional command & control: When not dictating texts, we should still exploit the information we do have (available commands) to keep recognition accuracy as high as it is for the limited-vocabulary use case we have now. A mixture (or switching) between grammar and language model should be explored.
- Better integration in other apps: The AT-SPI information used for correcting mistakes is sadly not consistent across toolkits and widgets. Many KDE widgets are in fact not accessible through AT-SPI (e.g. the document area of Calligra Words does not report to be a text field). This is mostly down to the fact that no other application currently requires the kind of information Simon does.
Even this rather long list is just a tiny selection of what I can think of right off the top of my head - and I'm not even touching on improvements in e.g. CMU SPHINX.
There's certainly still a lot left to do, but all of it is very exciting and meaningful work.
I'll be at the Akademy conference for the coming week where I'll also be giving a talk about the future of open source speech recognition. If you want to get involved in the development of an open source speech recognition system capable of dictation: Get in touch with me - either in person, or - if you can't make it to Akademy - write me an email!
Open Source Dictation: Acoustic Model
After working a bit on the language model last week, I spent some time on improving the used acoustic model which, simply put, is the representation of how spoken words actually sound.
Improving the general modelSo far, the best acoustic model for my test set was Nickolay's Voxforge 0.4 model built from the Voxforge database.
The Voxforge corpus is available under the terms and conditions of the GPL which means that I was free to try to improve upon that model. Sadly, I quickly realized that given the tight time constraints for this project, it was computationally infeasible to run a lot of experiments: The training procedure takes around 24 hours to complete on my laptop - the fastest machine at my disposal.
Because of that, I was not able to try some very interesting approaches like vocal tract length normalization (which tries to account for differences in the resonance properties of varyingly long vocal tracts of different speakers) or MMIE training although they have been shown to improve word error rates. I was also not able to try to fine tune the number of used Senones or clean the training database with forced alignment. Such experiments will have to wait until after the completion of this project - there's definitely quite a bit of low hanging fruit.
However, I was still able to boost recognition rates simply by rebuilding the existing Voxforge model to incorporate all the new training data submitted since the last model was created in 2010.
Acoustic model | Dictionary | Language model | WER |
---|---|---|---|
Voxforge 0.4 | Ensemble 65k | Ensemble 65k | 29.31 % |
Voxforge new | Ensemble 65k | Ensemble 65k | 27.79 % |
This also nicely shows what an impact a growing database of recordings has on recognition accuracy. If you want to help drop that WER score further, help today!
Adapting the model to my voiceOf course, when building a dictation system for myself, it would be foolish to not adapt this general acoustic model to my own voice. Model adaption is a fairly sure-fire way to dramatically improve recognition accuracy.
To this end, I recorded about 2 hours worth of adaption data (1500 recordings). Thanks to Simon's power training feature this only took a single afternoon - despite taking frequent breaks.
I then experimented with MLLR and MAP adaption with a range parameters. Although I fully expected this to make a big difference, the actual result is astonishing: The word error rate on the test set drops to almost half - about 15 %.
Acoustic model | Dictionary | Language model | WER |
---|---|---|---|
Voxforge new | Ensemble 65k | Ensemble 65k | 27.79 % |
Voxforge new; MAP adapted to my voice | Ensemble 65k | Ensemble 65k | 15.42 % |
Because I optimized the adaption parameters to achieve the lowest possible error rate on the test set, I could have potentially found a configuration that performs well on the test set but not in the general case.
To ensure that this is not the case, I also recorded an evaluation set consisting of 42 sentences from a blog post from the beginning of this series, an email and some old chat messages I wrote on IRC. In contrast to the original test set, this time I am also using vocalized punctuation in the recordings I'm testing - simulating the situation where I would use the finished dictation system to write these texts. This also better matches what the language model was built for. The end result of this synergy? 13.30 % word error rate on the evaluation set:
what speech recognition application are you most looking forward to ?question-mark
with the rising popularity of speech recognition in cars and mobile devices it's not hard to see that we're on the cost of making speech recognition of first class input method or across all devices .period
however ,comma it shouldn't be forgotten that what we're seeing in our smart phones on laptops today is merely the beginning .period
I am convinced that we will see much more interesting applications off speech recognition technologies in the future .period
so today ,comma I want to ask :colon
what application of speech recognition technology an id you're looking forwards to the most ?question-mark
for me personally ,comma I honestly wouldn't know where to begin .period
[...]
Open Source Dictation: Language Model
A language model defines probable word succession probabilities: For example "now a daze" and "nowadays" are pronounced exactly the same, but because of context we know that "Now a daze I have a smartphone" is far less likely than "Nowadays I have a smartphone". To model such contextual information, speech recognition systems usually use an n-gram that contains information of how likely a specific word is, given the context of the sentence.
When comparing different, existing speech models, the Gigaword language model (with a 64000 words vocabulary) outperformed all other language models. I decided to try to improve upon that model.
Existing Language ModelsTo see what's what, I again set up a test set. This time, we only want to look at the language model performance with no influence from other components.
This is done by measuring the perplexity of the language model given an input text. The perplexity value basically tells you how "confused" the language model was when seeing the test text. Think about it like this: "The weather today is round" is certainly more confusing than "The weather is good". (for the more mathematically inclined: the perplexity is two to the power of the entropy of the test set). Ideally, word successions that make sense would yield low perplexity and sentences that don't, very high ones. That way, the recognizer is discouraged from outputting hypothesis like "Now a daze I have a smartphone".
Additionally, we'll also be looking at "out of vocabulary" words. Naturally, a language model only contains information about a certain amount of words. Words that are not in the LM can not be recognized. Therefore, one might be tempted to use ever growing vocabulary sizes to mitigate the issue. However, this makes the recognizer not only slower but also more inaccurate: because the language model encodes more transitions, the gap between common and very rare transitions also becomes smaller, increasing perplexity. Moreover, implementation details in CMU SPHINX discourage the use of vocabularies much larger than roughly 64000 words.
Because I didn't need to record this test set, I elected to use a bigger one to get more accurate results. The used test set therefore consists of 880 sentences from 495 chat messages, 169 email fragments, 175 sentences from scientific texts and 30 sentences from various news sources. The extracted sentences were not cleaned of e.g., names or specialized vocabulary. Because we're aiming for dictation, I modified the used corpora to use what's called "verbalized punctuation": punctuation marks are replaced with pseudo-words like ".period" or "(open-parenthesis" that represent what a user should say in the finished system.
To sum up: We are looking for a language model with around 64 thousand words that has the lowest possible perplexity score and the lowest amount of out-of-vocabulary words on our given test set.
So, let's first again compare what's currently out there given our new test set.
Language model | OOVs [%] | Perplexity |
---|---|---|
HUB 4 (64k) | 15.03% | 506.9 |
Generic (70k) | 14.51% | 459.7 |
Gigaword, (64k) | 9.40% | 458.5 |
From this, we can already see why the Gigaworld corpus performed much better than the other two language models in our earlier test. However, it still has almost 10% out of vocabulary words on our test set. To understand why, we have to look no further than what the corpus is built from: various news wire sources that are about a decade old by now. Given our problem domain for this experiment, this is obviously not ideal.
Can we do better?Building a language model isn't exceptionally hard, but you need to find an extensive amount of data that should closely resemble what you want the system to recognize later on.
After a lot of experimenting (tests of all data sets below are available upon request), I settled on the following freely available data sources:
- English Wikipedia
Solid foundation over a diverse set of topics. Encyclopaedic writing style. - U.S. Congressional Record (2007)
Somber discussions over a variety of topics using sophisticated vocabulary - Corpus of E-Mails of Enron Employees
Mixture of business and colloquial messages between employees. - Stack Exchange (split between Stack Overflow and all other sites)
Questions and answers from experts over a variety of domains, many of which are technical (fitting our problem domain). - Open Subtitles.org (dump graciously provided upon request)
Everyday, spoken speech. - Newsgroups (alt.* with a few exceptions)
Informal conversations.
I built separate models of each of these corpora which were then combined to one large "ensemble" model with mixture weights optimizing the perplexity scores on the test set. These mixture weights are visualized in the graph below.
For each of the data sets, I also calculated word counts, selected the top 20000 to 35000 words (depending on the variability of the corpus) and removed duplicates to end up with a word list of about 136000 common words across the above corpora. I then further pruned this word list with a large dictionary of valid English words (more than 400000 entries) and manually removed a couple of e.g., foreign names to arrive at a list of around 65000 common English words to which I limited the ensemble language model.
The end result is a model with significantly fewer out of vocabulary words and lower perplexity on our test set than the Gigaword corpus.
Language model | OOVs [%] | Perplexity |
---|---|---|
Gigaword, (64k) | 9.40% | 458.5 |
Ensemble (65k) | 4.53% | 327.8 |
In order to perform recognition, we also need a phonetic dictionary. Of the 65k words in the ensemble language model, about 55k were already in the original CMU dictionary. The pronunciations for the remaining 10k words were (mostly) automatically synthesized with the CMU SPHINX g2p framework. While I was at it, I also applied the casing of the (conventional) dictionary to the (otherwise all uppercase) phonetic dictionary and language model. While a bit crude, this takes care of e.g., uppercasing names, languages, countries, etc.
So how does our thus created language model perform compared to the next best thing we tested? On the same test, with the same acoustic model, we decreased our word error rate by almost 2 percent - a more than 5 percent relative improvement.
Acoustic model | Dictionary | Language model | WER |
---|---|---|---|
Voxforge 0.4 (cont) | cmudict 0.7 | Gigaword, 64k | 31.02 % |
Voxforge 0.4 (cont) | Ensemble 65k | Ensemble 65k | 29.31 % |
Open Source Dictation: Scoping out the Problem
Today I want to start with the first "process story" of creating a prototype of an open source dictation system.
Project scopeGiven around a weeks worth of time, I'll build a demonstrative prototype of a continuous speech recognition system for the task of dictating texts such as emails, chat or reports, using only open resources and technologies.
Dictation systems are usually developed for a target user group and then modified for a single user (the one who'll be using the system). For this prototype, the target user group is "English speaking techies" and I myself will be the end-user to whom the system will be adapted to. The software to process and handle the recognition result will be Simon. Any additions or modifications to the software will be made public.
During the course of the project, I'll be referencing different data files and resources. Unless otherwise noted, those resources are available to the public under free licenses. If you need help to find them or would like more information (including any developed models), please contact me.
Evaluating existing modelsI started by developing a sensible testcase for the recognizer by selecting a total of 39 sentences of mixed complexity from various sources including a review of "Man of Steel", a couple of news articles from CNN and slashdot and some blog posts right here on PlanetKDE. This, I feel, represents a nice cross-section of different writing styles and topics that is in line with what the target user group would probably intend to write.
I then recorded these sentences myself (speaking rather quickly and without pauses) and ran recognition tests with PocketSphinx and various existing acoustic and language models to see how they'd perform.
Specifically, I measured what is called "Word Error Rate" or "WER", that basically tells you the percentage of words the system got wrong when comparing the perfect (manual) transcription to the one created by the recognizer. You can find more information on Wikipedia. Lower WER is better.
Acoustic model | Dictionary | Language model | WER |
---|---|---|---|
HUB4 (cont) | HUB4 (cmudict 0.6a) | HUB4 | 53.21 % |
HUB4 (cont) | cmudict 0.7 | Generic | 58.32% |
HUB4 (cont) | HUB4 (cmudict 0.6a) | Gigaword, 64k | 49.62% |
WSJ (cont) | HUB4 (cmudict 0.6a) | HUB4 | 42.81 % |
WSJ (cont) | cmudict 0.7 | Generic | 50.69% |
WSJ (cont) | cmudict 0.7 | Gigaword, 64k | 41.07% |
HUB4 (semi) | HUB4 (cmudict 0.6a) | HUB4 | 38.23 % |
HUB4 (semi) | cmudict 0.7 | Generic | 56.64% |
HUB4 (semi) | cmudict 0.7 | Gigaword, 64k | 36.18 % |
Voxforge 0.4 (cont) | HUB4 (cmudict 0.6a) | HUB4 | 32.67% |
Voxforge 0.4 (cont) | cmudict 0.7 | Generic | 42.5 % |
Voxforge 0.4 (cont) | cmudict 0.7 | Gigaword, 64k | 31.02 % |
So, what can we take away from these tests: Overall, the scores are fairly low and any system based on those models would be almost unusable in practice. There are several reasons why the scores are low: Firstly, I am not a native English speaker so my accent definitely plays a role here. Secondly, many sentences I recorded for the test corpus are purposefully complex (e.g., "Together they reinvent the great granddaddy of funnybook strongmen as a struggling orphan whose destined for greater things.") to make the comparisons between different models more meaningful. And thirdly: the used models are nowhere near perfect.
For comparison, I also analyzed the results of Google's public speech recognition API which managed to score a surprisingly measly 32.72 % WER on the same test set. If you compare that with the values above, it actually performed worse than the best of the open source alternatives. I re-ran the test twice and I can only assume that either their public API is using a simplified model for computational reasons or that their system really doesn't like my accent.
All things considered then, 31.02 % WER for a speaker independent dictation task on a 64k word vocabulary is still a solid start and a huge win for the Voxforge model!
If you're a researcher trying to find the best acoustic model for your own decoding task, you should definitely do your own comparison; it's really easy and definitely worth your while.
Simon 0.4.1
Simon 0.4.1 was just released to the public and can now be downloaded from the Simon homepage.
This release includes only bug fixes, a full list of which can be found in the official Changelog.
Next to addressing some crucial and many minor problems related to, among other areas, the network synchronization and the SPHINX model management backend, the released package also includes new and updated translations.
Results are in: One Open Source Dictation System Coming Up
Thanks to everyone who participated in the poll last week about what speech recognition project you'd most like to see.
The week is over, and Dictation has emerged as a clear winner!
As promised, I'll now try to build a proof of concept level prototype of such a system in time for this years Akademy.
With a system as complex as continuous dictation, there are obviously a wide range of challenges.
Here's just a few of the problems I'll need to tackle in the next two weeks:
- Acoustic model: The obvious elephant in the room: Any good speech recognition system needs an accurate representation of how it expects users to pronounce the words in its dictionary.
- Language model: "English" is simply not good enough - or when have you last tried to write "Huzzah!"? We not only need to restrict vocabulary to a sensible subset but also gain a pretty good understanding what a user might intend to write. This is not only important to avoid computationally prohibitive vocabulary sizes but to differentiate i.e. "@ home" and "at home".
- Dictation application: Even given a perfect speech recognition system, dictation is still a bit off. You'll also need some form of software that handles the resulting recognition result and applies formatting (casing, etc.) and allows users to correct recognition mistakes, change the structure, etc.
Obviously, I won't be able to solve all these issues in this short time frame but I'll do my very best to show off a presentable prototype that addresses all these areas. Watch this blog for updates over the coming weeks!
You pick, I work: Dictation, Assistant or Translator?
A little while ago, I mentioned that I'll be giving a talk about the current state of open source speech recognition at this years Akademy.
As part of that talk, I want to show off a tech-demo of a moonshot use case of open source speech recognition to not only demonstrate what is already possible, but also show off the limits of the current state of the art.
So a couple of days ago, I asked what application of speech recognition technology would be most interesting for you, and many of you responded. I extracted the three options that broadly cover all suggestions: Dictation (like Dragon Naturally Speaking), a virtual assistant (like Siri) and simultaneous translation (like Star Trek's universal translator).
You now get to pick one of those three from the poll below.
After the poll closes (a week from now), I'll take the idea that received the most votes and devote about a week to build a prototype based on currently available open source language processing tools. This prototype will then be demonstrated at this years Akademy.
What speech recognition application are you most looking forward to?
With the rising popularity of speech recognition in cars and mobile devices it's not hard to see that we're on the cusp of making speech recognition a first-class input method across our devices.
However, it shouldn't be forgotten that what we're seeing in our smart phones or laptops today is merely the beginning. I am convinced that we will see much more interesting applications of speech recognition technologies in the future.
So today, I wanted to ask: What application of speech recognition technology are you looking forward to the most?
For me personally, I honestly wouldn't know where to begin.
First I'd probably go for a virtual assistant. Yes, there's Google Now and Siri already, but those are still obviously not as good as an actual assistant. Especially Siri also suffers from being constrained to the same interaction method than an actual assistant. A virtual assistant can arguably be of much greater value when it takes a more pro-active role, exploiting the vast amount of information it has access to to become more like e.g., iron man's JARVIS.
Secondly, there is the domain of automatic, simultaneous translation that I think is fascinating. While early implementations already exist from industry greats like Microsoft and Google, there is obviously a lot of room to grow.
And of course from computer-aided memory of real-life conversations to finally understanding the announcements from PA systems in trains - everything is up for grabs.
So given an infinite budget: what speech recognition application would you pick? Please let me know what you think in the comments!
Coming up: Simon 0.4.1
I've been quietly fixing various bugs and annoyances since the release of Simon 0.4.0 and I think this warrants a small maintenance release before diving into new features for the next major Simon version.
So without much ado, I'd like to announce Simon 0.4.1 coming to a mirror near you on 24th of July, 2013.
Akademy 2013: A breakdown of FLOSS speech recognition efforts
I'm happy to announce that I'll be once again be attending this years Akademy!
In Bilbao, I'll be talking about the current status of open source speech recognition systems, why they're apparently all still "stuck" in the year 2000 and what we can do to change that. You can find more about my talk on the official schedule.
Simone, meet OVI and BlackBerry World
I'm proud to announce that the initial versions of Simone for the Nokia N9 and Simone for the BlackBerry PlayBook are now both available in their respective app store.
With Simone you can use the stellar audio hardware in your mobile devices as a microphone for your existing Simon installation. Both apps support push to talk as well as automatic voice activity detection.
As a small gimmick, the N9 app also has some basic voice commands built in. This was mainly included to show off the recognition without requiring users to first configure anything else and works by making Simone connect to a public Simond server per default. Samples, that are anonymously collected from users of the public server will be used to improve the recognition accuracy of free speech models (privacy policy).
Simon Gets a New Homepage
These days, it's rather hard to point someone interested in Simon to a website as most of the information is strewn across different sites of the KDE infrastructure. Especially for people outside of KDE, it's very hard to find e.g. the forum or the bug tracker.
With that in mind, I want to announce simon.kde.org, the new home for all things Simon.
It's a small landing page that gives users a short overview of the project and collects all the various resources on a single, easily sharable, website.
As always, feedback is appreciated.