Archive for October, 2009

Protected: NZ:

This post is password protected. To view it please enter your password below:


No Comments

Command recognition

Have you noticed something about speech recognition? It is not here. It is not on my laptop and not on yours (probably). I hear it works somewhat well with Dragon NaturalSpeaking and in some games, but it requires training.

Worse, commanding your computer to do something is not here. Vista has something, MacOS had something for a while. I doubt they work.

I would like a free software that works, does not require speaker databases and can command your computer. That is, you train words with command classes, and it can execute these. Shouldn’t be too complicated, should it? Saying “next” or “weiter” to get the next slide in a presentation would be useful. Or whistling. I believe it would also be helpful for impaired computer users.

So here is how I thought it could be done (and looks good so far):

  1. make time slices (around 0.01s long)
  2. a FFT gives a characteristic power curve P(f) at this time t

So for one word/command/sequence you have a P(t, f).

Next, you see P(t, f) as a instance of a Gaussian distribution N(mean_freq, sigma_freq). You can determine a experimental mean+sigma for P(t, f). This is the command template N(t, f, mean_freq, sigma_freq).

Checking whether a curve fits can be done by multiplying the probabilities of P(t, f) occuring for N(t, f, mean_freq, sigma_freq) and using a lower limit.

This method filters out noise because the sigma will be very high in these frequencies.

Tackling offsets/shifts

The starting point of a command obviously unknown. So far, we can detect the probability of the current curve fitting a slice of a command template.

So how can a command be detected? By continuously following the probabilities (i.e. following the potential command in time, comparing with each command template), multiplying probabilities, and dropping out those that fall below the probability limit.

Tackling frequency shifts and frequency and time dilatation

Obviously, frequency shifts and variations in frequency and time can occur. This can be assumed to be of first or second order:

f_real = f_shift + f_orig + f_increase * t

Starting with f_shift = 0 and f_increase = 0, we can change these parameters (while staying within a certain limit) while the command fitting probability improves.

The same can be done for time dilatation.

This is very similar to squeezing and shifting a 3D-surface to compare it with another 3D-surface. The best fit still has some non-fitting areas. A weighted integral gives the difference and is compared to a maximal difference.

Of course everything has to be normalized, etc.

How does speech recognition work today?

They used to use time variation, but then they went on to hidden markov-models. Phoneme are modeled and compared against example speakers for each language. More in the wiki article

This is obviously a technically superior approach to just comparing frequencies, but at the moment speech recognition is not here. Systems like sphinx are not easily installed and trained.

I think people would prefer a

  • system they have to train themselves
  • that always works
  • but only in the scenarios/environments they need it (which it was trained for)

to

  • a system they can partially train themselves,
  • but does not reliably work in
  • for all scenarios (general purpose)

The former is what I’m playing with, the latter are available systems.

1 Comment

Whitepaper

Whitepaper, also available in A4.

No Comments