Network-based Recognition
The speech recognition system architecture that will win in a
client/server, distributed application world (including any telephone
or internet based speech recognition based service), will have this
kind of structure:
The client does lightweight DSP tasks only. Calculate presence/absence of speech,
FFT, mel-frequency energy bands, a cosine transform (resulting in the
standard high-performance spectral parameters for speech recognition,
known as Mel Frequency Cepstral Coefficients, or MFCC's), followed
by a 12-bit or up to 14 bit vector-quantization based compression,
reducing the data to 1200 to 1400 bits per second, to send through
the network. This data will be derived from clean, high-bandwidth
audio which ensures maximum speech recognition accuracies, which can
be done because the client side device can have a high quality A/D
in it.
On the server side, the compressed data vectors are decompressed
and fed into the recognizer, which uses those spectral parameters to
determine the most likely word sequence.
This architecture has the advantages of:
- Lightweight client-side processing:
- little memory is required on the client
- integer-only CPUs are workable on the client
- high-quality 16-bit A/D's can be used on the client
- Extremely narrow bandwidth requirements:
1200 bps!
- Server-side processing can be as complex as you want, for
large vocabulary dictation tasks, and natural dialog systems
conducting business transactions through speech.
This architecture is the one that will win for businesses providing
distributed recognition-based services. For further information, for
discussion of your business or application, and for licensing
discussions, please contact info at sprex dot com.