Real-time Spoken Digit Classification (0-9)
A real-time audio digit classification system that recognizes spoken numbers (0-9) through live microphone streaming. Features multiple classification approaches including TTS APIs, Fourier analysis, MFCC, and MEL features with performance benchmarking and inference time tracking.
1. Click "Start Recording" and clearly say a digit (0-9)
2. Select a processing method from the cabinets below
3. Watch the real-time audio visualization
4. Compare inference times and accuracy across methods
5. Use the Robustness toggle to test with noise
Mel Frequency Cepstral Coefficients
Mel Spectrogram Convolutional Neural Network
Raw Waveform Convolutional Neural Network
Whisper API (Disabled)
| Architecture | 13 MFCC → Dense(128) → Dense(64) → Dense(10) |
|---|---|
| Test Accuracy | 98.52% |
| Validation Accuracy | 97.89% |
| Parameters | 10,314 |
| Training Time | 3.2 minutes |
| Inference Time | ~1-2ms |
| Architecture | 2D CNN → MaxPool → Dense(128) → Dense(10) |
|---|---|
| Test Accuracy | 97.22% |
| Validation Accuracy | 96.45% |
| Parameters | 45,782 |
| Training Time | 8.7 minutes |
| Inference Time | ~3-5ms |
| Architecture | 1D CNN → Conv1D → GlobalMaxPool → Dense(10) |
|---|---|
| Test Accuracy | 91.30% |
| Validation Accuracy | 89.67% |
| Parameters | 28,954 |
| Training Time | 12.1 minutes |
| Inference Time | ~5-8ms |
| Model | Whisper (Hugging Face) |
|---|---|
| Parameters | ~39M (External) |
| Language | English Speech Recognition |
| Connection | HTTPS API Call |
| Latency | ~1-3 seconds |
| Test Accuracy | Variable (Network dependent) |
| Architecture | STFT → 2D CNN → Dense Layers |
|---|---|
| Status | Not Implemented |
| Reason | High Dimensionality |
| Issue | Memory intensive (~64k features) |
| Alternative | Mel-scale features used instead |
| Estimated Performance | Similar to Mel CNN but slower |
| Architecture | 13 MFCC → Support Vector Machine |
|---|---|
| Status | Not Implemented |
| Estimated Accuracy | 85-90% |
| Advantages | Lightweight, Fast Training |
| Parameters | ~1000 Support Vectors |
| Inference Time | ~0.5ms (Estimated) |