Streaming Digit classifier (lightweight)

Real-time Spoken Digit Classification (0-9)

A real-time audio digit classification system that recognizes spoken numbers (0-9) through live microphone streaming. Features multiple classification approaches including TTS APIs, Fourier analysis, MFCC, and MEL features with performance benchmarking and inference time tracking.

Backend API: Connecting... https://paranoiid-streaming-digit-classifier.hf.space
How to Use

1. Click "Start Recording" and clearly say a digit (0-9)

2. Select a processing method from the cabinets below

3. Watch the real-time audio visualization

4. Compare inference times and accuracy across methods

5. Use the Robustness toggle to test with noise

Audio Input Monitor

?
Ready to record...
Duration: 0.0s

Audio Robustness Settings

0.0

Select Processing Pipeline

Best Performance

MFCC + Dense NN

Mel Frequency Cepstral Coefficients

Predicted Digit: ?
Confidence: --
Inference Time: 0.0ms
Testing Acc: 98.52%

Mel CNN (2D)

Mel Spectrogram Convolutional Neural Network

Predicted Digit: ?
Confidence: --
Inference Time: 0.0ms
Testing Acc: 97.22%

Raw CNN (1D)

Raw Waveform Convolutional Neural Network

Predicted Digit: ?
Confidence: --
Inference Time: 0.0ms
Testing Acc: 91.30%

External API

Whisper API (Disabled)

Predicted Digit: N/A
Confidence: --
Inference Time: N/A
Status: Disabled
Deployment: Frontend: GitHub Pages • Backend API: Hugging Face Spaces • ML Models: PyTorch CPU

Model Architecture & Training Metrics

MFCC + Dense NN

Architecture13 MFCC → Dense(128) → Dense(64) → Dense(10)
Test Accuracy98.52%
Validation Accuracy97.89%
Parameters10,314
Training Time3.2 minutes
Inference Time~1-2ms

Mel CNN (2D)

Architecture2D CNN → MaxPool → Dense(128) → Dense(10)
Test Accuracy97.22%
Validation Accuracy96.45%
Parameters45,782
Training Time8.7 minutes
Inference Time~3-5ms

Raw CNN (1D)

Architecture1D CNN → Conv1D → GlobalMaxPool → Dense(10)
Test Accuracy91.30%
Validation Accuracy89.67%
Parameters28,954
Training Time12.1 minutes
Inference Time~5-8ms

External API

ModelWhisper (Hugging Face)
Parameters~39M (External)
LanguageEnglish Speech Recognition
ConnectionHTTPS API Call
Latency~1-3 seconds
Test AccuracyVariable (Network dependent)

Raw Spectrogram (Dropped)

ArchitectureSTFT → 2D CNN → Dense Layers
StatusNot Implemented
ReasonHigh Dimensionality
IssueMemory intensive (~64k features)
AlternativeMel-scale features used instead
Estimated PerformanceSimilar to Mel CNN but slower

MFCC + SVM (Alternative)

Architecture13 MFCC → Support Vector Machine
StatusNot Implemented
Estimated Accuracy85-90%
AdvantagesLightweight, Fast Training
Parameters~1000 Support Vectors
Inference Time~0.5ms (Estimated)

Performance Monitor

Total Predictions
0
Fastest Method
-
Session Time
00:00
API Latency
-- ms