ASLive: A Real-Time American Sign Language Recognition System
Using MediaPipe and Random Forest Classification
Aayan Abhay Patil and Hrishikesh Vivek Khandade
Lab of Future
Dubai, UAE
 asliveapp08@gmail.com

ABSTRACT

This paper presents ASLive, a real-time American Sign Language (ASL) recognition system built using computer vision and machine learning techniques. The system leverages Google's MediaPipe Hands framework for skeletal landmark extraction, combined with a Random Forest classifier trained on user-collected gesture data to classify hand signs into alphanumeric characters. The implementation incorporates text-to-speech synthesis, session-based sentence construction, and an optimized inference pipeline designed for low-latency performance on standard consumer hardware. Experimental results demonstrate near-real-time recognition at approximately 30 FPS with high per-class accuracy on a custom-recorded ASL fingerspelling dataset. This paper outlines the system architecture, methodology, model training, performance optimizations, real-world applications, and future research directions.

Keywords: American Sign Language, hand gesture recognition, MediaPipe, Random Forest, computer vision, accessibility, text-to-speech.


INTRODUCTION

Communication is a fundamental human need. For the approximately 70 million deaf and hard-of-hearing individuals worldwide, sign language serves as a primary mode of communication. American Sign Language (ASL) is the predominant sign language used in the United States and Canada, with an estimated 500,000 native signers. Despite its prevalence, the gap between ASL users and the hearing population remains a significant accessibility challenge.

Traditional solutions involve human interpreters, which are costly and unavailable in many everyday contexts. Automated sign language recognition (SLR) systems powered by computer vision offer a scalable, low-cost alternative capable of operating in real time without specialized hardware.

ASLive was developed to bridge this gap using widely available tools: a standard webcam, Google's MediaPipe Hands library, and a lightweight Random Forest classifier. The design philosophy prioritizes accessibility, speed, and user customizability — enabling users to record and train new gestures without requiring deep learning expertise.

Primary Contributions

The primary contributions of this work are: (1) a complete, real-time ASL fingerspelling recognition pipeline running on CPU-only consumer hardware; (2) an interactive in-session data collection and retraining workflow; (3) a temporal consistency filter that substantially reduces false-positive character insertions; and (4) a non-blocking text-to-speech integration for immediate audio feedback.

BACKGROUND AND RELATED WORK

Sign Language Recognition

Sign language recognition has been an active research area for over three decades. Early approaches relied on data gloves and inertial measurement units (IMUs). Depth cameras, such as the Microsoft Kinect, later provided 3D spatial data without physical sensors. Deep learning methods using convolutional neural networks (CNNs) and recurrent neural networks (RNNs) now dominate SLR benchmarks, but demand large labeled datasets and significant computational resources inaccessible to many researchers.

MediaPipe Hands

Google's MediaPipe Hands infers 21 3D landmarks of the hand from a single RGB frame using a two-stage pipeline: a palm detector followed by a hand landmark model. It delivers real-time performance on mobile and desktop platforms. The Lite variant (model_complexity=0) reduces inference time while retaining sufficient accuracy for gesture classification tasks.

Random Forest Classification

Random Forest is an ensemble learning method that constructs multiple decision trees and aggregates their predictions by majority vote. It handles high-dimensional feature spaces effectively, is robust to overfitting, and provides per-class probability estimates. For the structured 63-dimensional landmark feature vectors used in this work, Random Forest achieves competitive accuracy with sub-second training times.


SYSTEM ARCHITECTURE

ASLive operates as a real-time pipeline with five core stages: video capture, hand detection and landmark extraction, feature normalization, gesture classification, and output rendering. The system is implemented entirely in Python (~120 lines) and runs in a single process, with multi-threading used only for text-to-speech synthesis.

Video Capture

Video is captured from the default system camera using OpenCV's VideoCapture API, forced to 640×480 resolution to minimize per-frame processing overhead. Each frame is horizontally flipped to provide a natural mirror-view experience consistent with user interaction expectations.

Landmark Extraction

The MediaPipe Hands model is initialized in Lite mode with a minimum detection and tracking confidence of 0.5. For each detected hand, it outputs 21 landmarks with normalized (x, y, z) coordinates. Hand skeleton connections are drawn on every frame for visual feedback, regardless of the frame-skip state.

Feature Normalization

Raw landmark coordinates depend on the hand's absolute position in the frame, making them unsuitable as direct classifier inputs. Normalization is achieved by computing offsets relative to Landmark 0 (the wrist joint), yielding a translation-invariant 63-dimensional feature vector:

fᵢ = (xᵢ − x₀, yᵢ − y₀, zᵢ − z₀), i = 1,...,20

Frame-Skip Optimization

MediaPipe inference and Random Forest prediction are executed only on every second frame (FRAME_SKIP=2). On skipped frames, previous AI results are reused and redisplayed. This

halves the AI computational load while maintaining visual responsiveness, as gesture states persist across consecutive frames.


DATA COLLECTION AND MODEL TRAINING

Interactive Data Collection

ASLive uses an in-session data collection workflow. The user presses 'N' to assign a gesture label and 'R' to toggle recording. While recording, one CSV row is appended per processed frame: [label, f₁, …, f₆₃]. This enables rapid vocabulary bootstrapping — functional recognition is achievable with as few as 20–50 samples per class. The system is not limited to ASL; any repeatable hand gesture can be labeled and trained.

Random Forest Training

The classifier is trained using scikit-learn's RandomForestClassifier with n_estimators=30 and n_jobs=−1. Training is triggered automatically at startup if the data file contains at least five samples. On a modern multi-core processor, training on several hundred samples completes in well under one second.

Confidence Thresholding and Temporal Filtering

Two complementary mechanisms reduce false positives. First, a confidence threshold (τ = 0.96) filters predictions: only predictions where the highest class probability exceeds τ are acted upon. Second, a temporal history buffer (deque of length N=4) requires four consecutive identical predictions before confirming a gesture. A 1.0-second cooldown between accepted characters prevents repeated insertion from sustained poses.


OUTPUT AND USER INTERACTION

Sentence Construction

Confirmed gestures are appended to a running sentence string displayed at the bottom of the video feed. Special labels enable editing: SPACE appends a whitespace character; DELETE removes the last character; and the keyboard shortcut 'C' clears the entire sentence. This allows users to construct arbitrary text through sign language input alone.

Text-to-Speech Synthesis

Each confirmed gesture triggers a non-blocking text-to-speech announcement via pyttsx3, running in a daemon thread to ensure the main video loop is never blocked. The speech rate is set to 180 WPM for clear, rapid feedback. A deduplication check prevents repeated announcements when the hand remains stationary between characters.

User Interface

The UI is rendered as OpenCV overlays directly on the video frame. A black header bar displays the current AI prediction and confidence percentage. A dark footer bar shows the running sentence. A status indicator distinguishes between live recognition mode and active recording mode for a specific gesture label.


RESULTS AND PERFORMANCE

The system achieves approximately 30 FPS on a mid-range laptop CPU (Intel Core i5, 8 GB RAM) with FRAME_SKIP=2. With a representative dataset of approximately 50 samples per

class across 28 classes (A–Z, SPACE, DELETE), the Random Forest achieves high per-class accuracy under controlled lighting. Table I summarizes the key system parameters.

TABLE I — KEY SYSTEM PARAMETERS

Parameter
Value
Feature vector dimension
63 (21 landmarks × 3 offsets)
Classifier
Random Forest (n_estimators=30)
Confidence threshold τ
0.96
Temporal buffer N
4 frames
Character cooldown
1.0 second
Frame skip
Every 2nd frame
Input resolution
640 × 480 px
Inference device
CPU only
TTS speech rate
180 WPM



REAL-LIFE APPLICATIONS

Accessibility Communication Aid

The most immediate application is as a real-time communication tool for deaf and hard-of-hearing individuals in settings where a human interpreter is unavailable — medical appointments, customer service counters, or administrative offices. The system can run on a standard laptop, allowing a signing user to communicate with a hearing interlocutor via transcribed text or TTS audio output.

Educational Tool for ASL Learners

ASLive serves as an interactive learning aid for individuals studying ASL. The confidence percentage displayed in the UI provides quantitative feedback on sign precision, enabling learners to identify and correct errors without requiring an instructor.

Smart Home and IoT Control

Custom gestures can be mapped to home automation commands such as 'lights on' or 'volume up.' This is particularly valuable in hygienic environments (hospitals, clean rooms) or for users with motor disabilities who cannot operate physical controls.

Healthcare and Rehabilitation

In occupational therapy, tracking hand landmark configurations can support assessment of motor recovery. Target hand positions can be labeled as gestures, and the system can score a patient's ability to replicate prescribed poses, providing quantitative rehabilitation metrics.

Extended Reality and Gaming

The landmark-based approach integrates naturally with VR/AR frameworks to enable hand interaction without controllers. Gaming applications can map gestures to in-game actions for an immersive, controller-free control modality.


LIMITATIONS AND FUTURE WORK

The current implementation classifies static hand poses only. Dynamic signs involving motion trajectories (e.g., ASL letters J and Z) cannot be captured by single-frame landmark snapshots. Two-handed signs, which comprise a substantial portion of ASL vocabulary, are also unsupported. The 96% confidence threshold, while reducing false positives, may reject valid signs under suboptimal lighting or oblique viewing angles.

Proposed future enhancements include:

  • Temporal gesture modeling via LSTM or Transformer architectures for dynamic signs
  • Dual-hand support for full ASL vocabulary coverage
  • Continuous sign segmentation to detect gesture boundaries in fluent signing
  • Mobile deployment via TensorFlow Lite or ONNX for Android/iOS
  • Word-level language model integration for predictive text from fingerspelled sequences
  • Fine-grained confidence calibration to improve robustness in low-light conditions

CONCLUSION

This paper presented ASLive, a real-time ASL recognition system demonstrating that effective sign language translation is achievable with a minimal software stack on commodity hardware. By combining MediaPipe's robust hand landmark estimation with a Random Forest classifier and interactive user-driven training, the system delivers a flexible, privacy-preserving, and highly accessible communication tool.

The modular architecture — separating capture, feature extraction, classification, and output — makes individual components independently upgradeable. The system's success demonstrates that meaningful accessibility technology can be built with open-source tools and standard hardware, without large research budgets or cloud infrastructure.



ACKNOWLEDGMENT

The authors thank the open-source communities behind MediaPipe, OpenCV, scikit-learn, and pyttsx3 for making high-quality computer vision and machine learning tools freely available.

REFERENCES

  1. National Institute on Deafness and Other Communication Disorders, "American Sign Language," NIH Publication, 2021.
  2. J. L. Hernandez-Rebollar, N. Kyriakopoulos, and R. W. Lindeman, "A new instrumented approach for translating American Sign Language into sound and text," Proc. IEEE Int. Conf. Autom. Face Gesture Recognit., pp. 547–552, 2002.
  3. C. Lugaresi et al., "MediaPipe: A framework for perceiving and processing reality," Workshop on Perception for AR/VR at CVPR, 2019.
  4. L. Breiman, "Random forests," Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
  5. pyttsx3 Contributors, "pyttsx3: Text-to-speech library for Python," [Online]. Available: https://pyttsx3.readthedocs.io/, 2023.
  6. O. Koller, "Quantitative survey of the state of the art in sign language recognition," arXiv:2008.09918, 2020.
  7. F. Pedregosa et al., "Scikit-learn: Machine learning in Python," J. Mach. Learn. Res., vol. 12,

pp. 2825–2830, 2011.

FIGURES

The following figures document ASLive's real-time operation across different system modes, illustrating MediaPipe landmark detection, the data recording interface, the full translation UI, and a live application demonstration.


Fig. 1. MediaPipe hand landmark detection with 21 keypoints overlaid on live camera feed.


Fig. 2. Data recording mode showing active gesture label (REC: SPACE) and live landmark extraction.


Fig. 3. Full system UI showing AI prediction (SPACE, 88%), confidence score, MODE: TRANSLATE, and sentence output reading "ASLIVE".


Fig. 4. Application demonstration showing sign-to-text output in a real-world usage scenario — sentence reads "HOW ARE YOU".