Portfolio / Python lane / LipSight

PY

LipSight

AI-powered lip reading tool — transcribes speech from silent video

Python

Open source on GitHub Open the Python lane

Delivery

Source-first

Browse code, README, and release notes on GitHub.

Primary lane

Python lane

The clearest adjacent context for this project inside the portfolio.

Freshness

May 26, 2026

Updated May 26, 2026

Latest release

No tag yet

README is the clearest project overview right now.

Preview

Using the generated project card as a clean fallback until a live capture is available.

Source at github.com/SysAdminDoc/LipSight.

README

Cached at build time, cleaned up for in-site reading, and linked back to the canonical GitHub source.

3 min read 513 words 11 sections

Contents

LipSight
Quick Start
Features
How It Works
Prerequisites
Usage
Configuration
Accuracy Notes
Models & Research
FAQ / Troubleshooting
License

LipSight

AI-powered lip reading tool that transcribes speech from silent video using state-of-the-art visual speech recognition models.

Quick Start

git clone https://github.com/SysAdminDoc/LipSight.git
cd LipSight
python LipSight.py  # Auto-installs all dependencies on first run

Features

Feature	Description
🧠 Auto-AVSR Inference	Cloud inference via Replicate API using the state-of-the-art Auto-AVSR model (~80% word accuracy)
👁️ Face/Mouth Detection	Real-time MediaPipe face mesh with mouth ROI visualization and open/close ratio tracking
🎬 Smart Segmentation	Automatic speech segment detection via mouth movement analysis — timestamps estimated per segment
📹 Video Preview	Frame-by-frame scrubbing with annotated face landmarks and speaking/silent status overlay
💾 Multi-Format Export	Export results as SRT subtitles, timestamped TXT, or structured JSON
🌐 Custom Endpoints	Support for self-hosted inference servers alongside Replicate API
🎨 Dark Theme	Professional Catppuccin Mocha dark interface
⚡ Threaded Processing	All heavy operations run on background threads — GUI never locks
🔧 Zero Configuration	Auto-bootstraps all dependencies (PyQt6, OpenCV, MediaPipe, Replicate)

How It Works

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Video Input    │────>│  Face Analysis   │────>│  Segmentation   │────>│   Inference      │
│                  │     │                  │     │                  │     │                  │
│  MP4/MOV/AVI/   │     │  MediaPipe Face  │     │  Mouth movement  │     │  Auto-AVSR via   │
│  MKV/WebM       │     │  Mesh detection  │     │  based splitting │     │  Replicate API   │
└─────────────────┘     └─────────────────┘     └─────────────────┘     └────────┬─────────┘
                                                                                  │
                         ┌─────────────────┐     ┌─────────────────┐              │
                         │   Export         │<────│   Results        │<─────────────┘
                         │                  │     │                  │
                         │  SRT / TXT /     │     │  Timestamped     │
                         │  JSON / Clipboard│     │  transcription   │
                         └─────────────────┘     └─────────────────┘

Prerequisites

Python 3.8+
Replicate API token — get one free at replicate.com/account/api-tokens
ffmpeg (optional) — enables faster video segment extraction. Falls back to OpenCV if not present.

Usage

Launch LipSight.py
Go to Settings tab → paste your Replicate API token → Save Settings
Click Load Video and select an MP4/MOV/AVI file
(Optional) Click Analyze Segments to detect speech regions via mouth movement
Click Lip Read to begin transcription
Export results as SRT, TXT, JSON, or copy to clipboard

Configuration

Settings are persisted to ~/.lipsight/config.json (or %APPDATA%\.lipsight\config.json on Windows).

Setting	Description	Default
Replicate API Token	Authentication for cloud inference	—
Custom Endpoint URL	Self-hosted VSR server URL	—
Auto-Segment	Split video by mouth movement before processing	Enabled
Mouth Open Threshold	Sensitivity for speech detection (3–15)	6
Inference Backend	Replicate API or Custom Endpoint	Replicate

Accuracy Notes

Visual-only lip reading is fundamentally limited by the homophene problem — many sounds (p/b/m, k/g, f/v) look identical on lips. Current state-of-the-art achieves ~80% word accuracy on benchmark data under ideal conditions:

Frontal face view, single speaker
Good lighting, no obstructions
Clear lip movement

Real-world accuracy varies. This tool is best suited for getting the gist of speech when audio is unavailable — not for precise transcription.

Models & Research

LipSight uses the Auto-AVSR model family, which represents the current state of the art in deployable visual speech recognition:

Auto-AVSR — Apache 2.0, ~20% WER on LRS3
VALLR — Latest research (ICCV 2025), 18.7% WER using LLaMA integration
AV-HuBERT — Meta's self-supervised visual encoder

FAQ / Troubleshooting

Q: The transcription is inaccurate A: Ensure the video has a clear, frontal view of the speaker's face with good lighting. Lip reading AI currently achieves ~80% accuracy at best — significantly below audio-based transcription.

Q: Processing is slow A: Cloud inference depends on Replicate's queue. Each segment takes ~10–30 seconds. Consider analyzing segments first and processing fewer, targeted clips.

Q: No face detected A: The video needs a clearly visible face. Check the video preview — green landmarks should appear on the mouth region.

License

MIT

Read on GitHub → github.com/SysAdminDoc/LipSight

LipSight

Preview

README

LipSight

Quick Start

Features

How It Works

Prerequisites

Usage

Configuration

Accuracy Notes

Models & Research

FAQ / Troubleshooting

License

More from this lane