Portfolio / Python lane / Stock-Video-Collector

PY

Stock-Video-Collector

Stock video collector

Python

Open source on GitHub Open the Python lane

Delivery

Source-first

Browse code, README, and release notes on GitHub.

Primary lane

Python lane

The clearest adjacent context for this project inside the portfolio.

Freshness

May 26, 2026

Updated May 26, 2026

Latest release

No tag yet

README is the clearest project overview right now.

Preview

Using the generated project card as a clean fallback until a live capture is available.

Source at github.com/SysAdminDoc/Stock-Video-Collector.

README

Cached at build time, cleaned up for in-site reading, and linked back to the canonical GitHub source.

6 min read 1,135 words 20 sections

Contents

Stock Video Collector
Quick Start
Features
Multi-Site Crawling
Browser Automation & Anti-Detection
Video Discovery
Database & Search
Asset Management
Download Manager
Export Formats
GUI
Keyboard Shortcuts
Usage
Basic Workflow
Configuration
Filename Templates
How It Works
Troubleshooting
License
Contributing

Stock Video Collector

Headless browser crawler with a dark-themed PyQt6 desktop GUI for discovering, cataloging, and downloading stock video clips from multiple sites — with full metadata extraction, FTS5 keyword search, and a concurrent download manager.

Quick Start

git clone https://github.com/SysAdminDoc/VideoScraper.git
cd VideoScraper
python artlist_scraper.py  # Auto-installs all dependencies on first run

That's it. The script bootstraps everything automatically:

Installs Python packages (PyQt6, playwright, etc.)
Downloads Chromium via Playwright
Launches the GUI

Requirements: Python 3.9+ — no other prerequisites. Works on Windows, Linux, and macOS.

Features

Multi-Site Crawling

Site	Video Types	Metadata	Pagination
Artlist	M3U8 HLS streams	Clip ID, resolution, duration, FPS, camera, formats, creator, collection, tags	Infinite scroll
Pexels	MP4 direct (SD/HD/UHD via Canva CDN)	OpenGraph + JSON-LD, URL slug titles	Load More button (up to 15 clicks)
Pixabay	MP4, WebM	OpenGraph + JSON-LD	Infinite scroll
Storyblocks	M3U8, MP4, WebM	OpenGraph + JSON-LD	Infinite scroll
Generic	M3U8, MP4, WebM, DASH, MOV	Auto-detect (OG, JSON-LD, DOM)	Infinite scroll

The Generic profile works on any site — it intercepts all video network requests and extracts whatever metadata is available.

Browser Automation & Anti-Detection

Feature	Description
Stealth mode	Hides `navigator.webdriver` flag, spoofs plugin array and WebGL vendor/renderer
Challenge detection	Auto-detects Cloudflare, CAPTCHA, and challenge pages
Manual solve mode	Switches to visible browser for CAPTCHA solving, resumes automatically on clearance
Persistent profile	Browser session cookies, localStorage, and tokens persist across runs
Request interception	Blocks heavy HLS `.ts` segments during crawl to save bandwidth
Configurable delays	Page delay, scroll delay, M3U8 wait, timeout — all adjustable per-run

Video Discovery

The crawler uses four complementary strategies to find video URLs on every page:

┌───────────────────────────────────────────────────────────────────┐
│                         Page Load                                 │
├───────────────┬─────────────────┬─────────────┬───────────────────┤
│  XHR/Fetch    │   DOM Observer  │  Response   │   HTML Regex      │
│  Intercept    │   (MutationObs) │  Body Scan  │   Fallback        │
│               │                 │             │                   │
│  Hooks into   │  Watches for    │  Scans all  │  Regex sweep for  │
│  XMLHttpReq & │  <video src>    │  HTTP resp  │  M3U8/MP4/WebM    │
│  fetch() API  │  injections     │  bodies     │  + Canva partner  │
│               │                 │             │  links (Pexels)   │
└───────┬───────┴────────┬────────┴──────┬──────┴────────┬──────────┘
        │                │               │               │
        └────────────────┴───────┬───────┴───────────────┘
                                 ▼
                    ┌─────────────────────┐
                    │  Quality Comparison  │
                    │  UHD > HD > SD       │
                    │  Dedup by clip ID    │
                    └──────────┬──────────┘
                               ▼
                    ┌─────────────────────┐
                    │   SQLite Database    │
                    │   + FTS5 Index       │
                    └─────────────────────┘

Database & Search

Feature	Description
SQLite with WAL mode	Concurrent reads, crash-safe writes
FTS5 full-text search	Search across title, creator, collection, tags, resolution, camera, duration
AND/OR search modes	Toggle between inclusive and exclusive multi-term search
Column filters	Filter by source site, resolution, creator, collection — all combinable with text search
Duration filter	Quick filter by clip length range
Saved searches	Save and recall frequent search + filter combos
FTS index rebuild	One-click repair if search results drift out of sync

Asset Management

Feature	Description
Star ratings	1–5 star rating per clip
Favorites	Quick-toggle favorite flag for any clip
Notes	Free-text notes per clip
User tags	Custom tag system independent of source tags
Collections	Organize clips into named collections with color coding
Bulk operations	Context menu actions on any card in the grid

Download Manager

Feature	Description
Concurrent downloads	Configurable parallel download workers (default: 2)
ffmpeg HLS→MP4	Automatic M3U8-to-MP4 conversion via ffmpeg
Retry with backoff	Exponential backoff retry (configurable max attempts)
Speed & ETA tracking	Real-time download speed and estimated completion time
Bandwidth limiting	Optional download speed cap
Filename templates	Customizable output filenames: `{title}`, `{clip_id}`, `{creator}`, `{collection}`, `{resolution}`
Sidecar metadata	JSON metadata file written alongside each downloaded MP4
Thumbnail extraction	Auto-extracts a thumbnail frame from downloaded videos

Export Formats

Format	Contents
`.txt`	Plain list of M3U8/MP4 URLs
`.json`	Full metadata for all clips (title, creator, tags, URLs, timestamps)
`.m3u`	Media player playlist — uses local path if downloaded, M3U8 URL otherwise
`.csv`	Spreadsheet-ready with all metadata columns
Batch	Export all four formats at once

GUI

Feature	Description
Dark theme	Catppuccin-inspired deep dark palette
Card grid view	Visual thumbnail grid with configurable card sizes (S/M/L)
Hover video preview	Mouse-over any card to preview the video inline
Detail panel	Always-visible side panel with full metadata, ratings, notes, tags, collections
System tray	Minimize to tray, continue crawling/downloading in background
Toast notifications	Non-blocking status notifications
Live crawl log	Real-time scrolling log with verbose/quiet toggle
Clipboard monitor	Opt-in URL detection from clipboard (auto-fills crawl URL input)

Keyboard Shortcuts

Key	Action
`Ctrl+F`	Focus search bar
`F5`	Refresh search results
`Ctrl+1` through `Ctrl+6`	Switch between tabs

Usage

Basic Workflow

Select a site profile — check one or more profiles in the Crawl tab (Artlist, Pexels, Pixabay, Storyblocks, or Generic)
Set the start URL — auto-populated per profile, or paste any URL for Generic mode
Configure crawl settings — batch size, depth, delays, headless mode
Start crawling — the crawler discovers pages, extracts metadata, and intercepts video URLs
Browse results — switch to the Library tab to search, filter, rate, tag, and organize clips
Download — select clips and download with the built-in manager, or export URL lists for external tools

Configuration

All settings persist automatically in a JSON config file. Key options:

Setting	Default	Description
Batch size	50	Pages per crawl batch
Page delay	2s	Wait between page loads
Scroll delay	1s	Wait between scroll steps
M3U8 wait	5s	Time to wait for video URLs to appear
Scroll steps	10	Number of scroll-down actions per page
Timeout	30s	Page load timeout
Max pages	0 (unlimited)	Stop after N pages
Max depth	3	Link-following depth
Headless	On	Run browser without visible window
Concurrent DLs	2	Parallel download workers
Max retries	3	Download retry attempts
Bandwidth limit	0 (unlimited)	Download speed cap in KB/s
Clipboard monitor	Off	Auto-detect URLs from clipboard

Filename Templates

Customize download filenames using template variables:

{title}                      → Beautiful_Sunset.mp4
{clip_id}_{title}            → abc123_Beautiful_Sunset.mp4
{creator}/{collection}/{title} → JohnDoe/Nature/Beautiful_Sunset.mp4

Available variables: {title}, {clip_id}, {creator}, {collection}, {resolution}

How It Works

┌─────────────────────────────────────────────────────────────────────────┐
│                            PyQt6 GUI                                    │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐ │
│  │  Crawl   │  │  Library  │  │  Detail   │  │ Download │  │  Export  │ │
│  │  Tab     │  │  Tab      │  │  Panel    │  │  Tab     │  │  Tab    │ │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬────┘ │
└───────┼──────────────┼──────────────┼──────────────┼──────────────┼─────┘
        │              │              │              │              │
        ▼              ▼              ▼              ▼              ▼
┌──────────────┐  ┌──────────────────────────┐  ┌──────────────────────┐
│   Crawler    │  │      SQLite + FTS5       │  │   Download Worker    │
│   Worker     │──│                          │──│                      │
│  (QThread)   │  │  clips, crawl_queue,     │  │  ThreadPoolExecutor  │
│              │  │  crawled_pages,           │  │  + ffmpeg HLS→MP4   │
│  Playwright  │  │  collections,            │  │  + retry backoff     │
│  Chromium    │  │  saved_searches           │  │  + speed tracking    │
└──────────────┘  └──────────────────────────┘  └──────────────────────┘

Crawler Worker — Runs Playwright in an async event loop on a dedicated QThread. Navigates pages, injects JavaScript hooks for XHR/fetch/DOM video interception, extracts metadata via regex selectors + OpenGraph + JSON-LD, and manages the crawl queue with depth/priority.

Database Layer — Thread-safe SQLite with WAL mode and a dedicated threading.Lock. FTS5 external content table indexes title, creator, collection, tags, resolution, camera, and duration. Quality-aware M3U8 URL upgrades prefer UHD over HD over SD.

Download Worker — Persistent queue on a QThread with a ThreadPoolExecutor for concurrent downloads. Handles M3U8→MP4 conversion via ffmpeg, exponential backoff retry, real-time speed/ETA calculation, sidecar JSON metadata, and thumbnail extraction.

Troubleshooting

"Chromium not found" — Click the "Install Browser" button on the Crawl tab. This runs playwright install chromium automatically.

Search results seem wrong or incomplete — Click the "🔄 Rebuild Index" button on the Crawl tab to rebuild the FTS5 search index from scratch.

Bot challenge / CAPTCHA detected — Uncheck "Headless" mode and restart the crawl. The browser will open visibly so you can solve the challenge manually. The crawler pauses and resumes automatically once the challenge clears.

Downloads fail repeatedly — Check that ffmpeg is installed and on your PATH. The scraper auto-detects ffmpeg in common locations, but if it can't find it, downloads that require HLS→MP4 conversion will fail.

Clipboard monitor not working — The clipboard monitor is opt-in. Enable it in your config by adding "clipboard_monitor": true, or toggle it programmatically. On Linux/Wayland, clipboard access may require additional permissions.

License

MIT License — see LICENSE for details.

Contributing

Issues and PRs welcome. If you add support for a new site, submit it as a SiteProfile.register() block with documented selectors and test URLs.

Read on GitHub → github.com/SysAdminDoc/Stock-Video-Collector