Automating Subtitle Extraction with DVBSub2Text — Workflow and Scripts
Overview
A reliable automated pipeline converts DVB image-based subtitles in transport streams to editable text (SRT/VTT) by: 1) extracting subtitle streams, 2) converting image subtitles to images, 3) running OCR, and 4) timing/cleanup and packaging. Below is a practical, ready-to-run workflow and example scripts using common tools.
Required tools (assumed installed)
- ffmpeg / ffprobe
- dvbsnoop or ProjectX (subtitle stream extraction)
- BDSup2Sub (or similar) to produce images (PNG) from DVB subtitles
- Tesseract OCR with appropriate language data
- Python 3 for orchestration and post-processing
- optional: ImageMagick for preprocessing, jq for JSON handling
High-level workflow
- Detect subtitle PID(s) from the TS file.
- Extract DVB subtitle PES packets to a dump.
- Convert DVB subtitle dump to per-frame PNGs.
- Preprocess PNGs (deskew, threshold, crop).
- Run Tesseract OCR on cleaned PNGs to produce text lines.
- Reconstruct timings into SRT/VTT and perform text cleanup (spellcheck, dictionary).
- Package SRT/VTT and optionally mux back into TS or store alongside media.
- (Optional) Batch mode: iterate over directory, apply templates, and parallelize OCR.
Example commands (single-file, basic)
- Detect PIDs (ffprobe)
Code
ffprobe -v error -select_streams s -show_entries stream=index,codec_name:streamtags=language -of json input.ts > streams.json
- Extract subtitle packets (dvbsnoop)
Code
dvbsnoop -mts -i input.ts -p 0x1200 -o subtitles.dump
(Adjust PID from streams.json.)
- Convert DVB subtitles to images (BDSup2Sub or srt2dvbsub/png-only mode)
Code
# If using srt2dvbsub with png-only srt2dvbsub –png-only –png-dir ./pngout –input subtitles.dump
Or use existing toolchain (ProjectX -> BDSup2Sub) to produce PNG frames.
- Preprocess images (ImageMagick)
Code
for f in pngout/*.png; do convert “\(f" -resize 2000x -colorspace Gray -sharpen 0x1 -threshold 70% processed/"\)(basename “$f”)” done
- OCR with Tesseract (per-image -> HOCR or plain text)
Code
for f in processed/.png; do tesseract “\(f" "\){f%.}” -l eng –psm 6 done
- Stitch OCR output into timed SRT (Python script)
- Read PNG filenames (include timestamps encoded by converter or extracted from stream)
- Group lines by frame ranges, merge consecutive identical text, and assign start/end times
- Output .srt with standard format
Minimal Python outline (run as ocr_tosrt.py)
python
import os, glob, datetime def png_timestamp(name): # assume filenames like frame_000123451612345678.png where last part is ms parts = os.path.splitext(os.path.basename(name))[0].split(‘’) return int(parts[-1]) / 1000.0 pngs = sorted(glob.glob(‘processed/*.png’), key=png_timestamp) entries = [] for i,p in enumerate(pngs): txtfile = p.rsplit(’.’,1)[0] + ’.txt’ if not os.path.exists(txtfile): continue text = open(txtfile,encoding=‘utf-8’).read().strip() if not text: continue start = png_timestamp(p) end = png_timestamp(pngs[i+1]) if i+1 < len(pngs) else start + 2.0 entries.append((start,end,text)) # merge consecutive identical text and write SRT merged=[] for s,e,t in entries: if merged and merged[-1][2]==t: merged[-1]=(merged[-1][0],e,t) else: merged.append((s,e,t)) def fmt(ts): return str(datetime.timedelta(seconds=ts))[:-3].replace(’.’,’,’) with open(‘output.srt’,‘w’,encoding=‘utf-8’) as out: for i,(s,e,t) in enumerate(merged,1): out.write(f”{i} {fmt(s)} –> {fmt(e)} {t}“
)
Batch/Automation script (bash)
- Detect subtitle PID with ffprobe (jq), extract, convert, OCR, run Python stitcher. Parallelize OCR with GNU parallel. Sketch:
Code
#!/bin/bash for f in /path/to/input/.ts; do base=\((basename "\)f” .ts) pid=\((ffprobe -v error -select_streams s -show_entries stream=index:stream_tags=language -of json "\)f” | jq ‘.streams[0].index’) dvbsnoop -mts -i “\(f" -p 0x\){pid} -o “\({base}.dump" srt2dvbsub --png-only --png-dir "\){base}_png” –input “\({base}.dump" mkdir -p "\){base}_proc” convert “${base}_png/.png” -resize 2000x -colorspace Gray -threshold 70% “\({base}_proc/%f" ls "\){base}_proc”/*.png | parallel tesseract {} {.} -l eng –psm 6 python3 ocr_to_srt.py –png-dir “\({base}_proc" --out "\){base}.srt” done
Accuracy & post-processing tips
- Use language-specific traineddata and dictionaries to improve OCR.
- Apply morphological cleanup: remove artifacts, normalize punctuation, fix common OCR confusions (0/O, l/1).
- Use heuristics for timing: extend end-time if same text appears across multiple frames; drop very short spurious entries (<140 ms).
- Consider a spellcheck/pass (aspell or hunspell) targeted to subtitle language.
- For multi-language tracks, set per-track language for Tesseract.
Packaging and muxing back
- To mux SRT into MKV: mkvmerge -o out.mkv input.ts –subtitle-tracks output.srt
- To re-generate DVB subtitles for broadcast, use srt2dvbsub to encode SRT back into DVB subtitle stream and mux with ffmpeg/tsmuxer.
Troubleshooting
- No subtitle PID found: inspect PMT with tsduck or dvbsnoop to locate subtitle descriptors.
- Poor OCR: improve preprocessing (contrast, denoise), use page segmentation modes (tesseract –psm), or train/tune tesseract language data.
- Timing drift: use dvbsnoop/ffprobe timestamps or align via audio/video cues (cross-correlation).
If you want, I can:
- Produce a tested end-to-end script tailored to your environment (Linux/macOS/Windows WSL) and sample filenames, or
- Generate a Dockerfile that bundles the toolchain for reproducible runs. Which would you prefer?
Leave a Reply