How to Convert DVBSub2Text: Step-by-Step Guide for Accurate Subtitle Extraction

Automating Subtitle Extraction with DVBSub2Text — Workflow and Scripts

Overview

A reliable automated pipeline converts DVB image-based subtitles in transport streams to editable text (SRT/VTT) by: 1) extracting subtitle streams, 2) converting image subtitles to images, 3) running OCR, and 4) timing/cleanup and packaging. Below is a practical, ready-to-run workflow and example scripts using common tools.

Required tools (assumed installed)

ffmpeg / ffprobe
dvbsnoop or ProjectX (subtitle stream extraction)
BDSup2Sub (or similar) to produce images (PNG) from DVB subtitles
Tesseract OCR with appropriate language data
Python 3 for orchestration and post-processing
optional: ImageMagick for preprocessing, jq for JSON handling

High-level workflow

Detect subtitle PID(s) from the TS file.
Extract DVB subtitle PES packets to a dump.
Convert DVB subtitle dump to per-frame PNGs.
Preprocess PNGs (deskew, threshold, crop).
Run Tesseract OCR on cleaned PNGs to produce text lines.
Reconstruct timings into SRT/VTT and perform text cleanup (spellcheck, dictionary).
Package SRT/VTT and optionally mux back into TS or store alongside media.
(Optional) Batch mode: iterate over directory, apply templates, and parallelize OCR.

Example commands (single-file, basic)

Detect PIDs (ffprobe)

Code
ffprobe -v error -select_streams s -show_entries stream=index,codec_name:streamtags=language -of json input.ts > streams.json

Extract subtitle packets (dvbsnoop)

Code
dvbsnoop -mts -i input.ts -p 0x1200 -o subtitles.dump

(Adjust PID from streams.json.)

Convert DVB subtitles to images (BDSup2Sub or srt2dvbsub/png-only mode)

Code
# If using srt2dvbsub with png-only srt2dvbsub –png-only –png-dir ./pngout –input subtitles.dump

Or use existing toolchain (ProjectX -> BDSup2Sub) to produce PNG frames.

Preprocess images (ImageMagick)

Code
for f in pngout/*.png; do convert “$f" -resize 2000x -colorspace Gray -sharpen 0x1 -threshold 70% processed/"$(basename “$f”)” done

OCR with Tesseract (per-image -> HOCR or plain text)

Code
for f in processed/.png; do tesseract “$f" "${f%.}” -l eng –psm 6 done

Stitch OCR output into timed SRT (Python script)

Read PNG filenames (include timestamps encoded by converter or extracted from stream)

Group lines by frame ranges, merge consecutive identical text, and assign start/end times

Output .srt with standard format

Minimal Python outline (run as ocr_tosrt.py)

python
import os, glob, datetime def png_timestamp(name): # assume filenames like frame_000123451612345678.png where last part is ms parts = os.path.splitext(os.path.basename(name))[0].split(‘’) return int(parts[-1]) / 1000.0 pngs = sorted(glob.glob(‘processed/*.png’), key=png_timestamp) entries = [] for i,p in enumerate(pngs): txtfile = p.rsplit(’.’,1)[0] + ’.txt’ if not os.path.exists(txtfile): continue text = open(txtfile,encoding=‘utf-8’).read().strip() if not text: continue start = png_timestamp(p) end = png_timestamp(pngs[i+1]) if i+1 < len(pngs) else start + 2.0 entries.append((start,end,text)) # merge consecutive identical text and write SRT merged=[] for s,e,t in entries: if merged and merged[-1][2]==t: merged[-1]=(merged[-1][0],e,t) else: merged.append((s,e,t)) def fmt(ts): return str(datetime.timedelta(seconds=ts))[:-3].replace(’.’,’,’) with open(‘output.srt’,‘w’,encoding=‘utf-8’) as out: for i,(s,e,t) in enumerate(merged,1): out.write(f”{i} {fmt(s)} –> {fmt(e)} {t}
“)

Batch/Automation script (bash)

Detect subtitle PID with ffprobe (jq), extract, convert, OCR, run Python stitcher. Parallelize OCR with GNU parallel. Sketch:

Code
#!/bin/bash for f in /path/to/input/.ts; do base=$(basename "$f” .ts) pid=$(ffprobe -v error -select_streams s -show_entries stream=index:stream_tags=language -of json "$f” | jq ‘.streams[0].index’) dvbsnoop -mts -i “$f" -p 0x${pid} -o “${base}.dump" srt2dvbsub --png-only --png-dir "${base}_png” –input “${base}.dump" mkdir -p "${base}_proc” convert “${base}_png/.png” -resize 2000x -colorspace Gray -threshold 70% “${base}_proc/%f" ls "${base}_proc”/*.png | parallel tesseract {} {.} -l eng –psm 6 python3 ocr_to_srt.py –png-dir “${base}_proc" --out "${base}.srt” done

Accuracy & post-processing tips

Use language-specific traineddata and dictionaries to improve OCR.

Apply morphological cleanup: remove artifacts, normalize punctuation, fix common OCR confusions (0/O, l/1).

Use heuristics for timing: extend end-time if same text appears across multiple frames; drop very short spurious entries (<140 ms).

Consider a spellcheck/pass (aspell or hunspell) targeted to subtitle language.

For multi-language tracks, set per-track language for Tesseract.

Packaging and muxing back

To mux SRT into MKV: mkvmerge -o out.mkv input.ts –subtitle-tracks output.srt

To re-generate DVB subtitles for broadcast, use srt2dvbsub to encode SRT back into DVB subtitle stream and mux with ffmpeg/tsmuxer.

Troubleshooting

No subtitle PID found: inspect PMT with tsduck or dvbsnoop to locate subtitle descriptors.

Poor OCR: improve preprocessing (contrast, denoise), use page segmentation modes (tesseract –psm), or train/tune tesseract language data.

Timing drift: use dvbsnoop/ffprobe timestamps or align via audio/video cues (cross-correlation).

If you want, I can:

Produce a tested end-to-end script tailored to your environment (Linux/macOS/Windows WSL) and sample filenames, or

Generate a Dockerfile that bundles the toolchain for reproducible runs. Which would you prefer?

How to Convert DVBSub2Text: Step-by-Step Guide for Accurate Subtitle Extraction

Automating Subtitle Extraction with DVBSub2Text — Workflow and Scripts

Overview

Required tools (assumed installed)

High-level workflow

Example commands (single-file, basic)

Batch/Automation script (bash)

Accuracy & post-processing tips

Packaging and muxing back

Troubleshooting

Comments

Leave a Reply Cancel reply

More posts

ConsoleX vs. Competitors: Which One Wins?

How to Edit PDFs Fast with VeryPDF PDF Editor

DC Envelope Printer Setup & Maintenance Tips for Perfect Prints

Top BS.Player tips and hidden features you should know