How to Convert DVBSub2Text: Step-by-Step Guide for Accurate Subtitle Extraction

Automating Subtitle Extraction with DVBSub2Text — Workflow and Scripts

Overview

A reliable automated pipeline converts DVB image-based subtitles in transport streams to editable text (SRT/VTT) by: 1) extracting subtitle streams, 2) converting image subtitles to images, 3) running OCR, and 4) timing/cleanup and packaging. Below is a practical, ready-to-run workflow and example scripts using common tools.

Required tools (assumed installed)

  • ffmpeg / ffprobe
  • dvbsnoop or ProjectX (subtitle stream extraction)
  • BDSup2Sub (or similar) to produce images (PNG) from DVB subtitles
  • Tesseract OCR with appropriate language data
  • Python 3 for orchestration and post-processing
  • optional: ImageMagick for preprocessing, jq for JSON handling

High-level workflow

  1. Detect subtitle PID(s) from the TS file.
  2. Extract DVB subtitle PES packets to a dump.
  3. Convert DVB subtitle dump to per-frame PNGs.
  4. Preprocess PNGs (deskew, threshold, crop).
  5. Run Tesseract OCR on cleaned PNGs to produce text lines.
  6. Reconstruct timings into SRT/VTT and perform text cleanup (spellcheck, dictionary).
  7. Package SRT/VTT and optionally mux back into TS or store alongside media.
  8. (Optional) Batch mode: iterate over directory, apply templates, and parallelize OCR.

Example commands (single-file, basic)

  1. Detect PIDs (ffprobe)

Code

ffprobe -v error -select_streams s -show_entries stream=index,codec_name:streamtags=language -of json input.ts > streams.json
  1. Extract subtitle packets (dvbsnoop)

Code

dvbsnoop -mts -i input.ts -p 0x1200 -o subtitles.dump

(Adjust PID from streams.json.)

  1. Convert DVB subtitles to images (BDSup2Sub or srt2dvbsub/png-only mode)

Code

# If using srt2dvbsub with png-only srt2dvbsub –png-only –png-dir ./pngout –input subtitles.dump

Or use existing toolchain (ProjectX -> BDSup2Sub) to produce PNG frames.

  1. Preprocess images (ImageMagick)

Code

for f in pngout/*.png; do convert “\(f" -resize 2000x -colorspace Gray -sharpen 0x1 -threshold 70% processed/"\)(basename “$f”)” done
  1. OCR with Tesseract (per-image -> HOCR or plain text)

Code

for f in processed/.png; do tesseract “\(f" "\){f%.}” -l eng –psm 6 done
  1. Stitch OCR output into timed SRT (Python script)
  • Read PNG filenames (include timestamps encoded by converter or extracted from stream)
  • Group lines by frame ranges, merge consecutive identical text, and assign start/end times
  • Output .srt with standard format

Minimal Python outline (run as ocr_tosrt.py)

python

import os, glob, datetime def png_timestamp(name): # assume filenames like frame_000123451612345678.png where last part is ms parts = os.path.splitext(os.path.basename(name))[0].split() return int(parts[-1]) / 1000.0 pngs = sorted(glob.glob(‘processed/*.png’), key=png_timestamp) entries = [] for i,p in enumerate(pngs): txtfile = p.rsplit(’.’,1)[0] + ’.txt’ if not os.path.exists(txtfile): continue text = open(txtfile,encoding=‘utf-8’).read().strip() if not text: continue start = png_timestamp(p) end = png_timestamp(pngs[i+1]) if i+1 < len(pngs) else start + 2.0 entries.append((start,end,text)) # merge consecutive identical text and write SRT merged=[] for s,e,t in entries: if merged and merged[-1][2]==t: merged[-1]=(merged[-1][0],e,t) else: merged.append((s,e,t)) def fmt(ts): return str(datetime.timedelta(seconds=ts))[:-3].replace(’.’,’,’) with open(‘output.srt’,‘w’,encoding=‘utf-8’) as out: for i,(s,e,t) in enumerate(merged,1): out.write(f”{i} {fmt(s)} –> {fmt(e)} {t}

)

Batch/Automation script (bash)

  • Detect subtitle PID with ffprobe (jq), extract, convert, OCR, run Python stitcher. Parallelize OCR with GNU parallel. Sketch:

Code

#!/bin/bash for f in /path/to/input/.ts; do base=\((basename "\)f” .ts) pid=\((ffprobe -v error -select_streams s -show_entries stream=index:stream_tags=language -of json "\)f” | jq ‘.streams[0].index’) dvbsnoop -mts -i “\(f" -p 0x\){pid} -o “\({base}.dump" srt2dvbsub --png-only --png-dir "\){base}_png” –input “\({base}.dump" mkdir -p "\){base}_proc” convert “${base}_png/.png” -resize 2000x -colorspace Gray -threshold 70% “\({base}_proc/%f" ls "\){base}_proc”/*.png | parallel tesseract {} {.} -l eng –psm 6 python3 ocr_to_srt.py –png-dir “\({base}_proc" --out "\){base}.srt” done

Accuracy & post-processing tips

  • Use language-specific traineddata and dictionaries to improve OCR.
  • Apply morphological cleanup: remove artifacts, normalize punctuation, fix common OCR confusions (0/O, l/1).
  • Use heuristics for timing: extend end-time if same text appears across multiple frames; drop very short spurious entries (<140 ms).
  • Consider a spellcheck/pass (aspell or hunspell) targeted to subtitle language.
  • For multi-language tracks, set per-track language for Tesseract.

Packaging and muxing back

  • To mux SRT into MKV: mkvmerge -o out.mkv input.ts –subtitle-tracks output.srt
  • To re-generate DVB subtitles for broadcast, use srt2dvbsub to encode SRT back into DVB subtitle stream and mux with ffmpeg/tsmuxer.

Troubleshooting

  • No subtitle PID found: inspect PMT with tsduck or dvbsnoop to locate subtitle descriptors.
  • Poor OCR: improve preprocessing (contrast, denoise), use page segmentation modes (tesseract –psm), or train/tune tesseract language data.
  • Timing drift: use dvbsnoop/ffprobe timestamps or align via audio/video cues (cross-correlation).

If you want, I can:

  • Produce a tested end-to-end script tailored to your environment (Linux/macOS/Windows WSL) and sample filenames, or
  • Generate a Dockerfile that bundles the toolchain for reproducible runs. Which would you prefer?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *