Subtitle encoding and mojibake, explained

When subtitles turn into é, зд, or rows of �, the timing is fine — the text is being read with the wrong character encoding. Here's why it happens, language by language, and how to fix it for good.

Reference · updated June 2026

A text file doesn't store letters; it stores bytes. An encoding is the lookup table that turns those bytes back into characters. Get the table wrong and you get gibberish — even though every byte is intact and the timing is perfect. This is the single most common reason non-English subtitles look broken.

Bytes, code pages and UTF-8

For the first 128 values (plain English letters, digits, punctuation), almost every encoding agrees — that's ASCII. The trouble starts above 127. For decades, each region used its own code page to map those high bytes: Windows-1252 for Western Europe, Windows-1251 for Cyrillic, Windows-1256 for Arabic, ISO-8859-7 for Greek, and so on. Each maps the same byte to a different letter.

UTF-8 replaced this mess. It can represent every character in Unicode and is now the default everywhere. The catch: SRT files carry no declaration of their own encoding, so a modern player simply assumes UTF-8. Feed it an old code-page file and it reads the bytes through the wrong table.

The two faces of broken text

There are two distinct failure modes, and they look different:

  • Mojibake — readable-looking but wrong characters. Cyrillic read as Windows-1252 becomes здрав; a French é becomes é. The bytes decoded to something, just the wrong something.
  • The replacement character — appears when a byte sequence is invalid for the assumed encoding, so the decoder gives up on it. Rows of usually mean a multi-byte file (like Chinese GBK) read as a single-byte encoding, or vice-versa.

Why it breaks, language by language

  • Cyrillic (Russian, Ukrainian, Bulgarian…) — old files are typically Windows-1251 or KOI8-R. Read as UTF-8 they become long strings of Ð and Ñ pairs.
  • Arabic / Persian — Windows-1256. Misread, it produces scattered Latin accents and symbols; the right-to-left direction can also make a correctly-decoded file look odd if the player lacks RTL support, which is a separate issue.
  • Greek — ISO-8859-7 or Windows-1253. Easily confused with Cyrillic by automatic detectors because both remap the same byte ranges to letters.
  • Central European (Polish, Czech, Hungarian…) — Windows-1250. The accented letters (ł, ő, ě) are the ones that corrupt.
  • Chinese, Japanese, Korean — multi-byte encodings (GBK, Big5, Shift-JIS, EUC-KR). When misread as a single-byte encoding you get a mix of garbage and , because the byte pairs don't line up.

The BOM

A byte-order mark is an optional invisible marker at the very start of a file (the bytes EF BB BF for UTF-8) that announces its encoding. It helps some Windows players detect UTF-8 — but a few older players display it as a stray  at the start of the first subtitle, or refuse the file. The safe default is UTF-8 without a BOM; add one only if a specific player needs it. The encoding fixer lets you choose.

The sneaky one: double-encoded UTF-8

A particularly confusing failure: text that was already UTF-8 gets read as Windows-1252 and saved again. Now é shows as é, as â€", and a curly apostrophe as ’. The file is technically valid UTF-8, so naïve detectors declare it fine and leave it broken. The cure is to reverse the extra layer — re-encode the text as Windows-1252 and read it as UTF-8 once. Our tool detects this pattern automatically and fixes it.

Why automatic detection is hard

Detecting a single-byte encoding from bytes alone is genuinely ambiguous: the same bytes are valid Cyrillic and valid Arabic and valid Greek — each just a different alphabet. Good detectors use letter frequency to guess which language the result most resembles, but short files don't give much to go on. That's why a trustworthy encoding fixer offers a live preview and a manual override: the machine guesses, you confirm with your eyes. Our encoding fixer ranks the candidates, shows the before-and-after per cue, and lets you pick the encoding if the guess is off.

Fixing it for good

  1. Open the garbled file in the encoding fixer.
  2. Check the preview. If the text reads correctly, you're done; if not, choose the encoding from the dropdown until it does.
  3. Download the result — clean UTF-8, which every modern player reads.
  4. If the file also has structural faults, follow up with the SRT repair tool.

Once a file is saved as UTF-8, the problem is gone permanently — there's no regional guesswork left for a player to get wrong.

Tools for this