The subtle art of subtitles and captions

Matthijs Langendijk
7 min readAug 1, 2023

That’s right! Subtitles are an art form. You can have incredibly good subtitles, and terrible ones. Whether the subtitles are formatted accordingly, if they use the proper wordings, are timed correctly to the video. Oh, and captions are important, too! Wait, you didn’t know they were different things? Let me explain!

Context: There are quite some differences in specifications and requirements depending on the medium that captions and subtitles are being used. For this blog, I’m mainly going to focus on the styling requirements in the context of OTT (or digital media). Other contexts, like traditional analogue television, will not be covered here.

The difference between subtitles and captions

While both subtitles and captions follow a similar principle (a textual version of the video), they both serve different and very distinct purposes. Both subtitles and captions are so-called ‘timed text applications’. As the name explains, based on the time of the current video, a certain text is displayed to provide textual context to the currently showing picture and audio. And that’s actually already where the distinction begins, too.

Subtitles are primarily used for translating spoken words in video. Their main purpose lies in providing access to video content for people not understanding the language spoken in the video. Any time you’re watching your favourite English-spoken series on Netflix with your Dutch, French or Spanish texts below it — you’re likely reading subtitles.

Captions however serve a different purpose, one that focuses a lot more on accessibility. Being created for the deaf and hard-of-hearing audience, captions generally provide a lot more textual details about the video. For example, captions will explain that music is playing, or that a door has been closed with a hard ‘thonk’ sound; next to the normal showing of any spoken words. In short, captions fully transcribe the dialogue, sound effects and music that’s in the video.

Caption style requirements

Typical to how the world works, different countries (and even different organisations within countries) have different requirements. Whereas the United States of America follows the CTA-708 standard based on FCC regulations, the United Kingdom (and many other European countries) uses the EBU-TT-D standard from the EBU, European Broadcasting Union. Both standards define various requirements and options for how to display captions and subtitles. Luckily, in principle most standards do focus on the same topics — and often even have the same exact requirements. I would argue that the FCC has some harder requirements than the BBC does. If you’re implementing subtitles or captions, following the FCC requirements will likely mean you cover most other countries' requirements.

So what do all of these requirements actually focus on? Generally, they provide a set of options that users need to have the ability to choose from. Referring to the FCC’s Communications and Video Accessibility Act of 2010, a set of style and display options is defined. Summing them up, users are required to be able to change the appearance of their subtitles in several ways: character colour, size and opacity; display font; caption background colour and opacity; window colour and opacity; and of course language. You can find similar requirements on the website of the BBC, for example.

What’s good about all of these standards, is the definition of a default style. And that style actually is what most people prefer, too. A white text on a black background, with a default font that’s readable (often a Sans Serif font). Still, the requirements often still stand that users need to be able to change the appearance, as per the options shown above.

Subtitle file formats

As with many standards in the world, I’ll always gladly refer to the following xkcd:

While it’s not as bad as with for example javascript frameworks, captions and subtitle file formats definitely have some competition and overlapping standards that need to be taken into account when displaying captions.

SubRip Text (SRT)

SRT, or SubRip Text, is the most widely support format across the world wide web, because of its simplicity. With all the different displaying options now being used for subtitles (like colours, positions, and different styles for different speakers), SRT actually provides no (official) support for any of that. SRT is a really simple, human-readable format that follows a simple counter- and time-based structure:

00:00:00,000 --> 00:00:02,500
Welcome to the Subtitle Blog!

00:00:03,000 --> 00:00:06,000
In this blog, we explore
the fascinating world of subtitles.

00:00:07,000 --> 00:00:10,000
Discover the importance of
accurate and well-timed subtitles.

00:00:11,000 --> 00:00:14,000
Learn about the technology
behind creating subtitles.

Web Video Text Tracks Format (WebVTT)

Similar in file structure to SRT, is WebVTT. A fully web-focused subtitle format with similar blocks or ‘cues’ as SRT, but with additions related to styling and position. You’ll recognise the file structure with a similar time display but with added positioning and style:


::cue {
background-color: rgba(0, 0, 0, 0.7);
color: #fff;
font-size: 18px;
padding: 8px;
border-radius: 5px;

00:00:00.000 --> 00:00:02.500 line:0 position:20% size:60% align:start
Welcome to the Subtitle Blog!

00:00:03.000 --> 00:00:06.000 line:0 position:20% size:60% align:start
In this blog, we explore the fascinating world of subtitles.

00:00:07.000 --> 00:00:10.000 line:0 position:20% size:60% align:start
Discover the importance of accurate and well-timed subtitles.

00:00:11.000 --> 00:00:14.000 line:0 position:20% size:60% align:start
Learn about the technology behind creating subtitles.

XML-based (the rest)

With SRT and WebVTT being plain-text files, the majority of subtitle formats are actually XML based. Not only that, all of them generally are based on (or extend, are a ‘profile’ of-) the TTML or Timed Text Markup Language standard. Formerly known as DXFP (Distribution Format eXchange Profile), TTML is an XML standard for displaying subtitles on the web. It provides options similar to WebVTT, but moved away from plain-text files in favour of more descriptive XML-based files.

<?xml version="1.0" encoding="utf-8"?>
<tt xmlns="" xmlns:ttp="" xmlns:tts="" xmlns:ttm="" xmlns:xml="" ttp:timeBase="media" ttp:frameRate="24" xml:lang="en">
<ttm:title>Sample TTML</ttm:title>
<style xml:id="s1" tts:textAlign="center" tts:fontFamily="Arial" tts:fontSize="100%"/>
<region xml:id="bottom" tts:displayAlign="after" tts:extent="80% 40%" tts:origin="10% 50%"/>
<region xml:id="top" tts:displayAlign="before" tts:extent="80% 40%" tts:origin="10% 10%"/>
<p xml:id="subtitle1" begin="00:00:00.000" end="00:00:02.500" style="s1" region="top">Welcome to the Subtitle Blog!</p>
<p xml:id="subtitle2" begin="00:00:03.000" end="00:00:06.000" style="s1" region="bottom">In this blog, we explore the fascinating world of subtitles.</p>
<p xml:id="subtitle3" begin="00:00:07.000" end="00:00:10.000" style="s1" region="top">Discover the importance of accurate and well-timed subtitles.</p>
<p xml:id="subtitle4" begin="00:00:11.000" end="00:00:14.000" style="s1" region="bottom">Learn about the technology behind creating subtitles.</p>

As you can see in the example above, it uses an XML-based setup to define styling, positioning and of course the subtitle entries themselves.

Now, as I mentioned before, there are many different formats based on TTML. You might call them ‘profiles’, an extension on TTML. The difference with TTML is that they get extended with requirements as defined by for example the FCC with CTA-708, and the EBU with EBU-TT-D. In the case of CTA-708, you can use the SDP-US TTML profile to be fully covered with the requirements set forth by the FCC. Similarly for EBU-TT-D, the ISMC1 profile covers all of those requirements (funnily, ISMC1 is actually a collection of profiles EBU-TT, SMPTE-TT and CFF-TT, confusing stuff!). Most important to note with these profiles, is that you can use most (if not all) of the same XML formatting. Different names — similar beast.

Manual styles versus system styles

With hard requirements being imposed on the makers of electronic devices like phones, televisions and other devices capable of displaying video; it often means you have the opportunity to simply retrieve subtitle styling from the system settings. That’s however not the case for all devices. The majority of televisions, don’t seem to offer those system settings, for example (at least from my personal experience, do let me know if you spot different on your end!).

Taking a look at some of the different system settings from the likes of Android, iOS, Roku and Amazon FireTV, you’ll see that they all offer styling options that cover the use cases of the FCC. So in the case of these platforms, all you have to do is retrieve the styles and display your subtitles as such. With TVs from for example Samsung and LG, you’ll also have to fully implement the user interface to change the actual subtitle styles.

Roku caption styling options
Amazon FireTV caption styling options
Android caption styling options


Subtitles and captions are incredibly useful and provide great accessibility options for many people alike. Whether it’s different languages, or options for deaf or hard-of-hearing folks, subtitles and captions massively improve the experience of watching video content.

There are however a lot of different requirements that device and app makers have to deal with when it comes to displaying, styling and positioning subtitles. Different rules depending on the country, different file formats to implement, and even differences depending on the device you’re building an app for. While it not might be the easiest thing to do, I’m 100% certain your users will very much thank you for it if they get a massively improved experience with a clean caption or subtitle implementation.