A Guide to Closed Captioning for Web, Mobile, and Connected TV

Captioning is coming to Internet video. Legislation goes into effect in the US during 2012 and 2013 that mandates closed captioning on certain categories of online content – see Zencoder's post for details on the legislation. But even apart from this legislation, closed captioning is a good thing for accessibility and usability, and is yet another milestone as Internet video marches towards maturity.

Unfortunately, closed captioning is not a single technology or “feature” of video that can be “turned on”. There are a number of formats, standards, and approaches, ranging from good to bad to ugly. Closed captioning is kind of a mess, just like the rest of digital video, and is especially challenging for multiscreen publishers.

How Closed Captions Work
The first thing to understand is how closed captions are delivered, stored, and read. There are two main approaches today:

  • Embedded within a video: CEA-608, CEA-708, DVB-T, DVB-S, WST. These caption formats are written directly in a video file, either as a data track or embedded into the video stream itself. Broadcast television uses this approach, as does iOS.

  • Stored as a separate file: DFXP, SAMI, SMPTE-TT, TTML, EBU-TT (XML), WebVTT, SRT (text), SCC, EBU-STL (binary). These formats pass caption information to a player alongside of a video, rather than being embedded in the video itself. This approach is usually used by browser-based video playback (Flash, HTML5).

What about subtitles? Are they the same thing as closed captions? It turns out that there are three main differences:
  • Goals: Closed captions are an accessibility feature, making video available to the hard of hearing, and may include cues about who is speaking or about what sounds are happening: e.g. “There is a knock at the door”. Subtitles are an internationalization feature, making video available to people who don’t understand the spoken language. In other words, you would use captions to watch a video on mute, and you would use subtitles to watch a video in a language that you don’t understand. (Note that this terminological distinction holds in North America, but much of the world does not distinguish between closed captions and subtitles.)

  • Storage: Historically, captions have been embedded within video, and subtitles have been stored externally. This makes sense conceptually, because captions should always be provided along with a video; 100% accessibility for hard-of-hearing is mandated by legislation. Whereas subtitles are only sometimes needed; a German-language video broadcast in Germany doesn’t need to include German subtitles, but that same video broadcast in France would.

  • Playback: Since captions are passed along with the video and interpreted/displayed by a TV or other consumer device, viewers can turn them on and off themselves at any time using the TV itself, but rarely have options for selecting a language. In these situations when subtitles are added for translation purposes, they are generally hard subtitles and thus cannot be disabled. However, when viewing DVD/Blu-Ray/VOD video, the playback device controls whether subtitles are displayed, and in which language.

Formats and Standards
There are dozens of formats and standards for closed captioning and subtitles. Here is a rundown of the most important ones for Internet video:
  • CEA-608 (also called Line 21) captions are the NTSC standard, used by analog television in the United States and Canada. Line 21 captions are encoded directly into a hidden area of the video stream by broadcast playout devices. If you’ve ever seen white bars and dots at the top of a program, that’s Line 21 captioning (more information.)

  • An SCC file contains captions in Scenarist Closed Caption format. The file contains SMTPE timecodes with the corresponding encoded caption data as a representation of CEA-608 data.

  • CEA-708 is the standard for closed captioning for ATSC digital television (DTV) streams in the United States and Canada. There is currently no standard file format for storing CEA-708 captions apart from a video stream.

  • TTML stands for Timed Text Markup Language. TTML describes the synchronization of text and other media such as audio or video. See the W3C TTML Recommendation for more.

    Example:
    <tt xml:lang="" xmlns="http://www.w3.org/ns/ttml">
    <head>
    <styling xmlns:tts="http://www.w3.org/ns/ttml#styling">
    <style xml:id="s1" tts:color="white" />
    </styling>
    </head>
    <body>
    <div>
    <p xml:id="subtitle1" begin="0.76s" end="3.45s">
    Trololololo
    </p>
    <p xml:id="subtitle2" begin="5.0s" end="10.0s">
    lalala
    </p>
    <p xml:id="subtitle3" begin="10.0s" end="16.0s">
    Oh-hahaha-ho
    </p>
    </div>
    </body>
    </tt>

  • DFXP is a profile of TTML defined by W3C. DFXP files contain TTML that defines when and how to display caption data. DFXP stands for Distribution Format Exchange Profile. DFXP and TTML are often used synonymously.

  • SMPTE-TT (Society of Motion Picture and Television Engineers – Timed Text) is an extension of the DFXP profile that adds support for three extensions found in other captioning formats and informational items but not found in DFXP: #data, #image, and #information. See the SMPTE-TT standard for more.

    SMPTE-TT is also the FCC Safe Harbor format – if a video content producer provides captions in this format to a distributor, they have satisfied their obligation to provide captions in an accessible format. However, video content producers and distributors are free to agree upon a different format.

  • SAMI (Synchronized Accessible Media Interchange) is based on HTML and was developed by Microsoft for products such as Microsoft Encarta Encyclopedia and Windows Media Player. SAMI is supported by a number of desktop video players.

  • EBU-STL is a binary format used by the EBU standard, stored in separate .STL files. See the EBU-STL specification for more.

  • EBU-TT is a newer format supported by the EBU, based on TTML. EBU-TT is a strict subset of TTML, which means that EBU-TT documents are valid TTML documents, but some TTML documents are not valid EBU-TT documents because they include features not supported by EBU-TT. See the EBU-TT specification for more.

  • SRT is a format created by SubRip, a Windows-based open source tool for extracting captions or subtitles from a video. SRT is widely supported by desktop video players.

  • WebVTT is a text format that is similar to SRT. The Web Hypertext Application Technology Working Group (WHATWG) has proposed WebVTT as the standard for HTML5 video closed captioning.

    Example:

    WEBVTT

    00:00.76 --> 00:03.45
    <v Eduard Khil>Trololololo

    00:5.000 --> 00:10.000
    lalala

    00:10.000 --> 00:16.000
    Oh-hahaha-ho


  • Hard subtitles (hardsubs) are, by definition, not closed captioning. Hard subtitles are overlaid text that is encoded into the video itself, so that they cannot be turned on or off, unlike closed captions or soft subtitles. Whenever possible, soft subtitles or closed captions are generally be preferred, but hard subtitles can be useful when targeting a device or player that does not support closed captioning.

Captioning for Every Device
What formats get used by what devices and players?:
  • Flash video players can be written to parse external caption files. For example, JW Player supports captions in SRT and DFXP format.

  • HTML5 captions are not yet widely supported by browsers, but that will change over time. There are two competing standards: TTML, proposed by W3C, and WebVTT, proposed by WHATWG. At the moment, Chrome has limited support for WebVTT; Safari, Firefox, and Opera are all working on WebVTT support; and Internet Explorer 10 supports both WebVTT and TTML.

    Example:
    <video width="1280" height="720" controls>
    <source src="video.mp4" type="video/mp4" />
    <source src="video.webm" type="video/webm" />
    <track src="captions.vtt" kind="captions" srclang="en" label="English" />
    </video>

    Until browsers support a format natively, an HTML5 player framework like Video.js can support captions through Javascript, by parsing an external file. (Video.js currently supports WebVTT captions.)

  • iOS takes a different approach, and uses CEA-608 captions using a modified version of CEA-708/ATSC legacy encoding. This means that, unlike Flash and HTML5, captions must be added at the time of transcoding. Zencoder can add captions to HTTP Live Streaming videos for iOS.

  • Android video player support is still fragmented and problematic. Caption support will obviously depend on the OS version and the player used. Flash playback on Android should support TTML, though very little information is available.

  • Some other mobile devices have no support for closed captions at all, and hard subtitles may be the only option.

  • Roku supports captions through external SRT files.

  • Some other connected TV platforms do not support closed captioning yet. But they will soon enough. Every TV, console, cable box, and Blu-Ray player on the market today wants to stream Internet content, and over the next year and a half, closed captioning will become a requirement. So Sony, Samsung, Vizio, Google TV, et al will eventually make caption support a part of their application development frameworks. Unfortunately, it isn’t yet clear what formats will be used. Most likely, different platforms will continue to support a variety of incompatible formats for many years to come.

Closed Captioning for Internet Video: 2012 Edition
The landscape for closed captioning will change and mature over time, but as of 2012, here are the most common requirements for supporting closed captioning on common devices:
  • A web player (Flash, HTML5, or both) with player-side controls for enabling and disabling closed captioning.

  • An external file with caption data, probably using a format like WebVTT, TTML, or SRT. More than one file may be required – e.g. SRT for Roku and WebVTT for HTML5.

  • A transcoder that supports embedded closed captions for HTTP Live Streaming for iPad/iPhone delivery, like Zencoder. Zencoder can accept caption information in a variety of formats, including TTML, so publishers could use a single TTML file for both web playback and as input to Zencoder for iOS video.

Beyond there, things get difficult. Other input formats may be required for other devices, and hard subtitles are probably necessary for 100% compatibility across legacy devices.

Source: Zencoder

Internet TV Systems and Coding

Today's TV viewers want more content from an increasing number of sources, and that means that Internet delivery is a growing phenomenon. With hybrid technologies emerging, it is reasonable to expect that television broadcast will increasingly use the Internet to expand throughput beyond that afforded by a single RF channel. But there are limitations to the Internet that must be understood in order to capitalize on this commodity, and some of those constraints are being overcome by new technologies.

Streaming Can Now Provide a High Quality of Service
In general, Internet TV is a means to provide streamed video content to a PC, STB or Internet-connected TV, by means of an Internet connection. Internet Protocol Television, or IPTV, refers to a special case where a full-time TV subscriber connection is established by means of a dedicated line (and channel) to the telephone system central office. It is envisioned, however, that many Internet TV viewers will get their content though their Internet connection, and as such, receive OTT video service that shares bandwidth with other Internet traffic.

This sharing of bandwidth creates a QoS challenge for Internet TV service: While a terrestrial channel has a fixed bandwidth (i.e., 19.2Mb/s in the U.S.), an Internet TV service must share the bandwidth, both locally (e.g., within a viewer's household) as well as regionally (e.g., with other subscribers). This means the bandwidth available to a receiver can vary continuously over a wide range, and different subscribers may have different levels of guaranteed service, as well. Lowering the video bit rate to the least common denominator would result in poor video quality to everyone; to deal with this, several technologies are available.

Progressive Download vs. Streaming
The simplest way to deliver video over the Internet is to use progressive download, sometimes called “HTTP streaming.” This is simply a bulk download of a video file to the viewer's terminal (i.e., Internet-connected TV, STB, PC, etc.). A temporary copy of the file is stored on the user's device, typically on a hard drive, and playback can start after a sufficient amount of the file has been downloaded. This means that content will always incur a considerable delay before it is available to be viewed, which makes a live service rather difficult to implement. However, because the files are downloaded using TCP, there can be a nearly 100 percent assurance that every single bit was transferred correctly.

True streaming, on the other hand, opens up a handshaking connection between the server and client using a set of Internet protocols to deliver streams, such as Real Time Streaming Protocol (RTSP), Real Time Messaging Protocol (RTMP) and Microsoft Media Services (MMS). A streaming connection delivers a video stream with minimal buffering, allowing a nearly real-time presentation of the source content. In this respect, streaming has an advantage over progressive download, as continuous delivery is the goal, but the associated downside is that corrupted or missing packets are not detected. The consequence is that audio and video can have ongoing glitches when network congestion is experienced.

Adaptive Bit Rate Streaming
To solve the QoS issue, Adaptive Bit Rate (ABR) streaming has been developed. ABR allows each device to determine the quality of its connection and then use that metric to select the best-coded stream from a number of different quality streams. At the server end, a series of encoders encode a set of multiple streams at different bit rates, and these streams are then sliced up into segments or “chunks.” An ABR client in the viewer terminal detects the incoming stream bandwidth on the fly and uses this, along with a model based on the device's CPU capability, to select a segment among the various streams.

A special manifest file precedes the first segment, providing the client with a list of URLs from which each segment can be accessed. As each segment is received, the client progresses to the next segment in that stream, or it can jump to a parallel segment in one of the other streams if the channel bandwidth changes because of congestion, etc. In principle, a handful of streams will provide enough granularity so that the viewer does not detect a change in picture quality.

Note that ABR provides high transmission bandwidth efficiency when a unicast transmission (i.e., one-to-one) is used, but it can also work well with multicast and broadcast scenarios depending on how well the Internet infrastructure distributes bandwidth to users. ABR has the potential to deliver an audio/visual experience that we have come to expect from linear transmission: low delay, fast start time and a consistent experience across viewers.

Several manufacturers have developed different solutions for ABR streaming. Adobe HTTP Dynamic Streaming (HDS) uses a format called F4F to deliver Flash videos over RTMP and HTTP. Apple HTTP adaptive Live Streaming (HLS) was developed for the iPhone and iPad, and is implemented using HTTP, H.264 and MPEG-2 Transport Streams, with a manifest file called M3U8. Microsoft Internet Information Services (IIS) Smooth Streaming is used within Silverlight on the Windows 7 phone and incorporates fragmented MP4 (fMP4) encapsulation, again with H.264 for video compression.

With these different enterprise systems, an interoperability problem exists because of proprietary protocols and manifest structures. Multiple ABR systems mean that different devices must either pick and choose which systems to support, leading to service-constrained devices, or must include all at increased cost. This situation has motivated companies and experts around the world to propose a single, standard ABR system.

DASH-ing to the Rescue: a Universal ABR System
MPEG-DASH (Dynamic Adaptive Streaming over HTTP) is a newly standardized method for defining Stream Segments and Manifest Files for the purpose of ABR streaming. The specification (ISO-IEC 23009-1) defines a Media Presentation Description (MPD) that formalizes the stream manifest, which includes Segment timing, URLs and media characteristics such as video resolution and bit rates. While compatible Segments can contain any media data — with arbitrary compression — two types of containers are exemplified in the standard: MPEG-4 file format and MPEG-2 Transport Stream.


MPEG-DASH defines a standard set of Media Presentation Description and Segment Formats
that enable adaptive bit rate IP video streaming.


In going to a standard system, MPEG-DASH is quickly deployable with the existing Internet infrastructure, using widely deployed standard HTTP servers/caches for scalable delivery. Generic encoders can be reused, with additional descriptive metadata for better client functionality, and legacy manifest files can be converted easily to MPD format, as well as sent in parallel for backward compatibility with low overhead. In addition, existing content and production equipment supporting legacy ABR streaming systems can be used for MPEG-DASH by means of a set of standard Profiles. Apple HLS content can be used with the DASH M2TS Main profile, and Microsoft IIS Smooth Streaming Content is suitable for DASH ISO-BMFF (Base Media File Format) Live profile.

Vendors are now proposing integrated workflow and delivery systems supporting ABR with multiple source formats, protocols and on multiple devices. While encoding latency can be an issue for live streams, MPEG-DASH includes a profile optimized for live encoding that can achieve a latency of a few seconds by encoding and immediate delivery of short Segments.

In addition to delivery of any multimedia content, MPEG-DASH supports a broad range of use cases, including live, VOD, time shifting (nPVR), ad insertion and dynamic update of program. MPEG-DASH also solves the problem of content repurposing to multiple devices with widely ranging capabilities. In principle, an MPEG-DASH-controlled stream can be targeted simultaneously to both large and small screens, as well as fixed and mobile.

Internet Quickly Becoming Viable for Long-Form Content
The once-exclusive realm of RF transmission as providing the highest quality content consumption experience is being challenged by streaming services. But new technologies and business models are providing broadcasters with the tools to compete with new service entities, and that's where content distribution is headed.

By Aldo Cugnini, Broadcast Engineering

First Live MPEG-DASH Large Scale Demonstration

During the 2012 London Olympics, VRT has been offering its audience the chance to experience the Olympic Games broadcast on their personal devices via MPEG-DASH. The public trial allowed for a maximum of 1000 concurrent viewers to watch their favourite sport events on a laptop, smartphone or tablet.

The commercial deployment of the MPEG-DASH (Dynamic Adaptive Streaming over HTTP) standard is one step closer with the first live public trial, presented by Belgian public broadcaster VRT. This trial has been supported by a number of DASH Promoters Group members: encoding has been provided by Elemental, Harmonic and Media Excel; streaming origins was courtesy of Wowza and CodeShop, who also provided encryption; web clients for PC and Android have been supplied by Adobe; and BuyDRM provided applications for iOS and Android, which incorporate its DRM solution.

This proof of concept was initiated by the European Broadcasting Union, which strongly supports the development of MPEG-DASH as it is a key enabler allowing broadcasters to use a single file and streaming format to deliver content to multiple devices on multiple platforms.

Supported Devices
VRT offered users the following choices for viewing the London Games:

  • PCs and MAC running Adobe Flash.

  • Web browser for Android provided by Adobe.

  • iPhone / iPads from iOS version 4.3, with a special app from the iTunes store. The app is currently pending Apple approval.

  • Android smartphones from version 4.0, via Sporza Olympic Games available from the Google Play store.

Used MPEG-DASH Profile
The demonstration featured a live video stream encoded using the MPEG-DASH ISO Base Media File Format Live Profile, delivered through Belgacom’s Content Delivery Network to a range of device categories including tablets, smartphones and PCs running iOS, Android and Windows operating systems. This represents the first large-scale multivendor deployment of MPEG-DASH.

The demonstration was based on an early version of the DASH-264 interoperability guidelines, specifically developed by the DASH Promoters Group for interoperable deployment of the MPEG-DASH standard. DASH-264 provides a general interoperability framework aligned with the HbbTV 1.5 specification and other consortia recommendations. HbbTV 1.5 will be widely used by European broadcasters for interactive services on connected televisions.

DASH details:
  • File Format: ISO BM FF
  • Profile: Live
  • Template: Time based
  • Codecs: H.264 Baseline (ABR) / Audio AAC-LC (SBR)

Encoding settings:

The video consists of 6 different streams that can be chosen by the player to adapt the video playout automatically to the available bandwidth. The highest quality is 1500 kbs for the video quality, and there is also an audio-only stream available when there is inadequate Internet speed for video.

The settings of the various adaptive switchable streams in this proof of concept have not been defined for optimal audiovisual quality. Simplicity was the key driver in selecting the Baseline profile for H.264 encoding and 64 kbs audio bitrate limitation. This will ensure that switching between the different streams will run smoothly.



Logic Flow
The logical flow of online distribution based on MPEG-DASH is very similar to what is currently deployed in adaptive streaming systems. One needs only an encoded/packaged stream and an HTTP server to get the job done and play out video to a player that supports MPEG-DASH natively.

The simple part of the workflow is demonstrated by the chain starting with the Elemental Live encoder, which captures the audiovisual content from the SDI feed at the VRT premises. The Apache server located at the CDN of Belgacom picks up the data packages via HTTP GET from the encoder and makes it available by a URL and an MPD file describing how the packages should be interpreted by the player. The Adobe player reads out the MPD, buffers the packages and plays out the video on a device of the end user.

It gets more complicated when one involves dedicated applications that play out video to devices that do not yet support MPEG DASH natively. Additionally, to secure premium content, content protection using DRM technology needs to be implemented in the server and the client. In this trial, a group of equipment suppliers are working together to make this happen.

The Harmonic ProMedia Live encoder encodes the IP feeds. The Wowza server, also located at the VRT premises, acts as an origin server, making the packages available for the Wowza cache server in the CDN. This content is then played out by the BuyDRM Android and iOS players.

In order to also showcase the Common Encryption support for MPEG-DASH with Microsoft’s PlayReady DRM, another distribution chain is set up in which the Media Excel HERO encoder works with CodeShops’s Unified Streaming server acting as an origin to produce live encrypted content. For proof of concept purposes, the BuyDRM Android and iOS applications in this distribution chain switch seamlessly between protected and unprotected content.


Click to enlarge

Source: DASH Promoters Group

New EBU Subtitling Specification Published

The EBU has published EBU-TT part 1, with TT standing for Timed Text. It's a follow-up to the widely used EBU STL specification, which was originally published in 1991. The new format is XML-based, which makes it “human readable” and more suited to modern integrated file-based production methods.

EBU-TT is a simplified version of the W3C Timed Text specification, which means it fits well into the broad family that includes W3C TTML and SMPTE TT, which is more focused on the US environment and on distribution. EBU-TT was developed by the EBU's XML Subtitles group, chaired by Andreas Tai of IRT.

Part 1 (Tech 3350) has been published, defining a structure for the interchange and archiving of subtitles; part 2, now being drafted, will provide mapping guidance for users who want to migrate from EBU STL to EBU-TT. Work has also started towards a specification for live subtitling, with a workshop scheduled for 9 August.

Source: EBU

What is Timed Text?

A nice video introduction to Timed Text by Bruce Devlin.