Multi-Screen IP Video Delivery

Online and mobile viewing of widely-available, high-quality video content including TV programming, movies, sports events, and news is now poised to go mainstream. Driven by the recent availability of low-cost, high-resolution desktop/laptop/tablet PCs, smart phones, set-top boxes and now Ethernet-enabled TV sets, consumers have rapidly moved through the ‘novelty’ phase of acceptance into expectation that any media should be available essentially on any device over any network connection. Whether regarded as a disruption for cable TV, telco or satellite TV providers, or an opportunity for service providers to extend TV services onto the web for on-demand, time-shifted and place-shifted programming environments – often referred to as ‘three screen delivery’ or ‘TV Anywhere’ – this new video delivery model is here to stay.

While tremendous advancements in core and last mile bandwidth have been achieved in the last decade around the world – primarily driven by web-based data consumption – video traffic represents a quantum leap in bandwidth requirements. Coupled with the fact that the Internet at large is not a managed quality-of-service environment, requires that new methods of video transport be considered to provide the quality of video experience across any device and network that we have come to expect from managed TV-delivery networks.

The evolution of video delivery transport has led to a new set of de facto standard adaptive delivery protocols from Apple, Microsoft, Adobe that are now positioned for broad adoption. Consequently, networks must now be equipped with servers that can take high-quality video content from its source live or file format and ‘package’ it for transport to devices ready to accept these new delivery protocols.

Video Delivery Background

The Era of Stateful Protocols

For many years, stateful protocols including Real Time Streaming Protocol (RTSP), Adobe’s Real Time Messaging Protocol (RTMP), and Real Networks' RTSP over Real Data Transport (RDT) protocol were utilized to stream video content to desktop and mobile clients. Stateful protocols require that from the time a client connects to a streaming server until the time it disconnects, the server tracks client state. If the client needs to perform any video session control commands like start, stop, pause or fast-forward it must do so by communicating state information back to the streaming server.

Once a session between the client and the server has been established, the server sends media as a stream of small packets typically representing a few milliseconds of video. These packets can be transmitted over UDP or TCP. TCP overcomes firewall blocking of UDP packets, but may also incur increased latency as packets are sent, and resent if not acknowledged, until received at the far end.

These protocols served the market well, particularly during the era where desktop and mobile device experiences were limited by frequency, quality, duration, screen/window size/resolution, constrained processor, memory and storage capabilities of mobile devices, etc.

However, the above experience factors have all changed dramatically in the last few years. And that has exposed a number of stateful protocol implementation weaknesses:

  • Stateful media protocols have difficulty getting through firewalls and routers.

  • Stateful media protocols require special proxies/caches.

  • Stateful media protocols cannot react quickly or gracefully to rapidly fluctuating network conditions.

  • Stateful media client server implementations are vendor-specific, and thus require the purchase of vendor-specific servers and licensing arrangements – which are also more expensive to operate and maintain.

The Era of the Stateless Protocol – HTTP Progressive Download.

A newer type of media delivery is HTTP progressive download. Progressive download (as opposed to ‘traditional’ file download) pulls a file from a web server using HTTP and allows the video file to start playing before the entire file has been downloaded. Most media players including Adobe Flash, Windows Media Player, Apple Quicktime, etc., support progressive download. Further, most video hosting websites use progressive download extensively, if not exclusively.

HTTP progressive download differs from traditional file download in one important respect. Traditional files have audio and video data separated in the file. At the end of the file, a record of the location and structure of the audio and video tracks (track data) is provided. Progressively downloadable files have track data at the beginning of the file and interleave the audio and video data. A player downloading a traditional file must wait until the end of the file is reached in order to understand track data. A player downloading a progressively downloadable file gets track data immediately and can, therefore, play back audio/video as it is received.

Unfortunately, it isn’t possible to efficiently store the audio and video, and create progressive download files from live streams. Audio and video track data needs to be 1) computed after the entire file is created and, then 2) written to the front of the file. Thus, it isn’t possible to deliver a live stream using progressive download, because the track data can never be available until after the entire file has been created.

Even so, HTTP Progressive Download greatly improves upon its stateful protocol predecessors as a result of the following:
  • No issue getting through firewalls and routers as HTTP traffic is passed through Port 80 unfettered.

  • Utilizes the same web download infrastructure utilized by CDNs and hosting providers to provide web data content – making it much easier and less expensive to deliver rich media content.

  • Takes advantage of newer desktop and mobile clients’ formidable processing, memory and storage capabilities to get start video playback quickly, maintain flow, and preserve a highquality experience.

The Modern Era – Adaptive HTTP Streaming

Adaptive HTTP Streaming takes HTTP video delivery several steps further. In this case, the source video, whether a file or a live stream, is encoded into segments – sometimes referred to as "chunks" – using a desired delivery format, which includes a container, video codec, audio codec, encryption protocol, etc. Segments typically represent two to ten seconds of video. Each segment is sliced at video Group of Pictures (GOP) boundaries beginning with a key frame, giving the segment complete independence from previous and successive segments. Encoded segments are subsequently hosted on a regular HTTP web server.

Clients request segments from the web server, downloading them via HTTP. As the segments are downloaded to the client, the client plays back the segments in the order received. Since the segments are sliced along GOP boundaries with no gaps between, video playback is seamless – even though it is actually just a file download via a series of HTTP GET requests.

Adaptive delivery enables a client to ‘adapt’ to fluctuating network conditions by selecting video file segments encoded to different bit rates. As an example, suppose a video file had been encoded to 11 different bit rates from 500 Kbps to 1 Mbps in 50 Kbps increments, i.e., 500 Kbps, 550 Kbps, 600 Kbps, etc. The client then observes the effective bandwidth throughout the playback period by evaluating its buffer fill/depletion rate. If a higher quality stream is available, and network bandwidth appears able to support it, the client will switch to the higher-quality bit rate segment. If a lower quality stream is available, and network bandwidth appears too limited to support the currently used bit rate segment flow, the client will switch to the lower quality bit rate segment flow. The client can choose between segments encoded at different bit rates every few seconds.

This delivery model works for both live- and file-based content. In either case, a manifest file is provided to the client, which defines the parameters of each segment. In the case of an on-demand file request, the manifest is sent at the beginning of the session. In the case of a live feed, updated ‘rolling window’ manifest files are sent as new segments are created.

Since the web server can typically send data as fast as its network connection will allow, the client can evaluate its buffer conditions and make forward-looking decisions on whether future segment requests should be at a higher or lower bit rate to avoid buffer overrun or starvation. Each client will make this decision based on trying to select the highest possible bit rate for maximum quality of playback experience, but not so great that it starves its own buffer of the next needed segments.

A number of advantages accrue with this delivery protocol approach:
  • Lower infrastructure costs for content providers by eliminating specialty streaming servers in lieu of generic HTTP caches/proxies already in place for HTTP data serving.

  • Content delivery is dynamically adapted to the weakest link in the end-to end-delivery chain, including highly varying last mile conditions.

  • Subscribers no longer need to statically select a bit rate on their own, as the client can now perform that function dynamically and automatically.

  • Subscribers enjoy fast start-up and seek times as playback control functions can be initiated via the lowest bit rate and subsequently ratcheted up to a higher bit rate.

  • Annoying user experience shortcomings including long initial buffer time, disconnects, and playback start/stop are virtually eliminated.

  • Client can control bit rate switching – with no intelligence in the server – taking into account CPU load, available bandwidth, resolution, codec, and other local conditions.

  • Simplified ad insertion accomplished by file substitution.

Encoding/Transcoding

The transcoder (or encoder, if the input is not already compressed) is responsible for ingesting the content, encoding to all necessary outputs, and preparing each output for advertising readiness and delivery to the packager for segmentation. The transcoder must perform the following functions for multi-screen adaptive delivery – and at high concurrency, in real time and with high video quality output.

Video Transcoding
  • Transcode the output video to a progressive format, which requires the transcoder to support input de-interlacing.

  • Transcode the input to each required output profile – where a given profile will have its own resolution and bit rate parameters – including scaling to resolutions suitable for each client device. Because the quality of experience of the client depends on having a number of different profiles, it is necessary to encode a significant number of output profiles for each input. Deployments may use anywhere from 4 to 16 output profiles per input. The table below shows a typical use case for the different output profiles:


  • GOP-align each output profile such that client playback (shifting between different bit rate ‘chunks’ created for each profile) is continuous and smooth.

Audio Transcoding

Transcode audio into AAC – the codec used by adaptive delivery protocols from Apple, Microsoft and Adobe.

Ad Insertion

Add IDR frames at ad insertion points, so that the video is ready for SCTE 35 ad insertion. It is also potentially possible to align chunk boundaries with ad insertion points so that ad insertion can be done via chunk-substitution rather than traditional stream splicing.

Ingest Fault Tolerance

The transcoding system needs to allow two different transcoders that ingest the same input to create identically IDR-aligned output – contributing to strong fault tolerance. This can be used to create a redundant backup of encoded content in such a way that any failure of the primary transcoder is seamlessly backed up by the secondary transcoder.

Packaging

To realize the benefits of HTTP adaptive streaming, a ‘packager’ function – sometimes referred to a ‘segmenter’, ‘fragmenter’ or ‘encapsulator’ – must take each encoded video output from the transcoder and ‘package’ the video for each delivery protocol. To perform this function, the packager must be able to:

Ingest

Ingest live streams or files, depending on whether the work flow is live or on-demand.

Segmentation

Segment chunks according to the proprietary delivery protocols specified by Microsoft Smooth Streaming, Apple HTTP Live Streaming (HLS), and Adobe HTTP Dynamic Streaming.

Encryption
  • Encrypt segments on a per delivery protocol basis (in a format compatible with each delivery protocol) as they are packaged, enabling content rights to be managed on an individual session basis. For HLS, this is file-based AES-128 encryption. For Smooth Streaming, it is also AES-128, but with PlayReady compatible signaling. Adobe HTTP Dynamic Streaming uses Adobe Flash Access for encryption.

  • Integrate with third party key management systems to retrieve necessary encryption information.

Note: Third party key management servers manage and distribute the keys to clients. If the client is authorized, it can retrieve decryption keys from a location designated in the manifest file. Alternatively, depending on the protocol used, key location can be specified within each segment. Either way, the client is responsible for retrieving decryption keys, which are normally served after the client request is authenticated. Once the keys are received, the client is able to decrypt the video and display it.

Delivery

The final step in this process is the actual delivery of segments to end clients – the aforementioned desktop/laptop/tablet PCs, smart phones, IP-based set-top boxes and now Internet-enabled television sets. Optimal delivery network design must take into consideration several content type, device type, delivery protocol type and DRM options.

Live vs. File Delivery

In the case of live delivery, it is possible to serve segments directly from the packager when the number of clients is relatively small. However, the typical use case involves feeding the segments to a CDN, either via a reverse proxy ‘pull’ or via a ‘push’ mechanism, such as HTTP POST. The CDN is then responsible for delivering the chunks and playlist files to clients.

The same delivery model can also be utilized in video-on-demand (VOD), but VOD also offers the alternative of delivering directly from the packager, even to a large number of users. However, with VOD delivery, it is sometimes desirable to distribute one file (or a small number of files) that contains all the chunks together; referred to as an aggregate format for the content. Distributing one file allows service providers to easily preposition content in the CDN without having to distribute and manage thousands of individual chunks per piece of content. When a client makes a request, the aggregate file is segmented ‘on the fly’ for that client, using the client’s requested format. The tradeoff is that while the CDN and file management is simpler, more packagers are required – ‘centralized’ packagers that create and aggregate the chunks and ‘distributed, edge-located’ packagers that segment the aggregation format (on demand) into actual chunks delivered to clients.

Output Profile Selection

The optimal number of profiles, bit rates and resolutions to use are very service-specific. However, there are a number of generally applicable guidelines. First, what are the end devices and what is the last-mile network? The end devices drive the output resolutions. It is desirable to have one or two profiles to service the high-quality video service for the device, and these would be encoded at the full resolution of the target device. For PCs, that’s typically 720p30.

Looking at the delivery network, for mobile distribution, it is typical to use very low bandwidth profiles. Even 3G mobile networks, which have relatively high peak bandwidths of several hundred kbps may fall back to much lower sustained bandwidths required for video streaming. WiFi networks have higher capacity, but also suffer from potential degradation depending on the distance to the base station or composition of walls between the transmitter and receiver. DSL distribution to PCs also varies widely in bandwidth capacity. And almost all last-mile networks suffer bandwidth reduction caused by aggregation, for example at a cable node or at the DSLAM. The table below suggests the number of output profiles in different scenarios:



Protocol Selection

Which of Apple HLS, Microsoft Silverlight Smooth Streaming or Adobe HTTP Dynamic Streaming is the optimal choice for a service provider? Each protocol has its own appeal, and so service providers must carefully consider the following in making a delivery protocol selection:
  • Adobe has a huge installed client base on PCs. For service operators that want to serve PCs and do not want to distribute a client, this is a big benefit. The availability of Adobe’s server infrastructure, including backwards compatibility with RTMP and Adobe Access, may also be appealing to service operators.

  • Apple HLS uses MPEG-2 transport stream files as chunks. The existing infrastructure for testing and analyzing TS files makes this protocol easy to deploy and debug. It also allows for the type of signaling that TS streams already carry, such as SCTE 35 cues for ad insertion points, multiple audio streams, EBIF data, etc.

  • Microsoft Smooth Streaming has a very convenient aggregate format and provides an excellent user experience that can adapt to changes in bandwidth rapidly, as it makes use of short chunks and doesn’t require repeated downloads of a playlist. Smooth Streaming is also an obvious choice when content owners require the use of PlayReady DRM.

Redundancy and Failover

Transcoder redundancy is typically managed using an N:M redundancy scheme in which a management system loads one of M standby transcoders with the configuration of a failed transcoder in a pool of N transcoders. The packager component can be managed similarly, but it can also be managed in a 1:1 scheme by having time-outs in the CDN root failover to the secondary packager. Avoiding outages in these scenarios involves making sure that the primary and backup packagers are synchronized, that is, they create identical chunks.

DRM Integration

DRM integration remains challenging. Broadly, there are two approaches:
  • The first uses unique encryption keys for every client stream. In this case, CDN caching provides no value. Every user view is a unicast connection back to the center of the network; network load is high; but the content is as secure as possible.

  • The second approach uses shared keys for all content, but keys are only distributed to authenticated clients. CDN caching can then lead to significant bandwidth savings, but key management still requires a unique connection to the core for each client. Fortunately, these connections are far lower bandwidth than the video streams.

Different DRM vendors provide different solutions, and interoperability between vendors for client authentication doesn’t exist:
  • Adobe uses Adobe Access to restrict access to streams, giving a unified, single vendor work flow.

  • Apple HLS provides a description of the encryption mechanism, but leaves client authentication as an implementation decision.
  • Microsoft’s PlayReady is a middle ground. Client authentication is well specified, but the interfaces between the key management server and the packager component is not. This means some integration is typically required to create a fully deployed system.

Apple HTTP Live Streaming (HLS)

HTTP Live Streaming (HLS) allows you to stream live and on-demand video and audio to an iPhone, iPad, or iPod Touch. HLS is similar to Smooth Streaming in Microsoft’s Silverlight platform architecture, and can be thought of as a successor to both RTSP and HTTP Progressive Download (HTTP PD), although both of those video options serve a purpose and likely won’t be going away anytime soon.

HLS was originally unveiled by Apple with the introduction of the iPhone 3.0 in mid-2009. Prior to the iPhone 3, no streaming protocols were supported natively on the iPhone, leaving developers to wonder what Apple had in mind for native streaming support. Apple proposed HLS as a standard to the IETF, and the draft is now in its sixth iteration.

As an adaptive streaming protocol, HLS has several advantages including multiple bit rate encoding for different devices, HTTP delivery, and segmented stream chunks suitable for delivery of live streams over widely available HTTP CDN infrastructure.

How HLS Works

HLS lets you send streaming video and audio to any supported Apple product, including Macs with a Safari browser. HLS works by segmenting video streams into 10-second chunks; the chunks are stored using a standard MPEG-2 Transport Stream file format. Optionally, chunks are created using several bit rates, allowing a client to dynamically switch between different bit rates depending on network conditions.

How does a stream get into HLS format? There are three main steps:
  • An input stream is encoded/transcoded. The input can be a satellite feed or any other typical input. The video and audio source is encoded (or transcoded) to an MPEG-2 Transport Stream container, with H.264 video and AAC audio, which are the codecs Apple devices currently support.

  • Output profiles are created. Typically a single input stream will be transcoded to several output resolutions/bit rates, depending on the types of client devices that the stream is destined for. For example, an input stream of H.264/AAC at 7 Mbps could be transcoded to four different profiles with bit rates of 1.5Mbps, 750K, 500K, and 200K. These would be suitable for devices and network conditions ranging from high-end to low-end, such as an iPad, iPhone 4, iPhone 3, and a low bit rate version for bad network conditions.

  • The streams are segmented. The streams contained within the profiles all need to be segmented and made available for delivery to an origin web server or directly to a client device over HTTP. The software or hardware device that does the segmenting (the segmenter) also creates an index file which is used to keep track of the individual video/audio segments.

Optionally, the segmenter might also encrypt the stream (each individual chunk) and create a key file.



What Does the Client Do?

The client downloads the index file via a URL that identifies the stream. The index file tells the client where to get the stream chunks (each with its own URL). For a given stream, the client then fetches each stream chunk in order. Once the client has enough of the stream downloaded and buffered, it displays it to the user. If encryption is used, the URLs for the decryption keys are also given in the index file.

If multiple profiles (that is, bit rates and resolutions) are available, the index file is different in that it contains a specially tagged list of variant index files for the different stream profiles. In that case, the client downloads the primary index file first, and then downloads the index file for the bit rate it wants to play back. The bit rates and resolutions of the variant streams are specified in the main index file, but precise handling of the variants offered is left up to the client implementation.

Typical playback latency for HLS is around 30 seconds. This is caused by the size of the chunks (10 seconds) and the need for the client to buffer a number of chunks before it starts displaying the video.

An odd curiosity about HLS is that it doesn’t make use of Apple’s Quicktime MOV file format, which is the basis for the ISO MPEG file format. Apple thought that the TS format was more widely used and better understood by the broadcasters who would ultimately use HLS, and so they focused on the MPEG-2 TS format. Ironically, Microsoft’s Smooth Streaming protocol does make use of ISO MPEG files (MPEG-4 Part 12).

For more information on how HLS works, good resources include Apple’s Live Streaming overview documentation.

Microsoft Smooth Streaming (SS)

Smooth Streaming was announced by Microsoft in October 2008 as part of the Silverlight architecture. That year they demonstrated a prototype version of Smooth Streaming by delivering live and on-demand streaming content such as the Beijing Olympics and Democratic National Convention.

Smooth Streaming has all of the typical characteristics of adaptive streaming. The video content is segmented into small chunks, it is delivered over HTTP, and usually multiple bit rates are encoded so that the client can choose the best video bit rate to deliver an optimal viewing experience based on network conditions.

Adaptive streaming is valuable for many reasons including low web-based infrastructure costs, firewall compatibility and bit rate switching. Microsoft is definitely a believer in those benefits as it is making a strong push to adaptive streaming technology with Microsoft Silverlight, Smooth Streaming and Mediaroom.

Smooth Streaming vs. Apple HLS

Microsoft has chosen to implement adaptive streaming in unique ways, however. There are several key differences between Silverlight Smooth Streaming and HLS:
  • HLS makes use of a regularly updated “moving window” metadata index file that tells the client which chunks are available for download. Smooth Streaming uses time codes in the chunk requests and thus the client doesn’t have to repeatedly download an index file.

  • Because HLS requires a download of an index file every time a new chunk is available, it is desirable to run HLS with longer duration chunks, thus minimizing the number of index file downloads. Thus, the recommended chunk duration with HLS is 10 seconds, while with Smooth Streaming it is 2 seconds.

  • The “wire format” of the chunks is different. Both formats use H.264 video encoding and AAC audio encoding, but HLS makes use of MPEG-2 Transport Stream files, while Smooth Streaming makes use of “fragmented” ISO MPEG-4 files. The “fragmented” MP4 file is a variant in which not all the data in a regular MP4 file is included in the file. Each of these formats has some advantages and disadvantages. MPEG-2 TS files have a large installed analysis toolset and have pre-defined signaling mechanisms for things like data signals (e.g. specification of ad insertion points). Fragmented MP4 files are very flexible and can easily accommodate all kinds of data, for example decryption information, that MPEG-2 TS files don’t have defined slots to carry.

For a more in-depth overview of how Microsoft Smooth Streaming works, a good resource is Microsoft’s Smooth Streaming Technical overview whitepaper.

Additionally, for a slightly biased but still accurate representation of some of the key differences between Smooth Streaming, HLS, and Adobe Flash Dynamic Streaming, check out this adaptive streaming comparison matrix.

Adobe HTTP Dynamic Streaming (HDS)

HTTP Dynamic Streaming was announced by Adobe in late 2009 as “project Zeri” and was delivered in June 2010. HDS is more similar to Microsoft Smooth Streaming than it is to Apple HLS. Primarily this is because it uses a single aggregate file from which MPEG-file container fragments are extracted and delivered rather than HLS-like individual chunks, and consequently there are certain implications of that design, which will be discussed in detail.

Characteristics of Adobe HDS

HTTP Dynamic Streaming supports both live and on-demand content using a standard MP4 fragment format (F4F). Video/audio codec support includes VP6/MP3 and H.264/AAC, however as with HLS and SS, the predominant video/audio codecs are H.264/AAC.

Similar to other adaptive streaming protocols, at the start of the stream, the client or CDN/origin server downloads the manifest file (in this case F4M file) which provides all the information needed to play back the content, including fragment format, available bitrates, Flash Access license server location, and metadata information.

Files representing either live or VOD workflows are sent to an HTTP origin server. The origin server is responsible for receiving segment requests from the client over HTTP and returning the appropriate segment from the file. Standard origin servers like Apache can leverage Adobe’s Open Source Media Framework (OSMF) to serve the content.

Differences Between Adobe HDS and Apple HLS

There are several key differences between Adobe HDS and Apple HLS:
  • HLS makes use of a regularly updated “moving window” metadata index (manifest) file that tells the client which chunks are available for download. Adobe HDS uses sequence numbers in the chunk requests and thus the client doesn’t have to repeatedly download a manifest file.

  • In addition to the manifest, there is a bootstrap file, which in the live case gives the updated sequence numbers and is equivalent to the repeatedly downloaded HLS playlist.

  • Because HLS requires a download of a manifest file as often as every time a new chunk is available, it is desirable to run HLS with longer duration chunks, thus minimizing the number of manifest file downloads. More recent Apple client versions appear to now check how many segments are in the playlist and only re-fetch the manifest when the client runs out of segments. Nevertheless, the recommended chunk duration with HLS is 10 seconds, while with Adobe HDS it is usually 2-5 seconds.

  • The “wire format” of the chunks is different. Both formats use H.264 video encoding and AAC audio encoding, but HLS makes use of MPEG-2 Transport Stream files, while Adobe HDS (and Microsoft SS) make use of “fragmented” ISO MPEG-4 files.

As with HLS, Adobe Flash clients first request a manifest file. The manifest contains information about what streams are available, bit rates, codecs, etc. and the streams are represented by a URL. Using a contiguous file creates two significant changes in the client/server architecture:
  • The client reads the manifest and can request chunks by a URL with a sequence number rather than a specific chunk name.

  • The server must calculate exact byte range offsets within the aggregate file by translating URL requests and delivers the appropriate chunk.

For a more in-depth overview of how Adobe HDS works, a good resource is Adobe’s HTTP Dynamic Streaming technical whitepaper.

By Andy Salo, RGB Networks