Netflix's Using the Cloud do the Heavy Lifting for Video Transcoding
For Netflix vice-president of digital supply chain Kevin McEntee, the US-based video streaming company's shift to using the cloud for transcoding its massive content library comes down to a modern take on the fable of the tortoise and the hare. Or, as he told the audience at the AWS re:Invent conference in Las Vegas last week, it's like a choice between moving a room full of people to another city by using expensive high-performance Ferraris or a fleet of somewhat more humble Toyota Priuses.
When it began its shift away from DVD rentals to Internet-based video streaming, Netflix initially employed a 'Ferrari' approach to dealing with the computationally intensive task of encoding movies and TV shows in a format that could be streamed to client devices.
In 2006/2007 when Netflix began the move to streaming, it found that the video processing technology typically employed in Hollywood centred on ensuring minimal latency: It was optimised for scenarios such as a single video editor mastering a Blu-ray image of a movie: "[It was] optimised for the expensive time of that one operator; essentially the artist sitting there doing that mastering."
"Back in 2006/2007 we hired people out of this industry and we ended up building out a data centre that [was] very Ferrari-like," McEntee said, with custom, GPU-based encoding hardware; "boxes that had custom GPUs that were built specifically to dump video very fast." It was expensive and constrained by the fixed footprint of Netflix's data centre.
However the limitations of this approach became evident in late 2008, when Netflix set out to launch new video players for PCs and Macs, and jumped onto TVs by launching a player on the Xbox in November of that year.
"There was such an amazing lack of standardisation around video streaming those days, and there still is today, that we had to create new formats for those players," McEntee said.
"And at the same time [as] we were innovating in the player space and therefore causing the need for new formats, our content team in LA was licensing more and more content, so our content library during the course of that project also doubled in size. And so we set out re-encode using the hardware farm that we had built."
Unfortunately for Netflix, the hardware didn't deal well with the load, and the company encountered frequent hardware failures. Fans on the custom GPUs being used were too small and "boxes were melting", McEntee said.
"It was really a very frustrating experience and in fact that catalogue re-encode was late and we failed. Basically we launched these players and the catalogue was not complete."
It was reflecting on this experience that caused Netflix to make the move to the Amazon Web Services cloud for transcoding. "If you jump forward a year, we had made the jump to move our transcoding farm into AWS. And we had seen the opportunity in fall 2009 for launching a video player on the [Sony PlayStation 3] so this was our first 100 per cent AWS transcode. "
"The player developers again realised they had to rely on a new format; they had to transcode the entire library," McEntee said. The new format was not finalised until three or four weeks before the launch of the new player, but Netflix was "able to spin up enough instances in EC2 to transcode the entire library in about three weeks" and managed to meet the deadline.
This is where the Ferrari versus Prius metaphor comes in. In McEntee's (somewhat elaborate) analogy, individual Ferraris offer great performance, but are expensive to buy, expensive to repair and available in limited numbers; whereas the hypothetical Prius fleet won't be quite so swift, but can be rented for a lot less than it would cost to buy Ferraris, repairs are someone else's problem and they're available in large numbers.
As McEntee explained, "By moving to the cloud while ... one encode was slower the overall throughput of the whole system was much, much faster." It's a question of "thinking horizontal, not vertical," he said, with an architecture that isn't optimised for latency but for overall throughput.
Netflix "haven't really missed deadlines" since the shift away from relying on in-house hardware for transcoding, he said. But, "even more than not missing deadlines, this change has actually created opportunities for the business."
His favourite example is from February 2010, when Apple approached Netflix about the impending iPad launch. Cupertino told Netflix it wanted the company to be part of the launch - which meant yet another video format had to be supported. Using its cloud-based approach to transcoding meant that Netflix was able to have its entire content library available for the April iPad launch.
"This is an opportunity we didn't anticipate when we set out to do the AWS project, but what we found is that having this ability to scale the whole system quickly without doing any purchasing or building out a data centre ourselves really just made the business very nimble and you really can't put a price on nimble, especially in a business that's moving as fast as Netflix," McEntee said.
Netflix's expansion into non-US territories - Canada in 2010, Latin American countries in 2011, and a number of European countries earlier this year - involved building up new content catalogues specific to each licensing territory, meaning a lot more transcoding using the cloud in order to meet fixed launch deadlines.
Netflix currently uses a media processing pipeline dubbed Matrix. Content partners such as movie studios deliver content to Netflix, with the video streaming company employing Aspera's "Direct-to-S3" service to house it in Amazon's Simple Storage Service (S3).
Netflix then uses technology from start-up eyeIO and Amazon's EC2 service to transcode the source material received from the studios into multiple formats that can be streamed to the range of devices supported by the company. The results are stored in S3, before being sent to Netflix's CDN for streaming; Netflix creates multiple versions of each movie or TV show episode to stream to devices ranging from TVs to tablets to gaming consoles. The transcoding farm uses 6000-6500 EC2 instances.
The company is currently working on a successor for Matrix, dubbed Maple. Instead of using Matrix's approach of processing an entire piece of content at once, Maple will break videos up into five-minute chunks, each of which will be processed by a separate EC2 instance. McEntee said that the advantages include being more fault-tolerant - currently a job may fail mid-way through transcoding and have to be restarted from scratch - and the ability to deliver content faster in those cases where Netflix has an agreement to begin screening content the day after it first aired.
Netflix is also working on a 'digital vault' that can house video masters and secondary assets, such as audio in different languages, that could be delivered to both its systems and those of its competitors in the video streaming space.