#Webrtc Codec vs Media Engines: Implementation Status and why you should care.

Which codec, and which flavours of codecs are supported by which browsers. This is a tricky question for every product manager that wants to define product expectations and roadmap. Should I support only VP8 / H264 ? Should I wait for VP9? What is multi-stream, simulcast, and SVC versions of those codecs? Beyond all those questions really is: when can I tell my customers I support Desktops browsers, Android, iOS? If I can’t should I have a native app instead. This blog post aims at putting everything back in perspective.

I. Codecs and bandwidth

When speaking about codecs, traditionally, people speak mainly about quality vs bandwidth footprint, and elude any reference to networks, and or CPU footprint. When speaking about encoding of pre-recorded media it makes a lot of sense.

First, you have all the time in the world to encode your content, so all optimisations, however costly in CPU and in time, are fair. Most of the best encoder use a multi-pass algorithm:

  1. pass #1: cut the video in homogeneous chunks (same background, fast or slow scene, …)
  2. pass #2: compute some statistics to help the encoder later, as well as some inter-frames values (those requires several frames to be computed).
  3. pass #3: encode each chunk separately, with the stats from pass #2

For real-time streaming, obviously, you will not wait untill the end of the conference, or the show, to start streaming it. It’s the “live” aspect, and the interaction that brings value to a real-time stream, more than the quality itself.

Then, if distributed on traditional physical medias like DVD or BR disks, your size budget is fixed and cannot be extended, so you want the best quality for a fixed size. If distributed over the network (streamed in the netflix/youtube sense), you main cost and main bottleneck for viewers, is the bandwidth usage, so you want the best quality possible for a given bandwidth (when bandwidth is limited), and the less bandwidth possible for a given “quality” (in this case quality loosely refers to resolution, e.g. full-HD, 4K, …) to be able to reduce the cost. All the glitches caused by the network side of things can be taken care of with intelligent buffering.

For real-time streaming, there cannot be a physical storage format. Moreover, when streaming real-time content (as in youtube-live), since the most important is interaction, any kind of buffering, with its added latency would be unacceptable.

Finally, for a very long time, the codecs experts have assumed that only one video stream at a time will ever be rendered on a computer (wether streamed or read from a DVD), and it could always be offloaded to the GPU. So complexity of the codec does not matter, really. Then mobile happened. Then WebRTC and SFUs happened.

Those little distinction above are the reasons why discussion about codecs, or comparisons of codecs in the context of Real-Time Media, does not make sense if you do not take into account both the extreme sensitivity to real-time constraints AND to network quality. Those who only compare maximum achievable compression ratio, are off-topic, and unfortunately, that is what i see the most often being cited. Even during the original discussion about VP8 vs H264, one of the most contentious discussion was wether the encoder settings were realistic or not. “Pure Codec” people would state that the network should not be taken into account when benchmarking, everybody else would argue that without being adaptive, without accounting for packet losses, and other network jitter, propagation, etc, the results would not be practical.

II. Codecs: So where do we stand?

The goal here is not to provide a definite answer, but more a rule of the thumb for product manager to make a decision.

  • H.264 and VP8 are basically in par when it comes to quality vs bandwidth.
  • VP9 and H265 seems to be practically in par when it comes to quality vs bandwidth, and exhibit an average 30% gain over VP8/H.264 at a cost of around 20% extra CPU footprint.
  • AV1, a mix of VP10, daala and thor, exhibits more or less the same gain / loss with VP9/H.265 than those had over VP8/H.264.

However, the most guarded secret is that nobody gives s*** none of this really truly matter when it comes to real-time media User Experience. What matters is how good a network citizen your codec is. Nowadays, in real-time media, nobody cares about the codec alone, people care about the media engine: capturer + en/decoder + packetizer (RTP) and its capacity to handle the network (bandwidth fluctuation and bandwidth quality). If your encoder stop working as soon as your bandwidth goes below a threshold, if your decoder stops working when there is a single packet loss, if your encoder cannot encode at least at 20fps, your real-time media solution is worthless, however good the compression ration of your codec is.

There is no surprise to see that google sponsored research from Standford Ph.D. on better codec / network coupling. It’s the future. 

III. Codecs to media engine: a logical disruption

Real-time media has almost the same goals as normal media, but with a very different priority order (main goals):

  1. I need to maintain 30fps (speed)
  2. I need to maintain 30fps with interactivity (latency)

and since you can’t assume anything about the public internet, if you stream over it, you have additional constraints (public internet goals):

  1. I need to accommodate small bandwidth fluctuations
  2. I need to accommodate for huge bandwidth fluctuations
  3. I need to accommodate for jitter (out-of-order arrival of packets)
  4. I need to accommodate for packets loss

III.1 Media Engine: Main goals

Maintaining a throughout of 30fps means that you have 33ms to capture, encode (codec), packetise (RTP), encrypt (SRTP) and send (UDP/TCP) on the sender side, and the reverse on the receiver side. It’s usually more difficult on sender side since encoding is usually more complicated and slower than decoding.

You could maintain a throughput of 30fps while inducing delay. For example, temptation is high to have a frame buffer with multiple frames to compute some of the encoding optimisations (inter-frame prediction, motion vectors, ….) I spoke about earlier. That would in turn reduce your bandwidth usage. Unfortunately, waiting for 5 frames means you are accumulating the delay of capturing 5 frames before you start any encoding. The encoding itself is slower, resulting in a delay in your system. Your throughput is 30fps but your end-to-end latency (from sending camera capturing to receiving screen rendering) is more than 33ms. A lot of website, intentionally or not, are deceiving their reader by reporting end-to-end latency as the time taken from the output of the transport or UDP/TCP socket on the sender side to the receiving side, conveniently omitting to measure encoding and decoding time, and any additional delay introduced on client side. Needless t say, their measure does not correlate with user experience.

The solution? drop every codec subcomponent and sub-algorithms that induce too much delay. Basically, revert to almost a frame-by-frame approach. While this was an outrage, sorry, a disgrace, or even a blasphemy originally in the codec community, nowadays most of the new codec have a “Real-time Mode”, i.e. a set of parameters where the latency is prioritised over anything else, while traditionally there were only a “best-quality” or “minimum-size” modes and timing did not really matter.

To be thorough, in the new codecs you also have a Screensharing mode, since the content of screen sharing is particular (high spatial resolution, low temporal resolution, lossless …).

III.2 Media Engine: public internet goals

small bandwidth fluctuations

Old codecs could not change their bandwidth rate, i.e once started they would assume that a certain bandwidth would be always available for them, and if not, would fail (miserably). That was the time were codecs and audio/video streamings was thought as an add-on of network equipment, and thus the security and the bandwidth were handled by the network equipment. Dedicated ports, dedicated bandwidth. No public internet.

The first change was to make codecs “bitrate adaptive”. In any codecs you can change certain parameters. Some changes are obvious to the human eyes like changing the (spatial) resolution, some a little bit less like changing the temporal resolution (30fps to 25fps), and some are almost invisible like changing the quantization. The quantization parameter is the number of shades a given colour can have. If you use 256 (often the default), you will have smooth transitions form white to black, if you reduce it, it will be less smooth but your eyes in most of the case will not see the difference. Traditionally encoders use the QP parameter as a knob to achieve bitrate adaptation without too much impact on visual quality of the video.

Of course, you need to be able to compute available bandwidth, and provide feedback. Those mechanisms are in a media engine, but not in the codec.

huge bandwidth fluctuations

bitrate adaptive is nice. bitrate adaptive is automatic. However, it cannot accommodate for high bandwidth change. Let’s say your bandwidth is divided by two, even with bitrate adaptive codec, you won’t survive.

In those cases, you need to reduce the spatial resolution, or the temporal resolution. The temporal resolution is usually the first target, for two reasons. One, the human eye is slightly less sensitive to frame rate changes than it is to resolution change (within reason). One usually just drops one frame out of 2, or 3 (30 fps => 15fps => 10 fps). In most of the case though, you need to do this on sender side, and if your sender is connected to a media server which relays the stream to multiple remote peers, all the remote peers would be impacted. Basically, they would all receive a stream that has been adapted for the worse configuration / network of all the remote peers.

If you control the sender side, and are using an SFU, but have bandwidth limitation on (one of) the receiver side, a better approach is to use simulcast. The sender side will encode the same stream at three different resolutions, and depending on the capacity of a remote peer at a given time, the SFU or the receiving client will decide which resolution of the original stream to consume. Now you can accommodate each remote peer individually, at the cost of encoding three times (CPU overhead) the same stream and sending them (bandwidth usage overhead on the sender side). Note that any codec can be used in simulcast mode, if corresponding implementation exists. It’s not an intrinsic codec feature, it’s external.

SVC, a.k.a layered codecs achieve the same thing: the capacity to chose within the SFU which resolution to relay for each remote peer, but in a smarter way. There is only one encoder instead of one per resolution. It allows for around 20% saved bandwidth over simulcast for the same resolutions. There is only one encoded bitstream, within which the resolutions are “layered” or interlaced, which simplifies lip-sync, ports management, and other practical details. On the SFU, it also simplifies things, since now each of the packet are marked, and changing resolutions practically boils down to dropping packets, which does not take times as opposed to switching between simulcast streams which require some rewriting of packets and to wait for a full frame.

jitter (out-of-order arrival of packets) and packets loss.

Those are the most difficult things to deal with. Jitter is easy, you create a buffer, and since all the packet are numbered, you put them back in order. In real-time you do not want the jitter to be too deep otherwise you potentially wait too long, and break your real-time constraint (33ms end-to-end). Using a bigger jitter buffer could help, but would basically means buffering.

Packet loss are usually taken care of by retransmitting the packet (RTX). If the time it takes to go from the sender to the SFU and back (RTT) is fast enough with respect to our 33ms constraint, we have time to retransmit the packet before running out of time. If this is not enough and you have a bad network (more than 0.01 % packet loss), you need to implement more advanced error cancellation algorithms like FEC.

A better approach, here again is to use SVC codecs. Because of the way the layers are interlaced, and because only the base layer is really needed for the call to go on, practically, the time widows you get to retransmit a packet corresponding to the base layer is several times the RTT. It means that just retransmitting packets is usually enough to compensate for very bad network conditions (1%+ packet loss) without loss of continuity in the call. While simulcast was just a solution for bandwidth management, with and SFU, SVC codecs are a solution to both bandwidth fluctuation and network quality problems.

IV. Current status of Browsers

Firefox and Safari follow google when it comes to the media engine. They only update their internal copy of libwebrtc once in a while though, and do not follow google chrome release schedule (every 6 weeks). They might be at one point out of sync, but they eventually catch up, with the exception of VP8 in Safari (don’t ask).

Then, you can take a look at the table below for completeness, but the analysis is simple, since most discard Edge right away. Today you have to choose between supporting iOS Safari or having a good quality. iOS Safari only supports H.264 on one hand, and  libwebrtc only implements simulcast (with temporal scalability) with VP8 and SVC with VP9 on the other hand. 

How important is simulcast support when we already have normal H.264 support in iOS? Well, in most cases, client will not forgive you for compromising quality over interoperability. If you want to support iOS with the same level of quality, go native for now. A few cherry-picked examples:

Highfive has an electron client (desktop native), with support for H264 simulcast for video for more than two years (and enhanced audio codecs from Dolby).
Attlasian has refused to deliver a client without support for simulcast, as it would not be of good enough quality. They support iOS through a react-native client in which they have support for simulcast through the embedded libwebrtc.
Symphony has an electron desktop native client, and react-native for iOS and android to be able to support simulcast and implement double encryption in libwebrtc and be in par with bank regulations.
Tokbox has had VP8 with temporal scalability and full simulcast support for at least 4 years now in their mobile SDK (using a modified libvpx in libwebrtc) to achieve better quality in their mobile-to-mobile video calls.

So first, you can trust that they know what they are doing. Then you can trust your consultant. Now if you don’ t, you have to wonder how you are going to compete with the rest of the ecosystem if you have an inferior technology.

V. The future

It’s pretty clear that VP8 will not be available in safari. Same can be say without a lot of risk about VP9.

While early on Apple seemed to support H265 for inclusion in webrtc, since it already supports it for HLS anyway, its recent joining of the Alliance for open media and a few other small things like scrapping any mention of H.265 on the outside of the iPhone lead me to think that AV1 might be the next thing. Unlike the rest of this post, this is just an opinion.

In any case, the reference AV1 bitstream has just been frozen (the specs are complete, if you want) but the reference encoder is still far, far away from real-time with a 0.3 fps on reference hardware. While it might not be such a problem for pre-recorded content (you have all the time in the world for encoding) it definitely is a no-go for real-time media. It will take a year at least to see it coming to a stage where it can be fast enough to be usable in RTC.

In the mean time, in non-RTC use case, you can already enjoy playing pre-encoded AV1 files in Firefox thanks to bitmovin’s (high-latency, non-webrtc) streaming technology. The same bitmovin whose founder invented MPEG-DASH and which just announced raising 30 millions to prepare for the next generation of video infrastructure … 

[Update 1] – 23 APR 2018 – Lorenzo miniero pointed our attention to the fact that enabling simulcast for VP8 was implicitly enabling temporal scalability for both Chrome and firefox. Article has been modified accordingly.

[Update 2] – 23 APR 2018 – The team behind the Simulcast support at tokbox reached out, and gave us some more details on their implementation. Article has been modified accordingly.

2 thoughts on “#Webrtc Codec vs Media Engines: Implementation Status and why you should care.

  1. In case of simulcast switch I understand why we need to wait for a full frame but why and what we need to rewrite – i.e what does “require some rewriting of packets” mean ?

    1. It’s simple. With SVC you have only one stream, so the SSRC is unique, whatever the layer is.
      With simulcast, each incoming stream has its own SSRC. The SFU will only relay one of the incoming simulcast’ed stream to the viewer, choosing only one SSRC for this stream from the viewer perspective. Wether it uses one of the original SSRC or a new, temporary one is up to the SFU implementation. When you change which stream you relay to the viewer, you need to change the SSRC information in the RTP headers of the new stream to match the one you used in the first place, so you can reuse the same connection with a different bitstream. Hope this helps. There is a little bit more to it, but this is the part I was referring to in the post.

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.