There are many that happened this week, both during IETF sessions and in the ecosystem, that troubled me enough to write a dedicated blog post. People from the streaming industry, and from the webrtc industry alike, are approaching OTT media in general, and webrtc specifically, the wrong way. It works, but it does not work well enough. It connects, it streams in perfect conditions, but it does not stream with good quality, at scale and in real-time. What is missing?
I wrote several blog posts on the subject, and talked about it even last year: https://www.slideshare.net/alexpiwi5/streaming-media-west-webrtc-the-future-of-low-latency-streaming.
Most of the streaming industry is using a model where the media is completely separated from the Transport. Codec on one side, HLS/MPEG-DASH (with or without CMAF) on the other, nothing in between. Of course, that does not work well, and multiple work arounds need to be used in both the servers and the players which makes both players and infra a critical part of the experience, and a must have (chunking, transcoding, ABR, ….).
Some people come to WebRTC with the same approach, and some WebRTC stack implementation present their capacity to support any encoder decoder, external framework, *without coupling with the RTP layer* as a feature.
Let’s dig again into that and the notion of Media Engine.

If you don’t dig too deep, the capturer feed the encoder with raw frames, the encoder feed the RTP packetizer with encoded frame, the RTP layer feeds the crypto with rtp packet, which feeds the ICE transport with SRTP packets. Looks like a linear pipeline, all is good. On the receiving side, you do the opposite, receive SRTP, remove encryption, reconstruct full encoded frame, push to decoder, all is good. Of course it s not that simple and the devil is in the detail.
You can see on slide 42 of this presentation that the RTP layer and the the Encoder have a feedback loop and a deep entangling. All the media Quality mechanisms are actually depending on those feedback loops and additional mechanisms in RTP like the jitter buffer. If you want to do real-time media, you shall NOT buffer, you shall NOT delay, you Shall not spend more time to process a frame than it takes to acquire one (30fps => 33ms, 60fps => 16ms).
If you take a closer look you will see that for example, the decoder on receiving side has a feedback loop to the sender-side encoder through special RTCP packets: NACK and PLI, which allow it to request for a lost packet, or, if too late, request for a full frame.

This is not the only feedback mechanism available in RTP/RTCP, there is actually a collection of Reports (SR, RR) and messages that helps controlling bandwidth and other critical network parameters with respect to media.

Many quality related features are negotiated through RTP header extension, especially for Audio.

and that’s even before we start getting into much more advanced topics like Forward Error Correction (FEC).

All that mechanism is also the foundation of Bandwidth estimation, Ram-up time, Bandwidth adaptation, Simulcast and SVC layer switching, and so on and so forth. Those all depends on those feedback mechanism to be implemented correctly.
When some propose to use an external capturer and pass the generated video frames to webrtc for encoding and on, it makes senses.
When someone proposes to use an external encoder, or an already encoded frame like you get from some IP camera, I’m always worried, and I try to tell them: make sure you have the feedback mechanism in place, or else … Alas, most people kill the messenger. It’s a feature, they say. all the other protocols do that, they say.
There is a very easy way to check if you have those mechanisms in place: add some latency or jitter, or packet loss to your network, and see what happen. No feedback, it will crash or create gruesome artefacts in the video. It is as simple as replicating that experiment from the jitsi team, or the agora comparative study, with your solution.
For the others using RTCP messages, good feedback, and a robust bandwidth estimation algorithm (chose your flavours, there are quite a few implemented in WebRTC as illustrated below), life should be good and the stream should remain viewable.

After all, Google still spent an extra 68M dollars on GIPS acquisition even thought thy had just acquired ON2 for their codecs? Why would they do that? Because encoders are only half of the story when you stream over the public internet.
Happy hacking.