In their latest blog post, Wowza is doing a great job at explaining in simple words latency, and the use cases that could benefit for having under 500ms, a.k.a. “real-time”, latency. However, the section about streaming protocol is somehow confusing me. This blog post is an attempt to put those protocols back into perspective to have a fair comparison.
Signalling path vs Media path
In most modern systems, wether video conference or streaming, you have a separation between the signalling path and the media path. The signalling path is used for discovery and handshake.
The discovery is the act of connecting parties that should start sending media to each other, wether both parties are individuals in the p2p case, or individual to server/infrastructure for publisher / viewers.
The handshake is the act of exchanging informations between parties prior to and for the purpose of establishing a media path and streaming media.
While the two path are explicitly separated in VoIP (SIP), and WebRTC, there were not in RTMP (flash) and often lead actors from the flash world to mistake one for the other, and compare things that should not be compared. Both signalling and media have their own protocol, and an underlying transport protocol.
Signalling protocol vs Signalling transport protocol
The signalling protocol defines the format and the content of the signals that are being exchanged during the discovery and handshake. It can be SIP for discovery with VoIP or WebRTC, with SDP O/A for the handshake, it can also be RTMP/AMF, for flash, it can be JSEP for webrtc, etc.
The signalling transport protocol is the underlying protocol used to transport the signalling from one party to the other. Traditionally it was either TCP or UDP (flash, SIP), but more recent protocol have also been used like Websocket (SIP, webrtc).
Note that websocket is a TRANSPORT protocol, that compares with TCP and UDP, and should not be compared as is to media streaming protocols like webrtc, SIP, RTMP, or others, that are more complex and belong to a higher level of abstraction. web socket has been especially popular for web apps as you can start a web socket connection from within the browser, while you cannot access a “raw” TCP or UDP socket in the same context.
For example of full signalling solutions, you can find “SIP over web socket”, “JSEP over web socket”, as well as (a subset of ) RTMP, or older “SIP over TCP/UDP”, ….
All the media streaming protocols suppose that an encoded version of the media to be streamed is available. It’s called a media “bitstream”. Most media streaming protocol are codec specifics or support only a limited list of codecs.
Real-time streaming protocols like webrtc will directly encode the frames from a video capturer (webcam, screen, ..) and encode it on-the-fly to avoid extra latency.
Moreover, bitrate adaptation, needed to compensate for bandwidth fluctuation or poor network quality (jitter and packet loss), is done on-the-fly by adapting the encoder settings from frame to frame, depending on feedback from the media streaming protocol when available (RTP/RTCP), and bandwidth estimation (REMB, Transport-CC, TIMBR).
If not available (RTMP), the available bandwidth needs to remain above a certain threshold for it to work.
Media streaming protocol
The media streaming protocol defines how the media is cut into smaller chunks to be then handed over to the media transport protocol.
Moreover, reliable protocols add a mechanism to compensate for poor network quality (jitter or packet loss). Jitter is usually dealt with on the receiver side by adding a small buffer to reorder the packets. Packet loss is dealt with in real time by retransmission (RTX), redundancy (RED) or Forward Error Correction (FEC).
Media Transport protocol
Once the media bitstream has been cut into smaller chunks, they need to be transported across using a transport protocol, not unlike the signalling before it, just over a different path. The transport protocols are more or less identical than for the signalling, but since the load is quite different, some are more appropriate than others.
As a transport protocol ,TCP is reliable when UDP is not. However, TCP comes at a cost both in term of bandwidth and latency. As most media streaming protocol already include a reliability mechanism (RTX/RED/FEC), UDP is then a much better choice in theory. In practice, the UDP ports might be blocked.
A port also needs to be used to attache the socket to. Finding an open port and protocol pair can be tedious and in older media streaming protocols is hardcoded. New streaming protocols like WebRTC use Interactive Connection Establishment (ICE) to automatically and dynamically chose which port and which transport protocol to use.
QUIC is a new transport protocol being discussed within the IETF standard committee. It is backward compatible with UDP, and has several other advantages in terms of both speed and reliability over both TCP and UDP.
Media streaming protocols like MPEG-DASH or HLS use HTTP as media transport protocol and should see possible improvement coming from the new HTTP2.0 standard in the making.
Some media streaming engine encrypt their data for added protection. Either the media is itself encoded (codec or payload level), or the chunks are encoded (SRTP). The exchange of the encryption key have their own protocols, with the two most often met being SDES (Voip), and DTLS (WebRTC). The later has the advantage over the former that the exchange of the key itself is always secure.
Points that confused me.
WebSocket and QUIC being “pure” transport protocol (agnostic to their use in transporting media or not), it’s surprising to see them put at the same level as WebRTC, Flash, or HLS, that are focussing solely on media streaming. One would need, and I can only assume that it is what wowza does, to handle the encoding and the chunking separately before using WebSocket or Quick Directly. Note that WebRTC (libwebrtc/chrome) and ORTC have an implementation of their stack using QUIC as a transport.
Equally surprising is the lack of mention for HTTP2.0 as an optimisation for HTTP based protocols like HLS, or MPEG-DASH. CMAF seems to be a file format that could be used by HLS and MPEG-DASH, but not a replacement of them.
Finally, SRT is also only a transport protocol. While it seems to bring to the table things that were missing from file-based protocols like HLS and MPEG-DASH, it seems that those added feature are already present in RTMP or WebRTC. SRT seems to suppose a separately encoded bitstream, which remove the opportunity to couple the network feedback (packet loss, jitter), to the encoder. While network reliability seem to be address at the packet level, it is likely that bandwidth fluctuations that would require encoder bitrate adaptation are not.
Note: Bandwidth fluctuations are addressed in file-based protocols like HLS through multiple parallels encoding, which adds latency, while they are addressed by on-demand encoder setting modifications in real-time protocols like WebRTC for minimal latency. In HLS, you need to be done reading a chunk (that can be 10s long), before you can change resolution and adapt. In WebRTC, you need to be done with one Frame (33ms at 30fps), before you change the encoder setting, reducing the adaptation time.
I certainly made a lot of mistakes and approximations in this post. I’ll put that on the jet-lag 🙂 Please correct me wherever I have been inaccurate.