In this article, we’re going to examine the details of how WebRTC architecture actually works so that a layperson can understand it.
WebRTC is an open-source project that ties together devices using peer-to-peer interactive web apps. If you’ve had an in-browser video call or played a real-time game through a web browser, WebRTC is probably what drove the back-end technology of how that web application worked.
Table of contents
- Architecture options
- P2P (Peer to peer/mesh)
- SFU (Selective Forwarding Unit)
- MCU (Multipoint Control Unit)
At the core of every video conferencing solution sits the architecture of sending and receiving the participants’ video/audio streams. For example, if there are N participants in a video conference each of them needs to see/hear the video/audio of all other N-1 participants.
This can be implemented in different ways, but there are three main architectures which are used in practice:
- P2P (Peer to peer/mesh),
- SFU (Selective Forwarding Unit),
- MCU (Multipoint Control Unit).
A hybrid approach is also possible - to use different kinds of architecture depending on the number of participants in the conference. That is more of an optimization and will be covered at the end of the article.
P2P (Peer to peer)
Peer-to-peer (P2P) is an application architecture, which is also occasionally referred to as mesh architecture. It represents the fundamental structure of network design and is straightforward to conceptualise. In the context of a conference, each individual is a peer, broadcasting their video and audio to every other peer through the establishment of a direct peer connection.
Below is a peer-to-peer architecture diagram illustrating P2P with four participants:
In the absence of intermediate media servers, privacy, facilitated through end-to-end encryption, is inherently present. However, while this seems advantageous, there is a significant limitation with P2P: it does not utilise upload bandwidth efficiently.
For example, if there are N participants in the call, each participant needs to establish N-1 peer connections and send N-1 times their video/audio for a total amount of N*(N-1) peer connections.
Still, many homes have asymmetrical internet connections - e.g. ADSL (Asymmetric Digital Subscriber Line), where the upload speed is severely limited compared to the download speed. And even if you have a good upload speed, there will still be an issue in an office setting where many people are sharing the same internet connection.
In reality, P2P (peer-to-peer) architecture makes sense mostly for 1-1 calls where 2 people participate in the conference. In that scenario, P2P is still optimal because each of the 2 participants only sends their audio/video one time sends only one time their video/audio.
- Privacy is easily achieved. All video/audio streams are E2EE (end-to-end encrypted) by default since there is no intermediate infrastructure which could spy on the video/audio.
- Complexity and hosting costs are low too since there are no intermediate media servers to host/implement and care about.
- Upload bandwidth is not used wisely and can be easily saturated even with a small number of participants in the conference.
CPU (Central Processing Unit) usage will be significantly higher on the client side because the browser needs to encode the video N-1 times to send it to N-1 other participants. Unless you have a really powerful machine, the performance will be easily affected.
The above disadvantages make the P2P architecture reliable mostly for 1-1 calls and not scalable. In practice, you won’t often see a video conferencing provider using P2P architecture if there are more than 3 participants in the conference.
SFU (Selective Forwarding Unit)
This architecture has become the preferred option in contemporary video conferencing solutions. Central SFU (Selective Forwarding Unit) media servers act as intermediaries, receiving the incoming streams and then distributing them unaltered to the other participants.
Although this approach introduces additional complexity to the server side, it is a significant enhancement over P2P architecture. It addresses the issue of limited upload bandwidth and improves scalability, which are notable challenges with P2P.
The technique of simulcast is frequently employed in SFU video conferencing. Each participant transmits multiple streams at varying qualities to the SFU unit. The SFU then selects the appropriate stream quality to forward; for instance, it may send streams of lower quality to participants with weaker internet connections. Conversely, it can route the high-definition version of a stream to those who are displaying it prominently on their local system.
That way a large amount of downlink bandwidth can be saved and many participants can be displayed in the same grid even if participants have an average internet connection.
In SFU video conferencing, as illustrated in the above diagram, each participant sends their stream to the SFU media server a single time and, in turn, receives the streams of all other participants.
- Participants publish exactly once to the SFU server. This makes SFU an upload bandwidth-friendly architecture, unlike the P2P (mesh).
- Much more scalable than the P2P architecture. Simulcast can be used, where different stream versions are routed depending on the available participant’s CPU and bandwidth.
- Flexible layout. Participants can decide which streams they want to receive, where to display them and in what quality.
- There are intermediate media servers which increase costs/complexity on the server side.
- E2EE (full privacy) is not achieved by default, since the intermediate media server has access to the raw stream bytes when forwarding. This is a disadvantage compared to the P2P architecture but can be mitigated by encrypting the stream bytes with a custom key before sending them to the SFU (Selective Forwarding Unit) media server. Of course, that means custom decryption would be needed on the receiving side.
SFU (Selective Forwarding Unit) is the most popular architecture deployed today in video conferencing.
SFU is much more efficient during upload and scalable than P2P.
Also while users still need to download and decode each of the other participant’s streams, the simulcast technique can be applied to allow a display of up to circa 50 participants in a grid on an average connection and machine.
MCU (Multipoint Control Unit)
In the MCU (Multipoint Control Unit) architecture, every participant publishes their stream only once their stream is to a central server. But unlike SFU, the MCU (Multipoint Control Unit) central server has the role of a mixer - combining all received streams into one stream.
Then all participants consume this one mixed stream instead of subscribing individually to the stream of every other participant.
- Every participant subscribes to only one stream - the combined layout of all other participants. This is very CPU and bandwidth-efficient on the client side (even more efficient than SFU) since only one stream is being decoded by the browser.
- Since only one stream is consumed on the client side, it is easy to reason about the architecture and integrate/debug it in the front end. On the other hand, complexity is mainly being moved to the backend.
- The layout is generally not flexible - the central server determines a fixed layout that all participants see. For example, a participant cannot reorder the streams or maximise the high quality of the stream of another participant.
- CPU usage and complexity on the server side are much higher compared to SFU due to the mixing of all streams. Scaling a room is mostly vertical - upgrading the CPU to handle the mixing of more and more participants. And of course, vertical scaling has its downsides, because it is hard and expensive to find more and more powerful and reliable machines.
- Mixing all streams into one result stream introduces a slightly larger delay compared to SFU since SFU only relays streams. Also if there is an error in the MCU layout, everyone will be affected by it.
The decoding/encoding and mixing are much more taxing than just routing/relaying streams like SFU. And since companies generally cannot afford to spend at least 10 times more money on the server side, SFU is the reasonable compromise which wins in most cases.
In the hybrid approach, different architectures are used depending on the number of participants in the call. Very often P2P is used for 1-1 calls and the application switches to SFU after a third person enters the call.
That way some server bandwidth/resources are spared during 1-1 calls, which could be a non-negligible saving, since 1-1 calls are pretty popular among people - according to our statistics and research around 50% of the calls are 1-1.
Of course, that percentage can vary depending on the product focus - obviously, products targeted towards large webinars won’t have as many 1-1 calls.
Advantages: Combining several architectures leads to benefiting from the advantages of all architectures depending on the situation. Using P2P for 1-1 calls is saving server resources since there are no intermediate servers.
Disadvantages: Combining several architectures into the same application increases code complexity and maintenance costs. Smoothly transitioning between P2P and other architecture (SFU/MCU) in the middle of a running call is not fully trivial.
See a modern WebRTC application in action
In this article, we have explored the different architecture options that drive WebRTC technology and enable seamless video conferencing experiences. Now, let's take a closer look at how Digital Samba leverages WebRTC on the back end to provide a cutting-edge live video conferencing solution.
Digital Samba is a leading provider of GDPR-compliant video conferencing API and SDK, offering a comprehensive platform for embedding video conferencing capabilities into software products or websites. Our solution is powered by WebRTC, an open-source project that facilitates peer-to-peer interactive web applications.
By integrating Digital Samba's video conferencing API and SDK into your platform, you can unlock the power of WebRTC and provide your users with high-quality, real-time video communication. Our solution is designed to be GDPR-compliant, ensuring the privacy and security of user data. With our EU-hosted infrastructure and end-to-end encryption, you can trust that sensitive information shared during video conferences is protected.
Whether you're building a remote collaboration tool, an online tutoring platform, or a virtual classroom, Digital Samba's video conferencing solution enables seamless communication and collaboration among participants. The WebRTC architecture allows for direct peer-to-peer connections, reducing latency and ensuring a smooth video conferencing experience.
With Digital Samba, you can leverage the advantages of both P2P and SFU architectures. For 1-1 calls, our solution utilises P2P, optimizing server resources and maximising efficiency. As the number of participants increases, the architecture seamlessly transitions to SFU, leveraging the scalability and bandwidth efficiency it offers. This hybrid approach ensures optimal performance and cost-effectiveness for your video conferencing solution.
Digital Samba's WebRTC-powered live video conferencing also supports advanced features such as screen sharing, file sharing, interactive whiteboarding, and more. These features enhance collaboration and enable interactive learning experiences for virtual classrooms, remote training sessions, and online meetings.
Experience the power of Digital Samba's WebRTC-powered live video conferencing solution. Contact our sales team today to learn more and get started on enhancing your video conferencing platform.
You May Also Like
These Related Stories