P2P, SFU and MCU - WebRTC Architectures Explained

Written by Digital Samba | November 14, 2022

In this article, we’re going to examine the details of how WebRTC architecture actually works so that a layperson can understand it.

WebRTC is an open-source project that ties together devices using peer-to-peer interactive web apps. If you’ve had an in-browser video call or played a real-time game through a web browser, WebRTC is probably what drove the back-end technology of how that web application worked.

Table of contents

Architecture options
P2P (Peer to peer/mesh)
SFU (Selective Forwarding Unit)
MCU (Multipoint Control Unit)
Hybrid architectures

Architecture options

At the core of every video conferencing solution sits the architecture of sending and receiving the participants’ video/audio streams. For example, if there are N participants in a video conference each of them needs to see/hear the video/audio of all other N-1 participants.

This can be implemented in different ways, but there are three main architectures which are used in practice:

P2P (Peer to peer/mesh),
SFU (Selective Forwarding Unit),
MCU (Multipoint Control Unit).

A hybrid approach is also possible - to use different kinds of architecture depending on the number of participants in the conference. That is more of an optimization and will be covered at the end of the article.

P2P (Peer to peer)

Peer-to-peer (P2P) is an application architecture, which is also occasionally referred to as mesh architecture. It represents the fundamental structure of network design and is straightforward to conceptualise. In the context of a conference, each individual is a peer, broadcasting their video and audio to every other peer through the establishment of a direct peer connection.

Below is a peer-to-peer architecture diagram illustrating P2P with four participants:

In the absence of intermediate media servers, privacy, facilitated through end-to-end encryption, is inherently present. However, while this seems advantageous, there is a significant limitation with P2P: it does not utilise upload bandwidth efficiently.

For example, if there are N participants in the call, each participant needs to establish N-1 peer connections and send N-1 times their video/audio for a total amount of N*(N-1) peer connections.

Still, many homes have asymmetrical internet connections - e.g. ADSL (Asymmetric Digital Subscriber Line), where the upload speed is severely limited compared to the download speed. And even if you have a good upload speed, there will still be an issue in an office setting where many people are sharing the same internet connection.

In reality, P2P (peer-to-peer) architecture makes sense mostly for 1-1 calls where 2 people participate in the conference. In that scenario, P2P is still optimal because each of the 2 participants only sends their audio/video one time sends only one time their video/audio.

Advantages:

Privacy is easily achieved. All video/audio streams are E2EE (end-to-end encrypted) by default since there is no intermediate infrastructure which could spy on the video/audio.
Complexity and hosting costs are low too since there are no intermediate media servers to host/implement and care about.

Disadvantages:

Upload bandwidth is not used wisely and can be easily saturated even with a small number of participants in the conference.
CPU (Central Processing Unit) usage will be significantly higher on the client side because the browser needs to encode the video N-1 times to send it to N-1 other participants. Unless you have a really powerful machine, the performance will be easily affected.

The above disadvantages make the P2P architecture reliable mostly for 1-1 calls and not scalable. In practice, while P2P architecture works well for small-scale sessions, WebRTC mcu is preferred when more participants are involved, offering centralised management of streams.

SFU (Selective Forwarding Unit)

This architecture has become the preferred option in contemporary video conferencing solutions. Central SFU (Selective Forwarding Unit) media servers act as intermediaries, receiving the incoming streams and then distributing them unaltered to the other participants.

Although this approach introduces additional complexity to the server side, it is a significant enhancement over P2P architecture. It addresses the issue of limited upload bandwidth and improves scalability, which are notable challenges with P2P.

The technique of simulcast is frequently employed in SFU video conferencing. Each participant transmits multiple streams at varying qualities to the SFU unit. The SFU then selects the appropriate stream quality to forward; for instance, it may send streams of lower quality to participants with weaker internet connections. Conversely, it can route the high-definition version of a stream to those who are displaying it prominently on their local system.

That way a large amount of downlink bandwidth can be saved and many participants can be displayed in the same grid even if participants have an average internet connection. The WebRTC sfu server helps by optimising bandwidth usage, as it only forwards streams without decoding them, reducing server load.

In SFU video conferencing, as illustrated in the above diagram, each participant sends their stream to the SFU media server a single time and, in turn, receives the streams of all other participants.

Advantages:

Participants publish exactly once to the SFU server. This makes SFU an upload bandwidth-friendly architecture, unlike the P2P (mesh).
Much more scalable than the P2P architecture. Simulcast can be used, where different stream versions are routed depending on the available participant’s CPU and bandwidth.
Flexible layout. Participants can decide which streams they want to receive, where to display them and in what quality.

Disadvantages:

There are intermediate media servers which increase costs/complexity on the server side.
E2EE (full privacy) is not achieved by default, since the intermediate media server has access to the raw stream bytes when forwarding. This is a disadvantage compared to the P2P architecture but can be mitigated by encrypting the stream bytes with a custom key before sending them to the SFU (Selective Forwarding Unit) media server. Of course, that means custom decryption would be needed on the receiving side.

SFU (Selective Forwarding Unit) is the most popular architecture deployed today in video conferencing.

SFU is much more efficient during upload and scalable than P2P.

Also while users still need to download and decode each of the other participant’s streams, the simulcast technique can be applied to allow a display of up to circa 50 participants in a grid on an average connection and machine.

MCU (Multipoint Control Unit)

In the MCU (Multipoint Control Unit) architecture, every participant publishes their stream only once their stream is to a central server. But unlike SFU, the MCU (Multipoint Control Unit) central server has the role of a mixer - combining all received streams into one stream.

Then all participants consume this one mixed stream instead of subscribing individually to the stream of every other participant.

Advantages:

Every participant subscribes to only one stream - the combined layout of all other participants. This is very CPU and bandwidth-efficient on the client side (even more efficient than SFU) since only one stream is being decoded by the browser.
Since only one stream is consumed on the client side, it is easy to reason about the architecture and integrate/debug it in the front end. On the other hand, complexity is mainly being moved to the backend.

Disadvantages:

The layout is generally not flexible - the central server determines a fixed layout that all participants see. For example, a participant cannot reorder the streams or maximise the high quality of the stream of another participant.
CPU usage and complexity on the server side are much higher compared to SFU due to the mixing of all streams. Scaling a room is mostly vertical - upgrading the CPU to handle the mixing of more and more participants. And of course, vertical scaling has its downsides, because it is hard and expensive to find more and more powerful and reliable machines.
Mixing all streams into one result stream introduces a slightly larger delay compared to SFU since SFU only relays streams. Also if there is an error in the MCU layout, everyone will be affected by it.

While MCU (Multipoint Control Unit) is the best architecture if the only concern is client-side resource usage, in practice MCU loses to SFU because it is at least 10 times more expensive to deploy it on the server side.

The decoding/encoding and mixing are much more taxing than just routing/relaying streams like SFU. And since companies generally cannot afford to spend at least 10 times more money on the server side, SFU is the reasonable compromise which wins in most cases.

Hybrid architectures

In the hybrid approach, different architectures are used depending on the number of participants in the call. Initially, P2P (Peer-to-Peer) is used for 1-1 calls, and as more participants join the call, the architecture switches to SFU (Selective Forwarding Unit) to accommodate the growing number of participants.

This approach helps optimise server resources, particularly during smaller 1-1 calls, where no intermediate media servers are required. Using P2P for smaller calls helps to save bandwidth and processing power. As soon as a third participant joins the call, the system transitions to SFU to handle the increased load efficiently.

This hybrid approach can be visualised through a WebRTC architecture diagram, which clearly shows the transition from P2P to SFU as the participant count grows. The diagram illustrates how the architecture evolves from direct peer connections in P2P to the use of an SFU media server, forwarding streams to multiple participants.

Advantages:

Optimised server resources: Using P2P for 1-1 calls saves server bandwidth, as there are no intermediate servers.
Scalable for larger calls: As participants increase, the system automatically shifts to SFU, which efficiently handles the growing number of streams.
Cost-effective: The hybrid architecture offers a good balance of performance and cost, using P2P for small calls and SFU for larger ones.

Disadvantages:

Increased complexity: Implementing a hybrid architecture requires managing multiple systems and smooth transitions between them, which can add complexity to the backend.
Maintenance: Maintaining a hybrid architecture can increase code complexity and maintenance costs, particularly as you have to ensure that the transition between P2P and SFU (and possibly MCU) is seamless during a live call.

See a modern WebRTC application in action

In this article, we have explored the different architecture options that drive WebRTC technology and enable seamless video conferencing experiences. Now, let's take a closer look at how Digital Samba leverages WebRTC on the back end to provide a cutting-edge live video conferencing solution.

Digital Samba is a leading provider of GDPR-compliant video conferencing API and SDK, offering a comprehensive platform for embedding video conferencing capabilities into software products or websites. Our solution is powered by WebRTC, an open-source project that facilitates peer-to-peer interactive web applications.

By integrating Digital Samba Embedded into your platform, you can unlock the power of WebRTC and provide your users with high-quality, real-time video communication. Our solution is designed to be GDPR-compliant, ensuring the privacy and security of user data. With our EU-hosted infrastructure and end-to-end encryption, you can trust that sensitive information shared during video conferences is protected.

Whether you're building a remote collaboration tool, an online tutoring platform, or a virtual classroom, Digital Samba's video conferencing solution enables seamless communication and collaboration among participants. The WebRTC architecture allows for direct peer-to-peer connections, reducing latency and ensuring a smooth video conferencing experience.

With Digital Samba, video conferencing is powered entirely by an SFU (Selective Forwarding Unit) architecture, designed for scalability, stability, and bandwidth efficiency. Unlike P2P or MCU-based systems, our SFU-based approach ensures consistent performance across all call sizes, from one-on-one meetings to large group sessions. This setup minimises latency, optimises bandwidth usage, and provides a reliable foundation for embedding video into any application.

Digital Samba's WebRTC-powered live video conferencing also supports advanced features such as screen sharing, file sharing, interactive whiteboarding, and more. These features enhance collaboration and enable interactive learning experiences for virtual classrooms, remote training sessions, and online meetings.

Experience the power of Digital Samba's WebRTC-powered live video conferencing solution. Contact our sales team today to learn more and get started on enhancing your video conferencing platform.

Common questions about WebRTC architectures

1. What is the difference between P2P and SFU in WebRTC?

P2P (Peer-to-Peer) directly connects participants without using a central server, which is suitable for small calls but can strain bandwidth and processing power as the number of participants grows. SFU (Selective Forwarding Unit) routes media streams through a server, optimising bandwidth and scalability, making it more suitable for larger calls.

2. How does SFU improve video conferencing scalability?

SFU improves scalability by reducing the amount of data participants need to send. Each participant only sends their media stream to the SFU server once, and the server forwards the stream to all other participants, reducing upload bandwidth and CPU usage on the client side.

3. When should I use MCU in WebRTC?

MCU (Multipoint Control Unit) is used when there’s a need to mix all participants' streams into a single stream for each participant. This can be useful when you need a consistent layout and minimal client-side processing, but it’s less scalable and more resource-intensive than SFU.

4. What are the main advantages of using WebRTC for video conferencing?

WebRTC enables real-time communication directly between browsers without the need for plugins. It’s cost-effective, supports high-quality video and audio, and is highly secure with end-to-end encryption. Additionally, WebRTC is widely supported across platforms and browsers.

5. Can WebRTC handle large-scale video conferences?

While WebRTC can handle large conferences, it becomes less efficient with more participants due to the increased strain on network bandwidth and processing power. SFU and MCU architectures help manage larger calls more efficiently, but the scalability depends on the infrastructure used.

View full post