P2P, SFU and MCU - WebRTC Architectures Explained

7 min read
November 14, 2022

In this article, we’re going to examine the details of how WebRTC architecture actually works so that a layperson can understand it.

WebRTC is an open-source project that ties together devices using peer-to-peer interactive web apps. If you’ve had an in-browser video call or played a real-time game through a web browser, WebRTC is probably what drove the back-end technology of how that web application worked.

Table of contents

  1. Architecture options
  2. P2P (Peer to peer/mesh)
  3. SFU (Selective Forwarding Unit)
  4. MCU (Multipoint Control Unit)
  5. Hybrid architectures

Architecture options

At the core of every video conferencing solution sits the architecture of sending and receiving the participants’ video/audio streams. For example, if there are N participants in a video conference each of them needs to see/hear the video/audio of all other N-1 participants.

This can be implemented in different ways, but there are three main architectures which are used in practice:

  • P2P (Peer to peer/mesh),
  • SFU (Selective Forwarding Unit),
  • MCU (Multipoint Control Unit).

A hybrid approach is also possible - to use different kinds of architecture depending on the number of participants in the conference. That is more of an optimization and will be covered at the end of the article.

P2P (Peer to peer)

P2P (peer-to-peer) is also sometimes called mesh architecture. This is the most basic architecture and easy to reason about. Each participant in the conference is a peer and sends their video/audio to all other peers by establishing a peer connection with each of them.

Here is a diagram of P2P (peer-to-peer) architecture with 4 participants:Digital Samba P2P (Peer to peer) WebRTC-min-png.png

There are no intermediate media servers, so privacy (end-to-end encryption) is achieved by default. While this sounds great, P2P architecture has one very significant drawback - upload bandwidth is not used wisely.

For example, if there are N participants in the call, each participant needs to establish N-1 peer connections and send N-1 times their video/audio for a total amount of N*(N-1) peer connections.

Still, many homes have asymmetrical internet connections - e.g. ADSL (Asymmetric Digital Subscriber Line), where the upload speed is severely limited compared to the download speed. And even if you have a good upload speed, there will still be an issue in an office setting where many people are sharing the same internet connection.

In reality, P2P (peer-to-peer) architecture makes sense mostly for 1-1 calls where 2 people participate in the conference. In that scenario, P2P is still optimal because each of the 2 participants only sends their audio/video one time sends only one time their video/audio.

Advantages:

  • Privacy is easily achieved - all video/audio streams are E2EE (end-to-end encrypted) by default since there is no intermediate infrastructure which could spy on the video/audio. 
  • Complexity and hosting costs are low too since there are no intermediate media servers to host/implement and care about.

Disadvantages:

  • Upload bandwidth is not used wisely and can be easily saturated even with a small number of participants in the conference.
  • CPU (Central Processing Unit) usage will be significantly higher on the client side because the browser needs to encode the video N-1 times to send it to N-1 other participants. Unless you have a really powerful machine, the performance will be easily affected.

Conclusion:

The above disadvantages make the P2P architecture reliable mostly for 1-1 calls and not scalable. In practice, you won’t often see a video conferencing provider using P2P architecture if there are more than 3 participants in the conference.

SFU (Selective Forwarding Unit)

This architecture is the main choice in recent video conferencing solutions. There are central SFU media servers which receive the published streams and then route them without modifying them to the other participants.

While obviously some of the complexity is shifted towards the server side it is a huge improvement over P2P, because it solves the upload bandwidth and scalability issue from which P2P suffers.

Simulcast is often used with SFU (Selective Forwarding Unit) where each participant publishes several versions of their stream to the SFU (Selective Forwarding Unit) each with a different quality. Then the SFU (Selective Forwarding Unit) can decide to route the low-quality stream versions to participants which have a poor internet connection. Or to deliver the high-quality version of a stream only when it is maximised locally by a participant.

That way a large amount of downlink bandwidth can be saved and many participants can be displayed in the same grid even if participants have an average internet connection.

Digital Samba - WebRTC - SFU (Selective Forwarding Unit)
As can be seen by the above diagram each participant publishes once to the SFU media server and also receives each of the other participants' streams.

Advantages:

  • Participants publish exactly once - to the SFU server. This makes SFU an upload bandwidth-friendly architecture, unlike the P2P (mesh).
  • Much more scalable than the P2P architecture. Simulcast can be used, where different stream versions are routed depending on the available participant’s CPU and bandwidth.
  • Flexible layout - participants can decide which streams they want to receive, where to display them and in what quality.

Disadvantages:

  • There are intermediate media servers which increase costs/complexity on the server side.
  • E2EE (full privacy) is not achieved by default, since the intermediate media server has access to the raw stream bytes when forwarding. This is a disadvantage compared to the P2P architecture but can be mitigated by encrypting the stream bytes with a custom key before sending them to the SFU(Selective Forwarding Unit) media server. Of course, that means custom decryption would be needed on the receiving side.

Conclusion: SFU (Selective Forwarding Unit) is the most popular architecture deployed today in video conferencing.

SFU is much more efficient during upload and scalable than P2P.

Also while users still need to download and decode each of the other participant’s streams, the simulcast technique can be applied to allow a display of up to circa 50 participants in a grid on an average connection and machine.

MCU (Multipoint Control Unit)

In the MCU (Multipoint Control Unit) architecture, every participant publishes their stream only once their stream is to a central server. But unlike SFU, the MCU (Multipoint Control Unit) central server has the role of a mixer - combining all received streams into one stream.

Then all participants consume this one mixed stream instead of subscribing individually to the stream of every other participant.

Digital Samba WebRTC - MCU (Multipoint Control Unit)Advantages:

  • Every participant subscribes to only one stream - the combined layout of all other participants. This is very CPU and bandwidth efficient on the client side (even more efficient than SFU) since only one stream is being decoded by the browser.
  • Since only one stream is consumed on the client side, it is easy to reason about the architecture and integrate/debug it in the front end. On the other hand, complexity is mainly being moved to the backend.

Disadvantages:

  • The layout is generally not flexible - the central server determines a fixed layout that all participants see. For example, a participant cannot reorder the streams or maximise the high quality of the stream of another participant.
  • CPU usage and complexity on the server side are much higher compared to SFU due to the mixing of all streams. Scaling a room is mostly vertical - upgrading the CPU to handle the mixing of more and more participants. And of course, vertical scaling has its downsides, because it is hard and expensive to find more and more powerful and reliable machines.
  • Mixing all streams into one result stream introduces a slightly larger delay compared to SFU since SFU only relays streams. Also if there is an error in the MCU layout, everyone will be affected by it.

Conclusion: While MCU (Multipoint Control Unit) is the best architecture if the only concern is client-side resource usage, in practice MCU loses to SFU because it is at least 10 times more expensive to deploy it on the server side.

The decoding/encoding and mixing are much more taxing than just routing/relaying streams like SFU. And since companies generally cannot afford to spend at least 10 times more money on the server side, SFU is the reasonable compromise which wins in most cases.

Hybrid architectures

In the hybrid approach, different architectures are used depending on the number of participants in the call. Very often P2P is used for 1-1 calls and the application switches to SFU after a third person enters the call.

That way some server bandwidth/resources are spared during 1-1 calls, which could be a non-negligible saving, since 1-1 calls are pretty popular among people - according to our statistics and research around 50% of the calls are 1-1.

Of course, that percentage can vary depending on the product focus - obviously, products targeted towards large webinars won’t have as many 1-1 calls.

Advantages: Combining several architectures leads to benefiting from the advantages of all architectures depending on the situation. Using P2P for 1-1 calls is saving server resources since there are no intermediate servers.

Disadvantages: Combining several architectures into the same application increases code complexity and maintenance costs. Smoothly transitioning between P2P and other architecture (SFU/MCU) in the middle of a running call is not fully trivial.

See a modern WebRTC application in action

WebRTC is a critical technology for the internet in the modern age. After just learning about some of the architecture options for WebRTC that fundamentally drive the technology that we rely on, wouldn’t you like to see an example of WebRTC technology in all its glory?

 

What if you could also have an expert guide you through the application and give you a tour of modern live-video software?

Digital Samba is a cutting-edge live video conferencing solution that WebRTC on the back end powers. Contact our sales team to get started.

Digital Samba videoconferecning software, meeting software, webinar software, virtual classroom software platform

 

Describe your image

Get Email Notifications