Video Call Quality Metrics for SDK Integrations: A Developer Guide

10 min read
March 23, 2023

Video call quality metrics in SDK integrations are the performance indicators that tell you whether your embedded video is working correctly or heading towards a failure your users will notice. The six core metrics are: one-way latency (target under 150 ms), jitter (target under 20 ms optimal, under 30 ms acceptable), packet loss (under 1%), frame rate (30 fps minimum for standard calls), bitrate (matched to available bandwidth), and CPU and memory consumption. Tracking all six lets you catch degradation before users experience it.

Most video quality problems are invisible until they affect a real session. By the time a user reports frozen video or distorted audio, the underlying metrics have already been deteriorating for minutes. For developers building on a video SDK, proactive monitoring is the only reliable way to maintain a consistent experience, particularly in sessions with varied network conditions, devices, and participant counts.

Unlike consumer video apps where the vendor handles infrastructure, SDK integrations put you in the path of quality management. You choose the deployment region, control the codec settings where the SDK allows, and handle the UX when something goes wrong. Knowing what each metric means (and what normal looks like) is what lets you diagnose and fix problems quickly.

This guide covers the six core metrics and their benchmark thresholds, what good SDK performance looks like in numbers, how to implement monitoring in your integration, and the best practices for keeping video quality consistently high.

Table of contents

  1. The six core video call quality metrics
  2. Video SDK performance benchmarks: what good looks like
  3. How to monitor video call quality in your SDK integration
  4. What affects video call quality in SDK implementations
  5. Best practices for optimising video call quality
  6. FAQ

The six core video call quality metrics

Latency and round-trip time

One-way latency is the time a media packet takes to travel from sender to receiver. Round-trip time (RTT) measures the full loop: sender to receiver and back. RTT is available from the getStats() API via the remote-inbound-rtp stat type (RTCP-based) or the candidate-pair stat type (STUN-based); it does not appear on inbound-rtp.

Benchmark thresholds:

  • Under 150 ms one-way (300 ms RTT): excellent, no perceptible delay
  • 150–300 ms one-way: acceptable for most business video calls
  • Above 300 ms one-way: noticeable delay, affects conversation rhythm

For interactive use cases such as medical consultations, education, and live support, aim for under 150 ms. High latency above 300 ms degrades turn-taking and makes real-time conversation feel broken.

Jitter

Jitter is the variation in packet arrival timing. Packets that leave the sender at regular intervals but arrive at irregular intervals introduce jitter. Network buffers absorb some jitter, but high jitter drains buffers faster than they can fill, causing audio gaps and video freezes.

Benchmark thresholds:

  • Under 20 ms: optimal for voice and video
  • 20–30 ms: acceptable, minor variation
  • Above 30 ms: high risk of choppy audio or call instability

Packet loss

Packet loss is the percentage of RTP packets that don't arrive. Video codecs can partially conceal low packet loss through error correction (NACK, FEC), but loss above 1% starts to visibly degrade quality.

Benchmark thresholds:

  • Under 0.5%: excellent
  • 0.5–1%: acceptable
  • 1–5%: degraded, expect visible artefacts and audio dropouts
  • Above 5%: unacceptable, trigger an alert

For more detail on how packet loss affects WebRTC streams specifically and what recovery mechanisms exist, see the Digital Samba guide to packet loss in WebRTC.

Frame rate and resolution

Frame rate determines how smooth video appears. Resolution determines sharpness. Both are constrained by available bandwidth.

Benchmark thresholds:

  • 30 fps at 720p: standard for video calls
  • 60 fps at 1080p: high quality, suitable for screen-sharing-heavy sessions
  • Below 15 fps: video becomes visibly choppy

Resolution should scale down gracefully when bandwidth drops, rather than maintaining resolution at the cost of frame rate. Most well-designed SDKs handle this automatically through adaptive bitrate control.

Bitrate and codec efficiency

Bitrate is the amount of data transmitted per second. Higher bitrates produce better quality but require more bandwidth. The relationship between bitrate and quality depends heavily on the codec.

Typical bitrate ranges for video calls:

  • Standard definition (480p, 30 fps): 300–600 Kbps (for real-time conferencing with efficient codecs such as VP8 or VP9; general video delivery typically requires more)
  • High definition (720p, 30 fps): 1.5–2.5 Mbps
  • Full HD (1080p, 30 fps): 3–5 Mbps
  • Audio only: 32–128 Kbps

For the difference between encoding approaches and what they mean for SDK integration, the Digital Samba explainer on video encoding vs video transcoding covers the core trade-offs.

CPU and memory usage

High CPU or memory consumption from the video session affects your entire application. Hardware acceleration offloads codec processing from the CPU; without it, high-resolution video can spike CPU to the point where the rest of the application becomes unresponsive.

Watch for CPU usage above 80% during active sessions and memory leaks that grow over long calls. Both are symptoms of either codec inefficiency or too many parallel streams being processed client-side.

Video SDK performance benchmarks: what good looks like

Benchmarks give you a baseline for evaluating whether an SDK implementation is performing within normal parameters. Here are the metrics that well-optimised SDK deployments consistently achieve:

Metric Acceptable Target
One-way latency <300 ms <150 ms
Jitter <30 ms <20 ms
Packet loss <1% <0.5%
Session join time <5 s <2 s
Frame rate (720p) ≥24 fps 30 fps

Join time (the interval from a user clicking "join" to their video appearing in the room) is one of the most visible quality indicators, even though it doesn't appear in most metric dashboards. Implementations using a Selective Forwarding Unit (SFU) can achieve joins in under two seconds under normal network conditions. Implementations using a Multipoint Control Unit (MCU) are often slower because they decode and mix streams server-side before forwarding.

At scale, the key reliability target is session completion rate: the percentage of sessions that start, run, and end without a platform-caused failure. Well-maintained infrastructure should achieve above 99.5%.

How to monitor video call quality in your SDK integration

The most direct method for client-side quality monitoring in WebRTC-based SDKs is the getStats() API. It returns a snapshot of the active peer connection's statistics: jitter, packet loss, bitrate, frame rate, and resolution from inbound-rtp; RTT from remote-inbound-rtp or candidate-pair. Polling every 2–5 seconds gives you sufficient granularity without adding meaningful overhead.

// Poll WebRTC getStats() every 3 seconds
// peerConnection is the RTCPeerConnection instance from your SDK

let prevPacketsLost = 0;
let prevPacketsReceived = 0;

const poll = async (pc) => {
  const report = await pc.getStats().catch(() => null);
  if (!report) return;

  report.forEach(stat => {
    if (stat.type === 'inbound-rtp' && stat.kind === 'video') {
      // packetsLost is a cumulative counter; delta two samples to get a rate
      const lostDelta = stat.packetsLost - prevPacketsLost;
      const receivedDelta = stat.packetsReceived - prevPacketsReceived;
      const total = lostDelta + receivedDelta;
      const lossPercent = total > 0 ? (lostDelta / total) * 100 : 0;

      prevPacketsLost = stat.packetsLost;
      prevPacketsReceived = stat.packetsReceived;

      // stat.jitter is in seconds per the W3C WebRTC spec; multiply by 1000 for ms
      const jitterMs = stat.jitter * 1000;

      console.log({
        packetLossPercent: lossPercent.toFixed(2),
        jitterMs: jitterMs.toFixed(1),
        framesPerSecond: stat.framesPerSecond,
      });
    }

    // RTT is not on inbound-rtp; read it from remote-inbound-rtp (RTCP-based)
    if (stat.type === 'remote-inbound-rtp' && stat.kind === 'video') {
      const rttMs = stat.roundTripTime != null ? stat.roundTripTime * 1000 : null;
      if (rttMs !== null) {
        console.log({ rttMs: rttMs.toFixed(1) });
      }
    }
  });
};

setInterval(() => poll(peerConnection), 3000);

For session-level events rather than real-time stream statistics, SDK webhooks give you a structured view of what happened: when a session started, when participants joined or left, when recording began, and when the session ended. These are useful for detecting patterns in connection problems across your user base, for example a particular region or device type consistently dropping sessions.

If your SDK exposes an MOS (Mean Opinion Score) metric, this is a useful quality signal. MOS combines latency, jitter, and packet loss into a single score on a 1–5 scale. Scores above 4 represent a good call experience, 3–4 an acceptable one, and below 3 a noticeably poor one. Note that MOS is primarily an audio quality metric derived from the ITU-T E-model; it does not capture video-specific degradation such as freeze events or resolution collapse, which require separate metrics.

Set alerts on threshold breaches, not averages. A session can have an acceptable average packet loss of 0.8% while experiencing 30-second spikes to 8% that break the call. Alerting on sustained threshold breaches (for example, packet loss above 1% for more than 10 consecutive seconds) is more useful than monitoring averages alone.

What affects video call quality in SDK implementations

Several factors outside the SDK itself affect the quality metrics your implementation achieves:

  • Network connection. Network quality is the dominant variable. High latency, elevated packet loss, and jitter all flow from a poor underlying network. Mobile connections and congested public Wi-Fi are the most common culprits.
  • Device performance. Older devices with slower CPUs or limited RAM struggle with video codec processing, particularly at HD resolutions without hardware acceleration. This shows up as elevated CPU usage and reduced frame rates.
  • Application features. Screen sharing, file transfer, and in-session recording all consume additional bandwidth and processing. The more features running in parallel, the greater the pressure on each quality metric.
  • User behaviour. A participant on a congested network with several other applications open will experience higher latency and bandwidth contention than a participant on a wired connection using only the video call.
  • Server capacity and routing. Under-provisioned servers or poor regional routing increase latency and packet loss for participants far from the infrastructure. SFU-based architectures handle concurrency more efficiently than MCU-based ones because they route streams without media decoding or mixing.

Best practices for optimising video call quality

  1. Use an SFU, not an MCU, for interactive calls. SFU architectures forward RTP packets without media decoding or mixing, which keeps latency lower and reduces infrastructure load. Note that standard SRTP encryption terminates at the SFU; the server does not decode video or audio frames, but it does handle the encrypted media payloads at the transport layer. True end-to-end encryption requires an additional layer such as SFrame. MCU architectures are sometimes used for broadcast scenarios but add latency for interactive calls.

  2. Choose your codec for your audience. VP8 and VP9 are the most widely supported codecs in browser-based WebRTC implementations and require no server-side transcoding in SFU architectures. Note that VP9 support on iOS Safari is less consistent than on desktop browsers; VP8 is the safest universal baseline when iOS participant support matters. H.264 has strong hardware acceleration support on many devices and is preferred in hardware-endpoint enterprise environments. AV1 offers superior compression efficiency but has higher encoding overhead. For most SDK integrations targeting browser-based participants, VP8 or VP9 with adaptive bitrate is the practical choice.

  3. Implement adaptive bitrate control. Allow the SDK to reduce resolution or frame rate dynamically when network conditions degrade. Users will accept a temporary drop to 360p far better than they will accept a frozen screen.

  4. Enable hardware acceleration. When encoding and decoding are offloaded to the device's GPU, CPU usage drops significantly and thermal throttling is less likely to affect call quality in longer sessions.

  5. Monitor at two levels: client and session. Client-side getStats() gives real-time, per-stream diagnostics. Session-level webhooks and server-side metrics give you aggregate patterns. Both are necessary; neither alone gives the full picture.

If you want to test these principles in a production-grade integration, Digital Samba's Embedded SDK includes several quality tools built in: MOS scoring, bandwidth management that auto-adjusts video quality per connection, connection error detection and correction, and system diagnostics for pre-join device testing. The platform runs on a VP8-only real-time stack over a Janus-based SFU, so streams are forwarded without server-side media decoding or mixing. HD video presets support 720p and 1080p at 1.5–5.5 Mbps, and audio goes up to 256 Kbps (exceeding the standard conferencing ceiling of 128 Kbps) with noise suppression and echo cancellation included. SDK events such as appError and mediaPermissionsFailed are subscribable, so you can build custom quality-failure handling directly into your UX. You can explore the full API surface in the Digital Samba Embedded SDK documentation or start testing on the free plan.

FAQ

What is acceptable latency for a video call?

One-way latency under 150 ms is the target for interactive video calls: medical consultations, education, and live support. Latency up to 300 ms one-way is acceptable for standard business calls but introduces a noticeable delay in turn-taking. Above 300 ms, conversation rhythm breaks down.

What causes packet loss in video conferencing?

Packet loss in video calls is most commonly caused by network congestion, poor Wi-Fi signal strength, or insufficient bandwidth for the number of active streams. It can also result from server-side capacity problems or poor routing between participants and the infrastructure. The Digital Samba guide to packet loss in WebRTC covers causes and recovery mechanisms in detail.

How do I measure video call quality in a WebRTC SDK integration?

The standard approach is the WebRTC getStats() API, which returns per-stream statistics including jitter, packet loss, frame rate, and resolution via inbound-rtp, and RTT via remote-inbound-rtp or candidate-pair. Poll it every 2–5 seconds during active sessions. For session-level visibility, use your SDK's webhook events to log when sessions start, participants join, and connections fail. If your SDK exposes MOS scoring, that gives you a useful audio-quality signal across the session.

What frame rate is good for video conferencing?

30 fps at 720p is the standard benchmark for video conferencing calls. Screen-sharing-heavy sessions benefit from higher frame rates to keep motion smooth. Below 15 fps, video becomes visibly choppy. Adaptive bitrate systems reduce frame rate automatically under bandwidth pressure, which is preferable to maintaining resolution at the cost of fluidity.

What is MOS scoring in video calls?

MOS (Mean Opinion Score) is a standardised measure of perceived call quality on a scale of 1 to 5. Scores above 4 represent good call quality, 3–4 is acceptable, and below 3 is noticeably poor. MOS combines latency, jitter, and packet loss into a single score, making it easier to identify sessions that degraded without needing to cross-reference multiple raw metrics. Because MOS originates from the ITU-T E-model for audio, it reflects the quality of the audio channel; video-specific issues such as freeze events or resolution drops require separate tracking.

What is the difference between jitter and latency in video calls?

Latency is the fixed delay from sender to receiver: the time a packet takes to travel the network path. Jitter is the variation in that delay from packet to packet. Both affect call quality, but in different ways: latency affects conversation turn-taking and synchronisation; jitter causes audio gaps and video freezes because packets arrive out of order or in bursts that overwhelm the receiver's buffer.