In January 2024, a finance employee at Arup, the multinational engineering firm, received what looked like a routine video call invitation. The invitation followed a phishing email about a secret transaction, which the employee had found suspicious. Rather than raise the alarm, he joined the video call, and what he saw there dissolved his doubts: the CFO was on screen, several familiar colleagues were present, and there was an urgent wire transfer request on the agenda. Everything looked normal. Everything sounded normal.
None of it was real.
Every person on that call was a deepfake: the CFO, the colleagues, the entire meeting. All of it was AI-generated synthetic video fed in real time. The attacker had not bypassed any platform access controls; the employee joined the call voluntarily after being brought there through social engineering. By the time the fraud was discovered, HK$200 million (approximately US$25 million) had been transferred out of the company's accounts. It remains the largest confirmed case of deepfake video call fraud against a corporate target.
The Arup case didn't just make headlines. It changed how security professionals think about video conferencing. If a trained finance professional can be deceived into authorising a $25 million transfer by a synthetic video call, the question is no longer whether your organisation could fall victim to this kind of attack. The question is whether your video platform and your processes are built to stop it.
This article breaks down how deepfake threats work in video environments, why your current defences may have a critical gap, and what genuinely effective protection looks like in 2025 and beyond.
Table of contents
The Arup case wasn't an isolated incident. It was a preview.
The deepfake detection market tells the story in numbers. Valued at $5.5 billion, analysts project it will reach $15.7 billion by 2026, growing at a 42 per cent compound annual growth rate, according to figures Deloitte cited in a November 2024 analysis. That level of investment doesn't happen unless the threat is real and growing.
The human side of the equation is more alarming. Research from Keepnet found that people correctly identify deepfakes only 24.5 per cent of the time. That's worse than a coin flip, and it means your employees are the wrong last line of defence against a deepfake fraud video call.
Enterprise exposure has accelerated sharply. Resemble AI tracked 980 corporate infiltration cases involving synthetic media in Q3 2025 alone, drawn from global media monitoring across that period. These weren't phishing emails or smishing attacks; they were coordinated attempts to infiltrate businesses through AI-generated personas on video calls. Meanwhile, Gartner has projected that by 2027, 50 per cent of enterprises will be investing in disinformation security products and strategies, up from less than 5 per cent just a few years ago, having recognised that traditional defences don't hold up against generative AI.
If your organisation runs video calls for onboarding, executive approvals, financial authorisations, or compliance sign-offs, that threat is directly relevant to you.
Can you fake a video call? The uncomfortable answer in 2026 is yes. You can do it convincingly, in real time, and at relatively low cost.
There are three primary attack vectors in a deepfake video call environment:
These capabilities power several categories of real-world attack:
Video calls are uniquely vulnerable to all of this for a simple reason: we've been trained to trust what we see and hear on a video call in a way we never would with an email. A suspicious email gets scrutinised. A confident, visually convincing 'CFO' on screen gets believed, especially when the request is framed as urgent and confidential.
Many organisations, after reading about these threats, immediately think about their encryption posture. End-to-end encryption, TLS in transit, AES-256 at rest. Surely that covers it?
Encryption protects the channel. It does not verify who is on the other end of it.
Think of it this way: a sealed envelope guarantees that nobody opened the letter in transit. But it tells you nothing about whether the person who sent it is who they claim to be. In video conferencing, encryption prevents a third party from intercepting your call. It does nothing to prevent an attacker who has already synthesised the CFO's face from participating in that call as an authenticated participant.
This is the authentication gap, and it's where most enterprise video security postures have a real blind spot.
Two broad approaches have emerged to close it:
The strongest security postures combine both. But if you're choosing where to invest first, the cryptographic layer is the more reliable foundation.
A category of dedicated deepfake detection tools has emerged to address the real-time identification problem. These include platforms like:
Zoom has also been rolling out built-in deepfake detection as part of its Workplace platform, including an integration with Pindrop for contact centre use cases announced in early 2026.
These tools are improving rapidly, but they carry inherent limitations. Detection accuracy degrades as generation quality improves. They typically require additional integration into existing conferencing workflows, and they generate false positives that create friction for legitimate participants, which is a real concern in regulated environments where executive calls cannot afford interruption.
As one layer in a defence stack, they add real value. As your primary control, they're not sufficient.
Solutions built around cryptographic identity verification address a different part of the problem. Rather than analysing what someone looks like during a call, cryptographic verification confirms that the person joining has already passed a verified identity check and holds a valid, unforgeable session credential.
This is implemented through token-based authentication systems where identity is asserted before the call begins. A participant cannot join a session without a cryptographically signed token issued to a verified identity. If someone attempts to impersonate a colleague using a synthetic face, they won't have that token, and they won't get in.
Token authentication has a clear limit, though. It verifies the credential at entry, not the face on screen during the call. Once a legitimately credentialled participant has joined, the token layer cannot detect a face-swap running on their device. An insider with a valid token, or an attacker who has obtained one through social engineering, could still conduct an in-session impersonation. Token authentication is a strong first control; it is not the complete answer on its own.
The Coalition for Content Provenance and Authenticity (C2PA) standard is backed by founding members including Adobe, Arm, BBC, Intel, Microsoft, and Truepic. It provides a framework for cryptographically signing media at the point of capture, creating a verifiable chain of provenance that links a video stream back to a specific, authenticated device. Applied to video conferencing, this would allow platforms to attest that a stream originated from a genuine device rather than a synthetic generator.
C2PA adoption in live video conferencing is still at an early stage. C2PA 2.3, released in December 2025, extended the standard to live streaming, but implementation in conferencing clients remains experimental. There is also a known limitation: many platforms strip embedded metadata during transcoding, which can break the provenance chain. These are solvable problems, and C2PA represents the most promising long-term architectural direction for deepfake video call detection at scale.
Liveness detection systems require participants to perform random physical actions (tracking a moving object, turning their head to a specific angle, blinking on cue) that generative models cannot anticipate and synthesise in real time. Combined with challenge-response protocols, liveness detection raises the cost of AI video call impersonation attacks.
That said, liveness detection is most effective against presentation attacks, where someone holds a photo or plays a video to the camera. It is weaker against the injected-stream attacks described earlier in this post, where a synthetic feed is inserted directly into the video pipeline and can be engineered to respond to challenges. Treat it as one useful layer, not a standalone defence.
The Zero Trust principle, 'never trust, always verify', translates directly to video conferencing security. A Zero Trust video identity framework means:
Video call identity verification at Digital Samba is built on a fundamentally different model than AI-based detection. The approach is architectural: prevent unverified participants from joining in the first place, rather than attempting to identify synthetic media after it has appeared on screen.
Digital Samba's end-to-end encryption implementation includes security verification codes, which are short cryptographic fingerprints derived from the session's encryption keys. When two participants compare their verification codes out-of-band (by voice, by message, or visually), they can confirm cryptographically that no man-in-the-middle is present and that both parties are genuinely connected to the same encrypted session.
That's not AI video call analysis. It's a mathematical proof. If the codes match, the session is authentic. The check cannot be spoofed by a synthetic video feed, because the attacker would need to compromise the cryptographic keys to generate a matching code, not just replicate someone's face.
Every Digital Samba session can be configured to require a signed authentication token for entry. These tokens are issued by the platform to participants who have been pre-verified by the host application. A participant without a valid, unexpired token simply cannot join.
In practice, deepfake defence starts at your user management layer. Whoever issues the token controls who gets in. If your onboarding, HR, or financial systems issue tokens only to verified identities, synthetic participants cannot obtain the credentials needed to join your calls. This does assume your identity management layer is secure upstream; token authentication is as strong as the issuance process behind it.
Digital Samba's RBAC system is enforced server-side. Participants join with a specific role (host, moderator, or participant) and cannot escalate their permissions through client-side manipulation. This matters in AI impersonation scenarios where an attacker might try to gain host or moderator privileges to manipulate meeting content, remove legitimate participants, or access sensitive shared resources.
Digital Samba runs all AI-powered features (transcription, live captions, meeting summaries) only on self-hosted models. No meeting audio, video, or content is sent to third-party AI providers for processing.
For security-conscious organisations, this matters for data containment: platforms that route meeting content through external AI services create exposure to infrastructure you don't control or audit. Digital Samba's approach keeps meeting data within the platform's own infrastructure, and the same principle will apply to any future AI-based identity verification features as the capability matures.
A video call deepfake scam targeting your organisation is unlikely to be stopped by any single control. The most resilient approach is layered:
The video call deepfake scam threat will not diminish. Generation technology is becoming faster, cheaper, and more accessible every month. The organisations that will be resilient are those that treat video call identity as a security domain, not just a technical convenience.
The Arup case established a proof of concept that the security community cannot ignore: a convincing enough deepfake video call can deceive even trained professionals into authorising catastrophic financial decisions. The technology that enabled it has only become more accessible and more convincing since.
The answer is not to distrust video calls, because they're too valuable to abandon. The answer is to secure them the same way you secure any other high-stakes communication channel: with verified identity at the point of access, cryptographic session integrity, and layered controls that don't depend on human visual perception alone.
Digital Samba's approach is based on token authentication before joining, E2EE with cryptographic verification codes, server-side RBAC, and self-hosted AI processing. Together, these address the platform layer. Paired with clear human protocols for out-of-band verification, they cover both the technical and process failures the Arup case exposed.
Download our Security Whitepaper for full technical architecture details including encryption specifications, access control implementation, and audit logging.
Talk to our team to discuss your organisation's video security requirements and see these features in action.