The AI in Business Podcast

Why Ensemble Architectures Win Against Real-Time Voice Risk - with Mike Pappas of Modulate

29 min
Mar 20, 20262 months ago
Listen to Episode
Summary

Mike Pappas, CEO of Modulate, discusses how contact centers have become major fraud surfaces where traditional text-based AI systems miss critical voice signals. He explains how ensemble listening models (ELMs) using 100+ specialized models can detect real-time fraud through voice analysis, emotion detection, and deepfake identification that general-purpose LLMs cannot capture.

Insights
  • Real-time fraud detection in voice calls requires analyzing audio signals that are lost when conversations are reduced to text transcripts
  • Ensemble architectures with specialized models outperform general-purpose LLMs for fraud detection by preserving voice nuances like emotion, pauses, and background audio
  • The cost of fraud prevention includes hidden expenses like agent attrition, regulatory penalties, and user friction from overly cautious security measures
  • Transparency in AI decision-making is crucial for building trust with fraud analysts and meeting regulatory requirements
  • Organizations must evaluate voice AI solutions based on adaptability to evolving fraud techniques, not just current performance metrics
Trends
Contact centers evolving from service channels to active fraud surfaces requiring real-time protectionFraudsters using sophisticated social engineering techniques including fake background audio and deepfake voicesShift from post-incident fraud detection to real-time prevention during live interactionsGrowing regulatory scrutiny requiring explainable AI decisions in fraud detection systemsEnsemble AI architectures replacing monolithic models for specialized use casesVoice biometric storage creating new regulatory compliance challengesAI systems designed for adversarial detection rather than helpful assistanceCost optimization through specialized smaller models versus large general-purpose models
Companies
Modulate
Builds voice intelligence systems for real-time fraud detection using ensemble architectures
People
Mike Pappas
CEO and co-founder of Modulate, expert in voice AI and real-time fraud detection systems
Quotes
"The worst harms tend to happen when you only notice the fraud after the fact and the transaction has been completed."
Mike Pappas
"LLMs are sycophantic. They're not designed to scrutinize. They're designed to be supportive."
Mike Pappas
"If you can't hear the audio, you're never going to pick up on it when someone genuinely calling says the baby's crying but that baby's voice is artificial."
Mike Pappas
"At this point, modulate's ELM consists of over 100 different models that are looking in different ways at that original voice content and connecting the dots."
Mike Pappas
Full Transcript
3 Speakers
Speaker A

Foreign.

0:00

Speaker B

Welcome everyone to the Emerge AI in Business podcast. Today's guest is Mike Pappas, CEO and co founder at Modulate. Modulate builds voice intelligence systems designed for real time and large scale deployment, with an emphasis on traceable outputs and enterprise governance requirements. Its core approach draws on ensemble architectures that preserve more of what's unique about voice beyond transcripts alone, so organizations can detect and act on conversational risk signals with low latency and clearer operational accountability. Mike joins us to explain why the contact center has become a major point of fraud and why text based systems miss the signals that surface in real voice. He also outlines how a multimodal approach improves real time detection and reduces the downstream cost time tied to missed fraud, agent strain and regulatory exposure. The conversation gives leaders a clear lens for evaluating voice AI investments based on accuracy, speed and the ability to adapt as threat patterns evolve. Today's episode is sponsored by Modulate. Before we begin, a quick note for our executive listeners. Emerge invites enterprise leaders who are driving meaningful AI initiatives to share what they're learning with a peer audience. If you're moving real projects forward and want to be part of that conversation, you can learn more@go.emerge.com that's go.emerj.com exp ert now the conversation with Mike.

0:12

Speaker A

Mike, welcome to the show. It's great to have you here.

1:41

Speaker C

Thanks so much for having me. Nick. Excited to chat, Mike in customer service

1:43

Speaker A

environments, most risk detection still happens after the interaction, after the fact. Once the funds are gone, accounts are compromised or disputes are filed. But as the tactics of fraud practitioners, if I could put it that way, start to become ever more sophisticated and social engineering isn't coming into play, the contact center itself is kind of increasingly this point of exploitation, as opposed to us kind of always seeing it as a place of resolution. I guess this shift is raising a whole bunch of new questions, new complexities about where risk actually is starting to materialize in real time and how visible it is when it does. So I guess to start off, when fraud, social engineering, as I mentioned, or abuse happen inside a live call, where do you see the first real kind of business impact? Is it on losses or agent churn compliance or something else entirely?

1:48

Speaker C

Yeah, I appreciate you teeing that up, Nick. As you say, there's been a lot of evolution in this space in the last few years, both in terms of the techniques that fraudsters can use, but also in terms of what the technology has become capable of. For those of us who are trying to protect against these bad outcomes, I will Say I don't generally think about which harm happens first so much as sort of where are the worst harms?

3:00

Speaker A

Right, right.

3:25

Speaker C

It is important to be thinking about how can you catch this in the act? Because the worst harms tend to happen when you only notice the fraud after the fact and the transaction has been completed. You know, escaped with the money, now you as an institution are potentially liable or it was your own money that you just gave away. So there's a lot of good reasons why it's very important to be detecting in real time. You can say, oh well, it's better to detect fraud after one minute than after two minutes. And that does reduce your handle times and get your agents back into the flow. There is some benefit fit there. But for me, I'd say the bigger sort of business impact is thinking just about did you prevent it or did you only notice and do something about it? If you are a retail institution and you are yourself being defrauded, obviously failing to prevent it can have very immediate negative repercussions for you. It also is extremely exhausting for agents who are trained to try and create a pleasant customer experience. These fraudsters are exploiting that training. So it's part of what makes it such high attrition to work in that kind of a call center space. And that has its own sort of hidden costs as you try and keep those folks or train up their replacements. If you're a financial sector provider, like a bank, you're in the trust business.

3:26

Speaker A

Right.

4:51

Speaker C

And again, if you allow someone to defraud your tellers and access your funds, it doesn't matter if you say, hey, I'm going to make you whole. At that point you've lost the end user's trust that you are the person they can rely on to protect their funds. And you're still likely to see sort of a loss of clientele from those kinds of incidents. So that's what to me it looks like when you say, hey, it's really important to catch this stuff early. It's catch it in the act more than anything.

4:52

Speaker A

Right, right. And I appreciate your framing of, rather than focusing on the first sign of risk or fraud to where is it having the most impact? What is the most value? Valuable is the wrong term here. But what is the most, the costliest of risks? And if you could perhaps, I mean, I'm sure, of course these are very depending on, on what vertical in what industry you're in, etc, but you know, if you could perhaps cast your mind to things that you've seen More recently, where would you say some of those kind of costlier fraud? Costlier fraud. What term would you use here? Fraud incidents, I guess, recurring fraud incidents,

5:21

Speaker C

greater cost of fraud as we can just.

6:20

Speaker A

Right, right. So where would you say, if you've perhaps seen a pattern or something like that, where are those kind of costlier surfaces where fraud is occurring?

6:22

Speaker C

Yeah, I mean, I, at a high level, I do kind of break it down into some of these categories. So there's the immediate financial cost of fraud, which is fairly obvious. There's subtler costs, like I mentioned, of, you know, the cost of lost trust, the cost of agent attrition. There's regulatory costs. Sometimes if you allow this fraud to get through, you actually get hit twice because then the regulators are coming in and penalizing you.

6:33

Speaker A

Worst case scenario, Right, exactly.

6:59

Speaker C

But there's also sort of the flip side of the cost of prevention. If you're using the wrong tools or if you haven't sort of orchestrated them in the right way, you can actually create excess costs on yourself because you're so paranoid about trying to reduce these kinds of fraud. That can be a regulatory challenge in and of itself. But it also sort of is something that we see as a user friction challenge. So going back to things like user trust and everything, if you're introducing all of these additional checks, let's say you want to use voice as an ID check to prevent fraud. Well, before we even talk about deep fakes and the ability to fake those voices, now you are potentially storing biometric data. So you've exposed yourself to much greater regulatory risk. You've introduced a new step in the call process that possibly adds up to a minute of extra friction into that call flow. So you have greater sort of drop hits, you have greater risk of the handle time sort of extending out. That now means that fraudsters, not even by being on the call, are now making it so that you can't handle as many calls as fluidly. And so you need to increase the staffing or you need to increase other costs of that. There's all those sort of hidden costs too. But if you want to say, you know, hey, I just got defrauded, how is this going to hurt me the most? The answer there, aside from the immediate monetary one, I would say, is in that regulatory side that now you need to be able to justify what are you doing differently? And if you're not already building a system that is designed to prevent fraud, well, it can be very hard to say. Here's how we're going to learn from that.

7:02

Speaker A

Right, right. I mean across, across industries and leaders that we spe to, it often comes back to that, like, are your foundations strong? Do you have the infrastructure built in a manner that is going to serve not just the organization and not just your employees, but that kind of, I guess, cooperative meshing of all three into I guess, hopefully seamless workflows and fast resolutions and all of that wonderful stuff. But I guess the, you know, we hear a lot of the naysayers who are moaning about AI is all hype or it's, you know, launched with a kick and it's taking too long. Your answer right now perfectly illustrates why these things have to do are taking a while. And I mean objectively, they're going at lightning speed and so I don't see how anyone can be complaining. But the complexity involved just means that these processes are time consuming and that requires a lot of. Soul searching is maybe the wrong term. But for a company to do its soul searching and truly get to the heart of things and ensure they're doing things right is of utmost importance.

8:43

Speaker C

Yeah, I mean, from both a user perspective and a platform perspective, if you've deployed a preventative solution that you don't actually trust, then you haven't sped anything up, no matter how fast that thing is, because you're still just going to go do that secondary check. I think a caller hearing, oh, you're going to run me through this process? Well, that's not going to help. That frustrates them that they're wasting their time. You as a platform implementing these things might end up saying, hey, that actually only catches some percentage of these instances. So we need to add additional layers and before you know it, you've actually extended the friction far beyond what you realize.

10:01

Speaker A

Right. If we think of the contact center becoming, I guess an active risk surface, not just this service channel as it's kind of always been to a certain degree. It obviously puts new pressures on the tech interpreting those interactions in real time and doing a million other calculations. A lot of organizations are looking to general purpose LLMs to support both the server side of things and the risk side of thing. But Live Voice introduces constraints that don't exist in text based environments. So with that in mind, when latency scale and reliability, for example, are non negotiable, what are the most common points of failure that you're seeing today?

10:40

Speaker C

Yeah, I mean these LLMs are not always the most reliable, it's true, but I wouldn't say that itself is the core problem. The core problem, as you mentioned, is Sort of reliability and accuracy. If you are trying to understand whether someone is fraudulent, first of all, you are by definition in an adversarial relationship where they're trying to disguise themselves from you. And you need to see the subtle signs that what they're saying is not actually legitimate. That is not what most AI systems were built for in the first place. They were built to be a home assistant where you're trying to explain exactly what you want and they're helpful. And you see this in LLMs, right? LLMs are sycophantic. You say, oh, I'm wondering what color is the sky? And they'll trip over themselves to tell you what an amazing, wonderful question it was. That is all. These systems are designed to be supportive. They're not designed to scrutinize. Why are you asking me this thing when you're using an LLM? First of all, you're using a system that does not have the introspection ability to say, is this a legitimate actor? Or could this be a sign of fraud? Then on top of that, because it's an LLM, large language models are text based. You're reducing this down to text, you're losing the emotion, the nuance, all the other stuff happening in voice. And that's often where the giveaways are. There's a bunch of examples that I could cite, but an easy one is someone genuinely calling an IT help line saying, hey, I need you to reset my password. No, I can't get there, the baby's crying. I'm trying to deal with all this. And if you know what you're doing, you can tell that that baby's voice is artificial. That is a recording of a baby from a completely different environment. But if you can't hear the audio, you're never going to pick up on it.

11:36

Speaker A

Sure, sure.

13:26

Speaker C

So using LLMs can work for saying, are you following the script? Are you adhering to policy? Though sometimes we'll hallucinate your policy too.

13:28

Speaker A

Sure.

13:37

Speaker C

But it's something they can at least in theory handle. Trying to say, hey, can I tell when someone who is trying to make me think they're a real actor is actually not? That you just do not have access to the data and the insight that you need to ask that question?

13:37

Speaker A

Well, yeah, absolutely. I mean, yeah, I have this conversation regularly with my own colleagues about the email that everyone is losing their minds about, but if it was spoken in voice, one would realize, oh, this isn't particularly urgent, or no, that person is not angry with you. It's etc so yeah, I mean, the example you've given here of, you know, a fake baby crying and things like that is, you know, it speaks to the. Not that it's particularly sophisticated, but the brazenness and the boldness of, of the, the threats today, which is super fascinating. Obviously these are the problems that you're trying to tackle, right?

13:55

Speaker C

Yeah, exactly.

14:39

Speaker A

Yeah. So I guess this brings us towards questions of design and how one goes about developing something that takes all of this complexity into account. If a single general purpose model struggles to kind of meet those performance and reliability demands of this kind of using real time voice, for example, I guess one could say that the architecture itself, not just the model, needs to be improved, needs to be looked at in a slightly different way. It starts to look different in production environments. We're seeing in many different places new approaches to developing models that are catering to ever more specific niches. What's the simplest way that you could put into words for business leaders the idea of an ensemble listening model and how that differs from the kind of traditional monolithic model and why that matters in production and day to day usage?

14:41

Speaker C

Yeah, so this idea of an ensemble listening model or elm is an architecture that modulate developed because our whole business is understand the real meaning in voice. And as we just talked about, traditional tools like LLMs just don't actually get to that detail and that meaning appropriately. So the initial idea of the ELM was what if you could combine the kind of text related intelligence of an LLM with deep understanding of the emotion that someone was expressing those words with, with the intimate pauses that reveal dissatisfaction or frustration, with timbre analysis that tells you things about the speaker that you can infer that all these different extra elements, deep fake detection, interruptive analysis, the list gets longer and longer as you really start to think about all of the hidden things that we as humans take for granted in a voice conversation. At this point, modulates ELM, Velma consists of over 100 different models that are looking in different ways at that original voice content and connecting the dots. So you might have had in the past sentiment analysis tools deployed in your call center. And so you could in theory have said, hey, I have a transcript of what was said and I have the sentiment analysis tool telling me that this line was sarcastic, but there's actually not really anything out there saying. And since that line was sarcastic and the words were a nice job, the meaning was clearly that they were not satisfied by the job actually connecting those things and then using that to answer intelligent questions or provide a summary of the call or talk about trends of what's happening in the call in a way that's informed by those connected dots. That's just actually a totally different problem. So the idea of the ELM is where an LLM is designed for analyzing text really well natively, the ELM is designed to truly natively understand the ins and outs of audio. And so for any application where the audio matters, and as we just talked about, that's very true for fraud in particular, elms are going to be able to perform a lot more accurately. There's two other benefits of elms that actually fell out of this that were not the original design goal. Interesting. And it's that it's more transparent and it's actually way more cost effective than using the transparency is a consequence of this idea of the ensemble of hundreds of different models. When we say, hey, we think this is fraud, we can tell you. And we think that because the deepfake detector is at 93%, the sense of urgency that we got is at 82% for it being sort of illegitimate in some way. We've noticed three separate attempts to bypass policy. And we can actually lay that logic out because each of these different models are analyzing it independently, whereas a single model like an LLM, you can never really get that clear explanation of why we made the determination.

16:01

Speaker A

Yeah. So does that help with, I mean, the black box issue? So now you do have a bit of insight into why something was flagged and the nuances of what's actually gone on in this particular incident.

19:07

Speaker C

Yeah, it helps in two ways. It helps the actual sort of fraud analysts or even, you know, live agents, if we're providing that agent assist because it gives them something better to ground to instead of this is suspicious. Be afraid. We're saying like, hey, here are the specific things you need to interrogate. But it's also helpful in a way I mentioned before, because you need the platform to trust that our tool is doing a good job. And if we're a black box that sometimes spits out its fraud, you can do some statistics on that and say, how accurate is it? But it's really hard for you or your agents to individually trust in the moment that that thing is making a good determination. Whereas since we are laying this all out, we can actually much more quickly build that trust and get a platform to feel comfortable deploying our solution at scale. And by the way, that transparency also helps on the regulatory side. If you're worried about things with the AI act, it's very valuable to be able to say here's why we determined this was a fraudster. It was not secure. Less. So there's all those benefits. And then the cost piece that I mentioned is simply, Even though there's 100 models, they can all be highly specialized, so they can not only be more accurate than a single generic model, but they can each be much smaller because they know exactly what they need to do. An LLM needs to know how to build a boat and about color theory and physics, and you just don't need all those things in this kind of detector. So you end up being able to shave off all that extra compute and run the elm. Much more cost effective.

19:20

Speaker A

Yeah, yeah, that's super, super impressive. And, yeah, I mean, I'm just kind of thinking from a practical standpoint. You've mentioned a couple of, you know, these, these big wins provided by the ensemble listening model and at least that approach. Where, where are the biggest. If, if we're thinking about that kind of specialization, where are those biggest performance gains coming from? Is it kind of across everything? Are there particular avenues in which you're seeing huge gains, perhaps less in a different scale? Fear, or again, is it that kind of like, everything improves now?

20:55

Speaker C

Yeah, I mean, it depends what kind of fraud you're looking at. Right. If you narrow your scope to deepfake fraud, we've got the best deepfake detector in the world. And if you use that, you don't need too much of the other stuff to detect these things pretty well. The only real use case in that scenario is that not every synthetic voice is actually fraudulent. There are legitimate reasons someone might use a synthetic voice if they can't speak normally or in a context where it's not appropriate. And so in those settings, you do want to be able to look at those other pieces and say, is this a deep fake being used for fraud or just a synthetic voice?

21:37

Speaker A

Yeah.

22:15

Speaker C

In contrast, if you're saying, hey, I want to watch for social engineering, then you might see deepfake voices, but that's certainly not the dead giveaway. You want to analyze the behavioral characteristics. You want to look for that sense of urgency. You want to listen to the background noise and say, hey, is that background noise consistent or is it a recording of a baby? So that's where you need a lot of different pieces, all connecting the dots. On the other hand, if you just want policy violations, you can get that pretty well from the text alone. You still want really good transcription to make sure that you're analyzing that in a noisy environment. But again, you do want to have an Understanding of why is that policy potentially being violated in this case? Is the agent hearing something authentic in that voice and be able to talk to the agent again through agent assistant, real time and say, not just, hey, we see that you're choosing to exercise this exception, but we see you're choosing to exercise this exception. And we do want to flag that these three indicators make this more high risk. So you should be prepared to defend your decision if you do go through with this.

22:16

Speaker A

Yeah, yeah. Super fascinating. And yeah, just so many considerations, it kind of boggles the mind. If we do take a step back, though, we've talked a little bit about where real time risk shows up a bit, why these general purpose models might not be fit for purpose in the kind of live environment, and then a little bit about the ensemble approach, how it's addressing those gaps architecturally. The next question I guess becomes with a little bit less about the technological design and I guess going more to the higher end of leadership and investment judgment. So given what you guys are seeing on the ground, given your experience, if you were advising a leadership team kind of evaluating their voice AI investment today, what decision criteria do you think should matter most beyond demos and pilot results? How should we be thinking about deploying in a way that is effective and resilient?

23:19

Speaker C

Yeah, I mean, ultimately, if you're in this business and trying to deploy a solution, there are actually only really three criteria that you care about. What does it cost, how accurate is it, and how quickly can it find me these instances in the first place, There's a lot of things, as we've talked about, that flow into questions of what does it cost? There's the potential regulatory cost if you can't report on that appropriately, and all of these different issues. I don't want to make it sound like it's so monotone, but ultimately those are the three questions you care about. What I would advise anyone who's trying to assess those things is, first of all, when you're assessing accuracy, you've got to have faith that it will keep up with fraudsters. So, yes, run your statistics, make sure that you have a good understanding of its performance today on the fraud that you're seeing today, but also be aware that you're going to see different fraud tomorrow because the techniques will evolve and make sure that part of what you're assessing is, do I have trust whether because of transparency, because of the partnership with the vendor in some way, do I have trust that this solution will be able to keep up and evolve and will not just be A point in time solution that falls behind. That's probably the biggest thing I see people missing. And I think it's just because it's scary and nebulous to ask what is going to happen next. And I'm not saying you need to be a tech expert, but you do need to have an understanding of is this solution multifaceted and dynamic enough that you can imagine a way that it could adapt to new kinds of threats if it's too narrowly focused on specific kinds of threats today, like many solutions that are just deploying deepfake detectors and saying good, we've solved fraud, yes, that works well on that kind of fraud. But if people find different techniques like, you know, good old fashioned social engineering, you're suddenly going to start missing all that stuff.

24:28

Speaker A

Stuff, Absolutely. Yeah, Mike, I mean really, really great insights and yeah, just providing a bit of a framework for leadership and our audience to to think about how they can go about reacting to all of the latest threat vectors, all of the latest model developments. Yeah, we really appreciate your insight and would love to have you back on the show someday. I think we could have many a long conversation about the nuances of all of this. But yeah, we'd love to have you back on the show. This was a great one. I think our audience are going to get a lot of value out of it and yeah, have a fantastic day and we look forward to seeing you next time. Perhaps.

26:30

Speaker C

Same to you. I enjoyed the conversation, happy to be a part of it. So please call me back in anytime.

27:17

Speaker B

Wrapping up today's episode, I think there are three key takeaways for from our conversation with Mike. First, real time voice interactions have become a major fraud. Surface and text only systems miss the signals that matter. Second, specialized audio native models can surface risks earlier by detecting Hughes general purpose models overlooked. Finally, leaders evaluating voice AI should prioritize accuracy, adaptability to evolving threats and clear reasoning behind every risk determination for our solutions. Bottom Business Position your brand alongside the Fortune 500 leaders defining the enterprise AI roadmap. For the opportunity to showcase your solution to the executives currently funding and scaling global initiatives, partner with Emerge. Secure your partnership@go.emerge.com Bartner that's go.emerj.com Bartner for further executive level analysis and to join the the network of leaders delivering workflow impact with AI, visit emerge. Com on behalf of the team at Emerge. We'll see you on the next episode.

27:22