How Can Healthcare Organizations Reduce AI Hallucinations and Bias?

Go back

Estenda Solutions

Jun 29, 2026

reduce AI hallucinations and bias in healthcare

Generative AI has moved from novelty to budget line in barely two years, and healthcare leaders are now under pressure to deploy it before they have fully worked out how to govern it. That order of operations is exactly backwards, and it is where most of the risk lives. A model that drafts a confident but wrong patient message, or a screening algorithm that performs worse for part of your population, does not just create a clinical problem. It creates a trust problem that is hard to walk back.

At Estenda, we help healthcare, medtech, life sciences, and digital health organizations build software, data, and AI solutions that solve real-world problems. Our COO and co-founder, RJ Kedziora, recently joined the Healthcare Theory podcast to discuss what responsible AI looks like in healthcare today.

How Can Healthcare Organizations Reduce AI Hallucinations and Bias?

The short version is that there is no single switch. Reducing hallucinations and bias is a set of activities that include how you write policy, to how you source and code your data, to how you watch a model after it ships. The five practices below are the ones RJ kept returning to, because they change outcomes rather than just check a compliance box.

Keep a human in the loop on every clinical output

RJ is blunt about what generative AI is and is not ready to do on its own: "It does have a tendency to hallucinate and make things up. So there should always be a human as part of that component."

He grounds that with a specific example. Pointing to research on AI features now built into major electronic medical record systems, he recalled a study of generated patient messages at a single institution: "I think it was, like, a 113 messages they looked at. And in those messages, they found seven mistakes or hallucinations."

The lesson is not that AI-drafted communication is too dangerous to use. The draft is a starting point, never the finished product. When a model writes a patient message, summarizes a chart, or proposes a next step, a qualified person has to review it before it reaches anyone.

For healthcare, medtech, and life sciences teams, that means designing the workflow around review from day one, not bolting it on after a near miss. The point where the human checks the output is not friction to be optimized away. It is the safety mechanism. Research on large language model responses to patient messages has flagged the same three failure modes RJ describes, including confabulated content and factual inaccuracy, which is why review belongs in the design, not the disclaimer.

Measure AI errors against humans, not against perfection

One of RJ's sharpest points is about how we judge these systems. The research that flags AI mistakes, he argues, often forgets to ask the obvious comparison question: "I find it fascinating in a lot of these articles and research studies that look at the problems and challenges of generative AI... they don't compare to humans. Humans make mistakes too."

A tired clinician working through a full inbox makes errors, too, and if you hold AI to a standard of zero mistakes that no human meets, you will reject tools that would actually raise your average quality. He pairs that with a finding that surprises a lot of providers: patients often prefer the AI-assisted version. As he put it, "patients time and time again in research studies are saying the generated message is more empathetic. They like it better than what their provider is responding to."

For decision-makers, this reframes the evaluation. The question is not "is this model perfect," it is "does this model, with a human reviewing it, beat our current human-only baseline on accuracy, empathy, and turnaround." Set that benchmark honestly and you make better adoption calls. This is the kind of evidence that our healthcare and software research team builds into a deployment, so the comparison is measured rather than assumed.

Start with data governance, not AI policy

When clients come to RJ asking what their AI policy should say, he tends to redirect the conversation before they get far: "It's interesting as we have these AI conversations, I quickly start asking questions about data."

RJ added: "Okay, what AI policies do we need in place? Well, what data policy is in place? How are you managing your data?" His point is that hallucinations and bias are rarely born in the model. They are inherited from the data underneath it. An AI policy written on top of disorganized, poorly understood, or inconsistently coded data is a policy with no foundation.

So the first move is not to draft AI principles. It is to audit how information actually flows through your organization: how it is captured, how it is structured, how clinicians code it, and where the gaps and inconsistencies sit. Get that right and a lot of downstream AI risk simply never forms. This is why we treat data analytics and data readiness as the groundwork for any AI program rather than a later phase. For teams thinking about where AI fits in their broader plans, our perspective on responsible AI in clinical and patient-focused research walks through how governance and data decisions connect.

Use representative data and actively audit for known bias gaps

RJ does not treat bias as an abstract worry. He treats it as a property of data that humans created, which means it is already in the building: "You do absolutely have to be aware of bias in the data. And this is the big challenge in the world because humans are biased."

He gives two concrete examples that land hard with anyone building for a real population. The first is historical and still unresolved: "Women weren't really included in health care research until, like, the nineties." That gap persists today.

A 2025 systematic review of more than a thousand cardiovascular trials found women still significantly underrepresented across major cardiac conditions, which means a model trained on that data inherits the skew.

His second example is hardware. Talking about the optical sensors in popular wearables, the same technology behind many consumer health devices, he noted: "It's been demonstrated that if you have darker colored skin, they don't work as well." That is not a hypothetical. Studies of photoplethysmography sensors have shown reduced accuracy on darker skin tones because melanin absorbs the wavelength of light the sensor relies on.

The action item is twofold. Confirm that your training data represents the people the tool will actually serve, and build in deliberate checks for the gaps you already know exist, by sex, by skin tone, by age, by anything where the historical record is thin. Assuming a dataset is neutral is how bias survives.

Validate locally and monitor for drift instead of treating it as one and done

The last practice is the one teams most often skip, because it costs ongoing effort. RJ frames it around two ideas: drift and local validation.

On drift, he warns that a model is not a finished artifact: "You also have to worry about drift... you develop an algorithm, you put it into production, and as the data changes over time, your algorithm might not work as well anymore."

He draws this straight from Estenda's own work in retinal imaging, the kind of diabetic retinopathy screening we have built for real deployments. As he explained, "early camera systems only had a very small field of view. Newer cameras have a much larger field of view, so you're getting more information from the newer images. Well, the older algorithms don't work as well anymore, so you have to retrain them on current data." Change the camera, change the coding standard, change the population, and yesterday's reliable model quietly degrades.

That degradation also shows up across sites. RJ contrasts a Boston teaching hospital, where students and supervisors may code data meticulously, with a rural practice where "one doctor that needs to see 50 patients that day is massively overwhelmed, they don't code their data the same way because they're constrained by reality. So if you try and implement that same system there, it's not going to perform as well." An algorithm born in one environment is not guaranteed to work in yours.

RJ also added: "It's not a one-and-done. It's like, oh, let's create an algorithm, put it out there, and solve the world's problems. You need to continually monitor and improve on it."

For your roadmap, that means budgeting for the after-launch life of a model: local validation before you trust a borrowed algorithm, monitoring once it is live, and a plan to retrain when the data shifts. That ongoing care is exactly what implementation and support is for, and it is the difference between a model that ages well and one that fails silently.

Build AI Healthcare Solutions You Can Actually Trust

At Estenda, we work with MedTech, life sciences, and digital health organizations to build AI-powered healthcare solutions designed to hold up against real clinical and operational pressure, including the hallucination and bias risks that can quietly undermine an otherwise promising tool. Our experience spans healthcare strategy, custom software development, AI and machine learning, and advanced healthcare data analytics.

Book your free 30-minute consultation today or contact Estenda at info@estenda.com to start building healthcare AI solutions that create measurable impact.