Why idempotency is so important for health-tech companies!
July 2025 , Sam Moreland
One of the most important things that gets missed when building data pipelines is idempotency. Now this isn’t a problem that a supplements can fix, its a silent bug in the system that can cause serious issues without you knowing about it.
The mathematical definition for idempotency is:
Idempotency: denoting an element of a set which is unchanged in value when multiplied or otherwise operated on by itself.
Now this is a bit vague, but when turned into data engineering terminology it means if you re-run your data pipeline with a specified input you get the same output. I prefer to think about it as being able to wind back the clock and view the state of the system, exactly when it was, at any point in time.
Traditionally idempotency has not been too much of an issue with medical products because they have fundamentally been hardware based. Because hardware device manufacturers are quite mature, their supply chain management is quite sophisticated and non conformities can be detected and fixed. But with the rise of AI and data products being used in healthcare, the same levels of rigour that apply in hardware supply chains need to be taken into account in data pipelines.
The classic example for not being idempotent is the computation of birthdays. Say you store the birthday of a user and the first part of your pipeline is to turn that into an age i.e. 34. Well if you run that pipeline again in 5 years time their date of birth would become 39. If you are using age as a feature in a model (which is a very powerful predictor), then you would be incorrectly calculating it and skewing your model.
Another example would there is an error when handling a piece of data in the pipeline. The data has caused a notification to be sent to the end user but doesn’t record this. Now when the data is re-run through the system it creates another notification which gets sent.
Now these are very basic examples, but these happens all the time and can have serious effects with model accuracy and customer satisfaction.
What does a non-idempotent pipeline do to our company?
Idempotency impacts your ability to respond to complaints and errors in reasonable timeframes and adapt. At best a non-idempontent pipeline will cause annoyance for your customers as you are unable to fix issues. At worst you may have to recall your product and face fines from the FDA.
A lack of idempotency can manifest in many different and serious ways:
Incorrect diagnosis / measurements: Predictions made with models built on incorrect data. Depending on the indications for use and usage environment, this could effect a patient's life.
Skewed KPIs: Double counted data causing poor analytics.
Maintenance Overhead: Engineers spend more time diagnosing problems because of inconsistent information, rather than building new features.
Lack of traceability and auditability: If you are required to be able to do a root cause analysis and prevention (which in the healthcare space you are), you are unable to trace errors correctly.
Duplicate notifications / alarms (alarm fatigue): Notifications and data can be double counted causing multiple alarms.
All of these are critical in the healthcare space, and a lack of idempotency leads to a huge issue with the regulatory (and moral) requirement of CAPAs.
The role of CAPAs
For those who are new to healthtech a CAPA is a Corrective Action and Preventative Action. This is defined in the ISO 13485 standard and the FDA requirements in Title 21 CFR Part 820, Subpart J (its as exciting as it sounds). It is the internal process for identifying non-conformities (errors) in your product, fixing them in a competent way and adjusting your company to no longer allow those kinds of non-conformities to be made. Every complaint or error found needs to be handled through the CAPA process. It is not an optional extra. Your CAPA process is not only core to your quality management system (needed to be regulatory compliant), but will be core to you being a successful company.
If you have a non-idempotent pipeline, you will be unable (or be very difficult) to correctly diagnose issues that occur in data based products. These issues might be minor such as false notifications, which can cause dissatisfaction in the customer. But your product may be fundamentally flawed and not be able to meet its indications of use. At this point the FDA may require you to remove your product from sale and do a recall.
You should design your data pipelines to be audible in a CAPA process from day 1. It may be more painful initially but it can stop much worse problems down the line.
How does non-idempotency creep in?
The biggest issues will arise in new products/companies coming to market, although I have seen these issues in well established med-tech companies. There's a rush to market, where systems are designed to scale, not be auditable. This is where non-idemonotent issues arise. The company is built to serve a product to the market, not built to serve a product in the market.
Understanding that building the initial product is the “easier” part is difficult to understand with 3-5 year initial development timelines. What happens after you go live can boost or kill your company. Things will go wrong, I promise you. You need to build not just for data in / data out, but for real time triage of the entire system. You can go further and develop systems to proactively track issues and respond to them before you get a complaint.
2 of the biggest causes of these issues are microservices architecture and data lake paradigms (I will not be covering lower level causes such as data duplication etc).
Microservices require extra overhead in communication. With direct service calls (linear calls) and single tenancy this can be less of an issue. But when there are lots of services called, the levels of complexity increase dramatically. Because of the increased complexity it can be hard to recreate the exact states within the system (especially if dealing with late data), leading to a reduced capability to enforce idempotency.
Data lakes are also an issue. As distributed compute has blown up, flexibility in the use of data to build models has been put at a premium. This has led to a move from normalized data tables into large data lakes as the core store of information. Unfortunately this has led to issues with schema enforcement, consistency and ACID transactions. This is why the rise of Apache Iceberg and data lineage tools has been so important.
But this is just historical, the rise of generative AI will make idempotency even harder. One “feature” of generative AI is the stochastic nature of them (if you don’t control the temperature). This means they can be inherently non-idempotent. Substantial logging will be needed but this can also conflict with HIPPA and GDPR data protection principals, so could potentially create a data privacy problem.
How to build idempotent systems.
Now there are lots of low levels things you can do in your data pipelines that should just be good practice:
Append-only writes without deduplication
Event sourcing with periodic snapshots
Ensure exactly-once semantics in streaming pipelines
Version data schemas & transformation logic
But there is one fundamental question to ask yourself when viewing your system, “Can I rewind the clock and view the state of a single patient in the system?”. If you can’t do that then your pipeline may be idempotent.
If your interested in help building out your data pipelines and ensuring they’re compliant, please reach out to us here.