Does your step counter undercount women? A Bayesian Audit of a Frequentist Failure to Reject Null Hypothesis

Jun 3

A Bayesian audit of a wearable algorithm, and the honest non-answer it gave back.

I wear an Oura ring every day, and at some point I started wondering whether it was quietly lying to me. Not maliciously, just structurally. Most wearables count steps the same way: take the accelerometer signal from a single sensor on your body, and trip a counter every time the acceleration crosses a threshold tuned to what a "typical" stride looks like. The trouble is that stride dynamics scale with body geometry. Smaller bodies produce smaller acceleration peaks. Set one universal threshold and you will, in principle, miss more steps for smaller-statured people.

Bodies are sexed, and women are on average smaller than men. So a bias that is really about geometry lands asymmetrically on women. That is Caroline Criado Perez's Invisible Women thesis turned into a testable claim about one specific class of algorithm. I wanted to answer it with a model instead of a hunch, so I made it my final project for Georgia Tech's Bayesian Statistics course.

The setup

You cannot audit an algorithm against a better algorithm, because then you are just measuring the disagreement between two guesses. You need ground truth. The Kuopio Gait Dataset has it: 47 people walking barefoot at three speeds in a Finnish biomechanics lab, recorded simultaneously by body-worn IMUs, optical motion capture, and floor-embedded force plates. The force plates are the gold standard. They feel the foot hit the ground. The anthropometrics are caliper-measured, not self-reported, so leg length and hip width are real numbers rather than proxies.

I ran a deliberately naive threshold step-detector on the IMU signal, ran an independent heel-strike detector on the force-plate signal, converted both to cadence (steps per minute), and took the difference as the thing to model:

cadence_error = cadence_imu - cadence_plate

Then the question becomes: does that error drift systematically with sex and body geometry?

Why Bayesian

47 subjects, 17 women and 30 men. That is a small, lopsided sample, and it is exactly the regime where frequentist subgroup tests fall apart and start producing the contradictory findings that litter the step-counting literature. Hierarchical Bayesian partial pooling is the right tool here. Each subject's estimate borrows strength from the population, the sex imbalance is handled honestly instead of swept up, and the output is a posterior credible interval on the bias rather than a yes/no verdict. A tight interval around zero is a real answer, not a failure to find one.

I fit four nested models, from a single pooled error distribution up to one that decomposes the sex effect into measured leg length, hip width, and mass, and compared them with leave-one-out cross-validation.

What the math said

No detectable demographic effect above the noise in this cohort.

The female-minus-male cadence-error gap came out at -0.32 steps per minute, with a 95% credible interval of [-2.03, +1.17]. Adjusting for body geometry widened the gap slightly rather than explaining it away (a suppression pattern, not the mediation I had pre-registered against), and every anthropometric coefficient straddled zero. Cross-validation confirmed it: the demographic layers added no predictive value over a purely structural baseline.

So the audit rules out effects bigger than roughly 1.5 steps per minute per standard deviation of body geometry, and leaves the door open for smaller ones that a bigger cohort could resolve. Not the dramatic headline I half-expected. But a clean, quantified "we can bound this, and it is small" is genuinely useful, and it is the kind of answer only Bayesian inference hands you naturally.

The honest version

This is a 47-person barefoot lab study in Finland. It is not the last word on whether your watch undercounts you on a real sidewalk in real shoes. What it is: a fully reproducible, end-to-end pipeline from 23 GB of raw motion-capture archives to four diagnosed models and a writeup that lands whether the effect is there or not, because the contribution is the method.

The code, the notebooks, and the full report (every number, diagnostic, and limitation) are on the repo. Phase two is an interactive tool that takes your demographics and returns the posterior bias for your profile. The interesting part of fairness work is not finding the scandal. It is being able to say, with calibrated honesty, how big the thing is and how sure you are.

Code and report: github.com/meeshmg/wearable-calibration-bayes. Built with PyMC and ArviZ. More at bizzib.ai.

Michelle Griffith