The Least Reliable Measurement in Endocrinology

A man walks into a clinic in Mexico City with fatigue, low libido, and brain fog. His doctor orders a testosterone test. The lab reports 285 ng/dL — below the 300 threshold his urologist uses — and he's diagnosed with hypogonadism. Treatment begins.

Had he walked into the lab next door, using a different assay with a different reference range, his result might have read 340. Normal. No diagnosis. No treatment. Same blood. Different future.

This isn't a hypothetical. In 2025, Gonzalez-Carranza, Morgentaler, and Reyes-Vallejo surveyed 134 laboratories in Mexico City. They found 27 different sets of reference ranges. The lower limit for "normal" testosterone ranged from 84 to 470 ng/dL — a 426% spread. The same sample could be catastrophically low at one lab and perfectly normal at another.

The testosterone measurement is the single number that determines whether a man receives a diagnosis, starts lifelong treatment, or gets told his symptoms are in his head. It is also the least reliable measurement in clinical endocrinology.

Two Technologies, Two Realities

The testosterone in your blood can be measured two fundamentally different ways: immunoassay or liquid chromatography–tandem mass spectrometry (LC-MS/MS). The difference matters more than most clinicians realize.

Immunoassays — the workhorses of routine clinical labs — use antibodies to bind testosterone and estimate its concentration. They're fast, cheap, and automated. They're also unreliable, particularly at low concentrations where diagnostic decisions are made. In a 2022 College of American Pathologists (CAP) proficiency survey, 1,558 labs used immunoassay while only 40 used mass spectrometry. Some immunoassay medians differed from the CDC reference concentration by up to 44%.

LC-MS/MS measures the actual molecule by mass — no antibody cross-reactivity, no interference from structurally similar steroids. It is the reference method. But as of 2025, Narinx et al. surveyed 124 European laboratories across 27 countries and found only 25% use LC-MS/MS for total testosterone. Three-quarters of Europe still runs immunoassay.

The CDC's Hormone Standardization (HoSt) program has certified 20 assays — 16 are LC-MS/MS, and only 4 are immunoassays (all from a single manufacturer, Siemens). The acceptable bias criterion is ±6.4%, based on biological variability data. Most immunoassays can't meet it. Yet most men's testosterone is measured by immunoassay.

This creates a quiet crisis: the clinical thresholds we use — 300 ng/dL, 264, 346 — were established in studies that may have used different assays than the one measuring your blood. A number generated on one platform cannot be assumed equivalent to the same number on another.

Nine Guidelines, Nine Numbers

Even if every lab in the world used the same assay, we still couldn't agree on what the number means.

In 2025, Tsampoukas et al. in IJIR compared diagnostic thresholds across nine major guideline bodies. The range was striking:

Guideline Body	Total T Threshold	Free T Used?
AACE	200 ng/dL	—
ISA	230 ng/dL	—
Endocrine Society	264 ng/dL	Yes (borderline)
AUA / FDA	300 ng/dL	No mention
BSSM	231–346 ng/dL	Yes (8–12 nmol/L)
EAU (2025)	346 ng/dL	Yes (8–12 nmol/L)
Italian Societies	~280–320 ng/dL	Formally incorporated
ICSM (2024)	Symptom-dependent	Yes
Society for Endocrinology	~230 ng/dL	Skeptical

As Pozzi and Ramasamy commented in IJIR: the AUA makes no mention of free testosterone. The Endocrine Society recommends it when total T is borderline. The EAU and BSSM advocate for it between 8 and 12 nmol/L. Italian societies formally incorporate it. The Society for Endocrinology is skeptical. They can't even agree on whether to measure a second number, let alone what the first one means.

The clinical consequence: reported prevalence of male hypogonadism ranges from 2% to 39% depending on which diagnostic criteria are applied. The same population. Different definitions. Prevalence swings by a factor of twenty.

The Free Testosterone Disaster

If total testosterone is unreliably measured, free testosterone — the biologically active fraction — is worse. Only 1–3% of circulating testosterone is unbound. Measuring it directly requires equilibrium dialysis (ED), a laborious and expensive reference method unavailable in routine clinical practice. So we calculate it.

Three major equations compete. In 2018, Fiers et al. in JCEM tested all three against equilibrium dialysis with LC-MS/MS in 146 men and 183 women. The results were devastating:

The Ly-Handelsman equation — a simple empirical formula from 2005 — matched equilibrium dialysis almost perfectly: median ratio 1.00. The Vermeulen equation, the standard clinical calculation used by most labs, overestimated by 19%. And the Zakharov allosteric model — the most theoretically sophisticated approach, which models cooperative binding of testosterone to the SHBG dimer — overestimated by 2.05×. A man with a true free T of 50 pg/mL would read as 103 on the Zakharov model.

This isn't merely academic. The Zakharov model underpins the TruT calculator, a patent-protected commercial tool. The Vermeulen equation is built into most laboratory information systems. The Ly equation, the best performer, is the least commonly implemented. The field defaulted to the wrong answer.

And it gets worse. Direct analog free testosterone assays — condemned by the Endocrine Society since 2007 as "totally inaccurate" — have dropped to just 7% of labs (CAP 2018). But Labcorp, the largest US reference laboratory, still offers one. As Andrea Dunaif put it: "This assay has been found to be totally inaccurate and should never be used." It's still commercially available.

The Timing Trap

Even with the right assay and the right equation, the wrong timing can invalidate the result.

Testosterone follows a circadian rhythm, peaking in the early morning and dropping 10–25% by afternoon. The EAU's 2025 guidelines specify 10:00 AM as the latest acceptable draw time. Non-fasting blood can underestimate testosterone by more than 30%. Yet the Narinx European survey found that fewer than half of labs recommend adequate morning sampling time and/or fasting.

The repeat-testing data is equally sobering. Studies consistently show that approximately 30% of men whose initial testosterone reads as "low" will normalize on a second measurement. The Endocrine Society, AUA, and EAU all recommend at least two separate morning fasting draws before diagnosis. In practice, many clinicians diagnose on a single result — and 25% of men starting testosterone replacement therapy have never had their testosterone measured at all.

The Body's Own Noise

Here's the fact that makes the entire diagnostic framework feel fragile: even if you control the assay, the timing, the fasting state, and the calculation — the body itself introduces significant variation.

The European Biological Variation Study (EuBIVAS), studying 38 men with weekly draws over 10 weeks, found a within-person coefficient of variation (CVi) of approximately 10% for total testosterone. This means:

18–28%

Expected difference between
two measurements, half the time

−38% to +60%

Reference Change Value —
anything smaller may be noise

"Marginally useful"

EuBIVAS conclusion on
population reference intervals

The individuality index is so high that population-based reference intervals are, in the study's own words, "marginally useful" for individual diagnosis. A man's testosterone can halve — from 500 to 250 — and still fall within the "normal" range. The reference interval tells you what's normal for a population. It tells you almost nothing about what's normal for this man.

The Clinical Cascade Is Broken at Every Step

The measurement problem doesn't exist in isolation. It cascades through the entire diagnostic and treatment pathway.

At every step in the diagnostic pathway — from whether the test is ordered, to which assay is used, to when the blood is drawn, to what threshold is applied, to how free testosterone is calculated — the system introduces error, disagreement, or failure. The compounding effect is that the same man, with the same biology, can receive fundamentally different diagnoses depending on which clinic he walks into.

What This Means for the Evidence

The measurement problem doesn't just affect individual patients. It silently degrades the evidence base that every clinical decision rests on.

To its credit, the TRAVERSE trial — the largest cardiovascular safety trial for testosterone — used LC-MS/MS at a central laboratory, with two fasting morning draws before enrollment. This is methodologically sound. But it means TRAVERSE's inclusion threshold of <300 ng/dL on mass spectrometry cannot be directly applied to the majority of clinical labs that use immunoassay. A man reading 290 on immunoassay might be 320 on mass spec — or 260. The trial's rigor doesn't translate to the real world where its results are applied.

In 2025, Arun et al. at Yale (ADLM/Clinical Chemistry) made a provocative argument: the original ~12% hypogonadism prevalence was established when most labs used immunoassay with a 300 ng/dL cutoff. When they applied LC-MS/MS data, they found that 264 ng/dL on mass spectrometry reproduces the same ~11–14% prevalence. In other words, 300 ng/dL on mass spec overdiagnoses relative to the original clinical intent. The apparent rise in hypogonadism diagnoses may be, in part, an assay migration artifact.

This has implications for every meta-analysis that pools studies across decades and assay platforms. The four cardiovascular meta-analyses I covered in my post-TRAVERSE analysis — Corona's 106 RCTs, García-Becerra's 41 RCTs, Hudson's 35 Lancet studies — aggregate data from trials that used different assays, different thresholds, and different timing protocols. They are comparing measurements that may not be measuring the same thing.

The same caveat applies to the secular testosterone decline: the NHANES data showing 25% decline in young men used immunoassay throughout, which controls for within-study bias. But the absolute numbers — and the clinical significance of any given level — remain assay-dependent.

Even the saturation model, which I've described as one of the most important conceptual breakthroughs in andrology, depends on measurement. Morgentaler's observation that intraprostatic DHT saturates at ~250 ng/dL of serum testosterone is a biological insight — but the precision of that number is constrained by the assay that generated it. With ±20% variability, the saturation point might be anywhere from 200 to 300. The concept holds; the precision is approximate.

"I'm sure if health care professionals were aware of these issues, they would insist on the correct assay."

— Andrea Dunaif, Mount Sinai / ENDO 2025 initiative for updated measurement standards

The Overdiagnosis–Underdiagnosis Paradox

The measurement problem creates a strange paradox: it simultaneously causes both overdiagnosis and underdiagnosis, depending on where the errors land.

If a man's true testosterone is 310 ng/dL and his immunoassay reads 275, he gets diagnosed and treated for a disease he doesn't have. If another man's true level is 240 and his afternoon, non-fasting draw reads 320, he's told he's fine when he isn't. If a third man's free testosterone is calculated with the Zakharov model at 10 pg/mL (true value: 4.9 pg/mL), his clinician sees a normal number and stops investigating.

The ADLM/Clinical Chemistry 2025 team proposed a way forward: a probabilistic approach. Instead of a binary threshold (above = normal, below = disease), report the probability that the true testosterone concentration falls below a clinically meaningful level, given the measurement's known uncertainty. This is how mature measurement sciences handle diagnostic ambiguity. Endocrinology hasn't adopted it.

The price transparency data adds another layer. DeMasi et al. (IJIR 2025) found that nearly half of US hospitals don't publicly report testosterone test prices. Among those that do, there is significant price variability. In Mexico City, costs ranged from $9 to $160 for the same test. When even the cost is opaque, patients can't make informed decisions about a measurement that's already unreliable.

The Path Forward

The problem is real but not unsolvable. Several efforts are converging:

CDC HoSt certification is the most concrete step. Labs that calibrate LC-MS/MS assays against certified reference materials achieve <5% inter-lab bias — a dramatic improvement over immunoassay. But only 20 assays are certified, and most clinical labs haven't adopted them. The Endocrine Society and AUA are jointly pushing for wider adoption, but progress is slow against the inertia of established immunoassay platforms.

The ADLM probabilistic framework is intellectually the most promising. Rather than arguing about whether the line should be at 264 or 300, it reframes the question: given this measurement, with this assay, at this time of day, what is the probability that this man's true testosterone is clinically low? This absorbs measurement uncertainty into the diagnostic process rather than pretending it doesn't exist.

The Mount Sinai/ENDO 2025 initiative, led by Andrea Dunaif, is pushing for updated measurement standards and reference intervals through the Endocrine Society. A CDC meeting in August 2024 specifically addressed standardized testosterone reference intervals. The work is underway — but it's years from changing clinical practice.

Until then, clinicians who understand the measurement problem can at least mitigate it: use LC-MS/MS where available, draw before 10 AM fasting, repeat at least twice before diagnosing, use the Vermeulen equation (imperfect but better than Zakharov or analog assays), and treat the number as one input among many — not as the sole arbiter of diagnosis. The EAU's 2025 update captures this best: the decision to treat should be based on signs and symptoms in addition to serum testosterone measurements.

The number matters. But the number is broken. And before we debate what it should be — before we argue about 264 versus 300 versus 346 — we need to be able to measure it.

Sources & Further Reading

Gonzalez-Carranza HR, Morgentaler A, Reyes-Vallejo LA. Testosterone threshold, assay and costs among laboratories for hypogonadism diagnosis. Transl Androl Urol. 2025.
Tsampoukas G et al. / Pozzi C, Ramasamy R. Variations in diagnostic criteria for male hypogonadism. IJIR. 2025.
Fiers T et al. Reassessing free-testosterone calculation by LC-MS/MS direct equilibrium dialysis. JCEM. 2018.
Arun AS et al. Recalibrating the testosterone threshold. Clinical Chemistry (ADLM). 2025.
Narinx N et al. European laboratory survey on testosterone measurement. CCLM. 2025.
DeMasi M et al. Testosterone testing in the United States: limited price transparency. IJIR. 2025.
College of American Pathologists (CAP). Proficiency testing surveys Y-B 2022.
CDC Hormone Standardization (HoSt) Program. Testosterone certification criteria.
EAU Guidelines on Male Hypogonadism. 2025 update.
Zakharov MN et al. Allosteric SHBG model. Mol Cell Endocrinol. 2015.
Ly LP, Handelsman DJ. Empirical estimation of free testosterone. Eur J Endocrinol. 2005.
European Biological Variation Study (EuBIVAS). Within-person testosterone variation.
Rosner W et al. Endocrine Society position statement on testosterone measurement. 2007.