Sample Size Calculation for Medical Device Clinical Investigations: Methods and Examples

Practical guide to sample size justification for device clinical investigations, covering ISO 14155, EU MDR Annex XV, FDA IDE expectations, common study designs, and worked examples.

Why Sample Size Matters for Medical Device Trials

Sample size is not a bureaucratic checkbox. It is the structural foundation upon which every clinical conclusion rests. An underpowered study cannot reliably distinguish a real treatment effect from random variation, and an oversized study exposes more patients than necessary to investigational risk. Both scenarios raise ethical concerns and regulatory objections.

The EU Medical Device Regulation (EU) 2017/745 makes this explicit in Annex XV Chapter III, which requires "an adequate number of observations to guarantee the scientific validity of the conclusions." ISO 14155:2026, the newly published edition of the international standard for clinical investigations of medical devices, reinforces this by requiring that the evaluation account for sample size, study population, and indication specific to the investigation. On the US side, the FDA expects a detailed statistical justification in every Investigational Device Exemption (IDE) application and expects sponsors to justify that the proposed number of subjects is adequate for the study objectives.

Getting sample size right is therefore not optional. It is a regulatory, scientific, and ethical imperative that directly affects whether a clinical investigation is approved, whether its results are accepted by reviewing authorities, and ultimately whether patients are protected.

Regulatory Framework

Sample size justification does not exist in a vacuum. Multiple overlapping regulations, standards, and guidance documents define what regulators expect. Understanding the framework is essential before performing any calculation.

EU MDR Annex XV Chapter III

Annex XV Chapter III of the MDR, titled "Clinical Investigation Plan," requires that the investigation plan include "the justification of the number of subjects" and specify the statistical methods used. For investigational devices, the plan must demonstrate that the chosen sample size provides adequate statistical power for the primary endpoint. The regulation does not prescribe a specific formula or power level, but Notified Bodies and ethics committees expect the justification to follow established statistical practice.

ISO 14155:2026

ISO 14155:2026 ("Clinical investigation of medical devices for human subjects — Good clinical practice") was published in early 2026 and represents a significant update. Key changes relevant to sample size include:

Annex K introduces the estimand framework aligned with ICH E9(R1), requiring sponsors to define the treatment effect of interest considering intercurrent events and missing data strategies before determining sample size.
The standard explicitly requires documentation of the statistical hypothesis, significance level, power, effect size, and variability estimates used in the sample size calculation.
ISO 14155:2026 now cross-references ICH E9(R1) principles, bringing device clinical investigations closer to the statistical rigor long expected in pharmaceutical trials.

FDA Statistical Guidance for Medical Devices

The FDA has published multiple guidance documents that inform sample size expectations for medical device submissions:

"Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests" provides the framework for diagnostic accuracy study sizing, including confidence interval approaches for sensitivity and specificity.
"Design Considerations for Pivotal Clinical Investigations for Medical Devices" addresses sample size planning for pivotal studies, including multiplicity adjustments and the relationship between study objectives and powering.
The FDA expects the IDE application to include a complete statistical section with the sample size derivation, all assumed parameters, and supporting references for every assumption.

ICH E9(R1) Estimand Framework

ICH E9(R1), "Addendum on Estimands and Sensitivity Analysis in Clinical Trials," was finalized in 2019 and has been progressively adopted by device regulators. While originally written for drug trials, ISO 14155:2026 now explicitly references it for devices. The estimand framework requires sponsors to define, before calculating sample size:

The population to which the result applies
The endpoint (or variable) to be analyzed
How intercurrent events (e.g., device explant, concomitant therapy, death) are handled
The population-level summary (e.g., difference in means, risk difference)

Defining the estimand first prevents the common mistake of calculating sample size for an analysis that does not match the regulatory question being asked.

MHRA Statistical Guidance

The UK MHRA published its guidance on "Statistical Considerations for Clinical Investigations of Medical Devices," which provides practical recommendations including minimum expected power levels, non-inferiority margin justification requirements, and multiplicity adjustment expectations. This document is widely referenced in both UK and EU submissions and aligns closely with FDA expectations.

The Five Essential Inputs for Sample Size Calculation

Every sample size calculation, regardless of study design, requires five fundamental inputs. Missing or poorly justified inputs produce unreliable results.

1. Statistical Hypothesis

The null hypothesis (H0) and alternative hypothesis (H1) define what the trial is testing. The formulation determines the type of calculation:

Superiority: H0: treatment effect = 0 vs. H1: treatment effect is not 0
Non-inferiority: H0: treatment effect is worse than margin -delta vs. H1: treatment effect is at least as good as -delta
Equivalence: H0: treatment effect is outside +/- delta vs. H1: treatment effect is within +/- delta

2. Significance Level (alpha)

The probability of a Type I error — rejecting the null hypothesis when it is true. Typical values:

0.05 two-sided for superiority (equivalent to 0.025 one-sided)
0.025 one-sided for non-inferiority
0.05 two-sided for equivalence (testing both bounds)

The significance level directly scales with sample size: halving alpha requires roughly a 33% increase in sample size, all else equal.

3. Statistical Power (1 - beta)

The probability of correctly rejecting the null hypothesis when the alternative is true. Regulatory expectations:

80% power is the generally accepted minimum
90% power is expected for pivotal trials, especially for Class III devices
Power below 80% is rarely accepted by regulators or ethics committees

Increasing power from 80% to 90% requires approximately a 30% increase in sample size.

4. Effect Size (Clinically Meaningful Difference)

The smallest treatment effect that would be considered clinically relevant. This is not a statistical parameter — it is a clinical judgment. Effect size must be justified based on clinical input, published literature, and regulatory precedent. Common mistakes include using the effect size expected from the data (which inflates type I error risk) rather than the minimum clinically meaningful effect.

5. Variability

The expected standard deviation (for continuous endpoints) or event rate (for binary endpoints). Variability estimates should come from prior studies, literature, or pilot data and must be documented in the statistical analysis plan.

How Each Input Affects Sample Size

Input	Change	Effect on Sample Size
Significance level (alpha)	Decrease from 0.05 to 0.01	Approximately 50% increase
Power (1 - beta)	Increase from 80% to 90%	Approximately 30% increase
Effect size (d)	Decrease by half	Approximately 4x increase (quadratic relationship)
Standard deviation (sigma)	Increase by 50%	Approximately 2.25x increase (quadratic relationship)
Dropout rate	Increase from 10% to 20%	Approximately 12.5% increase in enrolled subjects

Recommended Reading

Auto-Injector Critical-Task Matrix for Human Factors Validation

Design Controls Clinical Evidence2026-05-05 · 21 min read

Calculation Methods by Study Objective

Superiority Trials

Superiority trials test whether the investigational device is better than the control. This is the most common design for novel devices claiming a clinical advantage.

Formula for continuous endpoints (two-sample t-test, equal allocation):

n per group = 2 x (Z_{1-alpha/2} + Z_{1-beta})^2 x sigma^2 / d^2

Where sigma is the pooled standard deviation and d is the clinically meaningful difference.

Formula for binary endpoints (two-proportion test):

n per group = (Z_{1-alpha/2} x sqrt(2 x p_bar x (1 - p_bar)) + Z_{1-beta} x sqrt(p1 x (1-p1) + p2 x (1-p2)))^2 / (p1 - p2)^2

Where p1 and p2 are the expected event rates in each group, and p_bar is the average of p1 and p2.

Worked example: Orthopedic superiority trial

A sponsor is investigating a new total knee replacement implant against the standard of care. The primary endpoint is the change in VAS pain score at 12 months (continuous). Based on literature, the standard deviation is 20 mm. The clinically meaningful difference is 8 mm.

alpha = 0.05 (two-sided)
Power = 90%
sigma = 20
d = 8
Z_{0.975} = 1.96, Z_{0.90} = 1.282

n per group = 2 x (1.96 + 1.282)^2 x 20^2 / 8^2
           = 2 x 10.507 x 400 / 64
           = 2 x 65.67
           = 131.3

Round up to 132 per group. With a 15% dropout rate, enroll 132 / 0.85 = 156 per group, totaling 312 subjects.

Non-Inferiority Trials

Non-inferiority trials test whether the investigational device is not worse than the control by more than a pre-specified margin (delta). This design is common for devices that offer other advantages (easier use, lower cost, fewer complications) but are not expected to be superior in efficacy.

Formula for binary endpoints (non-inferiority, one-sided):

n per group = (Z_{1-alpha} + Z_{1-beta})^2 x (p1 x (1-p1) + p2 x (1-p2)) / (p1 - p2 + delta)^2

Where delta is the non-inferiority margin (positive value), and the true difference p1 - p2 is assumed to be 0 or a small value favoring the control.

Worked example: Cardiovascular non-inferiority trial

A sponsor is testing a new drug-eluting stent against an approved comparator. The primary endpoint is target lesion failure (TLF) at 12 months. The expected event rate for both devices is 8%. The non-inferiority margin is 3.5 percentage points.

alpha = 0.025 (one-sided)
Power = 80%
p1 = p2 = 0.08
delta = 0.035
Z_{0.975} = 1.96, Z_{0.80} = 0.842

n per group = (1.96 + 0.842)^2 x (0.08 x 0.92 + 0.08 x 0.92) / (0 + 0.035)^2
           = (2.802)^2 x 0.1472 / 0.001225
           = 7.851 x 0.1472 / 0.001225
           = 1.156 / 0.001225
           = 943.7

Round up to 944 per group. With a 10% dropout, enroll 944 / 0.90 = 1049 per group, totaling 2,098 subjects.

Non-inferiority margin justification is the most scrutinized aspect of this design. The margin must be justified both clinically (what magnitude of inferiority is acceptable given the device's other benefits) and statistically (the margin should be no larger than the smallest effect the active control has consistently demonstrated vs. placebo, often using the "95-95 method" or the synthesis method). The FDA expects a dedicated section in the statistical analysis plan addressing margin selection.

Equivalence Trials

Equivalence trials test whether the investigational device is neither superior nor inferior to the control within symmetric margins. This design is used when the goal is to demonstrate that two devices produce clinically comparable results — common for generic or biosimilar-type device comparisons.

Formula (two one-sided tests procedure):

n per group = 2 x (Z_{1-alpha} + Z_{1-beta/2})^2 x sigma^2 / (delta - |true difference|)^2

Where delta is the equivalence margin and the true difference is assumed to be 0.

Equivalence requires demonstrating both that the investigational device is not inferior and not superior to the control. Because both bounds must be tested, the sample size is typically larger than for a non-inferiority trial with the same margin, particularly if the true difference is not exactly zero.

Single-Arm and One-Sample Studies

Single-arm studies compare outcomes to a fixed performance target (performance goal or objective performance criterion, OPC) rather than to a concurrent control group. These are common for devices treating rare conditions or where randomized controlled trials are impractical.

Precision-based sizing for diagnostic accuracy:

For a diagnostic accuracy study estimating sensitivity (or specificity), the sample size is determined by the desired precision of the confidence interval:

n = Z_{1-alpha/2}^2 x Se x (1 - Se) / W^2

Where Se is the expected sensitivity and W is the desired half-width of the confidence interval.

This formula assumes normal approximation. For small sample sizes or extreme proportions (near 0 or 1), exact methods (Clopper-Pearson) are preferred, and the sample size should be verified using exact calculations.

Worked example: Diagnostic accuracy study

A new imaging device is expected to achieve 90% sensitivity for detecting a specific pathology. The sponsor wants a 95% confidence interval with a half-width of 5 percentage points.

alpha = 0.05 (two-sided)
Se = 0.90
W = 0.05
Z_{0.975} = 1.96

n = 1.96^2 x 0.90 x 0.10 / 0.05^2
  = 3.842 x 0.09 / 0.0025
  = 0.346 / 0.0025
  = 138.2

Round up to 139 positive cases. If the disease prevalence is 30%, the total sample needed is 139 / 0.30 = 464 subjects.

Study Design Considerations

Parallel Group vs. Crossover Designs

Parallel group designs assign each subject to one arm and are the most common design for device trials. Crossover designs expose each subject to both treatments in sequence, with a washout period in between.

For devices, crossover designs are feasible when the condition is chronic and stable, the treatment effect is temporary, and a washout period can be defined. When appropriate, crossover designs require roughly half the sample size of a parallel group design because each subject serves as their own control, eliminating between-subject variability from the treatment comparison.

However, many device trials cannot use crossover designs: implants cannot be swapped, surgical procedures cannot be reversed, and carryover effects are difficult to quantify for devices that cause permanent anatomical changes.

Cluster-Randomized Trials

In cluster-randomized trials, groups of subjects (e.g., hospitals, clinics) rather than individual subjects are randomized. This design is used when individual randomization is impractical or when the intervention operates at the group level (e.g., a new surgical protocol or training program).

Sample size must be inflated by the design effect:

Design effect = 1 + (m - 1) x ICC

Where m is the average cluster size and ICC is the intraclass correlation coefficient. For example, with an average of 20 subjects per site and an ICC of 0.05, the design effect is 1.95 — nearly doubling the required sample size compared to individual randomization.

Multi-Arm Studies and Multiplicity Adjustments

When a trial has more than two arms or more than one primary endpoint, the chance of a false positive increases. Multiplicity adjustments are required to control the overall Type I error rate.

Common approaches include:

Bonferroni correction: Divide alpha by the number of comparisons. Simple but conservative.
Hierarchical (fixed-sequence) testing: Pre-specify the order of testing. Each hypothesis is tested at the full alpha level, but only if all prior hypotheses were rejected. This approach is preferred when a clear ordering of importance exists.
Hochberg procedure: A less conservative step-up procedure that can provide more power than Bonferroni.
Gatekeeping procedures: For trials with primary and secondary endpoints, gatekeeping controls the familywise error rate across families of hypotheses.

The FDA and EU regulators expect the multiplicity adjustment strategy to be pre-specified in the statistical analysis plan. Post-hoc adjustments are not accepted.

Stratification Factors

Stratification ensures balance between treatment groups on key prognostic variables (e.g., disease severity, site, age group). While stratification does not directly change sample size, it improves the efficiency of the analysis and can reduce the variance of the treatment effect estimate. Stratified randomization should be considered when strong prognostic factors are known, and the analysis should account for stratification factors to preserve the efficiency gain.

Bayesian Approaches for Medical Devices

The FDA's 2010 Guidance on "Bayesian Statistical Guidance for Medical Device Clinical Trials" opened the door for Bayesian methods in device submissions. Unlike drug trials where sample sizes are typically large, device trials often involve smaller populations, iterative device modifications, and the availability of prior information from earlier device versions or literature — all conditions where Bayesian approaches excel.

Key Principles

Bayesian methods combine prior information (from previous studies, registries, or expert opinion) with current trial data to produce a posterior distribution for the treatment effect. The sample size needed depends on both the strength of the prior and the observed data.

Borrowing from Historical Data

Power priors: Discount historical data by raising the likelihood to a power between 0 and 1. A power of 0 ignores historical data entirely; a power of 1 treats it as equally informative as the current trial.
Hierarchical models: Borrow strength across related studies or device iterations, with the degree of borrowing determined by the between-study variance. This is the preferred approach when combining data from multiple prior studies.

Bayesian Adaptive Designs with Sample Size Re-estimation

Bayesian adaptive designs allow the sample size to be determined during the trial based on accumulating data. The FDA has accepted Bayesian adaptive designs for devices where:

Pre-specified decision rules govern when to stop early (for success or futility) or continue enrollment
Operating characteristics (Type I error, power) are demonstrated through simulation
Prior information is well-justified and its influence is transparent

The FDA has approved more than 40 devices using Bayesian approaches. A notable example is the SENATA trial (P170030) for a coronary stent that used a Bayesian non-inferiority design with borrowing from historical control data, reducing the required sample size while maintaining regulatory rigor.

When Bayesian Is Preferred

Rare populations: When the eligible patient population is too small for a frequentist trial with adequate power
Iterative device modifications: When a device undergoes sequential modifications and data from earlier versions can inform the current study
Slow enrollment: When enrollment is slow and pre-specified interim analyses can enable earlier decision-making
Strong prior data available: When well-characterized historical data from predicate devices or registries exists

Recommended Reading

21 CFR Part 4: Combination Product cGMP Quality Systems for Drug-Device Manufacturers

Quality Systems Regulatory2026-06-08 · 16 min read

Pilot and Feasibility Study Sizing

Pilot studies are not powered for efficacy. Their purpose is to test procedures, estimate parameters (especially variability), assess feasibility of recruitment, and refine the protocol for the pivotal trial. Powering a pilot study for a treatment effect is a fundamental design error.

Recommended Approaches

Steven Julious, in his influential 2005 paper on pilot trial sample sizes, recommended 12 subjects per group as a minimum for feasibility studies. This recommendation is based on the practical consideration that 12 per group provides a reasonable estimate of the standard deviation for planning the main trial — specifically, the precision of the variance estimate becomes acceptable at approximately 12 degrees of freedom per group.

Practical guidelines:

12 per group is the minimum for estimating variance with adequate precision for planning a main trial
12 to 20 per group provides progressively better variance estimation
The sample size should be justified based on the specific feasibility objectives, not on statistical power
If the pilot is also intended to estimate event rates for a binary endpoint, larger samples may be needed to achieve useful precision

Justification-Based Approach

Rather than a power calculation, the pilot sample size should be justified by describing what parameter estimates it will produce and how those estimates will be used in the pivotal trial design. For example: "A pilot of 15 subjects per group will provide a standard deviation estimate with a coefficient of variation of approximately 20%, sufficient to inform the pivotal trial sample size calculation with reasonable accuracy."

Adjustments and Inflation

Dropout Rate Adjustment

No trial retains every enrolled subject. The sample size must be inflated to account for expected attrition:

n_enrolled = n_calculated / (1 - dropout_rate)

Typical dropout rates for device trials range from 5% for short-term studies with objective endpoints to 20% or more for long-term studies with subjective endpoints. The assumed dropout rate must be justified by literature or prior experience.

Multiple Comparison Adjustments

As discussed in the multiplicity section, each additional comparison inflates the effective sample size required. For k independent comparisons using Bonferroni, the adjusted alpha is alpha/k, which substantially increases the per-comparison sample size.

Interim Analysis Spending Functions

When interim analyses are planned, the overall alpha must be preserved. Spending functions allocate portions of the total alpha to each interim look:

O'Brien-Fleming: Very conservative at early looks, spending little alpha until the final analysis. Most of the alpha is reserved for the end. This is the most commonly used approach for device trials because it penalizes early looks minimally and preserves nearly full alpha for the final analysis.
Pocock: Spends alpha equally across all looks. This makes early rejection easier but leaves less alpha for the final analysis, resulting in a higher critical value at the end.

The choice of spending function affects the sample size only modestly (typically a 2-10% increase over a fixed-sample design), but the operating characteristics differ significantly and must be pre-specified.

Subgroup Analysis Considerations

Subgroup analyses are often requested by regulators to evaluate consistency of the treatment effect across demographic or clinical subgroups. Subgroup analyses are typically underpowered, and regulators expect them to be exploratory unless the trial is specifically powered for a subgroup-by-treatment interaction test. If a subgroup is a primary analysis population, the sample size must be calculated for that subgroup independently.

Software Tools for Sample Size Calculation

Several commercial and free tools are available for sample size calculation. The choice depends on the complexity of the design, the need for documented output, and budget.

Tool	Type	Strengths	Limitations	Best For
nQuery (Statistical Solutions)	Commercial	Comprehensive, FDA-accepted documentation, wide range of designs including Bayesian, adaptive, and survival	Expensive	Pivotal trials requiring detailed statistical documentation
PASS (NCSS)	Commercial	Large library of tests, good for complex designs, detailed output	Windows only, expensive	Multi-arm, cluster-randomized, and dose-finding studies
*GPower**	Free	Easy to use, covers standard designs, well-documented	Limited to frequentist methods, fewer advanced designs	Academic and early-stage planning
SAS PROC POWER	Commercial (SAS license)	Integrated with SAS workflow, programmable, reproducible	Requires SAS license, steeper learning curve	Organizations already using SAS for analysis
R packages (pwr, powerSurvEpi, sampleSize4ClinicalTrials)	Free	Flexible, reproducible, can be customized	Requires R programming, no GUI	Statisticians comfortable with R

For regulatory submissions, nQuery and PASS are the most widely accepted because they produce detailed output documents that can be included directly in the statistical analysis plan. G*Power and R packages are acceptable if the calculation methodology is well-documented in the protocol.

Recommended Reading

Digital Twins and Synthetic Data in Medical Device Validation

Regulatory Digital Health & AI2026-04-30 · 11 min read

Practical Examples

Example 1: Orthopedic Implant Superiority Trial (Continuous Endpoint)

A sponsor is developing a novel lumbar fusion cage and wants to demonstrate superiority over the standard titanium cage. The primary endpoint is the change in Oswestry Disability Index (ODI) score at 24 months.

Parameter	Value	Source
Expected mean improvement (investigational)	30 points	Pilot study
Expected mean improvement (control)	22 points	Published literature
Clinically meaningful difference (d)	8 points	Clinical advisory board
Standard deviation (sigma)	18 points	Pooled estimate from literature
Significance level (alpha)	0.05 two-sided	Standard
Power (1 - beta)	90%	Pivotal trial standard
Allocation ratio	1:1	Balanced design
Expected dropout rate	15%	Literature for 24-month spine trials

Calculation:

n per group = 2 x (1.96 + 1.282)^2 x 18^2 / 8^2
           = 2 x 10.507 x 324 / 64
           = 2 x 53.20
           = 106.4

Round up to 107 per group. With 15% dropout: 107 / 0.85 = 126 per group. Total enrollment: 252 subjects.

Example 2: Cardiovascular Device Non-Inferiority Trial (Binary Endpoint)

A new transcatheter aortic valve replacement (TAVR) device is being tested against an approved valve. The primary endpoint is a composite of all-cause mortality, stroke, and moderate/severe paravalvular leak at 30 days.

Parameter	Value	Source
Expected event rate (control)	15%	Pivotal trial of predicate
Expected event rate (investigational)	13%	Conservative assumption
Non-inferiority margin (delta)	6 percentage points	Based on historical placebo-active margin and clinical judgment
Significance level (alpha)	0.025 one-sided	Standard for non-inferiority
Power (1 - beta)	85%	Regulatory expectation for Class III
Allocation ratio	2:1 (investigational:control)	To gain more safety data on new device
Expected dropout rate	5%	Short follow-up

Calculation (with unequal allocation, r = 2):

n per group (control) = (r+1) x (Z_{1-alpha} + Z_{1-beta})^2 x (p1(1-p1) + p2(1-p2)) / (r x (p1 - p2 + delta)^2)

With r = 2, p1 = 0.15, p2 = 0.13, delta = 0.06, Z_{0.975} = 1.96, Z_{0.85} = 1.036:

n_control = 3 x (1.96 + 1.036)^2 x (0.15 x 0.85 + 0.13 x 0.87) / (2 x (0.13 - 0.15 + 0.06)^2)
          = 3 x 8.988 x (0.1275 + 0.1131) / (2 x 0.0016)
          = 3 x 8.988 x 0.2406 / 0.0032
          = 3 x 676.0
          = 2028

Investigational arm: 2 x 2028 = 4056. With 5% dropout: control = 2135, investigational = 4270. Total enrollment: 6,405 subjects.

This example illustrates why non-inferiority trials with small margins and binary endpoints can require very large samples — a reality that drives many sponsors toward Bayesian borrowing or adaptive designs.

Example 3: Diagnostic Accuracy Study (Single-Arm)

A new AI-based diagnostic algorithm is being validated for detecting diabetic retinopathy from retinal images. The primary endpoint is sensitivity compared to a reference standard (expert ophthalmologist grading).

Parameter	Value	Source
Target sensitivity	92%	Performance goal
Null hypothesis sensitivity	85%	Minimum acceptable sensitivity
Significance level (alpha)	0.025 one-sided	Standard for diagnostic sensitivity
Power (1 - beta)	80%	Standard
Disease prevalence in study population	40%	Screening population

Calculation (exact binomial):

Using the exact binomial test, the sample size is determined by finding the smallest n such that if the true sensitivity is 92%, the probability of rejecting H0 (sensitivity less than or equal to 85%) is at least 80%.

Using normal approximation for planning:

n_positive_cases = (Z_{1-alpha} x sqrt(p0(1-p0)) + Z_{1-beta} x sqrt(p1(1-p1)))^2 / (p1 - p0)^2
                 = (1.645 x sqrt(0.85 x 0.15) + 0.842 x sqrt(0.92 x 0.08))^2 / (0.92 - 0.85)^2
                 = (1.645 x 0.357 + 0.842 x 0.271)^2 / 0.0049
                 = (0.587 + 0.228)^2 / 0.0049
                 = 0.664 / 0.0049
                 = 135.5

Round up to 136 positive cases. Total subjects needed: 136 / 0.40 = 340 subjects (verified by exact calculation).

Example 4: PMCF Survey Sample Size (Precision-Based with Finite Population Correction)

A manufacturer needs to conduct a PMCF survey for a Class IIa surgical instrument with an installed user base of 500 surgeons across Europe. The objective is to estimate the proportion of surgeons who report satisfactory performance.

Parameter	Value	Source
Population size (N)	500	Sales records
Expected satisfaction rate	85%	Previous survey data
Desired confidence interval half-width	5%	PMCF plan specification
Confidence level	95%	Standard

Calculation (with finite population correction):

Step 1 — Calculate sample size for infinite population:

n_infinite = Z_{1-alpha/2}^2 x p x (1-p) / W^2
           = 1.96^2 x 0.85 x 0.15 / 0.05^2
           = 3.842 x 0.1275 / 0.0025
           = 195.7

Round up to 196.

Step 2 — Apply finite population correction:

n_adjusted = n_infinite x N / (n_infinite + N - 1)
           = 196 x 500 / (196 + 500 - 1)
           = 98000 / 695
           = 141.0

Required sample: 141 surgeons (out of 500, or 28.2% of the population).

This finite population correction significantly reduces the required sample when the population is small relative to the calculated sample size — a common situation in PMCF surveys for specialized devices.

Common Mistakes and Regulatory Findings

Using Effect Sizes from Overly Optimistic Pilot Data

Pilot studies often produce inflated effect sizes due to small sample variability and selection bias. Using these estimates directly in the pivotal sample size calculation leads to underpowered trials. The effect size for sample size calculation should be the clinically meaningful difference, not the effect size observed in the pilot. If pilot data is used, consider discounting the observed effect by 20-30% or using the lower bound of the confidence interval.

Ignoring Clustering in Multi-Center Trials

When outcomes are correlated within sites (which is common in surgical device trials), the effective sample size is reduced by the design effect. Failing to account for clustering results in underpowered studies and potentially misleading confidence intervals. The intraclass correlation coefficient should be estimated from prior data or literature and incorporated into the sample size calculation.

Not Justifying the Non-Inferiority Margin

The non-inferiority margin is the most critical design parameter in a non-inferiority trial and the most frequently questioned by regulators. A margin that is too wide makes the trial easy to pass but clinically meaningless; a margin that is too narrow makes the trial impractically large. The margin must be justified using clinical reasoning (what degree of inferiority is acceptable given the new device's advantages) and statistical reasoning (the margin must preserve a fraction of the active control's effect over placebo, demonstrated through historical meta-analysis).

Failing to Account for Intercurrent Events

Under the ICH E9(R1) estimand framework, intercurrent events such as treatment discontinuation, device removal, or use of rescue therapy must be addressed in the estimand definition. If the primary analysis is a per-protocol analysis excluding subjects with intercurrent events, the sample size must account for the proportion of subjects expected to be excluded. If the analysis uses a treatment-policy strategy (including all subjects regardless of intercurrent events), the effect size may be attenuated relative to an idealized scenario, requiring a larger sample.

Underpowered Safety Endpoints

Most sample size calculations are based on the primary efficacy endpoint. However, regulators also expect adequate data on safety — particularly for known or anticipated adverse events. If the primary endpoint sample size provides fewer than 3-5 expected events for a key safety endpoint, the safety data will be uninformative. Consider inflating the sample size or extending follow-up if critical safety signals cannot be reliably characterized with the efficacy-based sample size.

Other Frequent Findings

Unjustified variability assumptions: Using a standard deviation from a different population, device type, or endpoint without adjustment
Ignoring multiplicity: Conducting multiple primary comparisons without adjusting alpha, then cherry-picking significant results
Inadequate documentation: Failing to present the complete sample size derivation with all inputs, formulas, and references in the statistical analysis plan
Post-hoc sample size revision: Changing the sample size after unblinded interim results without pre-specified adaptive design rules

Emerging Approaches: AI-Assisted Sample Size Planning

Artificial intelligence is beginning to augment traditional sample size estimation. The ClinicalReTrial framework (Xing et al., 2026) uses AI agents to iteratively analyze trial protocols and suggest modifications — including sample size optimization — to improve success probability. TrialBench (Nature Scientific Data, 2025) provides multi-modal AI-ready datasets enabling machine learning-based power estimation from historical trial data.

For medical device sponsors, these tools are not yet a replacement for formal biostatistical analysis, but they signal a direction of travel. The FDA, MHRA, and Health Canada's "Good Machine Learning Practice" principles already include provisions addressing sample size adequacy for AI/ML-based endpoints. A 2025 analysis in The Lancet Digital Health noted that most AI studies in healthcare still lack formal sample size justification — an important gap as AI-derived endpoints become more common in device submissions.

FDA's 2025 guidance on efficacy evaluation in small patient populations also endorses novel trial designs — including Bayesian borrowing and n-of-1 designs — supported by biomarkers and natural history data, rather than requiring large randomized controlled trials. For devices targeting rare conditions or narrow biomarker-defined populations, this represents a meaningful pathway to smaller, more efficient studies.

Recommended Reading

FDA 510(k) Fees: MDUFA V User Fee Schedule and Total Cost Breakdown

Regulatory 510(k)2026-06-08 · 14 min read

Key Takeaways

Sample size is a regulatory and ethical requirement, not a suggestion. EU MDR Annex XV, ISO 14155:2026, and FDA IDE guidance all require documented statistical justification. Underpowered studies risk regulatory rejection and raise ethical concerns about exposing patients to investigational risk without the possibility of meaningful conclusions.
Define the estimand before calculating sample size. Following the ICH E9(R1) framework, specify the population, endpoint, handling of intercurrent events, and population-level summary before choosing a formula. This prevents the mismatch between the regulatory question and the statistical analysis.
Choose the design first, then the formula. Superiority, non-inferiority, equivalence, and single-arm designs require fundamentally different sample size calculations. Selecting the wrong formula is an irrecoverable error.
Justify every input clinically and statistically. The significance level, power, effect size, and variability must all be supported by clinical rationale and literature references. The non-inferiority margin requires particular scrutiny and a dedicated justification section.
Account for real-world realities. Inflate for dropout. Adjust for multiplicity. Consider clustering in multi-center trials. Incorporate spending functions for interim analyses. A calculation that works on paper but ignores these factors will fail in practice.
Bayesian methods can reduce sample size when prior data is strong. The FDA has approved more than 40 devices using Bayesian approaches. Consider this option when historical data is available, populations are rare, or adaptive designs are appropriate.
Pilot studies are for feasibility, not efficacy. Size pilot studies based on precision of parameter estimates (12-20 per group), and use the results to inform — not dictate — the pivotal trial sample size calculation.

Sample Size Calculation for Medical Device Clinical Investigations: Methods and Examples

Why Sample Size Matters for Medical Device Trials

Regulatory Framework

EU MDR Annex XV Chapter III

ISO 14155:2026

FDA Statistical Guidance for Medical Devices

ICH E9(R1) Estimand Framework

MHRA Statistical Guidance

The Five Essential Inputs for Sample Size Calculation

1. Statistical Hypothesis

2. Significance Level (alpha)

3. Statistical Power (1 - beta)

4. Effect Size (Clinically Meaningful Difference)

5. Variability

How Each Input Affects Sample Size

Calculation Methods by Study Objective

Superiority Trials

Non-Inferiority Trials

Equivalence Trials

Single-Arm and One-Sample Studies

Study Design Considerations

Parallel Group vs. Crossover Designs

Cluster-Randomized Trials

Multi-Arm Studies and Multiplicity Adjustments

Stratification Factors

Bayesian Approaches for Medical Devices

Key Principles

Borrowing from Historical Data

Bayesian Adaptive Designs with Sample Size Re-estimation

When Bayesian Is Preferred

Pilot and Feasibility Study Sizing

Recommended Approaches

Justification-Based Approach

Adjustments and Inflation

Dropout Rate Adjustment

Multiple Comparison Adjustments

Interim Analysis Spending Functions

Subgroup Analysis Considerations

Software Tools for Sample Size Calculation

Practical Examples

Example 1: Orthopedic Implant Superiority Trial (Continuous Endpoint)

Example 2: Cardiovascular Device Non-Inferiority Trial (Binary Endpoint)

Example 3: Diagnostic Accuracy Study (Single-Arm)

Example 4: PMCF Survey Sample Size (Precision-Based with Finite Population Correction)

Common Mistakes and Regulatory Findings

Using Effect Sizes from Overly Optimistic Pilot Data

Ignoring Clustering in Multi-Center Trials

Not Justifying the Non-Inferiority Margin

Failing to Account for Intercurrent Events

Underpowered Safety Endpoints

Other Frequent Findings

Emerging Approaches: AI-Assisted Sample Size Planning

Key Takeaways

Related Articles

Aesthetic Device Regulation: FDA 510(k) and EU MDR for Laser, RF, and Energy Devices

Asia-Pacific Medical Device Registration: Market Comparison and Entry Strategy (2026)

IVD Calibrator Traceability: ISO 17511, EU IVDR, and JCTLM Guide