Developing diagnostic tests

In many clinical decisions, the most ready source of additional information is diagnostic testing. Diagnostic tests include not only laboratory tests, but other sources of information about diagnosis, such as history and physical examination. Patients (and indeed, many physicians), however, do not understand how diagnostic tests are developed or how to determine the value of the information they provide.

Suppose that we want to know if someone has an active H. pylori infection without forcing him to undergo invasive endoscopic testing. Instead, we perform a ¹³C-urea breath test (UBT) that measures change in the ratio of ¹³COÂ₂ to ¹²CO₂ (denoted by Î´) exhaled by the patient 30 minutes after ingestion of ¹³C-urea compared with the ratio before ingestion. Because H. pylori hydrolyzes ¹³C-urea to ¹³CO₂, the resulting change value, Î”Î´, tends to be higher in patients with H. pylori infection than the average uninfected patient, but there’s natural variation both groups, and it’s not impossible that a healthy patient could have a high Î”Î´. The figure below illustrates the situation.

Figure: Distribution of Î”Î´ for infected (bell curve on the right of the figure) and uninfected (bell curve on the left of the figure) patients. Both infected and uninfected patients may have Î”Î´ as low as 3 or as high as 15. Adapted from Figure 1 of Herold and Becker (2002), BMC Gastroenterology 2:12, with permission of the BioMed Central Open Access license agreement (http://www.biomedcentral.com/info/authors/license).

The UBT is thus an imperfect test, because there is the potential for error. A test which defines the disease (in the way that, for example, bacterial growth on a culture defines bacterial infection) and thus has, in principle, perfect discrimination, is often referred to as a reference standard or gold standard test.

Test thresholds

As the Î”Î´ gets higher, however, the person is more likely to be infected, and as it gets lower, they’re more likely to be healthy. And our goal is to treat the sick differently than the healthy; for example, to prescribe a proton pump inhibitor and antibiotics if we are sufficiently convinced of the likelihood that the patient is, in fact, suffering from H. pylori infection. This implies that we need a criterion, or threshold Î”Î´, above which we will act in one way (e.g., start drug therapy), and below which we will act in a different way (e.g., watch and wait). A threshold would be graphically represented by drawing a vertical line at the Î”Î´ threshold score; patients whose Î”Î´ is higher than the threshold are treated as infected, and those whose count is lower are treated as healthy.

Because of the overlap between the distributions of Î”Î´ (that is, because of the imperfect discriminative power of Î”Î´ in this example), no threshold can accurately classify every patient. Whatever criterion we set for calling someone infected or healthy based on this test, there will be some people rightly classified as infected or healthy, and some people wrongly classified as sick or healthy.

The threshold determines the kind of error we are likely to make. The higher the Î”Î´ threshold we require in order to call someone infected (graphically, the farther to the right we draw the vertical line), the more we’ll wrongly classify infected people as healthy; these errors are called false negatives. For example, if we set the threshold at Î”Î´=18, we will almost never wrongly classify a healthy person, but about half of those infected will be misclassified as false negatives (and potentially remain untreated for their infection).

Conversely, the lower the Î”Î´ we require, the more we’ll wrongly classify healthy people as infected; these are false positives. For example, if we set the threshold at Î”Î´=2, we will almost never wrongly classify a infected person, but almost half of those not infected will be misclassified as false positives (and potentially undergo unnecessary treatment).

Changing the threshold will always either increase false positives and decrease false negatives or vice versa. Only improving the discriminative power can lower both false positives and false negatives.

Choosing thresholds

Unfortunately, we generally can’t improve the discriminative power of a given test; we have to develop new and more discriminative tests (or variations of tests). But the choice of the threshold is arbitrary. A threshold may be recommended by the test developer or by guidelines on the use of the test. These thresholds should be chosen on the basis of the purpose of the test, and the consequences of false positives and false negatives.

For example, consider a rapid strep antibody test for strep throat. A false positive on this test results in a patient receiving an unnecessary dose of antibiotics for a few days; a false negative results in a patient with an untreated bacterial infection for a few days (until the results of the throat culture, a gold standard test, are available). The general consensus among physicians has been that a few days of unnecessary antibiotics is generally preferable to missing a bacterial infection for a few days, but not so preferable that antibiotics should be routinely started in all patients. Accordingly, the kits are developed to have a relatively low, but not very low, threshold for positive results. When the cost of a false negative is much greater, tests may have a very low threshold. A 17-year-old with unknown vaccination history who presents to the emergency department with high fever and possible neck stiffness is very likely to receive immediate presumptive treatment for meningitis. Although the probability of bacterial meningitis is quite low, the consequences of missing a case are so high that a marginally positive finding on a test with low discrimination (neck stiffness) is sufficient to warrant the relatively benign treatment.

In general, when noninvasive and inexpensive tests are used to screen a population for a serious condition, the goal of testing is to broadly identify individuals who may be at higher risk for the condition and refer them for confirmatory testing or other evaluation. Screening tests, therefore, are usually designed to have very few false negatives, and are willing to accept a larger number of false positives in order to assure that high-risk cases are not missed.

On the other hand, when the treatment is invasive and the cost of the disease is low or when the primary aim of the test is to provide reassurance that a patient does not have a serious condition, false positives may be a much greater concern than false negatives. A high threshold is required of a test for carpal tunnel syndrome if the treatment contemplated is open carpal tunnel release surgery.

Making Medical Decisions

The blog for the forthcoming book "Medical Decision Making: A Physician's Guide" by Alan Schwartz and George Bergus (Cambridge University Press, 2008)

Developing diagnostic tests

Leave a Reply