3.3. TB skin tests and interferon gamma release assays for the diagnosis of TB disease

Recommendation
Unamed-table-20

 

The Guideline Development Group concluded that both the sensitivity and specificity of IGRAs in detecting active TB among individuals presumed of having TB were suboptimal and the quality of evidence was low. They also recommended that these tests not be used as a replacement for conventional microbiological diagnosis of pulmonary and extrapulmonary TB.

The Guideline Development Group noted that current evidence did not support the use of IGRAs or the TST as part of the diagnostic work-up of adults presumed of active TB, irrespective of HIV status. This recommendation placed a high value on avoiding the consequences of unnecessary treatment (owing to a high number of false positive results), given the low specificity of IGRAs and the TST in these settings.

Evidence base

A systematic, structured, evidence-based process for TB diagnostic policy generation was followed. The first step constituted systematic reviews and meta-analysis of available data (published and unpublished), using standard methods appropriate for diagnostic accuracy studies. The second step involved the convening of a GDG to evaluate the strength of the evidence base, evaluate the risks and benefits of using IGRAs in LMIC and identify gaps to be addressed in future research. Based on the Expert Group findings, the third and final step involved development of a WHO policy guidance, with eventual dissemination to WHO Member States for implementation.

The GRADE system, adopted by WHO for all policy and guideline development, was used by the GDG. Given the absence of studies evaluating patient-important outcomes among TB suspects randomized to treatment based on IGRA results, reviews were focused on the diagnostic accuracy of IGRAs versus the TST in detecting TB infection or TB disease. Recognizing that test results may be surrogates for patient-important outcomes, the GDG evaluated the accuracy of IGRAs while also drawing inferences on the likely impact of these tests on patient outcomes, as reflected by false negatives (i.e. cases of TB infection missed) or false positives.

Systematic reviews were undertaken following detailed protocols with predefined questions relevant to the individual topics. Summaries of methodologies followed for each topic are given in the relevant sections below.

PICO questions

What is the diagnostic accuracy of commercial IGRAs for pulmonary TB in adult pulmonary TB suspects and confirmed TB cases in LMIC as compared with microbiological (culture or smear-microscopy) or clinical diagnosis of pulmonary TB?

Hierarchy of reference standards

Studies evaluating the performance of IGRAs are hampered by the lack of a gold standard to distinguish the presence or absence of TB infection. Since diagnostic accuracy for TB infection could not be directly assessed, a hierarchy of reference standards was developed and agreed beforehand with the systematic reviewers, to evaluate the role of IGRAs, depending on the individual topic (i.e. not all systematic reviews necessarily used the hierarchy). Primary outcomes were predefined for each systematic review as relevant; for example, the predictive value of IGRAs for development of active TB, the sensitivity of IGRAs in individuals with culture- confirmed active TB (as a surrogate reference standard for TB infection), and the correlation between IGRA and TST results. In addition to primary outcomes, specific characteristics of IGRAs that could influence their overall utility were evaluated where relevant; for example, the proportion of indeterminate IGRA results (i.e. not able to be interpreted, either due to a high IFN-γ response in the negative control or a low IFN-γ response in the positive control), the impact of HIV-related immunosuppression (i.e. CD4+ cell count) on test performance where available and correlation of IGRA results with an exposure gradient (typically used in contact and outbreak investigations).

Studies search, selection and quality assessment

All studies evaluating IGRAs published up to the end of May 2010 were reviewed using predefined data search strings. In addition to database searches, bibliographies of reviews and guidelines were reviewed, citations of all included studies were screened, and experts in the field as well as IGRA manufacturers were contacted to identify additional studies (published, unpublished and ongoing). Pertinent information not reported in the original publications was requested from the primary authors of all studies included by the systematic reviewers.

Studies that evaluated the performance of currently available commercial IGRAs, published in all languages and in all LMIC, were reviewed by individual topic. Only studies evaluating IGRA performance in LMIC were included in this analysis. Excluded were studies that evaluated noncommercial (i.e. in-house) IGRAs, older generation IGRAs (i.e. PPD-based IGRAs) and IGRAs performed in specimens other than blood; studies that were focused on the effect of anti-TB treatment on the IGRA response; studies including fewer than 10 individuals; studies reporting insufficient data to determine diagnostic accuracy measures; and conference abstracts and letters without original data, and reviews.

Study quality was assessed by relevant standardized methods, depending on the topic. For primary outcomes focused on test accuracy, quality was appraised using a subset of relevant criteria from QUADAS, a validated tool for diagnostic accuracy studies. For studies of the predictive value of IGRAs, quality was appraised with a modified version of the Newcastle-Ottawa Scale (NOS) for longitudinal or cohort studies. Conflicts of interest are a known concern in TB diagnostic studies; therefore, the systematic reviews added a quality item about involvement of commercial test manufacturers in published studies; they also reported whether IGRA manufacturers had any involvement with the design or conduct of each study, including donation of test materials, provision of monetary support, work or financial relationships with study authors, and participation in data analysis.

Data synthesis and meta-analysis

A standardized overall approach was specified a priori for each systematic review, to account for significant heterogeneity in results expected between studies. First, data were synthesized separately for each commercial IGRA and by the World Bank country income classification (LMIC versus high-income countries) as a surrogate for TB incidence. Second, heterogeneity was visually assessed using forest plots, and the variation in study results attributable to heterogeneity was characterized (I-squared statistic) and statistically tested (chi-squared test). Third, pooled estimates were calculated using random-effects modelling, which provides more conservative estimates than fixed-effects modelling when heterogeneity is present. For each individual study, all outcomes for which data were available were assessed. First, forest plots were generated to display the individual study estimates and their 95% CIs. Pooled estimates were calculated when at least three studies were available in any subgroup, and individual study results were summarized when fewer than four studies were available. Standard statistical packages were used for analyses.

Use of IGRAs in the diagnosis of active TB

Studies included were those that evaluated the performance of the technologies of interest for the diagnosis of TB disease among adult (>15 years) with presumed TB or people with TB in LMIC.

The initial search yielded 789 citations. After full-text review of 185 papers evaluating IGRAs for the diagnosis of active TB, 22 were determined to meet eligibility criteria, covering 33 unique evaluations of one or more IGRAs (hereafter referred to as studies) in 19 published and three unpublished reports. Of the 33 studies, 10 (30%) were from low-income countries and 23 (70%) were from middle-income countries. Seventeen studies (52%) included PLHIV (n=1057), and 27 studies (82%) involved ambulatory subjects (outpatients as well as hospitalized patients. IGRAs were performed in people suspected of having active TB in 19 studies (58%) and in people with known active TB in 14 studies (42%). Because of the focus on diagnostic accuracy for active TB and the high prevalence of TB infection in high TB burden settings, IGRA specificity was estimated exclusively among studies enrolling TB suspects where the diagnostic work-up ultimately showed no evidence of active disease.

The results demonstrated the following in LMIC:

  • The sensitivity of IGRAs in detecting active TB among people suspected of having TB ranged from 73% to 83% and specificity from 49% to 58%. Therefore, one in four patients, on average, with culture-confirmed active TB could be expected to be IGRA-negative in LMIC, with serious consequences for patients in terms of morbidity and mortality.
  • There was no evidence that IGRAs have added value beyond conventional microbiological tests for the diagnosis of active TB. Among studies that enrolled TB suspects (i.e. patients with diagnostic uncertainty), both IGRAs demonstrated suboptimal “rule-out” values for TB disease.
  • Even though data were limited, the sensitivity of both IGRAs was lower among PLHIV (about 60–70%), suggesting that nearly one in three PLHIV with active TB would be IGRA-negative.
  • There was no consistent evidence that either of the two IGRAs was more sensitive than the TST for active TB diagnosis, although comparisons with pooled estimates of TST sensitivity were difficult to interpret owing to substantial heterogeneity.
  • The few available head-to-head comparisons between QFT-GIT and T-Spot demonstrated higher sensitivity for the T-Spot platform, although this difference did not reach statistical significance.
  • The specificity of both IGRAs for active TB was low, regardless of HIV status, and results suggested that one in two patients without active TB would be IGRA-positive, with adverse consequences for patients because of unnecessary therapy for TB and a missed differential diagnosis.
  • Two unpublished reports reported no incremental or added value of IGRA test results combined with important baseline patient characteristics (e.g. demographics, symptoms or chest radiograph findings). Thus, these reports did not support a meaningful contribution of IGRAs for the diagnosis of active TB beyond readily available patient data and conventional tests.
  • The systematic review focused on the use of IGRAs to diagnose active pulmonary TB, given that data for extrapulmonary TB were lacking; nevertheless, the GDG consensus was that recommendations for pulmonary TB could reasonably be extrapolated to extrapulmonary TB.
  • Industry involvement was unknown in 18% of studies and acknowledged in 27% of studies, including donation of IGRA kits as well as work or financial relationships between authors and IGRA manufacturers.
Strengths and limitations of the evidence base

Strengths and limitations were as follows:

  • Heterogeneity was substantial for the primary outcomes of sensitivity and specificity. Activities performed to minimize heterogeneity were empirical random-effects weighting, excluding studies contributing fewer than 10 eligible individuals, and separately synthesizing data for currently manufactured IGRAs.
  • No standard criteria exist for defining high TB incidence countries, and the World Bank income classification is an imperfect surrogate for national TB incidence; nevertheless, results were fundamentally unchanged when restricted to countries with an arbitrarily chosen annual TB incidence of at least 50 per 100 000 population.
  • It is possible that ongoing studies were missed, despite systematic searching. It is also possible that studies that found poor IGRA performance were less likely to be published. Given the lack of statistical methods to account for publication bias in diagnostic metaanalyses, it would be prudent to assume some degree of overestimation of estimates due to publication bias.
  • The systematic review focused on test accuracy (i.e. sensitivity and specificity) and indirect assessment of patient impact (false positive and false negative results). None of the studies reviewed provided information on patient-important outcomes (i.e. showing that IGRAs used in a given situation resulted in a clinically relevant improvement in patient care or outcomes). In addition, no information was available on the values and preferences of patients.

Data synthesis was structured around the preset PICO question, as outlined above. Web Annex I provides additional information on evidence synthesis and analysis.

Operational aspects of the use of IGRAs

Operational aspects of the use of IGRAs were as follows:

  • Cost of IGRAs was mentioned by four studies, which all stated that the assays are too expensive and that this is a limitation to their use.
  • Only one study addressed reproducibility of T-Spot by assessing inter-observer agreement; it showed excellent correlation. No other study mentioned the issue of test reproducibility.
  • Twelve studies reported on accepted transport times of samples to the laboratory, which were mainly less than 6 hours (i.e. within the limit accepted by the test manufacturers). One study accepted a transport time of 16 hours and another 24 hours. None reported on the impact of the transport times (i.e. delay between drawing the blood and initiating the IGRA test) and IGRA test results or performance.
  • No study reported on time-to-result for IGRAs.
  • Four studies reported on the impact of IGRAs on TB therapy. In two studies, IGRA results were reported to clinicians; one study did not discuss the consequences, and in the other study QFT-positive children and adolescents received preventive chemotherapy. The other two studies commented on the reduced number of patients that would require preventive therapy if IGRAs were part of the diagnostic algorithm.
  • The following aspects related to the feasibility of IGRAs were highlighted:
    • blood amounts required may be an issue; however, tests were performed with less than 2 mL of blood (T-Spot) in some studies;
    • a strong interferon response in negative control tubes (high background results) in QFT may reflect the influence of other coincident diseases;
    • standardization and generation of automated, quantitative results should render IGRAs more objective than the TST; and
    • a well-equipped laboratory, expensive equipment and training are required for IGRA test performance, which may cause logistical problems.
Research priorities

Targeted further research to identify IGRAs with improved accuracy is strongly encouraged. Such research should be based on adequate study design, including quality principles such as representative suspect populations, prospective follow-up, and adequate and explicit blinding. It is also strongly recommended that proof-of-principle studies be followed by evidence produced from prospectively implemented and well-designed evaluation and demonstration studies, including assessment of patient impact.

Book navigation