NIH Toolbox Cognition Battery Tested in Individuals with Disabilities

In 2020, Neurology published an article titled “Validation of the NIH Toolbox Cognitive Battery in intellectual disability.” Findings from this work demonstrate that the feasibility, test-retest reliability, and convergent validity of the NIH Toolbox Cognition Battery among children and young adults with fragile X syndrome, Down syndrome, and other intellectual disabilities. This important work will give way to future research focused on the possible benefits of drugs, specialized educational programs, and treatments in people with intellectual disabilities. Read more here>>

Substantial qualitative and quantitative evidence supports the validity of NIH Toolbox® measures.

Have the NIH Toolbox measures been validated?

Though measures are often described as being “valid” or having been “validated,” what is most important is thinking about and understanding how the measure will be used (its purpose for the user). It is the use of the scores, not the measure itself, which must be considered when evaluating whether a measure is valid for its intended purpose. As a simple example, scores on a test of math computation would not likely be valid for diagnosing a reading disability, but they could be valid in identifying students who could benefit from math tutoring. The math test has been “validated,” but only for one of the two purposes listed. The same can be said for NIH Toolbox measures. The role of validity analyses is to collect a body of evidence that reveals for what and with whom the scores on measures are valid.

What evidence is there for the validity of NIH Toolbox measures?

Substantial qualitative and quantitative evidence has been gathered that supports the validity of NIH Toolbox measures. It is important to remember that the NIH Toolbox encompasses a broad set of measures across many functional domains, including Cognition, Motor, Sensation, and Emotion. Validity evidence for these measures has been gathered in ways that are appropriate for the type and mode of measurement.

Content Validity

As Anastasi stated, “content validity is built into a test from the outset through the choice of appropriate items.”1 Although the NIH Toolbox is comprised of a variety of measures covering a broad range of functioning and utilizing many unique types of questions and performance tasks, a common approach to establishing content validity was used. For each NIH Toolbox measure across every domain and content area, panels of experts were convened to make recommendations about the appropriate content to be assessed. Moreover, a systematic review of the literature was conducted in each domain and content area to identify relevant and appropriate constructs for measurement with the NIH Toolbox. Once this content “blueprint” was in place, the expert panels developed new (or identified existing) items or tasks to build each measure. For additional external subject matter, experts were sought during the measure development process to vet the items and tasks for quality, to match the intended constructs and appropriateness for important subgroups (e.g., young children, older adults), and to ensure items did not unintentionally disadvantage any population subgroups or raise sensitivity concerns. Once items/tasks for measures were selected or developed, each measure underwent significant additional evaluation for validity. Details on content validity for each NIH Toolbox measure have been published.

1Anastasi A, 1988. Psychological Testing, New York, Macmillan Publishing Company, p. 122-127.

Construct Validity Evidence

In selected cases, NIH Toolbox adopted an existing measure for a specific content area. In such cases, strong evidence of the measure’s validity for use in large-scale research or clinical trials had been established and published. In other cases, where NIH Toolbox content was newly developed or significantly adapted from existing items, formal studies were conducted to evaluate each measure’s construct validity. Concurrent Validity is typically established by comparing a carefully drawn sample’s performance on a new measure (“experimental” measure) with the same sample’s performance on other, well-established measures of the same construct(s) (sometimes referred to as “gold standard” measures because of their common use and acceptance). If one can establish that the new measure is sufficiently correlated with a “gold standard,” one can reasonably assume that the new measure can also be used effectively with the population(s) on which the well-established measure was used. NIH Toolbox measures performed well in such validity evaluations.

Additional construct validity evidence for NIH Toolbox measures varies by measure and domain, but includes factor analytic studies and comparisons of group performance by age (to ensure expected trajectories of performance). Detailed descriptions of NIH Toolbox validation studies have been published for all domains and content areas. In addition, a number of NIH Toolbox measures have even more published documentation of utility and validation for different ages in the normal, community-dwelling population. Examples of measures that have collected such evidence include the NIH Toolbox Grip Strength Test, NIH Toolbox Flanker Inhibitory Control and Attention Test and Dimensional Change Card Sort Test, and the NIH Toolbox Dynamic Visual Acuity Test.

Clinical Validity Evidence

Some initial studies have gathered additional validity evidence with clinical or rehabilitation population samples for a number of NIH Toolbox measures. These studies are important to assure that one can reasonably interpret scores in these target populations as well. For example, significant work has been done to validate the NIH Toolbox for use with those suffering traumatic brain injury. A number of studies have been published that establish validity of specific NIH Toolbox measures in special, targeted groups.

View the NIH Toolbox publications page>>  

Making the case for using NIH Toolbox

If you are considering one or more NIH Toolbox measures (or domain batteries) in a clinic or for a study, there is substantial evidence to evaluate. The questions you ask about that evidence can also serve as the framework for supporting your choice to others (e.g., administrators, granting agency). Here are some questions you should consider:

  • Why is it important to measure this construct or these constructs in my study or clinic? Describe the relevance of the symptom or outcome to the population of interest.
  • What psychometric evidence has accumulated when this measure was used in my targeted population? If you are not able to find a study in your population, weigh the evidence that exists for the measure across populations.
  • What are the alternatives to the NIH Toolbox measures? State clearly why you believe an NIH Toolbox measure is a good choice, particularly for your population and for your particular purpose. Remember, validity resides in the use of the scores.