How to Evaluate a Test

In order to evaluate a test, one must understand two basic concepts: test reliability and validity. Reliability is about how consistently a test can give a test score in measuring the construct in question whereas validity is how well the test measures what it is purported to measure in a certain context. A good test is said to have both high reliability and validity.


Reliability, in our daily life, usually refers to something positive. For example, when we say someone is reliable or the flight schedule of a certain airline is reliable. In the psychometric context, reliability is about consistency, not necessarily consistently good or bad. Reliability is defined as "the degree of dependability, consistency, or stability of scores on a measure used in selection research" (Gatewood, Feild, & Barrick, 2008). It is about how reliable the test is in measuring a particular characteristic that is, if one takes a test twice, will the results be similar or different. A reliable test should give similar results if a re-test is taken within a reasonable period of time.

When we use a test to measure a construct, e.g., ability in numerical reasoning, conscientiousness, or interpersonal communication skills, we are trying to get a closest estimate of the true score. A "true" score is the *true* numerical reasoning ability (for example) of an individual that we want to get but that we cannot directly observe. This "true" numerical reasoning ability very largely determines how this individual performs on numerical reasoning tasks that we can measure. However, numerous other factors like having enough sleep or being unusually stressed could also affect an individual's performance on this task. The objective of testing is to estimate a psychological construct that we cannot directly observe in this case numerical reasoning ability by assessing behaviors that we can measure, e.g., performance on a numerical reasoning test, while eliminating factors that affect the performance but are unrelated to the psychological construct of interest. The factors that affect our estimate of the true score are called error of measurement. None of our selection tools are ever free of measurement error. Errors of measurement can come from the test takers, the examiner, and the environment. The test takers' psychological state, the examiners' attitude, or the room temperature of the examination hall might also be factors. The smaller the error of measurement, the more reliable the test will be.

There are varying methods of evaluating the reliability of tests. One way to determine test reliability is to compare the two test scores recorded from two different periods of time. When a math ability test is taken by the same group of people at two different times, a test with high reliability should yield similar scores from one test to the next. This is called the test-retest reliability. The more similar the test scores are, the more reliable the test is. There are also other ways to compute reliability such as split-half reliability in which we compare the score from the first half of the test to the second half of the test, and parallel forms reliability in which we compare the scores from two similar sets of items (or forms of the test). Lastly, internal reliability is how the items in a scale are related to one another. The higher internal reliability a scale has, the more we could say the items are measuring the same construct.

Thus, when you begin to choose a psychological test, one question you should ask the test provider is how reliable the test is. Test reliability is measured by a reliability coefficient ranging between 0 (totally not reliable) to +1 (perfect reliability). Usually the reliability coefficient is represented by a letter "r" and usually it is a decimal such as r = .78. The closer the number is to +1, the more reliable the test is. As a rule of thumb, when r = .90 or above, we say that the test has excellent reliability; r = .80 to .89 is good reliability; r = .70 to .79 is adequate reliability; and when r is below .70, we say that the test may have limited applicability (O*Net).

How high a reliability coefficient is needed? "The more critical the decision to be made, the greater the need for precision of the measure on which the decision will be based, and the higher the required reliability coefficient" (as quoted in Gatewood, Feild, & Barrick, 2008, pp.138). However, you should not choose a test solely based on its reliability; you should look for other information in the test manual for your decision, such as the type of reliability the test developer used, how the reliability test was conducted, and the characteristics of the sample group etc. (Based on Gatewood, Feild, & Barrick, 2008, Chapter 4)

(Based on Gatewood, Feild, & Barrick, 2008, Chapter 4)


Validity is the judgment or an estimate of how well the test is measuring what it is intended to measure in a particular context. It is defined as "the degree to which available evidence supports inferences made from scores on selection measures" (Gatewood, Feild, & Barrick, 2008). Validity tells you if the test is measuring something related to job qualifications and requirements, such as social skills or communication skills, and it tells you the meaning of the test scores so that you can conclude or make a prediction of someone based on his or her test scores. For example, when a mathematics ability test predicts a student's math exam scores at school, it is said to be a valid test. In the human resources context, we are most interested in how well a test gives us inferences regarding the test takers' job performance. In another word, how well the test score can predict a person's job performance when he or she is hired.

The validity of a test is established by drawing inferences to a specific sample group. It is possible that a test could be valid in one situation and not in another. For example, while a test may be valid in predicting managers' general ability, it may not be valid in predicting the general ability of clerical workers. Thus, when you are choosing a test, you must make sure the reference group has similar characteristics (e.g. occupation, cultural differences, language skills etc.) as your target group to make appropriate inferences.

Validation is the process of gathering evidence for how well the test is measuring the criteria which the test is meant to measure. There are different methods for validation. One method is criterion-related validation which is a demonstration of the relatedness of the test scores and the criterion such as job performance. The criterion can be anything ranging from the score on a certain examination to the quality of customer service provided by a customer service representative in a hotel. Individuals who score high on the test should perform better on the job if the test is valid. Another method is construct-related validation in which one demonstrates that the test is measuring what it is intended to measure. In addition, that measured characteristic must be related to job performance. That is to say, how appropriate is the test score in drawing inferences from the construct being measured. One way to establish construct-related validity is to compare the test takers' test score with the score on other similar tests and also dissimilar tests. For example, we expect that integrity should be similar to honesty but integrity should have little to do with open-mindedness. If an integrity test has a high construct-related validity, we should see that test takes who have high scores on an integrity test should also have high scores on an honesty test; however, test scores on an integrity test should have little association with test scores on an open-mindedness test. Lastly, face validity is about the appearance of the test; whether the test takers think a test appears to measure what it meant to measure. Lack of face validity can lead to a decrease in test takers' motivation to complete the test, in which case the test results will then be affected.

Among the various validation approaches, criterion-related validation is the most common. Criterion-related validity, like reliability, is reported by a validity coefficient which is a number between 0 (totally not valid) to +1 (perfect validity). The larger the number is, the more valid the test is said to be. The validity coefficient is also denoted by the letter "r" which is the correlation between the test scores and the criterion. This should not be confused with the r representing reliability when r stands for "reliability". Validity coefficient, unlike reliability, rarely exceeds .40 because any one test alone is seldom sufficient in predicting total job performance. Every job is a complex combination of different job-related behaviors. When multiple tests are used, you will have a better estimate of the different aspects of the job in question, thus a higher validity. As a result, a validity coefficient is typically .21 to .35 (O*Net). A rule of thumb to interpret validity coefficient is that when r is above .35, we say the test is very beneficial in predicting job performance; when r is .21 to .35, we say that the test is likely to be useful; when r is .11 to .20, validity is said to depend on circumstances; and when r is below .11, we say that the test is unlikely to be useful (O*Net).

For more information regarding validity of a particular test, you should refer to the test manual. A test manual should give information on validation evidence, including a detailed explanation of how validation studies were carried out and the characteristics of the sample groups. (Based on Gatewood, Feild, & Barrick, 2008, Chapter 5)

In choosing a test it is important to ensure that the test is both reliable and valid for your situation. Apart from reading the test manuals, you can also refer to independent reviews of each test.