Test reliability refers to the instrument’s degree of internal consistency (or content homogeneity). It quite simply indicates the extent to which the items in an instrument measure the same thing. Reliability therefore is a measure of the confidence with which scores obtained with an instrument could be regarded. The inverse of an instrument’s Reliability is the instrument’s Standard Error of Measurement. The difference between a candidate’s true score (the score that he or she would have obtained with a perfect test), and the candidate’s observed score (the score that was obtained on the actual test) is the measurement error. An instrument’s Standard Error or Measurement gives an indication of the degree of inconsistency associated with the instrument.

Reliability is relative and is influenced by the group the instrument is administered to. A more heterogeneous group will result in a higher reliability coefficient than a homogeneous group. The reason for this is that a more heterogeneous group will have a larger range of scores. In addition, there is likely to be more extreme cases. For example, in the case of an ability test used on a diverse group, there are likely to be candidates who get virtually all items correct and candidates who get virtually all items incorrect. This will increase the estimated reliability (which is a measure of the tests consistency). It is therefore important to determine whether a test publisher’s stated reliability estimates were obtained on a population that is equivalent to the one you intend using the instrument on.

There are general guidelines regarding the level of reliability that are acceptable. Cognitive ability instruments generally have higher reliability coefficients, while lower levels are considered acceptable for personality and behavioural styles instruments. Experimental tests may have lower reliabilities than commercial instrumentation.

Reliability is usually estimated with either the Cronbach Alpha or the KR-20 coefficients (in some cases split-half and KR21 reliability estimates may also be used). As mentioned, the more homogeneous the items are, the higher the reliability coefficient (because the items tend to measure the same underlying construct). Up to a point it is good that an instrument has a substantial degree of unity and coherence. However, in the field of ability and intelligence measurement, it is sometimes desirable that an instrument has some spread of items and mixes in different constructs. This allows the instrument to measure a somewhat wider larger range of abilities, and in this way allow the measurement of a more general ability. Conversely, an exclusive preoccupation with internal consistency may result in the narrowing of the scope of the instrument.

It is for this reason that in some circumstances, reliability can limit predictive validity. A balance therefore needs to be achieved between a reliability that is high enough to allow us to be confident that we are measuring what we say we are measuring, and an instrument that samples ability or behaviour in a broad enough domain to allow it to make effective real-world predictions. Therefore, unless scores are reasonably consistent, they cannot be interpreted with any degree of confidence, but at the same time heterogeneity (of items) is important for adequate predictions of external criteria.


Selected scales were evaluated. The KR-20 reliability coefficient was used in the case of dichotomous scored indices, while the coefficient Alpha was used on the interval scaled scales. The standard error of measurement was calculated for each of the scales.


The various scales in general seem to have respectable reliability coefficients on the indices used for the specific populations included in this study.