What is Reliability?

reliable definition

For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods.

On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct.

Validityis the extent to which the scores from a measure represent the variable they are intended to. We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers.

Reliability in statistics and psychometrics is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions. A second kind of reliability isinternal consistency, which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other.

Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects.

Reliability theory shows that the variance of obtained scores is simply the sum of the variance of true scores plus the variance of errors of measurement. Errors of measurement are composed of both random error and systematic error. It represents the discrepancies between scores obtained on tests and the corresponding true scores. The goal of estimating reliability is to determine how much of the variability in test scores is due to errors in measurement and how much is due to variability in true scores. Test-retest reliability assesses the degree to which test scores are consistent from one test administration to the next.

Like test-retest reliability, internal consistency can only be assessed by collecting and analyzing data. This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship between the two sets of scores is examined.

One reason is that it is based on people’s intuitions about human behaviour, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity.


Conceptually, α is the mean of all possible split-half correlations for a set of items. For example, there are 252 ways to split a set of 10 items into two sets of five. Note that this is not how α is actually computed, but it is a correct way of interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken to indicate good internal consistency.

Measurements are gathered from a single rater who uses the same methods or instruments and the same testing conditions. The test-retest method is just one of the ways that can be used to determine the reliability of a measurement. Other techniques that can be used include inter-rater reliability, internal consistency, and parallel-forms reliability.

For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. Face validityis the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities.

As well as reliability, it’s also important that an assessment is valid, i.e. measures what it is supposed to. Continuing the kitchen scale metaphor, a scale might consistently show the wrong weight; in such a case, the scale is reliable but not valid. To learn more about validity, see my earlier post Six tips to increase content validity in competence tests and exams. Essentially, you are comparing test items that measure the same construct to determine the tests internal consistency. Discriminant validity, on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct.

Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimetre longer than another’s would indicate nothing about which one had higher self-esteem.

  • When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to.
  • Validityis the extent to which the scores from a measure represent the variable they are intended to.
  • We have already considered one factor that they take into account—reliability.

It’s an estimation of how much random error might be in the scores around the true score. It was well known to classical test theorists that measurement precision is not uniform across the scale of measurement. Tests tend to distinguish better for test-takers with moderate trait levels and worse among high- and low-scoring test-takers. Item response theory extends the concept of reliability from a single index to a function called the information function.

If there is a high internal consistency, i.e. the results for the two sets of questions are similar, then each version of the test is likely to be reliable. The test – retest method involves two separate administrations of the same instrument, while internal consistency measures two different versions at the same time. Researchers may use internal consistency to develop two equivalent tests to later administer to the same group. Some examples of the methods to estimate reliability include test-retest reliability, internal consistency reliability, and parallel-test reliability. Each method comes at the problem of figuring out the source of error in the test somewhat differently.

Words nearby reliable

Internal validity dictates how an experimental design is structured and encompasses all of the steps of the scientific research method. External validity is the process of examining the results and questioning whether there are any other possible causal relationships. The split-half method assesses the internal consistency of a test, such as psychometric tests and questionnaires. There, it measures the extent to which all parts of the test contribute equally to what is being measured.

The IRT information function is the inverse of the conditional observed score standard error at any given test score. The goal of reliability theory is to estimate errors in measurement and to suggest ways of improving tests so that errors are minimized. While reliability does not imply validity, reliability does place a limit on the overall validity of a test. A test that is not perfectly reliable cannot be perfectly valid, either as a means of measuring attributes of a person or as a means of predicting scores on a criterion.

This is as true for behavioural and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials. With the parallel test model it is possible to develop two forms of a test that are equivalent in the sense that a person’s true score on form A would be identical to their true score on form B. If both forms of the test were administered to a number of people, differences between scores on form A and form B may be due to errors in measurement only.

Example sentences from the Web for reliable

A key question to ask yourself is ‘How congruent are the findings with reality? ’ A credible project is one that adopts established research methods, such as random sampling, a range of different research methods, iterative questioning and frequent client debriefing.

If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead. A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. Content validityis the extent to which a measure “covers” the construct of interest.

Internal Reliability and Personality Tests

While a reliable test may provide useful valid information, a test that is not reliable cannot possibly be valid. That is, a reliable measure that is measuring something consistently is not necessarily measuring what you want to be measured. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance. Inter-method reliability assesses the degree to which test scores are consistent when there is a variation in the methods or instruments used.

Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct. Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to.

As you can see from their definition, validity and reliability are both key points you need to examine in any research study. For a study to be reliable the same experiment must be conducted under the same conditions to generate the same results.

Just as you can count on the consistency of your friend, when something is reliable in science this indicates some level of consistency. In science,validityrefers to accuracy; if something is not accurate, it is not valid. Just as reliability applies at multiple levels of the scientific process, so too does validity. One of the key criteria is that of internal validity, in which they seek to ensure that their study measures or tests what is actually intended.

For example, Figure 5.3 shows the split-half correlation between several university students’ scores on the even-numbered items and their scores on the odd-numbered items of the Rosenberg Self-Esteem Scale. A split-half correlation of +.80 or greater is generally considered good internal consistency. If the scores are 100, 111, 132 and 150, then the validity and reliability are also low. However, the distribution of these scores is slightly better than above, since it surrounds the true score instead of missing it entirely. Reliability is a property of any measure, tool, test or sometimes of a whole experiment.

What you mean by reliable?

Reliable, infallible, trustworthy apply to persons, objects, ideas, or information that can be depended upon with confident certainty. Reliable suggests consistent dependability of judgment, character, performance, or result: a reliable formula, judge, car, meteorologist.