the extent to which measures indicate what they are intended to measure is referred to as
Learning Objectives
- Define reliability, including the different types and how they are assessed.
- Define validity, including the different types and how they are assessed.
- Describe the kinds of prove that would be relevant to assessing the reliability and validity of a particular mensurate.
Over again, measurement involves assigning scores to individuals so that they represent some feature of the individuals. But how do researchers know that the scores really represent the characteristic, specially when it is a construct like intelligence, self-esteem, low, or working memory capacity? The answer is that they carry enquiry using the measure to confirm that the scores make sense based on their understanding of the construct being measured. This is an extremely important bespeak. Psychologists do not merelyassume that their measures work. Instead, they collect information to demonstrate that they work. If their inquiry does not demonstrate that a measure works, they finish using it.
As an informal example, imagine that you have been dieting for a month. Your apparel seem to be fitting more than loosely, and several friends have asked if you have lost weight. If at this bespeak your bathroom scale indicated that you had lost 10 pounds, this would brand sense and you would continue to utilise the scale. But if it indicated that you had gained x pounds, you would rightly conclude that information technology was broken and either fix information technology or go rid of it. In evaluating a measurement method, psychologists consider two general dimensions: reliability and validity.
Reliability
Reliability refers to the consistency of a mensurate. Psychologists consider three types of consistency: over fourth dimension (test-retest reliability), across items (internal consistency), and across dissimilar researchers (inter-rater reliability).
Test-Retest Reliability
When researchers measure out a construct that they presume to be consistent across time, then the scores they obtain should besides be consistent across time.Test-retest reliability is the extent to which this is really the instance. For instance, intelligence is mostly thought to be consistent across time. A person who is highly intelligent today volition be highly intelligent side by side calendar week. This means that whatever good measure of intelligence should produce roughly the same scores for this individual next week as information technology does today. Clearly, a measure out that produces highly inconsistent scores over time cannot be a very good measure out of a construct that is supposed to exist consistent.
Assessing test-retest reliability requires using the measure out on a group of people at once, using it again on thesame group of people at a later time, and so looking attest-retest correlation betwixt the 2 sets of scores. This is typically done past graphing the data in a scatterplot and computing the correlation coefficient. Figure 4.two shows the correlation between two sets of scores of several university students on the Rosenberg Cocky-Esteem Scale, administered two times, a week apart. The correlation coefficient for these data is +.95. In full general, a test-retest correlation of +.lxxx or greater is considered to signal adept reliability.
Once again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are non causeless to be stable over time. The very nature of mood, for example, is that it changes. Then a mensurate of mood that produced a low exam-retest correlation over a menstruation of a month would not be a cause for business concern.
Internal Consistency
Another kind of reliability isinternal consistency, which is the consistency of people's responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people's scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Calibration, people who agree that they are a person of worth should tend to concord that they have a number of adept qualities. If people's responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is every bit true for behavioral and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure out of their level of risk seeking. This measure would be internally consistent to the extent that private participants' bets were consistently high or low across trials.
Like examination-retest reliability, internal consistency can only be assessed by collecting and analyzing data. One approach is to expect at acarve up-half correlation. This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. And so a score is computed for each set of items, and the relationship between the two sets of scores is examined. For case, Effigy four.3 shows the split-half correlation between several academy students' scores on the even-numbered items and their scores on the odd-numbered items of the Rosenberg Self-Esteem Scale. The correlation coefficient for these data is +.88. A split-one-half correlation of +.fourscore or greater is generally considered good internal consistency.
Possibly the most common measure of internal consistency used by researchers in psychology is a statistic calledCronbach'due south α (the Greek letter blastoff). Conceptually, α is the hateful of all possible split-one-half correlations for a set of items. For example, in that location are 252 ways to split a set of ten items into 2 sets of v. Cronbach'southward α would be the hateful of the 252 divide-one-half correlations. Note that this is not how α is actually computed, merely it is a correct mode of interpreting the meaning of this statistic. Again, a value of +.fourscore or greater is generally taken to indicate proficient internal consistency.
Interrater Reliability
Many behavioral measures involve significant judgment on the role of an observer or a rater.Inter-rater reliability is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring academy students' social skills, you lot could make video recordings of them every bit they interacted with some other student whom they are meeting for the showtime time. Then yous could have two or more observers lookout man the videos and charge per unit each student's level of social skills. To the extent that each participant does, in fact, have some level of social skills that can be detected by an attentive observer, unlike observers' ratings should be highly correlated with each other. Inter-rater reliability would also accept been measured in Bandura's Bobo doll report. In this example, the observers' ratings of how many acts of aggression a particular kid committed while playing with the Bobo doll should have been highly positively correlated. Interrater reliability is oftentimes assessed using Cronbach's α when the judgments are quantitative or an coordinating statistic chosen Cohen's κ (the Greek letter kappa) when they are categorical.
Validity
Validity is the extent to which the scores from a measure out represent the variable they are intended to. Only how do researchers make this judgment? We accept already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should exist more than confident that the scores represent what they are supposed to. There has to be more to information technology, yet, considering a measure can be extremely reliable only have no validity whatsoever. As an cool case, imagine someone who believes that people's index finger length reflects their self-esteem and therefore tries to measure self-esteem by belongings a ruler up to people's index fingers. Although this measure would take extremely good exam-retest reliability, it would have absolutely no validity. The fact that one person's index finger is a centimeter longer than another's would indicate nothing about which one had college cocky-esteem.
Discussions of validity ordinarily carve up it into several distinct "types." But a expert fashion to interpret these types is that they are other kinds of testify—in add-on to reliability—that should exist taken into business relationship when judging the validity of a measure out. Here we consider three basic kinds: face validity, content validity, and criterion validity.
Face Validity
Face validity is the extent to which a measurement method appears "on its confront" to measure the construct of interest. Most people would await a cocky-esteem questionnaire to include items near whether they meet themselves as a person of worth and whether they recollect they have good qualities. Then a questionnaire that included these kinds of items would have good confront validity. The finger-length method of measuring self-esteem, on the other hand, seems to take nothing to exercise with cocky-esteem and therefore has poor confront validity. Although face validity can be assessed quantitatively—for case, past having a large sample of people charge per unit a measure in terms of whether it appears to measure what it is intended to—it is unremarkably assessed informally.
Face up validity is at best a very weak kind of prove that a measurement method is measuring what information technology is supposed to. One reason is that it is based on people'southward intuitions about human behavior, which are ofttimes wrong. It is besides the case that many established measures in psychology piece of work quite well despite defective face validity. The Minnesota Multiphasic Personality Inventory-ii (MMPI-2) measures many personality characteristics and disorders past having people decide whether each of over 567 different statements applies to them—where many of the statements do not accept whatsoever obvious human relationship to the construct that they measure. For case, the items "I savour detective or mystery stories" and "The sight of blood doesn't frighten me or brand me sick" both measure out the suppression of aggression. In this case, it is not the participants' literal answers to these questions that are of interest, but rather whether the pattern of the participants' responses to a series of questions matches those of individuals who tend to suppress their aggression.
Content Validity
Content validity is the extent to which a measure "covers" the construct of interest. For case, if a researcher conceptually defines examination feet as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, so his measure of test anxiety should include items most both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts nearly exercising, feels good about exercising, and actually exercises. So to have practiced content validity, a measure of people's attitudes toward exercise would have to reflect all iii of these aspects. Like face validity, content validity is not ordinarily assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.
Criterion Validity
Benchmark validityis the extent to which people's scores on a measure are correlated with other variables (known ascriteria) that 1 would expect them to be correlated with. For example, people's scores on a new measure of exam anxiety should be negatively correlated with their performance on an important school test. If it were constitute that people's scores were in fact negatively correlated with their examination functioning, then this would exist a piece of show that these scores really correspond people's exam anxiety. But if it were found that people scored equally well on the test regardless of their test anxiety scores, and then this would cast uncertainty on the validity of the measure.
A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there volition usually be many of them. For example, one would expect test feet scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People'south scores on this measure should be correlated with their participation in "extreme" activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same fourth dimension as the construct, benchmark validity is referred to as concurrent validity; however, when the benchmark is measured at some indicate in the time to come (after the construct has been measured), information technology is referred to as predictive validity (because scores on the measure out have "predicted" a future issue).
Criteria can also include other measures of the same construct. For example, one would look new measures of test anxiety or physical chance taking to be positively correlated with existing established measures of the same constructs. This is known as convergent validity.
Assessing convergent validity requires collecting information using the measure. Researchers John Cacioppo and Richard Petty did this when they created their self-written report Need for Cognition Calibration to measure how much people value and engage in thinking (Cacioppo & Petty, 1982)[1]. In a serial of studies, they showed that people's scores were positively correlated with their scores on a standardized academic accomplishment test, and that their scores were negatively correlated with their scores on a measure of dogmatism (which represents a tendency toward obedience). In the years since information technology was created, the Need for Cognition Calibration has been used in literally hundreds of studies and has been shown to be correlated with a broad variety of other variables, including the effectiveness of an advert, interest in politics, and juror decisions (Petty, Briñol, Loersch, & McCaslin, 2009)[ii].
Discriminant Validity
Discriminant validity, on the other mitt, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, cocky-esteem is a general attitude toward the cocky that is fairly stable over time. It is not the same as mood, which is how skillful or bad one happens to be feeling right at present. So people'southward scores on a new measure of self-esteem should not exist very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a mensurate of mood, it could be argued that the new mensurate is non actually measuring self-esteem; it is measuring mood instead.
When they created the Need for Cognition Scale, Cacioppo and Footling also provided evidence of discriminant validity past showing that people's scores were not correlated with certain other variables. For case, they found only a weak correlation between people's need for knowledge and a measure of their cerebral style—the extent to which they tend to think analytically by breaking ideas into smaller parts or holistically in terms of "the big picture." They also found no correlation betwixt people's need for cognition and measures of their exam feet and their tendency to respond in socially desirable means. All these depression correlations provide evidence that the mensurate is reflecting a conceptually singled-out construct.
Key Takeaways
- Psychological researchers practice not simply presume that their measures work. Instead, they conduct research to show that they piece of work. If they cannot evidence that they piece of work, they stop using them.
- There are ii singled-out criteria by which researchers evaluate their measures: reliability and validity. Reliability is consistency across fourth dimension (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually stand for the variable they are intended to.
- Validity is a judgment based on various types of evidence. The relevant prove includes the measure's reliability, whether it covers the construct of interest, and whether the scores it produces are correlated with other variables they are expected to be correlated with and not correlated with variables that are conceptually distinct.
- The reliability and validity of a measure is non established by whatever single study just by the blueprint of results beyond multiple studies. The assessment of reliability and validity is an ongoing process.
Exercises
- Practice: Inquire several friends to complete the Rosenberg Cocky-Esteem Scale. And then assess its internal consistency past making a scatterplot to bear witness the divide-half correlation (fifty-fifty- vs. odd-numbered items). Compute the correlation coefficientbesides if you know how.
- Give-and-take: Think back to the last college exam you lot took and recall of the exam as a psychological measure. What construct practice you recall it was intended to mensurate? Comment on its confront and content validity. What data could you collect to assess its reliability and criterion validity?
Source: https://opentext.wsu.edu/carriecuttler/chapter/reliability-and-validity-of-measurement/
0 Response to "the extent to which measures indicate what they are intended to measure is referred to as"
Post a Comment