What are Confidence Judgments Made of? Students’ Explanations for their Confidence Ratings and What that Means for Calibration
Daniel L. Dinsmore and Meghan M. Parkinson
University of Maryland
Daniel L. Dinsmore, Department of Human Development, University of Maryland; Meghan M. Parkinson, Department of Human Development, University of Maryland.
Correspondence concerning this article should be addressed to Daniel L. Dinsmore, Department of Human Development, 3304 Benjamin Building, College Park, MD 20742.
Thanks to Gregory Hancock for his helpful comments regarding this manuscript.
Although calibration has been widely studied, questions remain about how best to capture confidence ratings, how to calculate continuous variable calibration indices, and on what exactly students base their reported confidence ratings. Undergraduates in a research methods class completed a prior knowledge assessment, two sets of readings and posttest questions, and rated their confidence in their responses to each posttest item. Students also wrote open-ended responses explaining why they marked their confidence as they did. Students provided confidence ratings on a 100-mm line for one of the passages and through magnitude scaling for the other counterbalanced passage. Calibration was calculated using a rho coefficient. No within-subject differences were found between 100-mm line responses and magnitude scaling responses, p = .54. Open-ended responses revealed that students base their confidence ratings on prior knowledge, characteristics of the text, characteristics of the item, guessing, and combinations of these categories. Future studies including calibration should carefully consider implicit assumptions about students’ sources of confidence and how those sources theoretically relate to calibration.
What are Confidence Judgments Made of? Students’ Explanations for their Confidence Ratings and What that Means for Calibration
Students take many tests throughout their academic career. Successful students develop a set of expectations about what will be on the test and either memorize the appropriate information or develop knowledge schema (Crisp, Sweiry, Ahmed, & Pollitt, 2008). At a certain point in the process of studying, students must determine whether they have adequately learned the concepts on which they will be tested, called a judgment of learning (Dunlosky, Serra, Matvey, & Rawson, 2005). The same type of judgment occurs again when answering test questions and students determine whether they have adequately answered each question, called a confidence judgment (Schraw, 2009). It is extremely important how accurate students’ judgments of learning and confidence judgments are as this impacts self-regulation of learning (Labuhn, Zimmerman, & Hasselhorn, 2010). Calibration is thus the relation between the degree of confidence students have about their performance and their actual performance (Fischoff, Slavic, & Lichtenstein, 1977; Glenberg, Sanocki, Epstein, & Morris, 1987).
Since calibration is so crucial to learning, it is not surprising that cognitive, developmental, and educational psychologists have long studied the phenomenon (e.g., Fischhoff, Slavic, & Lichtenstein; Glenberg & Epstein, 1985; Koriat, 2011). Thorndike first brought attention to the need to move away from simply measuring how precisely one could estimate physical differences and move towards investigations of how one judges differences in comparison to a mental standard of some sort (Woodworth & Thorndike, 1900). He argued that this was of utmost importance in understanding how we navigate everyday situations. After over a century of research into calibration there are still many questions as to how it relates to the everyday situation of successful or unsuccessful learning. It is the purpose of this study to systematically address these questions through precise theoretical framing, manipulated methods of measurement, and students’ self-reported reasons for their responses to the confidence measures.
A recent review of the contemporary literature on calibration found that while just over half of studies explicitly defined what was meant by calibration, very few grounded that definition in a theoretical framework (Parkinson, Dinsmore, & Alexander, 2010). What it means to be a poorly calibrated versus a well-calibrated learner is somewhat in doubt without conceptual assumptions from a model or theory of some sort. Furthermore, measurement of the confidence ratings upon which calibration is calculated should be congruent with the chosen theoretical framework. If, for example, calibration is theoretically thought to be domain-specific the corresponding measurement should reflect domain-specific items. Theory should also guide interpretation of students’ bases for judging their confidence since confidence ratings are necessary for calculating calibration. Previous research has suggested that students when students answer questions after reading a passage they tend to judge their understanding and subsequent confidence based on their familiarity with the domain rather than what they learned from the specific text (Glenberg, Sanocki, Epstein, & Morris, 1987). According to the Model of Domain Learning (Alexander, Murphy, Woods, & Duhon, 1997) this finding would not only be explainable, but also expected.
The current study is guided by Bandura’s (1986) model of reciprocal determinism. The model explains that individuals must regulate the reciprocal relations between personal influences, behavioral influences, and environmental influences in order to learn. Personal influences could be metacognitive, such as prior knowledge and experience (i.e. metacognitive knowledge or experiences; Flavell, 1979), or motivational, such as goals, interest, and self-efficacy. Metacognitive experiences, according to Flavell (1979), impact goal-setting and activate strategy usage and tend to be conscious when a learner is faced with a complex, or highly demanding task. As with most conceptions of the relation between metacognition and self regulation, we conceptualize metacognition as a cognitive aspect of self regualation, which subsumes it (Dinsmore, Alexander, & Loughlin, 2008). Since the larger frame of self-regulation is thought to be domain-specific in nature (Alexander, Dinsmore, Parkinson, & Winters, 2011), it is important to consider that students will have different metacognitive experience across different domains and subsequently their calibration would be expected to potentially be better or worse depending on the interaction between their metacognitive experience and the particular task. Given this consideration, students in the current study were asked to complete a set of tasks specific to their research methods class.
Personal factors both influence and are influenced by environmental factors, such as the task, achievement outcomes, or the classroom climate. Evaluations of the task are expected to contribute to students’ confidence ratings, and students’ judgment accuracy is in turn expected to shape perceptions of task difficulty. The final corner of the triangle is behavioral factors. Bandura described three types of behavior: self-observation, self-judgment, and self-reaction. Self-observation refers to attention to a task. Confidence ratings are a type of self-judgment because students compare their performance to pre-established goals, or Woodworth and Thorndike’s mental standard. Self-reaction is the resulting behavior from self-observation and self-judgment. Positioning calibration in the behavioral portion of the model implies that not only do personal factors and environmental factors influence calibration (as has oft been studied), but that calibration itself influences students’ metacognition, goals, and ability to confront the task. Therefore if students are poorly calibrated their performance is not expected to be as high as students who are well-calibrated.
In order to understand the role of calibration in the frameworks mentioned previously, we must be able to make valid inferences about the measurement of confidence and performance, and ultimately, the calculations of calibration used. Currently, there appears to be little consensus on what methods to use to calculate calibration (e.g., Parkinson, Dinsmore, & Alexander, 2010). Typically, respondents are asked to complete a multiple-choice recall measure for each item then respond whether they were “confident” or “not confident”. The Hamman or gamma coefficient would then be used to calculate the likelihood that a participant exhibited more correct judgments than incorrect judgments (Schraw, 2009). However, this dichotomous measure of both performance and confidence is a problematic one. As Thorndike and Gates (1929) pointed out, there is likely a distribution of people that fall along a continuum of any variable (in this case performance and confidence) rather than a true dichotomy. While it may be the case that dichotomization of the construct makes discussing and calculating calibration easier, we may in fact be invalidating any inferences we can make about an individual’s calibration in the framework of self regulation.
Fortunately, there are measures of both performance and confidence that can be used to increase the validity of the inferences made from calculations of calibration. With regards to knowledge, it is clear that a correct response on a multiple-choice questions does not indicate total knowledge of a concept and conversely an incorrect response on a multiple choice item does not indicate no knowledge of a concept. One way to expand this dichotomy is to use a graduated response model (Alexander, Murphy, & Kulikowich, 1998). In this response model, ordered categories are used to represent varying levels of knowledge about a domain or topic.
With regards to confidence, it is also unlikely that an individual would either be completely confident or not confident at all. This study explores two different options for measuring confidence on a continuous scale. The first of these is to measure confidence on a 100-millimeter line (e.g., Schraw, Potenze, & Nebelsick-Gullet, 1993). Using the 100-mm line, participants are free to mark anywhere on the line from “not confident” to “very confident” to describe their level of confidence about an item with possible scores ranging from 0 to 100. The other option explored in this study is that of magnitude scaling (for more information about magnitude scaling see Lodge, 1981). Magnitude scaling not only provides interval level data, but additionally allows one to rate their confidence at a ratio level by comparing confidence of one item to an anchor item. The magnitude of their confidence increases the validity of the interpretations one can make about their confidence, and thus, the calibration calculated for each individual.
Although, extending confidence and performance beyond dichotomized measures allows for greater validity of interpretation, it also brings with it some challenges. Namely, how do we calculate calibration? The extension of confidence and performance measures to continuous data will require us to go beyond the use of the Hamman and the gamma coefficients, to coefficients that consider ordinal data with underlying continuities (i.e., the graduated response model) and continuous data (i.e., the 100-mm and magnitude scales).
Lastly, looking at the model of reciprocal determinism (Bandura, 1986) it seems apparent that many studies of calibration manipulate environmental influences, such as feedback (Labuhn, Zimmerman, & Hasselhorn, 2010) and task difficulty (Pulford & Colman, 1997), and measure behavioral influences, such as judgments of learning (van Overschelede & Nelson, 2006) or help-seeking behavior (Stavrianopoulos, 2007), but less is known about the role of personal factors in calibration. The current study asks students to self-report how they chose their confidence rating for particular items through an open-ended written response. The aim in gathering this information is to determine which factors students are considering when they make their confidence judgments, and subsequently which factors are cited by poorly and highly calibrated individuals. In addition to the valuable contribution of self-reported bases for judgments, the following questions are under investigation in the current study.
Can the rho coefficient provide valid inferences about the relation between performance and confidence (i.e., calibration)? It is hypothesized that the rho coefficient will be appropriate for calculating calibration since it correlates ordinal data with an underlying continuous distribution (Cohen, Cohen, West, & Aiken, 2003).
Does the type of confidence scale used (i.e., 100-mm versus magnitude scale) affect the validity of inferences drawn from the calculation of calibration (i.e., the rho coefficient) within subjects and in relation to an external measure (e.g., prior subject-matter knowledge)? It is predicted that magnitude scaling will provide more valid inferences since these are ratio-level data instead of interval level data.
What do students report considering when making their confidence judgments? It is anticipated that students will report personal influences suggested by Bandura (1986) such as self-efficacy or interest, and aspects of metacognition described by Flavell (1979) such as metacognitive knowledge and metacognitive experiences. Students may also consider aspects of the task, such as item difficulty and answer choices, when making their confidence judgments. The current study utilized open-ended written responses to capture as much variability in self-reported responses as possible.
The participants for the current study (11 males, 61 females, Mage = 20.9 years, age range: 19-25) were recruited from a human development research methods class in a large public mid-Atlantic university. These participants were ethnically diverse (58.3% Causasian, 19.4% African American, 15.3% Asian, and 6.9% Hispanic) and came from a wide variety of majors (the most prevalent major was psychology at 33.3%). Participants were sophomore, juniors, and seniors (Mcredits = 87.8, MGPA = 3.3).
The materials for this study consisted of two text passages adapted from a research methods textbook (Cozby, 2007). These passages dealt with the topic of scientific approach (SA; Appendix A) and statistical inference (SI; Appendix B). The passages were chosen to be equivalent with regards to both readability and difficulty. Both passages were three pages single spaced (1669 words and 1672 words respectively), had comparable Flesch Reading Ease scores (42.7 and 45.0 respectively), and comparable Flesch-Kincaid Grade Level scores (both passages 11.5). Further, there were no significant differences detected for the passage recall measures for each passage described below (Mdiff= 1.25, SEdiff = 6.63, p = .11).
The measures for this study consisted of a subject-matter knowledge measure, two passage recall measures, and two item confidence measures.
Subject-matter knowledge. The subject-matter knowledge measure was a 12-item graduated multiple-choice test. These 12 items constituted a variety of topics from the research methods course in which the students were enrolled and were verified as important topics in the course by the instructor of record.
The answer choices for each item were on a graduated response scale (Alexander, Murphy, & Kulikowich, 1998) representing four categories. The first of these four categories was the in-topic correct response which was scored a 4. The second of these four categories was the in-topic incorrect response and was scored a 2. This response was not correct, however, the response within the same topic as the correct response. The third category was the in-domain incorrect response and was scored a 1. This response was also incorrect, was from a different topic than the correct response, but was still in the domain of research methods. The fourth category was the popular lore option which was scored a 0. This incorrect response was one in which a participant with little or no subject-matter would chose and was not associated with the domain of research methods. An example of one of these items follows.
A researcher runs a time-series design testing the effects of his magic memory pill. Ironically, after taking his Magic Memory Pill, the participants in his study forget to come to the subsequent observations. The main threat to internal validity here is:
These items and graduated responses were validated by the instructor of record for the research methods course. Possible scores for this measure could range from 0 to 48. The current sample had a mean score for this measure of 33.74 (SD = 5.77). The reliability for this measure was low (α = .30), possibly due to the low prior knowledge of many of these participants.
Passage recall measures. For each of the text passages (SA and SI) there was a 10-item graduated multiple-choice measure. The ten items for each of these measures (the SA passage recall and SI passage recall) were based on concepts taken directly from the respective passages. Similar to the subject-matter knowledge measure, the response choices represented a graduated response scale consisting of four categories. The first category was the in-passage correct response and was scored a 4. The second category was the in-passage closely-related incorrect response and was scored a 2. This response was closely conceptual related to the correct response and was located in close proximity to the correct answer in the passage. The third category was the in-passage further-related incorrect response and was scored a 1. This response was in the passage but was not closely-related to the correct response and was in further proximity to the correct response in the passage. The last category was the popular lore response and was scored a 0. This response was not located anywhere in the text passage. An example of one of these items follows.
To ensure research with major flaws will not become part of the scientific literature, scientists rely on:
reliability estimates (1)
verification of ideas (2)
peer review (4)
Validity for these items was established by their correspondence to the text passages and validated by the course instructor of record. Possible scores for these measures could range from 0 to 40. The current sample had a mean score for the SA passage recall measure of 33.40 (SD = 6.45) and for the SI passage recall measure of 32.15 (SD = 5.63). The reliability for these measures were .67 and .62 for the SA and SI measures respectively.
Item confidence measures. Following each item on both the SA and SI passage recall measures participants were asked how confident they were that their response was correct. This was done using either a 100-mm line (Schraw, et al., 1993) or magnitude scaling (Lodge, 1981).
100-mm line confidence measure.For this measure, participants were instructed, “After each item please indicate how confident you are with the answer you circled by making a slash mark on the line (or ends) indicating your confidence from not confident to very confident.” A sample item follows.
How confident are you that your response to the above item is correct?
Each participant’s response was measured on the 100 millimeter line and recorded using a standard ruler. Possible scores on the 100-mm line confidence scale ranged from 0 to 100.
Magnitude scaling confidence measure. For this measure, the directions were:
After each item please indicate how confident you are with the answer you circled by comparing your confidence in that item with the anchor item which has been assigned a value of 10. If you are more confident for a particular item than the anchor item (which had a value of 10) you should write a number higher than 10 in the brackets. For example, if you are three times as confident in that particular item than you were with the anchor statement, you would write a 30 (i.e. three times as much as 10) for your confidence judgment of that item. If you are less confident for a particular item than the anchor item (which had a value of 10) you should write a number lower than 10 in the brackets. For example, if you are half as confident in that particular item than you were with the anchor statement, you would write a 5 (i.e. half of 10) for your confidence judgment of that item. No negative numbers please.
The anchor item immediately followed the instructions and consisted of a multiple-choice item of the same type as the recall items that was of about medium difficulty (item easiness = .53) in an earlier pilot sample. A sample confidence question follows.
How confident are you that your response to the above item is correct compared to the anchor item? [ ]
Possible scores on the magnitude judgments ranged from 0 to ∞. These raw scores were then converted to log (base 10) scores (Lodge, 1981) so that a transformed value of one was equal to the anchor (initially 10), values under ten indicated less confidence than the anchor, and values above ten indicated more confidence than the anchor.
In addition to the measures described above, two measurements were taken during the experiment, namely, participants’ ability to summarize the passages they had read and open-ended questions about how they were judging their confidence.
Passage summaries. After reading the passage, participants were asked to, “Please summarize this passage, giving both the overall main idea and the major points addressed.” Participants were given the front and back of the page to write their response. These summaries will be scored using a rubric currently under development.
Open-ended confidence questions. Following two of the confidence items participants were asked to, “Please explain how you arrived at or what you considered when making your confidence judgment.” Participants were given a 3-inch space to write their response.
The open-ended responses were then coded for instances of confidence judgments being arrived at through five a priori categories: prior knowledge, characteristics of the text, characteristics of the item, guessing, and other considerations. Considerations of prior knowledge in confidence judgments were instances where participants cited their level of prior knowledge about the concept in question. For example, one participant responded to an item, “this question I decided based on past knowledge and understanding.” Characteristics of the text included instances where participants cited things they had read in the text. For example, a participant responded, “I remember seeing adoption in the passage earlier.” Characteristics of the item included instances where participants cited either the stem or responses for the item as reasons for their confidence level. For example, one participant responded, “All of the answer choices seem to contribute to the soundness of research (i.e., expertise of the peer reviewers matters). However, peer review is the formalized process by which research becomes part of the scientific literature.” Guessing included instances where participants indicated their confidence level was influenced by guessing. For example, one participant responded, “I just guessed.” The other category included statements that had none of the element described previously (i.e., prior knowledge, characteristics of the text, characteristics of the item, or guessing). For example, one participant responded, “What I considered when making my confidence judgment is that personal judgment is like a gut feeling of one person not a group of people. Everyone can/may rely on intuition based on experience.”
It was also possible that an open-ended response could include more than one of these elements. In other words, a response could include both an explanation of how prior knowledge and item characteristics influenced their confidence levels. For example, a participant with this combination responded, “prior knowledge (definitions) and relation of confidence to other answers.”
The first and second author concurrently coded responses for 10 randomly selected participants (40 responses in all) to ensure the coding scheme was useful. Then the first and second authors independently coding the remaining participants (62 participants, 248 responses). Exact agreement for codes (including combinations of codes) was 92.5%.
The study was conducted during regular class time for their research methods class. The first session consisted of filling out the informed consent, filling out a demographic survey, completing the subject-matter knowledge measure. Additionally, students were randomly assigned to read and respond to one of the two text passages (either SA or SI) and fill out their confidence ratings using the 100-mm scale or the magnitude scale. Passage and confidence ratings scale were counterbalanced. After reading the passage, participants completed the 10 recall items, the confidence items, and completed the passage summary. During the second session, participants read other passage (either SA or SI) and completed the 10 recall items, confidence items, and passage summary for that passage. Participants were also randomly assigned two items from each passage to complete the open-ended response.
Calibration for each participant was calculated using the rho coefficient which is appropriate for ordinal level data with an underlying continuous distribution (Cohen, Cohen, West, & Aiken, 2003). To examine whether the rho coefficient gave valid inferences about an individual’s calibration we began by inspecting the scatterplots for each individual’s raw performance and confidence ratings. Figures 1 and 2 present scatterplots for individuals that have a range of rho coefficient values for the 100-mm and magnitude confidence scales respectively.
From these scatterplots, it appears that the rho coefficient is giving us some indication that the higher confidence and performance are linearly related, the higher the rho coefficient is (i.e., more highly calibrated). One issue that we encountered and is evident in the scatterplots of the 100-mm lines is the amount of the 100-mm line participants used. Specifically, the mean of the standard deviations of participants’ ten responses on the 100-mm confidence line was 17.68 (SD = 12.88) with a maximum standard deviation of 71.56 and a minimum standard deviation of 0. Clearly, it then becomes difficult to ascertain what differences the participants ascribe to different locations on the 100-mm line.
Type of Confidence Scale
The issue of interpreting the different locations on the 100-mm line described previously brings us to the next question. Namely, does the type of confidence scale (100-mm line versus magnitude judgments) matter? We will discuss issues of interpretation from above, but will first present an analysis of within-subjects differences between the 100-mm and magnitude scales. Since there were no passage differences in recall performance (analysis previously presented in the methods section), this allows an inspection of the differences between participants’ calibration (rho coefficient) for the two scales. A within-subjects analysis indicated that there were no significant differences between individuals’ calibration using the 100-mm scale and the magnitude scale (F = .375, df = 1, 68, p = .54). There were also no differences for order of measurement scale (F = .186, df = 1, 68, p = .67). The means and standard deviations for this analysis can be found in Table 1.
Further, the distributions of rho for these two passages were similar. Both the SA passage and SI passage were both negatively skewed (-.64 and -.53 respectively), however, had different levels of kurtosis (-.02 and -.34 respectively). The distributions of rho for these passages are presented in Figure 3 for the SA passage and Figure 4 for the SI passage.
Although there were no differences in terms of the magnitude of their calibration (i.e., rho coefficient), we still feel that the validity of the interpretations of the calibration coefficient are stronger for the magnitude scale. In other words, given an anchor statement, individuals are rating their confidence by indicating the magnitude of their confidence for one item versus an anchor item, hence the scale is known and fixed to the anchor item. For the 100-mm line, it is unknown if an item marked at 50 millimeters is perceived to be twice as difficult as an item marked at 25 millimeters.
We had hoped to see if the strength of the relation between the rho coefficient calculated from the 100-mm line and the magnitude scale to the subject-matter knowledge differed, however, due to the low reliability of the subject-matter knowledge scale we chose not to analyze this particular scale. We will, however, code the passage summaries that we have and see if the correlation between passage summary and calibration differ in those relations.
Open-ended Participant Reports
Finally, we present data on what students reported they used to make their confidence ratings. The five categories discussed in the methods section (prior knowledge, characteristics of the text, item characteristics, guessing, and other) were all present in the raters’ codings. Additionally, participants responses also fell into the following joint categories: prior knowledge and text characteristics; prior knowledge and item characteristics; prior knowledge and guessing; text characteristics and item characteristics; text characteristics and guessing; item characteristics and other; prior knowledge, text characteristics, and item characteristics; text characteristics, item characteristics, and guessing; and finally, prior knowledge, item characteristics, and guessing. The number of responses that fell into each of these categories by passage is presented in Table 2.
It is clear from these data that participants were taking into account multiple factors when rating their confidence. Not only do these data support using Bandura’s (1986) model of reciprocal determinism, but further, indicates that inferences made about participants’ calibration must be carefully considered. For instance, there were many individuals who only used one factor for making their confidence judgments, while others used different factors for different items. Further some individuals were able to incorporate multiple factors when making their confidence judgments.
How can the preceding findings from this exploratory study help guide exploration of calibration in different learning situations? We suggest that these data inform research into calibration in regards to theoretical framing and the importance of knowing what individuals are basing their confidence judgments on, measurement of confidence and performance, and the calculation of calibration.
These findings indicate that students’ confidence ratings do include elements of the person (i.e. prior knowledge) and the task (i.e., text characteristics and item characteristics in this study). Further, it was clear students were basing their confidence on different parts of the model of reciprocal determinism (i.e., person characteristics or environment characteristics) or a combination of both (i.e., person and environment characteristics). These findings highlight that these data fit the model of reciprocal determinism, and that further investigation into how students chose which of these aspects to consider when making a confidence judgment is warranted. For instance, simply asking students to rate their general confidence may not provide enough information for valid inferences related to the nature of calibration. Perhaps it is the case that for items or tasks it may be more telling to ask how confident one is based on person factors, environment factors, and some combination of both. Regardless of how confidence is measured, it is critical to frame it in some sort of theory or model to provide explanatory power about calibration inside and outside of the classroom.
Secondly, in order to make valid inferences about calibration it is also clear that the measurements of confidence and performance must match the underlying theoretical distribution. For both confidence and performance these underlying continuous distributions need to be measured using continuous, not dichotomous variables. For the case of confidence judgments this can be done in relatively easy fashion. The 100-mm lines provided a “quick and dirty” method measuring confidence on an interval scale. While these scales meet the criteria of being interval-level data, the issue of where students’ rate their confidence on the line (both mean location and the variability of these responses) is difficult to interpret and likely differs across individuals. The magnitude scales on the other hand provide ratio-level data in which we can directly interpret the magnitude of students’ responses to a given anchor statement. The difficulty here is figuring out what the anchor item can be. For this study, we chose to use an item that was of moderate difficulty for a small pilot sample, but perhaps other considerations about an item and the individuals in the sample need to be taken into account.
Performance on the other hand is a bit more difficult to measure in a continuous fashion. We have used ordinal level data with the graduated response model. One possible extension of this response model would be to use more responses, upwards of seven, which may produce a distribution of item responses over enough items to be approximately normal. Using Monte Carlo simulations such as Schraw, Kuch, & Roberts (this symposium) may help elucidate some of these issues.
Lastly, the issue of measurement has a direct effect on the calculation of calibration. It is evident that whatever coefficient or value is used for calibration must meet the assumptions of the underlying data. In this case we used rho because it was a conservative coefficient for our data. Another option for these data would be to use the polyserial correlation, which can be used for ordinal level data with an underlying continuous distribution and continuous data, although the polyserial coefficient has some computational difficulties (i.e., maximum-likelihood estimation) that is difficult when each student only responds to a small number of items such as in the current study.
One avenue of further investigation would be to see if an index of calibration could be constructed using standard error in linear regression. If confidence and performance were expected to have a standardized slope of 1, participants confidence ratings and performance could be used to obtain a standard error of these points from this predicted slope of 1. However, since continuous measures of performance are difficult and not common in the literature, other types of regression need to be investigated. One type of regression useful for ordinal data in this case might be ordinal regression, which is a type of logistic regression (Pedhazur, 1997). Issues of violations of assumptions need to be overcome. For example, with these data a few participants only had responses in one outcome category (e.g., 4). Without a minimum of responses in each category, violations of assumptions and model fit may inhibit its use.
Because calibration is such an important consideration in self-regulation and learning, innovations in its measurement continue to be of utmost importance to researchers and educators alike. This study sought to answer three critical questions remaining in the use and interpretation of students’ calibration data. First, two types of confidence rating response formats were contrasted and some restriction in the range of responses to 100-mm lines was found. This finding draws attention to the need to consider the nature of the data being interpreted. Second, a rho coefficient was utilized to calculate calibration scores with continuous data and this approach is suggested for use with non-dichotomus responses. Finally, a theoretical examination of students’ rationales for their confidence ratings was undertaken. Students consider various and sometimes multiple sources when making their confidence judgments. Therefore, it is imperative for educators to consider the cues provided from these sources in order to better understand why some students can accurately monitor their learning, while others do not even notice the need to adjust their strategies and goals for learning.
Alexander, P. A., Dinsmore, D. L., Parkinson, M. M., & Winters, F. I. (2011). Self-regulated learning in academic domains. In B. J. Zimmerman, & D. Schunk (Eds.), Handbook of self-regulation of learning and performance. New York: Routledge.
Alexander, P. A., Murphy, P. K., & Kulikowich, J. M. (1998). What responses to domain-specific analogy problems reveal about emerging competence: A new perspective on an old acquaintance. Journal of Educational Psychology, 90, 397-406.
Alexander, P. A., Murphy, P. K., Woods, B. S., & Duhon, K. E. (1997). College instruction and
concomitant changes in students’ knowledge, interest, and strategy use: A study of domain learning. Contemporary Educational Psychology, 22, 125-146.
Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory.
Englewood Cliffs, NJ: Prentice-Hall.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. Mahwah, NJ: Lawrence Erlbaum Associates.
Cozby, P. C. (2007). Methods in behavioral research. NY: McGraw-Hill.
Crisp, V., Sweiry, E., Ahmed, A., & Pollitt, A. (2008). Tales of the expected: The influence of students’ expectations on question validity and implications for writing exam questions. Educational Research, 50, 95-115.
Dinsmore, D. L., Alexander, P. A., & Loughlin, S. M. (2008). Focusing the conceptual lens on metacognition, self regulation, and self-regulated learning. Educational Psychology Review, 20, 391-409.
Dunlosky, J., Serra, M. J., Matvey, G., & Rawson, K. A. (2005). Second-order judgments about judgments of learning. Journal of General Psychology, 132, 335-346.
Fischoff, B., Slavic, P., & Lichtenstein, S. (1977). Knowing with certainty: The appropriateness of extreme confidence. Journal of Experimental Psychology: Human Perception and Performance, 3, 552-564.
Flavell, J. H. (1979). Metacognition and cognitive monitoring: A new area of cognitive-developmental inquiry. American Psychologist, 34, 906-911.
Glenberg, A. M., & Epstein, W. (1985). Calibration of Comprehension. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 11, 702-718.
Glenberg, A. M., Sanocki, T., Epstein, W., & Morris, C. (1987). Enhancing calibration of comprehension. Journal of Experimental Psychology: General, 2, 119-136.
Koriat, A. (2011). Subjective confidence in perceptual judgments: A test of the self-consistency
model. Journal of Experimental Psychology: General, 140, 117-139.
Labuhn, A. S., Zimmerman, B. J., & Hasselhorn, M. (2010). Enhancing students’ self-regulation
and mathematics performance: The influence of feedback and self-evaluative standards.
Metacognition and Learning, 5, 173-194.
Lodge, M. (1981). Magnitude scaling: Quantatitive measurement of opinions. Newbury Park, CA: Sage Publications.
Parkinson, M. M., Dinsmore, D. L., & Alexander, P. A. (2010). Calibrating calibration:
Towards conceptual clarity and agreement in calculation. Paper presented at the annual
Meeting of the American Educational Research Association, Denver.
Pedhazur, E. J. (1997). Multiple regression in behavioral research (3rd edition). Orlando, FL: Harcourt Brace.
Pulford, B. D., & Colman, A. M. (1997). Overconfidence: Feedback and item difficulty effects.
Personality and Individual Differences, 23, 125-133.
Schraw, G. (2009). A conceptual analysis of five measures of metacognitive monitoring.
Metacognition and Learning, 4, 33-45.
Schraw, G., Kuch, F., & Roberts, R. (2011). Bias in the gamma coefficient: A Monte Carlo study. In P. Alexander (Chair), Calibrating calibration: Conceptualization, measurement, calculation, and context. Symposium presented at the annual meeting of the American Educational Research Association, New Orleans.
Schraw, G., Potenza, M. T., & Nebelsick-Gullet, L. (1993). Constraints on the calibration of performance. Contemporary Educational Psychology, 18, 455-463.
Stavrianopoulos, K. (2007). Adolescents’ metacognitive knowledge monitoring and academic
help seeking: The role of motivation orientation. College Student Journal, 41, 444-453.
Thorndike, E. L., & Gates, A. I. (1929). Elementary principles of education. New York: The Macmillan Company.
van Overschelde, J. P., & Nelson, T. O. (2006). Delayed judgments of learning cause both a
decrease in absolute accuracy (calibration) and an increase in relative accuracy
(resolution). Memory & Cognition, 34, 1527-1538.
Woodworth, R. S., & Thorndike, E. L. (1900). Judgments of magnitude by comparison with a
Number of Responses for each Confidence Code Category by Passage
Note.P = prior knowledge; T = text characteristic; I = item characteristic; G = guessing; O = other; PT = prior knowledge and text characteristics; PI = prior knowledge and item characteristics; PG = prior knowledge and guessing; TI = text characteristics and item characteristics; TG = text characteristics and guessing; IO = item characteristics and other; PTQ = prior knowledge, text characteristics, and item characteristics; TIG = text characteristics, item characteristics, and guessing; PIG = prior knowledge, item characteristics, and guessing
Scatterplots for Ranges of Participants Calibration (rho) for the 100-mm Scales
Scatterplots for Ranges of Participants Calibration (rho) for the Magnitude Scales
Distribution of Participants’ Calibration for the Scientific Approach Passage
Distribution of Participants’ Calibration for the Statistical Inference Passage