First Facet: Physicians – true variance
The first facet, which reflects differences between physicians, reveals that most scales measure a considerable amount of variance which can be attributed to the physicians (see Table 4). The first facet represents, together with the interaction between physicians and items, the true-variance in which we are primarily interested. Since the quality of measurement of the traits had already been secured by Rasch analyses, the amount of variance that can be attributed to the first facet is not really of significance to us. We can still see that the scale Communication Skills has poor measurement properties because the variance component attributed to physicians appears to be very small.
Second Facet: Observers
The second facet, reflecting differences between observers, reveals varying degrees of an observer’s influence on the variance in the MAAS-MI G-scales. This implies that differences between observers with regard to the overall severity of their grading standards will produce variation in the estimates of a physician’s competence. Saal argues that a significant observer main effect, especially one that explains a sizable proportion of the rating variance, has to be interpreted as the traditional leniency effect, generally defined as the tendency of observers to assign a higher or lower rating than is warrant by a subject’s behavior (Saal et al, 1980).
Leniency refers to the phenomenon that some observers will indicate that behavior occurred, whereas others will state that it occurred insufficiently or only partly, although all of them observed the same behavior
This influence is lowest in the scales History-taking and Interpersonal Skills, moderate in Exploring Reasons for Encounter, Presenting Solutions and Structuring and greatest in the scale Communication Skills. Scales referring to larger units of interview behavior which are difficult to define and for which the criteria for scoring are less well described are more likely to be affected by a leniency effect. Remarkable is that the scale Interpersonal Skills is scarcely influenced by the leniency effect. Stricter definitions and training sessions for observers to reach agreement are ways to reduce the leniency effect in affected scales.
Third Facet: Items
The third facet, reflecting differences between items, shows that a considerable amount of variance is induced by the items which form the scales. The interpretation of the item facet posed problems for the researchers because of the difficulty resulting from systematic variance being induced by items which are identical for each observation situation. We attributed the differences between items to a different source of variation in our study: namely, the combination of simulated patient and medical problem. Considerable differences on the third facet between the two cases are observed for the scales Exploring Reasons for Encounter, History-taking and Presenting Solutions (respectively 11.9%, 17.8% and 17.1% for myocardial infarction, and 48.2%, 4.0% and 36.0% for diabetes mellitus). Dissimilarity in case presentation by the simulated patients and differences between the medical problems are considered to be responsible for these effects.
Fourth Facet: Physicians X Observers
The fourth source of variation, reflecting the interaction between physicians and observers, influences the scores on the scales in varying degrees. Once again, the scale Communication Skills is affected most significantly but this influence also acts upon Structuring the interview and Interpersonal Skills.
The interaction between physicians and observers is known in the literature as halo-effect and it is defined as an observer’s failure to discriminate between conceptually distinct and potentially independent aspects of a subject’s behavior (Saai et al, 1980)
Occurrence of halo-effects suggests that one characteristic of a subject will influence an observer’s opinion on a variety of items. Halo-effects are likely to emerge when the behavior under study is not well defined or when a substantial degree of judgment is involved in answering an item. Halo-effects were expected to occur to a certain degree in the scales Structuring, Interpersonal Skills and in Communication Skills because these scales measure either larger units of difficult-to-define interview behavior or require observers to indicate their personal opinion of a physician’s skills. However, halo-effects were not expected to contribute that significantly to the measurement of Communication Skills because, beforehand, we had considered the interviewing skills in this scale to be defined more in behavioral terms and that they could be measured more easily when compared to interpersonal skills. Apparently, an observer’s impression of the skill to communicate is reduced to a more global judgment and this probably occurs because the scale measures interview behavior which takes place more than once and, sometimes, even many times in the course of a consultation. No differences between the two cases were observed with regard to the impact of halo-effects on the MAAS-G-scales.
To diminish the influence of halo-effects, items should be well-defined, a difficult requirement with regard to these scales. Items in the scales Structuring and Interpersonal Skills are already described as behaviorally as possible. We consider that the rewording of these items to achieve more behavioral descriptions of interviewing skills would seriously impair the validity of the scales. With regard to Communication Skills, we have earlier proposed the classification of every utterance of the physician. A quite different approach might be the selection for measurement purposes of those observers who are known to evoke only low degrees of halo-effects. Nevertheless, it is our opinion that, even with this approach, halo-effect has to be accepted as a measurement feature of these scales.
Fifth facet: Physicians X Items – True Variance
The fifth source of variation, reflecting the interaction between physicians and items, is considered in the literature as indicating true variance in addition to the physician’s facet (Thorndike, 1982). This source of variation refers to differences between physicians with regard to their interviewing styles. The figures show that interviewing styles are most pronounced in Exploring Reasons for Encounter and History-taking, less in Presenting Solutions and Structuring the interview, and scarcely present at all in Interpersonal Skills and Communication Skills. Considerable differences in impact on the interaction between physicians and items were observed between the two cases, especially in Exploring Reasons for Encounter, History-taking and Presenting Solutions (respectively 30.7%, 25.6% and 13.3% for myocardial infarction case and 10.2%, 43.4% and 6.2% for inception of diabetes mellitus case).
Sixth Facet: Observers X items
The sixth source of variation, reflecting the interaction between observers and items, is strongest for Communication Skills, less for Interpersonal Skills, Presenting Solutions and Structuring, and almost absent for Exploring Reasons for Encounter and History-taking. The interaction between observers and items refers to differences between observers in interpreting the meaning of items and the criteria for scoring. The figures reveal that single acts of operationally defined interview behavior are interpreted very similarly, whereas situations in which the observer has to match the occurrence of interview behavior with MAAS-MI G-items and their definitions are inclined to induce differences in interpretation. The item Explains diagnosis or problem definition understandably, for example, is difficult to score because the observer has to decide what of everything said by the physician to the patient pertains to the diagnosis, whether it provides an explanation and whether it is presented understandably. This source of error variance is assumed to be minimized by the use of additional resources such as a manual, instructions and articles which increase observers’ understanding of the behavior under study. Moreover, the training of observers by mean of group observations of videotaped interviews is likely to reduce this effect. Since four of the six observers are experts as far as the MAAS-MI G is concerned as they participated actively in the construction, the figures presented here are considered as reflecting the upper limits of agreement that can be achieved among observers on the interpretation of MAAS-MI G-items. Moreover, only minor differences between the two cases are observed.
Seventh Facet: Error
The size of the seventh source of variation reflects that the interaction between physicians, observers and items, including error, forms a considerable source of variation in our measurements. This suggests that, in addition to the known and controlled sources of variation, other influences can act upon the variation of our measurements. Error is one of these influences. Physicians’ interview behavior is considered to be affected by conditions such as fatigue, motivation, willingness to participate, etc., whereas observers’ ratings are likely to be influenced by fatigue, mood, inaccuracy and pressure of time. Moreover, the simulated patients who play opposite to the physicians and the cases they are presenting, will contribute to the patient-physician communication and therefore will induce variation in our measurements. Differences between the two cases with regard to the size of the seventh source of variation are observed for the scales Exploring Reasons for Encounter and Presenting Solutions.
In conclusion,
Physicians, observers and items are all capable of inducing variance in MAAS-MI G measurements of medical interviewing skills as is evidenced by the size of the different variance components presented in Table 4. Furthermore, the cases and/or simulated patients appear to elicit different amounts of true variance and induce differences in the facet items, the interaction of physicians and items and the interaction of physicians, items and observers including error. This phenomenon undermined the design of the generalizability study to some extent and posed interpretation problems for the researchers. This issue therefore is elaborated upon below.
Remedy of errors induced by observers
The generalizability study was conducted to determine, in one study, the impact of physicians, observers and items on MAAS-MI G measurements of medical interviewing skills and to calculate generalizability coefficients which were expected to provide information about the effect of strategies to remedy error induced by observers. This is of importance because the number of observers participating in data recording is the only source of variation that can be manipulated by the researchers.
The generalizability coefficients for one observer (see Table 5) reveal that an acceptable level of reliability is achieved for the scale History-taking, whereas the other scales display low levels of reliability (Mitchell, 1979). Adding a second observer increases the reliability of Exploring Reasons for Encounter, History-taking, and Presenting Solutions to a moderate or even high level. Structuring barely reaches an acceptable level, whereas Interpersonal Skills and Communication Skills gain the least advantage from an increase in the number of observers to enhance reliability; this is probably because these scales do not measure a large component of true variance. It is clear that an increase in the number of observers mitigates error induced by observers and diminishes differences between observers, halo-effects and different interpretations of items. Reliability is most effectively increased by adding one or two observers. The addition of even more observers to the process of measurement will not add substantially to reliability.
In conclusion, we observe that it is possible to alleviate the influence of observers on our measurements of medical interviewing skills by incorporating one or two additional observers in the process of measurement.
Interrater-reliability of MAAS-MI General can be enhanced by adding a second observer
Reliability in case of interactional data
The reliability of the MAAS-MI G measurement was studied by observing medical interviews of 20 physicians who talked with one out of five simulated patients who presented one out of two cases. We decided to study reliability over two different cases in order to enhance external validity. Although this design was completely crossed for physicians, observers and items, we did not include simulated patients and cases in the completely crossed design.
Since the question can be raised of whether this procedure is acceptable, we studied the impact on our measurements of medical interviewing skills of:
- The case as a medical problem;
- Simulated patients as human beings interacting with their physician;
- Simulated patients as performers of certain tasks.
Swanson et al (1981) studied the stability of medical interviewing skills over different cases and concluded that patients and cases seem to differ greatly and that the physician’s adaptation to these differences is complex and difficult to measure.
The mean scores, which are displayed in Table 7, of 40 physicians on the scales in the MAAS-MI G for both cases reveal significant differences for the scales Exploring Reasons for Encounter, History-taking and Presenting Solutions. Beforehand, we had expected differences to show up only for the scale History-taking because the cases that were presented differed considerably with regard to the amount of information necessary to solve the medical problem. The mean scores reveal that, in accordance with expectation, almost no questions were asked in the diabetes mellitus case, whereas several questions were asked in order to solve the myocardial infarction case. The generalizability coefficients for the scale History-taking are, however, considerable and almost identical for both cases. This suggests that, given a certain medical problem, differences in History-taking skills are determined solely by differences between physicians and – of course – observers. The case will not interfere with the measurement of physicians’ interviewing skills during History-taking.
The situation is rather different for Exploring Reasons for Encounter and Presenting Solutions, because the medical problem is not conceived as significantly influencing these interviewing skills. The simulated patients, however, are expected to have a considerable impact, especially because these phases of the interview enable them to voice their concerns and to obtain information about and treatment for their complaints. The simulated patients in the diabetes mellitus case were trained to worry about the imminent disease and to insist on obtaining information, whereas the patients in the myocardial infarction case were much less demanding. The generalizability coefficients presented in Table 6 show considerable differences between the cases in Exploring Reasons for Encounter, Presenting Solutions (after diminishing interfering observer influences) and Communication Skills. The results suggest that the goals which patients try to achieve in the interview interfere – a better term might be interact – with the measurement of differences between physicians with regard to the quality of their interviewing skills.
Goals patients try to achieve interfere with the measurement of interviewing skills of physicians
An alternative explanation might be that differences between the simulated patients as private persons will influence the interaction between physician and patient. To study this alternative explanation, generalizability coefficients over the scale Exploring Reasons for Encounter were computed for each of the five simulated patients. Since chance capitalization was likely to occur because the number of physicians was very small, these results have to be treated cautiously. For the patients in the myocardial infarction case, the coefficients were respectively .51, .35, and .56, and for the diabetes mellitus case, .06 and .18. The results support the hypothesis that patients, as private persons, elicit different amounts of free variance because different coefficients were reported within each case but, more importantly, the coefficients were low for the diabetes case and moderate for the myocardial infarction case. Since the coefficients were rather stable within each case, they also support the hypothesis that the goals which patients try to achieve have a considerable impact on physicians’ interviewing skills.
During the construction of the Patient Satisfaction with Communication Checklist, identical problems were encountered because patients did not distinguish between the dimensions Insight and Providing Information. It is clear that, during an interaction, both participants influence each other and constitute together the reality of a medical interview. Unfortunately, our methods of measurement are only able to measure either the physician’s interviewing skills or the patient’s opinion about the consultation and they are unable to deal with the interaction.
Given that we now know that a medical interview forms a dialogue between physician and patient and that both contribute to the communication, the following question is raised: What is the reliability of the MAAS-MI G? Physician and patient influence each other and the interviewing skills which are displayed by the physician during a medical consultation are therefore partly dependent upon his interviewing style and partly dependent upon the patient’s contribution. Reliability, on the other hand, is defined as the ratio of true variance and observed variance. True variance in a physician’s medical interviewing skills is assumed to reflect the differences in style and quality that are attributed to the physician, but it is clear that a physician’s skills are also dependent upon the patient’s contribution to the communication and upon the interaction between physician and patient. An example will clarify this issue.
Generalizability coefficient for two observers for Exploring Reasons for Encounter over both cases was .65 (moderate), over the myocardial infarction case, .60 (moderate) and over the diabetes mellitus case, .25 (low). Is reliability moderate or low? This question cannot be answered because both are true. We therefore conclude that the reliability of interactional data should be studied cautiously and that results should be interpreted more in a comparative way than absolutely as we have in the discussion of the estimates of variance components.
Demanding patients interfere in the interaction and yield less information about physician’s interview ability
Our study reveals, furthermore, that demanding patients who insist on discussing specific topics yield less information about physician’s/student’s ability to perform a medical interview, an important finding for teaching and evaluation situations. The patient’s influence is most pronounced during the Exploration of Reasons for Encounter and the Presentation of Solutions, whereas History-taking is influenced considerably by case-differences. Future studies are necessary to study this issue in greater depth.