The generalizability coefficients for physicians with the sources of variation observers random and items fixed are, in general low for 1 observer, moderate for 2 observers and reasonable to good for 6 observers. Mitchell (1979) considers the range of 0.50 to 0.80 as moderate to good when three main sources of variance and their interaction components are taken into account.
Generalizability of MAAS-MI Mental Health scales is generally moderate for 2 observers
The scales show that the coefficients are reasonable for the scales Exploration of the Reason for Encounter, History-taking, Psychiatric Examination and Socio-emotional Exploration. The scales Presenting Solutions, Structuring the interview and Interpersonal Skills have moderate coefficients, whereas the coefficient for Communicative Skills is low.
When we scrutinize the variance components of the various sources (Table 4), it becomes clear why some generalizability coefficients are low or moderate.
True Variance in Physicians’ Interviewing Skills
The first source of variation, physician (p), together with the fifth source of variation, the interaction between physicians and items (p x i), represents the true variance of the physician’s interviewing ability in the eight theoretical dimensions represented by the MAAS-MI MH scales (see also Thorndike, 1982). These variance components are generally low, in particular in the scales Presenting Solutions, Structuring the interview, Interpersonal and Communicative Skills. Since the quality of measurement of the traits has already been secured by the Rasch analyses, the magnitude of this first source of variation is not that important in itself.
Observer Influences on Variance
The second source of variation, observers (0), concerns the variance caused by systematic differences in scoring by observers. Strict or lenient observers, for example, cause systematic variations in the assessment of the physicians’ interviewing skills. According to Saal et al. (1980), a significant proportion of this observers’ variance should be interpreted as the traditional leniency effect, defined as the systematic tendency by observers to assign a higher or lower rating than is warranted by the subject’s interviewing skills. The scales Presenting Solutions, Structuring the interview and Communicative Skills, seen to be most vulnerable to the leniency effect, whereas the scales History-taking and Psychiatric Examination are less vulnerable. However, all these effects are not very strong.
The background of this leniency effect may be threshold problems in the scoring of more complex interviewing behavior. For instance, observers may know and agree upon the criteria for scoring an item, such as Discussion of the pros and cons of the proposed help. In observing an interview, observers may disagree on whether such a discussion took place completely or only partly. Observers holding the latter opinion will decide that the threshold for positive scoring has not been reached.
Variance Induced by Our Choice if Items
The third source of variation, items (i), reflects the variation in the items used during the scoring of the MAAS-MH. The variation in which items are scored positively depends on two factors.
- First, the physician’s interviewing style reacts more or less flexibly to the mode of self-presentation of the patient. This flexibility requires a variation in the skills needed for the process of interviewing. Examples of item variation due to differences in style are found in relatively high scores of Presenting Solutions and – in particular- Interpersonal Skills.
- Second, the variation in mental health problems requires different content elements in the interview. Examples of the influence of this factor are found in the relatively high components in the scales Exploration of the Reasons for Encounter, History-taking and Socio-emotional Exploration.
This frequently considerable source of variation reflects in general the difficulties of translating medical interviewing skills into items.
Halo-effects
The fourth source of variation is the interaction between physicians and observers (p x 0), known in the literature as halo-effect. It is defined as the observer’s failure to discriminate among conceptually distinct and potentially independent aspects of a subject’s behavior (Saal, et al., 1980). In general, halo-effects are not very high in the MAAS-MH (variance components of about 5%), except for the scales Structuring the interview and, to a lesser extent, Communicative Skills.
We expected halo-effects to be high in the scales Presenting Solutions and Interpersonal Skills as well, because these scales measure larger units of difficult-to-define interview behavior. Moreover, they require observers to indicate their personal opinion of the physicians’ skills. It is commonly acknowledged that halo-effects are high under such conditions. It is not clear whether the high halo-effect in the scale Structuring the interview should be explained by the complex interviewing behavior causing unreliable scoring or by the low validity of the underlying concept.
The latter argument is supported by the fact that, in primary mental health care, the distinction between the phases Exploration of the Reasons for Encounter and History-taking is not clear because, in both phases, much patient-centered information is collected. The structuring of the interview in phases is thus difficult to measure.
Physicians respond on circumstances
The fifth source of variation, the interaction between physicians and items (p x i), is considered in the literature as true variance and is therefore added to the physician’s facet (p) (Thorndike, 1982). This source of variance is the highest in the scales measuring skills characteristic for the three phases of the medical interview. Content elements play a major role here. Depending on the specific content aspects of a case, the physician will ask certain questions which might well not be posed in other cases. Parallel to the discussion of items as source of variation, one should conclude that different cases yield different patterns of scored items. In the scales which measure process aspects of the interview that are less dependent on the case, the physician and items interaction component is less prominent. These findings suggest a degree of case-specificity to which we address ourselves further in the next paragraphs.
Leniency effects
The sixth source of variance, the interaction between observers and items (o x i), refers to the differences between observers in interpreting the meaning of items and the criteria for scoring. This variance component is the greatest for the scale Communicative Skills. It is clear that this source of variance may be responsible for the low generalizability coefficient of this scale. Apparently, items like confrontation, concretization, conveying information in small units etc. are liable to differences in interpretation because the criteria for scoring may be confusing. The combination of qualitative (how adequately accomplished) and quantitative criteria (how many times accomplished) might be too intricate for appropriate scoring. Improvement will be made by the splitting of items and by introducing more unequivocal criteria.
There is more than we can measure
The seventh source of variation is the interaction between physicians, observers and items also including an error component (p x o x i + error). This considerable variance component suggests influences other than the above-mentioned sources of variation also impinge on the variation of the scores, such as error. The observers’ ratings may be influenced by fluctuating attention, mood, fatigue and pressure of time. All scales of the MAAS-MH are considerably affected by this variance component (ranging from 31.4 to 47.0).
Although in this source of variation, the error is not distinguishable from the p x o x i component, we have to conclude that much of the physician’s interviewing style is not covered by MAAS-MI MH items. This is the price paid for our operationalization of the physician’s interviewing behavior into clearly defined, teachable skills. In particular, the exclusion of many non-verbal behavior from the MAAS may be a reason for this considerable, unexplained component of variance.
The observer’s influence on the MAAS-scores becomes clear from the comparison of the generalizability coefficients based on measurement by one, two and six observers. Rising from one to six observers, the generalizability coefficients increase from low to good. This finding suggests a marked observers’ influence on the scores.
For summative evaluation, adding a second observer mitigates observer induced variance and improves reliability
Despite this conclusion, for the forthcoming validity studies with the MAAS-MH we use scores summation over two observers to control the observers’ effect.