4.2Scalable, Reliable, Generalizable

MAAS Medical Interview Mental Health is well-suited to measuring young practitioners’ competence when it comes to the mental health medical interview. A portfolio of around 20 cases allows supervisors to gain unique insights into young professionals’ interviewing skills in mental health.

Each of the eight MAAS Medical Interview Mental Health scales measures one fundamental clinical competence. Together, all eight scales measure the complex yet integrated competence of medical interviewing in mental health.

Here, we investigate the scalability, reliability and generalizability of MAAS Medical Interview Mental Health.

Kraan, H. F., & Crijnen, A. A. M. (1987). Scalability and reliability of the MAAS-Mental Health. In H. F. Kraan & A. A. M. Crijnen (Eds.), The Maastricht History-taking and Advice Checklist – studies of instrumental utility (pp. 249–278). Lundbeck, Amsterdam.

In the present chapter, results of an investigation into the scalability, reliability and generalizability of the MAAS-MI MH are reported. 

Scalability

Scalability is attained when items aiming to measure one underlying theoretical dimension fit well the assumptions of the Rasch model. To ascertain its scalability, the fit of 8 MAAS-MI MH-scales to the Rasch model is determined. In addition, it is checked to see whether each scale measures a different dimension. This study is useful because, in forthcoming studies, such as validity research, we shall use the 8 scales as indices for important theoretical concepts such as Exploration of the Reasons for Encounter, Psychiatric Examination, etc. 

Reliability

Furthermore, the reliability of the MAAS-MH is investigated. The question is addressed of whether measurement with the MAAS-MI MH is stable and consistent when conditions of subjects, observers and mental health problems are varying. Reliability is studied on the level of the 8 scales of the MAAS-MI MH for the summed scores of the 8 scales of the MAAS-MI MH. Reliability is also studied for each MAAS-MI MH item. This subject is treated in the next chapter where results are presented from studying item reliability in a content validity perspective.

Generalizability over observers and scales

Generalizability analyses are used as a method to gain insight in the amount of true variance, i.e. of the physician’s essential ability in interviewing, about the agreement among observers, about measurement biases caused by the instrument itself, and about errors of measurement. 

Generalizability over cases

A special question of reliability is that of inter-case reliability: this refers to the stability of interviewing skills over different mental health problems. The impact of the case influence on the MAAS- scores is also studied by means of generalizability analysis. 

We refer the reader to Instrumental Utility for clarifying remarks on the objectives and the theoretical background of the methodology of these studies.

Scalability

A first step in scale construction of a measurement instrument is the assembly of a set of items which all measure the latent trait we wish to measure. The latent trait in our study is the physician’s ability to perform initial interviews in primary mental health care.

In scalability, we examine whether items measure the interviewing skills we intend to measure.

We are interested in whether:

  • The 8 theoretically assembled scales have one underlying trait;
  • The items of these scales are able to differentiate between competency levels on this latent trait;
  • These scales measure different underlying traits or whether there is also uni-dimensionality between the different scales.

Probabilistic Scale Models

In this thesis, we have chosen to use the probabilistic scale models, especially the one-parameter logistic (Rasch) model, to support the process of scale construction because of its attractive, though demanding, features. We have extensively described the characteristics of this model in Instrumental Utility

We decided to use the Rasch model, a probabilistic scale model, because of it’s attracting – though demanding – features where the difficulty of an item defines a subject’s position on the ability scale and the probability of endorsing an item

  • In the Rasch-model, the items should fit in a logistic function, relating the probability of a successful score on an item to the subject’s position on the ability scale. The shape of this curve is almost indistinguishable from the normal ogive (Hambleton et al, 1977).
  • The fitting of items in this item-characteristic curve has two important consequences:
    • In the first place, the probability of an individual subject providing a correct score on an item (when a certain interviewing skill is present) is independent of the distribution of the subject’s ability in the population of subjects of interest. The probability of a correct score for a physician does not depend on how many other physicians are located at the same point on the ability continuum (or at a different point).
    • In the second place, all items in the Rasch-model are assumed to have equal discriminating power, but only varying in terms of difficulty. In our study, difficulty means the physician’s level of ability to apply a certain interviewing skill.

Method

Subjects, research setting, and analyses are presented here.

More

Interviews of 102 (future) physicians who talked with a patient simulating a major depression have been scored with the MAAS-MI MH. This sample consists of two subsamples of subjects: 40 residents in general practice and 62 psychiatric clerks during the 6th and final year of their medical curriculum. The characteristics of the residents group and the experimental conditions of this study have been described in Convergent & Divergent Validity

The simulated patients presented a case of a middle-aged woman with a long standing undertreated depression (Diagnostic and Statistical Manual of Mental Disorders – major depression), after migration from her native village to a neighbouring town. 

The forty resident interviews were scored live with the MAAS-MI, whereas the 62 clerk interviews were scored after being videotaped. 

Analysis 

Over the 102 interviews, Rasch analysis was carried out for each of the 8 MAAS scales using the PML-program (Obstaffsson, 1977; see also Scalable, Reliable, Generalizable). 

The following five steps were taken. 

  • Firstly, the item scores of the scales Interpersonal and Communicative Skills were dichotomised. On theoretical grounds, three constructors of the MAAS dichotomized each three-point scale by determining whether the second scale point was to be considered as good or bad interviewing. Their agreement was high (80-90%). With respect to controversial items, a consensus was attained after discussion. 
  • Secondly, descriptive statistical analyses were carried out, revealing no missing data. 
  • Thirdly, the binominal and Allerups graphical test selected the items not fitting in the Rasch-model (Molenaar, 1981). 
  • Fourthly, the fit of the scales in the Rasch-model was ascertained by means of the Martin Löf chi-square test (Martin Löf, 1973). 
  • Finally, the uni-dimensionality of each MAAS-scale was tested by determining whether pairs of Rasch homogeneous scales which were considered to measure distinct traits could be positioned in one, similar scale.

Results

The Rasch analysis of the 8 MAAS-MH scales is given in Table 1.

  • Firstly, it is shown that the 8 scales fit the assumptions of the Rasch-model after elimination of only a few items by the binominal and Allerups graphical tests (first and second column of Table 1). From the scales Psychiatric Examination, Socio-emotional Exploration and Presenting Solutions and Structuring the interview, several items are excluded in order to fit the scales in the Rasch-model.
  • In the scale Psychiatric Examination, some items are excluded because in the cases the simulated patients presented, certain interviewing skills of this scale were not applicable.
  • Secondly, results of the Martin Löf tests (column three to five of Table 1) show the empirical proof that the 8 scales fit in the Rasch-model. The higher the probability of the chi-square test, the better the fit of the scale to the assumptions of the Rasch-model.
  • Thirdly, the internal consistency (KR-20) of the 8 scales are presented as a comparison of the probabilistic scale model with the classical test model. The alphas of the 8 scales are given in column 5 of Table 1. To compare them, their values have been corrected for a test length of 20 items by means of the Spearman-Brown formula (Guilford et al., 1982). The alphas are moderate except for the scale Interpersonal and Communication Skills, where they are low.
  • Fourthly, the item difficulties that reflect the point of the ability scale, where subjects have a 50% chance to score an item positively, have been calculated. They are presented in column 5 of Table 2.

In Table 3, the subject’s scores on the scales are pairwise compared by means of the Martin Löf chi-square test to examine whether the scales measure the same underlying dimension. The higher the probability (p), the higher is the probability of the uni- dimensionality between both scales. Other statistics provided by the Martin Löf tests are also given such as chi-squares and Pearson correlations (Molenaar, 1981).

What Demanding Criteria Reveal: The Hierarchical Texture Of Skills To Interview In Mental Health

It is notable that most of the items of the 8 original scales fit in a Rasch homogeneous scale despite the strictness of the Rasch-model.

MAAS-MI Mental Health: all scales meet the demanding criteria of the Rasch scaling model disclosing the hierarchical texture of these complex interviewing skills

This fact might be explained by the homogeneity of our sample. It is striking that the scales fit so well in the Rasch-model, given that internal consistencies in general are only moderate. The reason for this discrepancy might be that the variance in the MAAS-scores is rather restricted because of the homogeneity of our sample. This relatively low variance suppresses the reliability figures. This fact indicates a fallibility of the classical test theory in which reliability figures, calculated from rather homogeneous samples, turn out to be relatively low because of their relatively low variances in the subject’s scores on the scales. In Instrumental Utility we have already discussed weaknesses of the classical test theory resulting from sample dependency. 

We now take a closer look at the loss of items from the scales Psychiatric Examination, Socio-emotional Exploration and Presenting Solutions, caused by a non-fit in the Rasch-model (see Table 2; column 5):

  • The loss of items from the scale Psychiatric Examination is considerable.
    • This loss concerns those items in particular which pertain to the exploration of disturbances in consciousness and orientation, memory, perceptual and thought disturbances. It is explained by the zero or near-zero variances of the item scores in our experiments. Possibly these items fit in the Rasch-model when tested in interviews that yield higher variances in these item scores.
    • The loss of three items pertaining to the psychiatric examination of anxiety can only be partly attributed to low variances in item scores, but is mainly due to a poor operationalization of the underlying theoretical concepts of anxiety in the items.
  • The item loss in the scale Socio-emotional Exploration is not severe. Only two items are lost, although the item on perspectives and ambitions in life is important, especially for depressive disorders.
  • Finally, the unexpected exclusion of two important items from the scale Presenting Solutions, Conveying concrete information about the execution of a given advice and Making appointments for the further follow-up, is considerable in a theoretical sense.

In sum, all MAAS-MH scales fit in the Rasch-model. In addition, they all share one similar underlying dimension according to the Martin Löf test for uni-dimensionality.

Each MAAS-MI Mental Health scales measures one of eight underlying clinical competences where the scales together measure one complex – but integrated – competence: medical interviewing in Mental Health

They differ in a content validity aspect (see next chapter). However, same caution in interpretation of these results is necessary pertaining to the application of these Rasch homogeneous scales to different populations and to the influence of case specificity

Although our population of 6th year psychiatric clerks and residents in general practice is rather broadly defined, our Rasch homogeneous scales might not be applicable in a population of – for example- experienced general practitioners or inexperienced undergraduate students.

This restriction is made for three reasons:

  • In the first place, the patterns of interviewing ability in other populations might be so different that the present scale would no longer fit. For such a population, similar research should be carried out to ascertain anew the Rasch homogeneity of the scales.
  • Secondly, the distribution of item parameters within scales is often unequal and narrowly-spaced, so some item-characteristic curves fall within the confidence bands of others. According to Birnbaum (1974), the informativity of such narrowly-scaled items on the subject’s ability level is relatively low in these instances. In more simple terms, this means that some of the adjacent items are interchangable.
  • Thirdly, and – partly – the most likely cause of the latter observation, the numbers of subjects used in these analyses are relatively low. This is partly due to the relatively low number of observations for too many cells in the bivariate distributions of score groups and of numbers of subjects within these score groups. Therefore, the probability of capitalization on chance might be rather high.

Another critical remark pertains to the restrictive influence of case specificity. As shown previously (e.g. in Validity, reliability, scalability), case specificity, which can be attributed to differences in mental health problems and to differences in presentation by patients, might cause restraints on the generalizability of the physician’s interviewing skills from one “case” to another. Although these findings based on the classical test theory do not hold automatically for probabilistic scaling, they entail the exercise of some caution. The use of only one case of a depressive patient means that not all items of the MAAS-MH are used to a sufficient degree as is witnessed by the scoring pattern on the scale Psychiatric Examination. For instance, symptoms concerning anxiety, disturbance in perception and thinking etc. are not used. 

Conclusions

All scales of the MAAS-MI MH fit in the Rasch-model without losing items of great theoretical importance except for some items in the scales Psychiatric Examination and Presenting Solutions. These scales all have one similar underlying dimension.

Inter-observer reliability

Measuring at Scale Level

Reliability is analyzed on the scale level by means of a generalizability study. The scale level has been chosen because each MAAS-MI MH scale is constructed around a theoretical dimension of medical interviewing skills and because each scale fits in the Rasch model. An additional reason for the use of the scales as the unit of analysis is the fact that the sum scores of items within each scale are taken as indices for the underlying theoretical dimensions of medical interviewing in the studies of validity. 

Include all sources of variance

We selected generalizability analysis as a measurement of reliability because it is the most appropriate approach to the analysis of different sources of variation in a set of scores such as the one studied here. The sources of variation and their interactions may be rooted in differences in agreement between observers, in the physicians’ interviewing styles and ability, and in measurement properties of the method.

Sources of variation may be rooted in differences in agreement between observers, in physicians’ interviewing styles and ability, and in measurement properties of the method  

For each scale, different coefficients of generalizability are calculated as reliability measurements. Generalizability coefficients are calculated for measurement situations where one, two and six observers are used. These coefficients are scrutinized in order to investigate how the observers, the physician’s interviewing ability and the measurement properties of the method itself all influence the reliability of the MAAS-scores. 

Method

Sample, instruments and analyses are described here.

More

This study in reliability was carried out in a similar sample to that described in other chapters and which is briefly presented here. 

Subjects

In order to secure optimal conditions for measurement, comparability and control, we created a consultation hour in which 40 residents in general practice interviewed four different simulated patients. The characteristics of this group of 40 residents in general practice have already been described, as has the experimental situation (see Convergent & Divergent Validity).

Two of the four simulated patients presented a mental health problem to the physician.

  • One patient represented a middle-aged woman with a long-standing, under-treated depression (Diagnostic and Statistical Manual of Mental Disorders: major depression) after migration from her native village to a neighbouring town.
  • The second patient represented a middle-aged man who, after losing his job, developed a panic disorder (Diagnostic and Statistical Manual of Mental Disorders) with insomnia, mild depression and feelings of shame resulting in family tensions. 

In the present studies, we use the interview which the 40 residents held with both patients simulating the mental disorders. A pool of six trained observers rated the interviews with the MAAS-MI MH (live scores). The interviews were videotaped simultaneously. 

From this collection of 80 videotaped interviews, we randomly selected 20 interviews, 10 of each case. Each of these 20 interviews was scored by the whole pool of 6 trained observers using the MAAS-MI MH. The database was composed in this way for two reasons: to secure a good impression of the observers’ influence, we raised their number to six and to alleviate the burden of scoring, we decreased the number of residents to 20. 

The scores of these 120 (6×20) rated interviews were submitted to a three-way analysis of variance where physicians, observers and items were the sources of variation. 

In this way, we may generalize from our 20 subjects, 6 observers and 104 items to the universe of subjects, observers and items. We fixed the variance component of items, however, because the items of the scale have been carefully selected in order to attempt to encompass the entire domain of the underlying theoretical dimension. Such a universe of interviewing skills, pertaining to one dimension, is, however, not endless. We therefore assume that most interviewing skills are covered by the MAAS-MI items. 

The analyses of variance were carried out for each MAAS-MI scale by means of the General Mixed Model Analysis of Variance with Equal Cell Sizes (BMDP-program P8V; Dixon and Brown, 1979). The size of the variance components was estimated for the facets physicians, observers and items, and their interactions. These estimates were used to calculate the percentage of the total variance induced by each component. 

Coefficients of generalizability were then calculated for scoring situations with respectively one, two and six observers. These coefficients are the ratio between the expected true variance (or universe variance) and the observed variance (see also Instrumental Utility). 

To study better the observers’ influence on the MAAS-MI-scores, we considered the effects on the generalizability coefficients when we generalized from research situations with one, two or six observers. These effects can be determined, in the formula of the generalizability coefficient, by dividing the number of observers by 6 in the case of one observer and by 3 in the case of 2 observers (Thorndike, 1981). This procedure was accomplished in each of the 8 scales. 

Results

The analyses of variance over the eight Rasch homogeneous scales can be obtained from the authors. The estimates of the variance components pertaining to the sources of physicians, observers, items and their interaction effects are given in Table 4 for each scale of the MAAS-MI MH. The coefficients of generalizability of the Rasch homogeneous scales in research situations, using one, two and six observers are given in Table 5.

Untangling Variance Components

The generalizability coefficients for physicians with the sources of variation observers random and items fixed are, in general low for 1 observer, moderate for 2 observers and reasonable to good for 6 observers. Mitchell (1979) considers the range of 0.50 to 0.80 as moderate to good when three main sources of variance and their interaction components are taken into account.

Generalizability of MAAS-MI Mental Health scales is generally moderate for 2 observers 

The scales show that the coefficients are reasonable for the scales Exploration of the Reason for Encounter, History-taking, Psychiatric Examination and Socio-emotional Exploration. The scales Presenting Solutions, Structuring the interview and Interpersonal Skills have moderate coefficients, whereas the coefficient for Communicative Skills is low. 

When we scrutinize the variance components of the various sources (Table 4), it becomes clear why some generalizability coefficients are low or moderate. 

True Variance in Physicians’ Interviewing Skills

The first source of variation, physician (p), together with the fifth source of variation, the interaction between physicians and items (p x i), represents the true variance of the physician’s interviewing ability in the eight theoretical dimensions represented by the MAAS-MI MH scales (see also Thorndike, 1982). These variance components are generally low, in particular in the scales Presenting Solutions, Structuring the interview, Interpersonal and Communicative Skills. Since the quality of measurement of the traits has already been secured by the Rasch analyses, the magnitude of this first source of variation is not that important in itself. 

Observer Influences on Variance

The second source of variation, observers (0), concerns the variance caused by systematic differences in scoring by observers. Strict or lenient observers, for example, cause systematic variations in the assessment of the physicians’ interviewing skills. According to Saal et al. (1980), a significant proportion of this observers’ variance should be interpreted as the traditional leniency effect, defined as the systematic tendency by observers to assign a higher or lower rating than is warranted by the subject’s interviewing skills. The scales Presenting Solutions, Structuring the interview and Communicative Skills, seen to be most vulnerable to the leniency effect, whereas the scales History-taking and Psychiatric Examination are less vulnerable. However, all these effects are not very strong. 

The background of this leniency effect may be threshold problems in the scoring of more complex interviewing behavior. For instance, observers may know and agree upon the criteria for scoring an item, such as Discussion of the pros and cons of the proposed help. In observing an interview, observers may disagree on whether such a discussion took place completely or only partly. Observers holding the latter opinion will decide that the threshold for positive scoring has not been reached. 

Variance Induced by Our Choice if Items

The third source of variation, items (i), reflects the variation in the items used during the scoring of the MAAS-MH. The variation in which items are scored positively depends on two factors.

  • First, the physician’s interviewing style reacts more or less flexibly to the mode of self-presentation of the patient. This flexibility requires a variation in the skills needed for the process of interviewing. Examples of item variation due to differences in style are found in relatively high scores of Presenting Solutions and – in particular- Interpersonal Skills.
  • Second, the variation in mental health problems requires different content elements in the interview. Examples of the influence of this factor are found in the relatively high components in the scales Exploration of the Reasons for Encounter, History-taking and Socio-emotional Exploration.

This frequently considerable source of variation reflects in general the difficulties of translating medical interviewing skills into items. 

Halo-effects

The fourth source of variation is the interaction between physicians and observers (p x 0), known in the literature as halo-effect. It is defined as the observer’s failure to discriminate among conceptually distinct and potentially independent aspects of a subject’s behavior (Saal, et al., 1980). In general, halo-effects are not very high in the MAAS-MH (variance components of about 5%), except for the scales Structuring the interview and, to a lesser extent, Communicative Skills.

We expected halo-effects to be high in the scales Presenting Solutions and Interpersonal Skills as well, because these scales measure larger units of difficult-to-define interview behavior. Moreover, they require observers to indicate their personal opinion of the physicians’ skills. It is commonly acknowledged that halo-effects are high under such conditions. It is not clear whether the high halo-effect in the scale Structuring the interview should be explained by the complex interviewing behavior causing unreliable scoring or by the low validity of the underlying concept. 

The latter argument is supported by the fact that, in primary mental health care, the distinction between the phases Exploration of the Reasons for Encounter and History-taking is not clear because, in both phases, much patient-centered information is collected. The structuring of the interview in phases is thus difficult to measure. 

Physicians respond on circumstances

The fifth source of variation, the interaction between physicians and items (p x i), is considered in the literature as true variance and is therefore added to the physician’s facet (p) (Thorndike, 1982). This source of variance is the highest in the scales measuring skills characteristic for the three phases of the medical interview. Content elements play a major role here. Depending on the specific content aspects of a case, the physician will ask certain questions which might well not be posed in other cases. Parallel to the discussion of items as source of variation, one should conclude that different cases yield different patterns of scored items. In the scales which measure process aspects of the interview that are less dependent on the case, the physician and items interaction component is less prominent. These findings suggest a degree of case-specificity to which we address ourselves further in the next paragraphs. 

Leniency effects

The sixth source of variance, the interaction between observers and items (o x i), refers to the differences between observers in interpreting the meaning of items and the criteria for scoring. This variance component is the greatest for the scale Communicative Skills. It is clear that this source of variance may be responsible for the low generalizability coefficient of this scale. Apparently, items like confrontation, concretization, conveying information in small units etc. are liable to differences in interpretation because the criteria for scoring may be confusing. The combination of qualitative (how adequately accomplished) and quantitative criteria (how many times accomplished) might be too intricate for appropriate scoring. Improvement will be made by the splitting of items and by introducing more unequivocal criteria. 

There is more than we can measure

The seventh source of variation is the interaction between physicians, observers and items also including an error component (p x o x i + error). This considerable variance component suggests influences other than the above-mentioned sources of variation also impinge on the variation of the scores, such as error. The observers’ ratings may be influenced by fluctuating attention, mood, fatigue and pressure of time. All scales of the MAAS-MH are considerably affected by this variance component (ranging from 31.4 to 47.0).

Although in this source of variation, the error is not distinguishable from the p x o x i component, we have to conclude that much of the physician’s interviewing style is not covered by MAAS-MI MH items. This is the price paid for our operationalization of the physician’s interviewing behavior into clearly defined, teachable skills. In particular, the exclusion of many non-verbal behavior from the MAAS may be a reason for this considerable, unexplained component of variance. 

The observer’s influence on the MAAS-scores becomes clear from the comparison of the generalizability coefficients based on measurement by one, two and six observers. Rising from one to six observers, the generalizability coefficients increase from low to good. This finding suggests a marked observers’ influence on the scores. 

For summative evaluation, adding a second observer mitigates observer induced variance and improves reliability

Despite this conclusion, for the forthcoming validity studies with the MAAS-MH we use scores summation over two observers to control the observers’ effect.

Inter-case Reliability

The question treated in this section concerns the sensitivity of the MAAS-MH to differences in cases. Instruments measuring medical competence, including medical interviewing skills, are always susceptible to case influence (Swanson, et al., 1981). What does case influence mean? 

The physician’s interviewing style is influenced by the case, in the sense of the nature of the medical problem (its complexity, severity, aetiology etc.), but also by the way the patient presents it. Self-presentation by the patient is determined by personality traits (intro-extroversion, etc.), affective states (anxiety, mood), attitudes, social norms, locus of control etc. 

In Validity, reliability, scalability, we attempted to disentangle different aspects of this case-influence. The results suggest that the influence caused by the patient’s self-presentation explains most of the variance, especially in the scale Exploration of the Reasons for Encounter and Presenting Solutions. The case influence in the History-taking is caused by the case as a medical problem. 

However, our design did not allow us to replicate this study because simulated patients and cases are not completely crossed (simulated patients are irregularly nested in the cases). We therefore relied once more upon the generalizability analysis to assess the not disentangled case influences in the variance of MAAS-MI MH scores. 

Method

Subjects and analyses are described below.

More

The subjects for this study were once more the 40 residents in general practice, each interviewing two patients simulating respectively a major depression and a panic disorder. The interviews were scored live on the MAAS-MI MH (in its “Rasch version”) by trained observers. This database has been described before. From this database, we took the “live” scored 80 interviews (40 residents x 2 cases). We again applied a three-way analysis of variance to the item scores of these interviews. Our purpose was to generalize to physicians with the sources of variation “cases” random and “items” fixed. To avoid a more difficult – to – interpret four-way analysis of variance, we excluded the observers’ variation from our generalizability analysis. This observers’ effect was thus lost in the error term of the variance components. A coefficient of generalizability was then calculated for physicians with “cases” random and “items” fixed for each scale of the MAAS-MI MH. 

To gain an insight into the impact of case influences on the MAAS-scores, we estimated the generalizability coefficients when the number of cases was raised to 20 and 40 cases. These estimations were calculated by substitution of these numbers of cases in the equation of the generalizability that was previously calculated for 2 cases. 

Results 

The case influence on the MAAS-MH-scores was studied in a generalizability analysis. Its results, the generalizability coefficients for the 8 scales, are presented in Table 6

In general, these generalizability coefficients are low, with the exception of Socio-emotional Exploration. These results indicate a lack of inter-case reliability of the MAAS-MH.

Patients as Person and as Problem Contribute to Variance

As we noticed low inter-case reliability in our measures of residents’ interviewing skills, we raised the number of cases to twenty which made the coefficients moderate to very reasonable, except for the scales History-taking and Interpersonal Skills. These scales appear to be very susceptible to ‘case influences’. The influence in the first scale is due to differences in mental health problems, whereas in the second scale, the impact of the self-presentation by the patient is sensible.

The patient and their problems induce variance in physicians’ interviewing skills – thereby reducing reliability – in the same way as patient-centered and physician-centered interviewing contributes to a consultation

Low inter-case reliability due to susceptibility to differences in mental health problems is only avoidable when items pertaining to content elements are more abstractly formulated to the level of grouped topics (see MAAS Medical Interview Construction). For example, items could be constructed that assess the exploration of the patient’s social functioning in general instead of its distinctive aspects. When this is not possible, all content elements would be removed from the MAAS-MH, which is, of course, absurd. 

Moreover, the low inter-case reliability due to self-presentation by patients can be improved by using better-trained simulated patients whose self-presentation has been standardized. This is, in fact, only possible in test situations in medical education.

Conclusions from the generalizability analyses of the MAAS-MH scales

In generalizability analyses, the influence is determined:

  • Of observers;
  • Of method-of-measurement (items);
  • Of cases on the physician’s interviewing skills . 
  • To gain an impression of the observers’ influence, generalizability coefficients for physicians are calculated with the sources of variation observers kept random and the source of variation items kept fixed.

We conclude that the coefficients are generally low when one observer, moderate when two observers and satisfactory when six observers are used. A moderate to reasonable inter-rater reliability may be concluded from these findings.

For MAAS-MI Mental Health, inter-rater reliability is moderate to reasonable 

Regarding the performance of the scales, we notice that Exploration of the Reasons for Encounter, History-taking, Psychiatric Examination and Socio-emotional Exploration are satisfactory. However, Presenting Solutions and Interpersonal Skills have moderate, and Communicative Skills have even low coefficients. 

Causes of unreliability are studied by inspecting the main and interaction components from the analysis variance. 

The variation of the most important component, the physician’s interviewing skills, is substantial but rather low compared to the other variance components. 

Leniency effects, i.e. systematic high or low rating by the observers, are not high in general, but notable in the scales Presenting Solutions, Structuring the interview and Communicative Skills. It is apparent that the threshold to considering an interviewing skill as present or absent is an important reason for these differences between observers. 

Halo-effects in scoring originate from characteristics of subjects (appearance, attitude, etc) which are not related to their interviewing skills, but which influence the observers to systematic high or low ratings. These effects are seen in the scales Structuring the interview and Communicative Skills, witnessing poor operationalization and/or low validity of the underlying concepts. 

The high item component is characteristic for measurement of skills. This phenomenon in the measurement of interviewing skills is due to the method’s susceptibility for different mental health problems and – in particular – for difference in self-presentation by patients. Finally, it is notable that a considerable variance component is not explainable (the phy x obs x item + error component). Although this component is generally high in measurement because of the high number of degrees of freedom, this may also be due to our exclusion of non- verbal interviewing behavior that is not easily teachable. 

Case influence turns out to be a high variance component in MAAS-scores. It is marked in particular in the scales History-taking and Interpersonal Skills. In the analysis, where we generalized to the universes of physicians and cases, it could not be clearly discerned whether this inter-case unreliability should be attributed to real differences in medical problems or to the influence of the way patients present their problem to the physician. Evidence from the MAAS-MI General points to the latter of the two causes. Reducing these case influence effects requires about twenty cases. 

MAAS-MI Mental Health Useful in Training Young Professionals

Is the MAAS-MI MH usable in the light of these reliability considerations? 

In general, causes of unreliability, like halo-effects and non- systematic error of measurement, should be improved by simplification of items based on more clearly-defined and delineated interviewing behavior and by better-described criteria for scoring. However, these interventions imply a further selection of items, perhaps entailing a more simple and reliable but less valid item domain. This price may be too high for a measurement method used in research into medical interviewing skills. 

Leniency effects, a typical observers’ characteristic, may be diminished by better training and – sometimes – by selection of observers. 

In test situations using 1 observer, there will be a considerable observer effect. This can be compensated for by observer training, or by the use of two or more observers. 

Case influence may be reduced by standardizing simulated patients in their presentation of cases. 

In research situations, where interviews are videotaped, the MAAS-MI MH is very useful. In these situations, at least two observers and more cases can be used to reduce respect observers’ and case influences respectively.

MAAS-MI Mental Health is well-suited to reveal the stability in young professionals competence in the medical interview in Mental health, such as in supervision

Moreover, observing residents over a number of clients over a period of time would overcome the issue of inter-case reliability and reveal the stability of a resident’s competence in the medical interview in mental health. This condition is achieved in the supervision of young professionals.

With a portfolio of around 20 cases, supervisors obtain insight in young professionals interviewing skills in Mental Health

Table 1 -- Rasch analysis of the eight MAAS-MI MH scales
Schermafbeelding 2021-03-25 om 09.38.36
Table 2 -- Generalizability coefficients of MAAS-MI Mental health Items for One, Two and Six Observers; variance-components of item-scores to be attributed to observers; and Item Difficulties Computed during Rasch Analyses; p-values of items..
Schermafbeelding 2021-03-25 om 10.01.46
Table 2 Cont'd -- History-taking
Schermafbeelding 2021-03-25 om 10.14.48
Table 2 Cont'd -- Psychiatric Examination
Schermafbeelding 2021-03-25 om 10.32.48
Table 2 Cont'd -- Socio-emotional Exploration
Schermafbeelding 2021-03-25 om 10.47.34
Table 2 Cont'd -- Presenting Solutions
Schermafbeelding 2021-03-25 om 11.01.06
Table 2 Cont'd -- Structuring
Schermafbeelding 2021-03-25 om 11.14.30
Table 2 Cont'd -- Interpersonal Skills
Schermafbeelding 2021-03-25 om 11.20.59
Table 2 Cont'd -- Communicative Skills
Schermafbeelding 2021-03-25 om 11.27.40
Table 3 -- Results of the Martin Löf chi-square tests for uni-dimensionality of the MAAS-MI Mental Health scales
Schermafbeelding 2021-03-25 om 11.48.33
Table 4 -- Variance components in the sumscores of MAAS-MI Mental health scales after three-way analyses of variance (in percentages of the total variance)
Schermafbeelding 2021-03-25 om 12.11.20
Table 5 -- Generalizability coefficients of the MAAS-MI Mental Health scales based on 1, 2 and 6 observers
Schermafbeelding 2021-03-25 om 13.34.36
Table 6 -- Generalizability coefficients of the MAAS-MI Mental Health scales for cases and physicians’ interviewing skills (physicians and cases random; items fixed) and estimated coefficients for 20 and 40 cases
Schermafbeelding 2021-03-25 om 13.41.05

References

Birnbaum A. Same latent trait model and their use in inferring an examiner’s ability. In: Lord FM, Novick MR (Eds.). Statistical theories of mental test scores. Addison-Wesley, Reading, 1974 (2nd ed.). 

Dixon wJ, Brown MD, BMDP-79: biomedical computer programs, P- series. Univ. Calif. Press, London, 1979. 

Diagnostic and Statistical Manual of Mental Disorders, third edition (DSM-III). American Psychiatric Association, 1980. 

Guilford JP, Fruchter B. Fundamental statistics in psychology and éducation. MCGraw-E111, Auckland, 1982 (6th ed.). 

Gustafsson JE. The Rasch model for dichotomous items: theory applications and a computer program. Reports fran the Institute of Education. Univ. of Goteborg, nr. 85. 

Hambleton RK, Cook LL. Latent trait models and their use in analysis of educational test data. Journal of Educational Méasurement, 1977; 14: 75-96. 

Martin-Lff P. Statistical models. Notes fran seminars 1969-1970 by Rolf Sunberg. Institute for försäkringsmatematik och matematik vid Stockholms Universitet, Stockholm, 1973. 

Mitchell SK. Inter-observer agreement, reliability and generalizability of data collected in observational studies. Psychological Bulletin, 1979; 86: 376-390. 

Molenaar 1W. Programma-beschrijving van PML (versie 3.1) voor het Rasch model. Heymans Bulletins, Vakgroep Statistiek en Meettheorie, Universiteit van Groningen, Groningen, 1981. 

Saal FE, Downey FG, Lahey NA. Rating the ratings: assessing the psychometric quality of rating data. Psychological Bulletin, 1980; 88: 413-428. 

Swanson DB, Mayewski RJ, Norsen L, Baran G, Mdshlin AI. A psychometric study of measures of medical interviewing skills. Proceedings of the 20th Annual Conference on Research in Medical Education. 1981: 3-8. 

Thorndike RL. Applied psychometrics. Houghton, Mifflin Cie., Boston, 1982.