3.1Scalable, Reliable, Generalizable

We confirmed MAAS Medical Interview’s usefulness by carefully assessing its scalability, reliability and validity.

Analysis shows that MAAS has good measuring properties that can be applied to both medical education and clinical practice. MAAS complies with the demanding criteria of the Rasch model. Observer influence can be alleviated by involving one or two extra observers to the process.

MAAS unravels interviewing-skills development, meaning that trainers and assessors can better incorporate it into their programmes.

Crijnen, A. A. M., & Kraan, H. F. (1987). Scalability and reliability of the Maastricht History-taking and Advice Checklist – General. In H. F. Kraan & A. A. M. Crijnen (Eds.), The Maastricht History-taking and Advice Checklist – studies of instrumental utility (pp. 173–202). Lundbeck, Amsterdam.

Instrumental Utility

In this chapter, we elaborate on the issues of scalability and reliability with regard to the MAAS-MI General. We decided to study scalability of our measurements first and then reliability, because the groups of items constituting the final scales formed the starting point for reliability studies and further validity studies.

Research setting

Scalability and reliability were studied in the same research setting described more extensively in Convergent & Divergent Validity.

More

During simulated consultation hours, 40 residents in General Practice interviewed several simulated patients (Crijnen et al, 1986). Residents were asked to behave as if they had taken charge of a colleague’s practice and to perform a complete medical consultation. 

Simulated patients presented two difficult but frequently occurring somatic problems (myocardial infarction and inception of diabetes mellitus). Reliability and scalability studies are based on MAAS-G-recordings of these (videotaped) interviews. 

The first extension was formed by independent observations of 20 randomly chosen medical interviews (10 diabetes mellitus cases and 10 myocardial infarction cases) by 6 observers which provided the data for assessing the reliability of the MAAS.

The second extension was formed by increasing the number of 40 original interviews between residents and patients simulating the myocardial infarction case to a total of 100 interviews which enabled us to examine the scalability of the MAAS-G by means of Rasch analyses. Since Rasch analyses are independent of the sample of subjects studied but dependent on the population, we decided to compile a heterogenous group of physicians and medical students.

The number of 100 observations was achieved by:

  • Observing 40 videotaped consultations of residents in General Practice who interviewed the patient as part of the simulated consultation hours;
  • 29 residents in General Practice who interviewed the patient as part of the residency educational program;
  • 7 general practitioners from the staff of the residency program who took part in the simulated consultation hours;
  • 24 third year medical students who interviewed the simulated-patient during examination of their medical skills at Maastricht Medical School.

All interviews were observed by experienced observers: four of them had actively participated in the process of MAAS-G-construction.

Scalability of the MAAS-MI General

The central issue during the process of scale construction is to bring together a set of items all of which measure to a satisfactory degree the trait of interest and which collectively reflect different levels of possession of this trait. The theoretical considerations of their content are described in MAAS-MI & Related Skills and construction of the six scales in the MAAS-G in Constructing MAAS-MI. The study of the scalability of the six MAAS-scales is elaborated below.

Scale construction brings together: items measuring the underlying skill, but reflecting different levels of expertise in performing the skill

Scale construction is guided scientifically by scaling models that formalize the relation between a subject’s responses on items and indices representing a subject’s ability on the intended dimension. In the literature, two scaling models are known that are based on either the latent trait models of the classical test theory.

Rasch-model

In this thesis, we rely heavily on the Rasch-model, a one-parameter logistic model, which is the most demanding but also the most attractive latent trait model. The advantages of the Rasch-model are that item parameters are invariant across samples of subjects chosen from the population of interest, and that a subject’s ability can be estimated on the same ability scale from any subset of items that have been fitted in the model. The Rasch-model assumes that all items have equal discriminating power and that the items vary only in terms of their difficulty. Moreover, it assumes that each item in the scale measures only one latent trait, and that a subject’s performance on one item will not affect performance on other items.

Method & Results

Method

To provide data for Rasch analyses, 100 interviews between physicians/medical students and simulated patients who presented complaints accompanying a myocardial infarction were observed and scored on the MAAS-MI G-scales by experienced observers.

More

Since Rasch analyses require dichotomized data, variables in the scales Structuring, Interpersonal Skills and Communicative Skills were dichotomized according to predetermined criteria. 

Rasch analyses were carried out using the PML-programm (Gustafsson, 1977).

A sequence of analyses was carried out:

  • First of all, items that did not fit in the Rasch-model were selected by means of the Binomial test and Allerup’s Graphical test (Molenaar, 1981).
  • Secondly, the scalability of item groups was analyzed by means of the Martin Löf chi-square-test. The Martin Löf chi-square test assesses the fit between the observed proportion positive answers for a specific item on each ability level and the estimated probability of a positive answer according to the assumptions of the Rasch-model. Construction of Rasch homogeneous scales was secured by eliminating items which appeared to diminish the fit between the estimated probability of a positive answer according to the Rasch-model and the empirical pattern of responses to the group of items which constituted each scale.
  • Thirdly, the uni-dimensionality of each MAAS-scale was tested by determining whether pairs of Rasch homogeneous scales, which were considered to measure distinct traits, could be positioned on one scale. This occurred by means of a second type of Martin Löf chi-square test which establishes unidimensionality by estimating the differences of person parameters for pairs of scales.

Results

Fit of Itemgroup  Statistics from the Martin Löf chi-square tests for the six MAAS-MI G scales, such as chi-square, degrees of freedom, probability and number of items, are presented in Table 1. A high level of probability indicates a good fit of the itemgroup to the assumptions of the Rasch- model.

Table 1 -- Martin Löf Chi-square Tests to Test the Fit of the Item-group to the Assumptions of the Rasch Model
Schermafbeelding 2021-03-08 om 11.34.11

Item-difficulties  The item-difficulties of items that appeared to fit in the Rasch homogeneous scales are presented in the right-hand column of Table 2 as a summarizing statistic. Item-difficulties reflect the point on the ability-scale where subjects have a 50% chance of scoring an item positively.

Table 2.1 -- Content Scales MAAS-MI General -- Reasons for Encounter: Generalizability Coefficients of MAAS-MI General Items for One, Two and Six Observers, and Item Difficulties Computed during Rasch Analyses
Schermafbeelding 2021-03-08 om 14.46.49
Table 2.1 Cont'd -- Content Scales MAAS-MI General -- History-taking
Schermafbeelding 2021-03-08 om 15.06.31
Table 2.1 Cont'd -- Content Scales MAAS-MI General -- Presenting Solutions
Schermafbeelding 2021-03-08 om 15.01.25
Table 2.2 -- Process Scales MAAS-MI General -- Structuring: Generalizability Coefficients of MAAS-MI General Items for One, Two and Six Observers, and Item Difficulties Computed during Rasch Analyses
Schermafbeelding 2021-03-08 om 15.50.24
Table 2.2 Cont'd -- Process Scales MAAS-MI General -- Interpersonal Skills
Schermafbeelding 2021-03-08 om 15.51.42
Table 2.3 Cont'd -- Process Scales MAAS-MI General -- Communication Skills
Schermafbeelding 2021-03-08 om 15.52.34

Uni-dimensionality of Scales  Statistics from the Martin Löf chi-square test for uni-dimensionality, such as chi-square, degrees of freedom, probability and Pearson productmoment correlation between pairs of scales are presented in Table 3. A low level of probability on the Martin Löf chi-square test for uni-dimensionality indicates that scales are measuring distinct dimensions.

Table 3 -- Results of Martin Löf Test for Uni-dimensionality of the Scales in MAAS-MI General
Schermafbeelding 2021-03-08 om 12.22.58

Discussion

In the following paragraphs, we discuss the scalability of the six MAAS-MI G scales determined by means of Rasch analyses. 

Exploring Reasons for Encounter

The scale Exploring Reasons for Encounter fits in the Rasch-model after elimination of one item. The items occupy a broad range of item difficulties. The scale thus has adequate measurement properties meaning that it is able to distinguish levels of ability in the skill to explore reasons for encounter. The eliminated item is, unfortunately, very important, as it addresses the issue of a patient’s interpersonal relations and social support system. The fact that it is eliminated may suggest that it just does not fit in the Rasch-model or that it measures a different trait. Since we made no theoretical distinction between this item and the remaining items, we do not think that it is measuring a different trait. The test for uni-dimensionality reveals furthermore that the skill to explore reasons for encounter is not different from any of the other medical interviewing skills; this suggests that they are all measuring interviewing skills. 

History-taking

The scale History-taking fits in the Rasch-model and the items use a broad range of item difficulties. The scale thus displays adequate measurement properties. To increase the fit of the scale in the Rasch- model, the number of items was reduced significantly from 23 to 11. Since both the included and excluded items refer to the physician’s questioning behavior about medical topics, and since we made no theoretical distinctions between the items; see also MAAS-MI & Related Skills and Constructing MAAS-MI, it is unlikely that the excluded items measure a different trait. The test for uni-dimensionality discloses, furthermore, that history-taking is not different from any of the other types of interviewing skills, although it is differentiated slightly from communication skills. 

Presenting Solutions

The scale Presenting Solutions fits in the model after exclusion of one item. The items display a wide range of item difficulties. The scale thus has attractive measurement properties. The item referring to the negotiation between physician and patient about problem-definition and treatment was eliminated from the original scale during the process of scale construction. Since the remaining items pertain to the exchange of information, we presume that the item on negotiation measures a different trait. Furthermore, the ability to present solutions is essentially not different from other interviewing skills. 

Structuring

The scale Structuring the interview corresponds well to the Rasch model, although two items had to be eliminated. Since the eliminated items are theoretically not different from the remaining items in the scale, we do not expect them to measure a different trait. The items in this scale utilize only a limited range on the continuum of item difficulties which means that the confidence intervals of the estimates of item parameters will show some overlap. The measurement properties of this scale are thus considered to be less optimal when compared to the other scales and violations of the Rasch-model are to be expected. The skill to structure the medical interview is identical to other medical interviewing skills according to the test for uni-dimensionality. 

Interpersonal Skills

The scale Interpersonal Skills fits well in the Rasch-model and the items appear to display a broad range of item difficulties which enhances their measurement properties. During the process of scale construction, two items were excluded pertaining to the physician’s ability to handle negative emotions directed at himself and to the quality of eye-contact. Since no theoretical distinction was made between these items and the other components of interpersonal skills, we do not expect them to measure a distinct trait. Moreover, observers often experienced difficulties in scoring the quality of eye-contact because we were not able to define strict criteria. The Interpersonal Skills scale, which measures the physician’s ability to establish a warm human relationship with the patient, is not different from any of the other scales measuring medical interviewing skills. 

Communication Skills

The scale Communication Skills fits only marginally in the Rasch model and it is interesting to note that the item referring to the quality of closed-ended questions, which is generally regarded as essential, is eliminated from this scale. Despite the fact that the items display a wide range of item difficulties, the scale has therefore no optimal measurement properties. The fact that Communication Skills are not really different from Exploring Reasons for Encounter, Structuring and Interpersonal Skills, whereas rather low probabilities are obtained for History-taking and Presenting Solutions on the test for uni-dimensionality, is attributed for the moment to the low reliability of this scale. Further research with reliable instruments is necessary to establish the dimensionality for this type of skills. 

Advantages

Latent trait models have strong advantages when compared to classical test models (Hambleton and Cook, 1977). Since the MAAS-MI G scales fit in the Rasch-model, they can take advantage of the characteristics of this demanding latent trait model.

One advantage is that it is possible to estimate the quality of a physician’s medical interviewing skills on each of the MAAS-MI-G-scales from any subset of items that proved to confirm the model. The estimates of item difficulty are free from the influence of the ability distribution in the calibrating sample of subjects. This forms an important advantage for the construction of tests used in examination or evaluation situations because small tests can be constructed which are less demanding for the observers, because parallel forms of tests can be developed and compared, and because scales which are endorsed by subjects from different classes, institutions or even different nations can be compared. 

A second advantage is that the estimates of item parameters are independent of the sample of students/physicians used for the statistical analyses, but dependent on the population. In our situation, data for computations were obtained on a rather heterogeneous sample of subjects, ranging from inexperienced third year medical students to residents in general practice and even to general practitioners with years of experience in daily practice who all interviewed simulated patients. Item and scale characteristics are therefore assumed to be stable for a population of medical students, residents in general practice and experienced general practitioners while they are interviewing simulated patients.

MAAS-MI General items and scales measure distinct interviewing skills robust over a wide array of expertise ranging from medical students to experienced practitioners

The limits of the population are less well defined. We are unable to say whether the same results hold for surgeons, psychiatrists or cardiologists, etc. Moreover, we are unable to defend the hypothesis that the fit of the scales in the Rasch-model holds true for interviews between physicians and real patients in daily practice, although the general practitioners and residents considered the simulated consultation hours to be an appropriate simulation of real life (Crijnen et al, 1986).

A third advantage of the Rasch-model is that it treats each of these six types of medical interview behavior as a hierarchy of skills: physicians who are able to perform the more difficult skills well are also likely to perform the less difficult skills well, whereas physicians who experience difficulties with the less difficult skills are not able to perform the more difficult skills well. It would be productive for educationalists to take a look at the item difficulties and apply them to their educational programs.

Through MAAS-MI  General, the hierarchical texture of interviewing skill development is unravelled – developers and assessors better incorporate these insights in their programs

To summarize, the MAAS-MI General scales, with the exception of Communication Skills, fit well in the Rasch-model and are considered to have adequate measurement properties. Furthermore, only one dimension is distinguished in our measurements of medical interview behavior: all scales measure medical interviewing skills.

Items & Inter-observer Reliability

In situations where the process of measurement is heavily dependent upon human observers, major problems with regard to the standardization of measurement are likely to occur. The quality of measurement in terms of inter-observer reliability has therefore to be studied in greater depth.

More

In the following paragraphs, we report on the inter-observer reliability of the individual MAAS-G-items determined by means of generalizability studies (Cohen, 1960; Cronbach et al, 1972; Shrout et al, 1979; Guilford et al, 1981). Observers are usually regarded as potential sources of error in the measurement of observational data. Psychometricians have therefore developed a variety of indices which all purport to reflect inter-observer agreement and reliability (Tinsley et al, 1975; Mitchell, 1979).

  • Cohen’s kappa determines the proportion of agreement between observers corrected for chance (Cohen, 1960).
  • Intra-class correlations reflect the ratio of the variance of interest over the sum of the variance of interest plus error (Shrout et al, 1979).
  • Finally, Cronbach et al (1972) developed the theory of generalizability which makes use of estimates of variance components to determine several reliability coefficients for a variety of situations about which a researcher wishes to generalize.

We applied all three procedures to our data and compared the results. The generalizability coefficients for one observer were similar to Cohen’s kappa and intra-class correlations for one observer, whereas the generalizability coefficients for six observers resembled strongly the average intra-class correlations. Furthermore, generalizability coefficients provided information about the reliability of items when pairs of observers were used. We therefore decided to report only generalizability coefficients and to leave out Cohen’s kappa and intra-class correlations.

Method

Generalizability coefficients can be computed in several ways. As it is our intention to provide information about the reliability of MAAS-G-items for random samples of observers, estimates of variance components were computed by means of the General Mixed Model Analysis of Variance with Equal Cell Sizes (BMDP-program P8V – Dixon et al, 1979) in which the physicians and observer facets were considered to be random, and in which each item was considered to be fixed. Data for computations were provided by twenty videotaped medical interviews all independently observed by six well-trained observers who filled in all the MAAS-MI G-items.

Results

Three types of generalizability coefficients were calculated using the formula described by Thorndike (1982):

  • The first type, presented in the left-hand column of Table 2, provides information about the typical reliability of a single observer’s scores for each item, a very common observation situation. The generalizability coefficient for one observer has to be interpreted as the correlation between scores of two randomly chosen observers after observing the same sample of videotaped interviews. For example, the correlation between scores of two observers on item 1 will be .34.
  • The second type of generalizability coefficient, presented in the second column of Table 2, provides information on the reliability of an item for pairs of observers. The final score of a subject on an item is the summed score of both observers. This generalizability coefficient is interpreted as the correlation between summed scores for pairs of observers with summed scores of a different pair of randomly chosen observers who observed the same medical interviews. Generalizability coefficients for pairs of observers were assessed because computations for the validity studies elaborated in Interviewing Skills & Clinical Competency and Convergent & Divergent Validity, and future studies about the nomological net are based on sumscores of observations by continually changing pairs of observers. Adding a second observer appears to be a powerful remedy against unreliability attributed to observer influences because the relative importance of errors of measurement made by observers is reduced.
  • The third generalizability coefficient, presented in the third column of Table 2, provides information on the reliability for a group of six observers who all observed the same interviews. Since an increase in the number of observers enhances reliability, and since it is almost impossible to base analyses on larger groups of observers due to limited resources, the generalizability coefficients for six observers for each item can be considered to display the upper limits of reliability achieved by MAAS-MI-items.

Discussion

Exploring Reasons for Encounter

Items in the scale Exploring Reasons for Encounter display moderate generalizability coefficients. Appending a second observer significantly diminishes unreliability due to observer influences. The coefficients for six observers are very high, with the exception of item 2 which measures questioning behavior about the emotional impact of the complaint. The moderate reliability of this item is probably due to the fact that this issue is often elicited by a reflection of the patient’s emotions rather than by question behavior, and that it is often difficult to discern what are the main and what are the related complaints. 

History-taking

Items in the scale History-taking show moderate to high levels of inter-observer reliability. Appending a second observer has a considerable impact on the reliability of items with moderate generalizability coefficients. In general, history-taking skills refer to single acts of interview behavior. They are therefore more easily defined and subsequently more easily recognized by observers when compared to other medical interviewing skills. Items in this scale display the upper limits of reliability to be achieved by observational measurements of medical interviewing skills. Moderate reliability is reported for items referring to factors that appear to influence the complaints. This moderate reliability is attributed to an observer’s confusion when a physician asks questions with a specific content which are recognized in the literature as influencing complaints such as, for example, physical exercise, smoking, etc., instead of a general question about factors which trigger or increase the problem. Reliability is expected to be enhanced by sharper definitions. Moreover, low reliability is reported for the exploration of both somatic and psychological determinants, mainly a threshold problem. Observers generally agree on the topics elicited, but disagree on the depth of exploration. 

History-taking

Items in the scale Presenting Solutions show low to moderate levels of inter-observer reliability. Appending a second observer considerably enhances reliability. Low reliability is reported for the explanation of diagnosis, probably due to the fact that explanations are often mixed up with an explanation about the cause of the problem when they are presented in lay terminology. Low reliabilities for items pertaining to the patient’s intention to comply and the check on the patient’s understanding are probably due to the often implicit performance of this interview behavior. Sharpening of definitions by requiring explicit interview behavior is likely to enhance reliability.

Structuring

Items in the scale Structuring the interview show low to moderate levels of inter-observer reliability. Adding a second observer enhances reliability, although still only moderate levels of reliability are achieved. A feature of this scale is that it attempts to measure phases of the interview and the quality of transitions. Observers are asked to recognize the phases and to qualify the transitions, a fairly difficult task because demarcations in a medical interview are, apart from at the beginning and the end, never that clear. We expect sharpening of definitions to enhance reliability at the cost of a decrease in validity. When, for example, a topic stemming from the patient’s frame of reference is brought forward and discussed in a few sentences during history-taking, despite the fact that the physician has adequately explored the reasons for encounter, the question is raised of whether item 48 Explores the reason for encounter before history-taking has to be scored Yes or No. The burden of taking this decision rests on the observer. Strict observation of interview behavior is not possible with regard to this kind of interviewing skill, because a considerable degree of interpretation is always involved. The sizes of the generalizability coefficients adequately reflect the difficulties observers encounter during observation and scoring of medical interviewing skills which refer to the structuring of the interview. 

Interpersonal skills

Items in the scale Interpersonal skills show low generalizability coefficients. Adding a second observer almost doubles the size of the coefficients, although still only low to moderate levels are achieved. Generalizability coefficients for six observers, representing finite reliability, even reach only a low to moderate level. Achievement of high reliability is hampered by our inability to define interpersonal skills in single acts of easily observed interview behavior. Some items require observers to position themselves emphatically in the communication and to indicate whether they feel comfortable in it. Other items require observers to indicate whether what we call non-behavioral terms would be more appropriate, such as the item on meta-communication which tries to assess whether, in the case of inhibited communication, physicians fail to make meta-communicative comments. Inter-observer reliability is unlikely to be enhanced by sharpening of definitions since we have already tried to construct and define the items as behaviorally as possible. The question is how should the low to moderate reliability of this scale be treated: should we abandon the scale or should we accept this level of reliability because the scale measures – or at least attempts to measure – an important quality of the physician’s medical interview behavior? In this thesis, items that fitted in the Rasch-model are applied during the validity studies. For future studies, we recommend the reconstruction of the scale Interpersonal Skills. Many items referring to a variety of aspects of interpersonal skills should be included in the process of scale construction and the items should be described in behavioral terms in order to increase reliability and should be well defined in an accompanying manual. Reconstruction of this scale will not be easy: researchers should be wary of impairing the validity of the scale by wording items too behaviorally. 

Communicative Skills

The scale Communicative Skills displays low levels of inter- observer reliability with the exception of the item on the quality of summaries. Increasing the number of observers has no impact on reliability. The low reliability is, in our opinion, due to the fact that the observers are required to make a summarizing judgment about frequently occurring and, in itself, already difficult to assess interview behavior, such as closed-ended questions and the provision of information in small units. We do not expect sharpening of definitions and better training of observers to increase reliability. A better approach for future studies might be to apply a different method of measurement in which the nature and quality of each utterance of the physician is determined and coded separately. With regard to this scale, the question is raised of whether it can be accepted as part of the MAAS-G. Because the scale attempts to measure a theoretically important quality of medical interviewing skills, and because the data are available, we included the scale in the validity studies presented in this thesis. 

We have analyzed and discussed here the inter-observer reliability for each of the MAAS-MI G items separately. High levels of reliability are seen when items are worded in behavioral terms. Moderate reliability is observed when larger units of interviewing behavior are measured. Low reliability is reported in items which require interpretation by the observers. High inter-observer reliability is seen for the scale History-taking, moderate reliability for Exploring Reasons for Encounter, Presenting Solutions, and Structuring, whereas low reliability is observed for the scales Interpersonal Skills and Communicative Skills. Inter-observer reliability is enhanced significantly when a second observer is added.

Scales & Inter-observer Reliability

Since the summed scores of items in each scale are treated in our study as indices of the quality of six distinct types of medical interviewing skills in the MAAS-MI General, we decided to analyze the reliability of these measurements on scale level.

Generalizability Theory

Generalizability theory appears to be the most appropriate approach for the analysis of the size of a number of sources of variation in our measurements in one and the same analysis. Analysis of variance and the subsequent estimation of variance components provides information about the contribution of each of the sources of variation. Based on these figures, generalizability coefficients can be computed which objectify the reliability of the measurements under a variety of situations (Cronbach et al., 1972; Mitchell, 1979; Thorndike, 1982; Nunnally, 1982).

Method

Analysis of variance and generalizability theory ideally requires data to be gathered in a completely crossed, multidimensional design.

More

In our research situation, 20 videotaped interviews between physicians and a total of 5 simulated patients who presented complaints of either a myocardial infarction or the inception of diabetes mellitus, were all observed by six observers on all the items of the MAAS-G. The research situation is described more extensively in Convergent & Divergent Validity. Analysis of variance were carried out for each MAAS-G scale by means of the General Mixed Model Analysis of Variance with Equal Cell Sizes (BMDP- program P8V – Dixon and Brown, 1979) in which the physician and observer facets were considered to be random, and in which the item facet was considered to be fixed. 

The size of the variance components was estimated for the facets physicians, observers and items, and for the interaction between physicians and observers, the interaction between physicians and items, the interaction between observers and items, and the interaction between physicians, observers and items including error. These estimates of variance components were used to calculate the percentage of the total variance induced by each component and the estimates provided the necessary information to compute generalizability- coefficients.

Results

The results of the analyses of variance with a summary of the estimates of the variance components for the respective scales is presented in Table 4, whereas the results for each scale can be obtained through the authors. The generalizability coefficients, in which the item component is considered to be fixed, were calculated by means of the formulae obtained from Thorndike (1982).

The items facet became a fixed facet rather than a facet that is sampled randomly from a larger universe of items, because the sets of items were selected carefully to constitute together the domain of interviewing skills in which we are interested. The universe of interviewing skills pertaining to each dimension is limited and it is the researchers’ impression after reviewing the literature and after observing many interviews, that most interview behavior is covered by one of the MAAS-GP items. Moreover, the sets of items constituting the scales reflect the universe about which we wish to generalize because the groups of items used in our scalability, reliability and validity studies are always identical. Generalizability coefficients for increasing numbers of observers are presented in Table 5.

Unfortunately, we did not extend our generalizability study to a completely crossed design which also included the simulated patients and the cases. Out of curiosity, and because we wished to increase our understanding of the influence of simulated patients and cases on MAAS-G measurements of medical interviewing skills, analysis of variance and subsequent generalizability coefficients were computed for each of the two cases and each of the five simulated patients. The generalizability coefficients for the two cases are presented in Table 6, whereas the estimates of variance components are presented between brackets in the discussion section where applicable.

To compare physician’s responses on the scales for each case, mean scores and t-tests over the scale-totals were computed (Table 7).

Table 4 -- Percentage of Variance of Each MAAS-MI G Scale Attributed to Different Sources of Variance
Schermafbeelding 2021-03-09 om 12.55.10
Table 5 -- Generalizability Coefficients for Increasing Numbers of Observers of the MAAS-MI G (over Rasch Homogeneous Scales)
Schermafbeelding 2021-03-09 om 12.54.52
Table 6 -- Generalizability Coefficients over MAAS-MI General Scales for Myocardial Infarction and Inception of Diabetes Mellitus Cases for One and Two Observers
Schermafbeelding 2021-03-09 om 12.54.32
Table 7 -- Mean Scores and Standard Deviation for Both Cases on MAAS-MI General Rasch-homogeneous Scales and T-tests Between the Cases (df=78)
Schermafbeelding 2021-03-09 om 13.27.01

Discussion

First Facet: Physicians – true variance

The first facet, which reflects differences between physicians, reveals that most scales measure a considerable amount of variance which can be attributed to the physicians (see Table 4). The first facet represents, together with the interaction between physicians and items, the true-variance in which we are primarily interested. Since the quality of measurement of the traits had already been secured by Rasch analyses, the amount of variance that can be attributed to the first facet is not really of significance to us. We can still see that the scale Communication Skills has poor measurement properties because the variance component attributed to physicians appears to be very small. 

Second Facet: Observers

The second facet, reflecting differences between observers, reveals varying degrees of an observer’s influence on the variance in the MAAS-MI G-scales. This implies that differences between observers with regard to the overall severity of their grading standards will produce variation in the estimates of a physician’s competence. Saal argues that a significant observer main effect, especially one that explains a sizable proportion of the rating variance, has to be interpreted as the traditional leniency effect, generally defined as the tendency of observers to assign a higher or lower rating than is warrant by a subject’s behavior (Saal et al, 1980).

Leniency refers to the phenomenon that some observers will indicate that behavior occurred, whereas others will state that it occurred insufficiently or only partly, although all of them observed the same behavior

This influence is lowest in the scales History-taking and Interpersonal Skills, moderate in Exploring Reasons for Encounter, Presenting Solutions and Structuring and greatest in the scale Communication Skills. Scales referring to larger units of interview behavior which are difficult to define and for which the criteria for scoring are less well described are more likely to be affected by a leniency effect. Remarkable is that the scale Interpersonal Skills is scarcely influenced by the leniency effect. Stricter definitions and training sessions for observers to reach agreement are ways to reduce the leniency effect in affected scales. 

Third Facet: Items

The third facet, reflecting differences between items, shows that a considerable amount of variance is induced by the items which form the scales. The interpretation of the item facet posed problems for the researchers because of the difficulty resulting from systematic variance being induced by items which are identical for each observation situation. We attributed the differences between items to a different source of variation in our study: namely, the combination of simulated patient and medical problem. Considerable differences on the third facet between the two cases are observed for the scales Exploring Reasons for Encounter, History-taking and Presenting Solutions (respectively 11.9%, 17.8% and 17.1% for myocardial infarction, and 48.2%, 4.0% and 36.0% for diabetes mellitus). Dissimilarity in case presentation by the simulated patients and differences between the medical problems are considered to be responsible for these effects. 

Fourth Facet: Physicians X Observers

The fourth source of variation, reflecting the interaction between physicians and observers, influences the scores on the scales in varying degrees. Once again, the scale Communication Skills is affected most significantly but this influence also acts upon Structuring the interview and Interpersonal Skills.

The interaction between physicians and observers is known in the literature as halo-effect and it is defined as an observer’s failure to discriminate between conceptually distinct and potentially independent aspects of a subject’s behavior (Saai et al, 1980)

Occurrence of halo-effects suggests that one characteristic of a subject will influence an observer’s opinion on a variety of items. Halo-effects are likely to emerge when the behavior under study is not well defined or when a substantial degree of judgment is involved in answering an item. Halo-effects were expected to occur to a certain degree in the scales Structuring, Interpersonal Skills and in Communication Skills because these scales measure either larger units of difficult-to-define interview behavior or require observers to indicate their personal opinion of a physician’s skills. However, halo-effects were not expected to contribute that significantly to the measurement of Communication Skills because, beforehand, we had considered the interviewing skills in this scale to be defined more in behavioral terms and that they could be measured more easily when compared to interpersonal skills. Apparently, an observer’s impression of the skill to communicate is reduced to a more global judgment and this probably occurs because the scale measures interview behavior which takes place more than once and, sometimes, even many times in the course of a consultation. No differences between the two cases were observed with regard to the impact of halo-effects on the MAAS-G-scales. 

To diminish the influence of halo-effects, items should be well-defined, a difficult requirement with regard to these scales. Items in the scales Structuring and Interpersonal Skills are already described as behaviorally as possible. We consider that the rewording of these items to achieve more behavioral descriptions of interviewing skills would seriously impair the validity of the scales. With regard to Communication Skills, we have earlier proposed the classification of every utterance of the physician. A quite different approach might be the selection for measurement purposes of those observers who are known to evoke only low degrees of halo-effects. Nevertheless, it is our opinion that, even with this approach, halo-effect has to be accepted as a measurement feature of these scales. 

Fifth facet: Physicians X Items – True Variance

The fifth source of variation, reflecting the interaction between physicians and items, is considered in the literature as indicating true variance in addition to the physician’s facet (Thorndike, 1982). This source of variation refers to differences between physicians with regard to their interviewing styles. The figures show that interviewing styles are most pronounced in Exploring Reasons for Encounter and History-taking, less in Presenting Solutions and Structuring the interview, and scarcely present at all in Interpersonal Skills and Communication Skills. Considerable differences in impact on the interaction between physicians and items were observed between the two cases, especially in Exploring Reasons for Encounter, History-taking and Presenting Solutions (respectively 30.7%, 25.6% and 13.3% for myocardial infarction case and 10.2%, 43.4% and 6.2% for inception of diabetes mellitus case). 

Sixth Facet: Observers X items

The sixth source of variation, reflecting the interaction between observers and items, is strongest for Communication Skills, less for Interpersonal Skills, Presenting Solutions and Structuring, and almost absent for Exploring Reasons for Encounter and History-taking. The interaction between observers and items refers to differences between observers in interpreting the meaning of items and the criteria for scoring. The figures reveal that single acts of operationally defined interview behavior are interpreted very similarly, whereas situations in which the observer has to match the occurrence of interview behavior with MAAS-MI G-items and their definitions are inclined to induce differences in interpretation. The item Explains diagnosis or problem definition understandably, for example, is difficult to score because the observer has to decide what of everything said by the physician to the patient pertains to the diagnosis, whether it provides an explanation and whether it is presented understandably. This source of error variance is assumed to be minimized by the use of additional resources such as a manual, instructions and articles which increase observers’ understanding of the behavior under study. Moreover, the training of observers by mean of group observations of videotaped interviews is likely to reduce this effect. Since four of the six observers are experts as far as the MAAS-MI G is concerned as they participated actively in the construction, the figures presented here are considered as reflecting the upper limits of agreement that can be achieved among observers on the interpretation of MAAS-MI G-items. Moreover, only minor differences between the two cases are observed. 

Seventh Facet: Error

The size of the seventh source of variation reflects that the interaction between physicians, observers and items, including error, forms a considerable source of variation in our measurements. This suggests that, in addition to the known and controlled sources of variation, other influences can act upon the variation of our measurements. Error is one of these influences. Physicians’ interview behavior is considered to be affected by conditions such as fatigue, motivation, willingness to participate, etc., whereas observers’ ratings are likely to be influenced by fatigue, mood, inaccuracy and pressure of time. Moreover, the simulated patients who play opposite to the physicians and the cases they are presenting, will contribute to the patient-physician communication and therefore will induce variation in our measurements. Differences between the two cases with regard to the size of the seventh source of variation are observed for the scales Exploring Reasons for Encounter and Presenting Solutions

In conclusion,

Physicians, observers and items are all capable of inducing variance in MAAS-MI G measurements of medical interviewing skills as is evidenced by the size of the different variance components presented in Table 4. Furthermore, the cases and/or simulated patients appear to elicit different amounts of true variance and induce differences in the facet items, the interaction of physicians and items and the interaction of physicians, items and observers including error. This phenomenon undermined the design of the generalizability study to some extent and posed interpretation problems for the researchers. This issue therefore is elaborated upon below. 

Remedy of errors induced by observers

The generalizability study was conducted to determine, in one study, the impact of physicians, observers and items on MAAS-MI G measurements of medical interviewing skills and to calculate generalizability coefficients which were expected to provide information about the effect of strategies to remedy error induced by observers. This is of importance because the number of observers participating in data recording is the only source of variation that can be manipulated by the researchers.

The generalizability coefficients for one observer (see Table 5) reveal that an acceptable level of reliability is achieved for the scale History-taking, whereas the other scales display low levels of reliability (Mitchell, 1979). Adding a second observer increases the reliability of Exploring Reasons for Encounter, History-taking, and Presenting Solutions to a moderate or even high level. Structuring barely reaches an acceptable level, whereas Interpersonal Skills and Communication Skills gain the least advantage from an increase in the number of observers to enhance reliability; this is probably because these scales do not measure a large component of true variance. It is clear that an increase in the number of observers mitigates error induced by observers and diminishes differences between observers, halo-effects and different interpretations of items. Reliability is most effectively increased by adding one or two observers. The addition of even more observers to the process of measurement will not add substantially to reliability.

In conclusion, we observe that it is possible to alleviate the influence of observers on our measurements of medical interviewing skills by incorporating one or two additional observers in the process of measurement.

Interrater-reliability of MAAS-MI General can be enhanced by adding a second observer 

Reliability in case of interactional data

The reliability of the MAAS-MI G measurement was studied by observing medical interviews of 20 physicians who talked with one out of five simulated patients who presented one out of two cases. We decided to study reliability over two different cases in order to enhance external validity. Although this design was completely crossed for physicians, observers and items, we did not include simulated patients and cases in the completely crossed design.

Since the question can be raised of whether this procedure is acceptable, we studied the impact on our measurements of medical interviewing skills of:

  • The case as a medical problem;
  • Simulated patients as human beings interacting with their physician;
  • Simulated patients as performers of certain tasks.

Swanson et al (1981) studied the stability of medical interviewing skills over different cases and concluded that patients and cases seem to differ greatly and that the physician’s adaptation to these differences is complex and difficult to measure. 

The mean scores, which are displayed in Table 7, of 40 physicians on the scales in the MAAS-MI G for both cases reveal significant differences for the scales Exploring Reasons for Encounter, History-taking and Presenting Solutions. Beforehand, we had expected differences to show up only for the scale History-taking because the cases that were presented differed considerably with regard to the amount of information necessary to solve the medical problem. The mean scores reveal that, in accordance with expectation, almost no questions were asked in the diabetes mellitus case, whereas several questions were asked in order to solve the myocardial infarction case. The generalizability coefficients for the scale History-taking are, however, considerable and almost identical for both cases. This suggests that, given a certain medical problem, differences in History-taking skills are determined solely by differences between physicians and – of course – observers. The case will not interfere with the measurement of physicians’ interviewing skills during History-taking.

The situation is rather different for Exploring Reasons for Encounter and Presenting Solutions, because the medical problem is not conceived as significantly influencing these interviewing skills. The simulated patients, however, are expected to have a considerable impact, especially because these phases of the interview enable them to voice their concerns and to obtain information about and treatment for their complaints. The simulated patients in the diabetes mellitus case were trained to worry about the imminent disease and to insist on obtaining information, whereas the patients in the myocardial infarction case were much less demanding. The generalizability coefficients presented in Table 6 show considerable differences between the cases in Exploring Reasons for Encounter, Presenting Solutions (after diminishing interfering observer influences) and Communication Skills. The results suggest that the goals which patients try to achieve in the interview interfere – a better term might be interact – with the measurement of differences between physicians with regard to the quality of their interviewing skills.

Goals patients try to achieve interfere with the measurement of interviewing skills of physicians

An alternative explanation might be that differences between the simulated patients as private persons will influence the interaction between physician and patient. To study this alternative explanation, generalizability coefficients over the scale Exploring Reasons for Encounter were computed for each of the five simulated patients. Since chance capitalization was likely to occur because the number of physicians was very small, these results have to be treated cautiously. For the patients in the myocardial infarction case, the coefficients were respectively .51, .35, and .56, and for the diabetes mellitus case, .06 and .18. The results support the hypothesis that patients, as private persons, elicit different amounts of free variance because different coefficients were reported within each case but, more importantly, the coefficients were low for the diabetes case and moderate for the myocardial infarction case. Since the coefficients were rather stable within each case, they also support the hypothesis that the goals which patients try to achieve have a considerable impact on physicians’ interviewing skills. 

During the construction of the Patient Satisfaction with Communication Checklist, identical problems were encountered because patients did not distinguish between the dimensions Insight and Providing Information. It is clear that, during an interaction, both participants influence each other and constitute together the reality of a medical interview. Unfortunately, our methods of measurement are only able to measure either the physician’s interviewing skills or the patient’s opinion about the consultation and they are unable to deal with the interaction. 

Given that we now know that a medical interview forms a dialogue between physician and patient and that both contribute to the communication, the following question is raised: What is the reliability of the MAAS-MI G? Physician and patient influence each other and the interviewing skills which are displayed by the physician during a medical consultation are therefore partly dependent upon his interviewing style and partly dependent upon the patient’s contribution. Reliability, on the other hand, is defined as the ratio of true variance and observed variance. True variance in a physician’s medical interviewing skills is assumed to reflect the differences in style and quality that are attributed to the physician, but it is clear that a physician’s skills are also dependent upon the patient’s contribution to the communication and upon the interaction between physician and patient. An example will clarify this issue. 

Generalizability coefficient for two observers for Exploring Reasons for Encounter over both cases was .65 (moderate), over the myocardial infarction case, .60 (moderate) and over the diabetes mellitus case, .25 (low). Is reliability moderate or low? This question cannot be answered because both are true. We therefore conclude that the reliability of interactional data should be studied cautiously and that results should be interpreted more in a comparative way than absolutely as we have in the discussion of the estimates of variance components.

Demanding patients interfere in the interaction and yield less information about physician’s interview ability

Our study reveals, furthermore, that demanding patients who insist on discussing specific topics yield less information about physician’s/student’s ability to perform a medical interview, an important finding for teaching and evaluation situations. The patient’s influence is most pronounced during the Exploration of Reasons for Encounter and the Presentation of Solutions, whereas History-taking is influenced considerably by case-differences. Future studies are necessary to study this issue in greater depth.

Conclusion

In this chapter, the scalability and reliability of the MAAS-MI General have been studied.

Scalability

The scalability was assessed by means of Rasch analysis which determines whether all items of a scale are, to a satisfactory degree, measuring the trait of interest and whether the group of items collectively reflect different levels of possession of this trait. MAAS-G-scales of medical interviewing skills fit well in the Rasch model with the exception of Communication Skills. The scales are considered as having adequate measurement properties. Additional analyses reveal that our scales of medical interviewing skills are measuring only one dimension. 

Inter-observer Reliability

Since the process of measuring medical interviewing skills is heavily dependent upon human observers, inter-observer reliability was studied. High levels of inter-observer reliability are seen when items are worded in behavioral terms. Moderate reliability is observed when larger units of interview behavior are measured. Low reliability is reported for items which require considerable interpretation by observers. High inter-observer reliability is seen for the scale History-taking, moderate reliability for Exploring Reasons for Encounter, Presenting Solutions, and Structuring, whereas low reliability is observed for the scales Interpersonal Skills and Communication Skills. Inter-observer reliability can be enhanced significantly by adding a second observer. 

Generalizability Studies

Additional generalizability studies reveal that leniency, halo-effects, and differences in interpretation impair the quality of measurement to varying degrees. These studies reveal, furthermore, a complex interaction between physicians’ interviewing skills and the goals patients try to achieve during a medical consultation. Simulated patients who insist on discussing certain topics yield less information about a physician’s interviewing skills during the Exploration of Reasons for Encounter and the Presentation of Solutions. We conclude, therefore, that the reliability of interactional data on scale level should be studied cautiously and that results should be interpreted comparatively rather than absolutely.

References

Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960; 20: 37-46. 

Cronbach LJ, Gleser (X’, Nanda H, Rajaratnam N. The dependability of behavioral measurements: theory of generalizability for scores and profiles. John Wiley and Sons, New York, 1972. 

Crijnen AAM, Thiel J van, Kraan HF. Evaluatie van consultvoering: een spreekuur nagebootst (Evaluation of a medical consultation: simulated consultation hours). Huisarts en Wetenschap, 1986; 29: 316- 318. Dixon WJ, Brown MB. Biomedical Computer Programs, P-series. University of California Press, Berkeley, 1979. 

Guilford JP, Fruchter B. FUndamental statistics in psychology and education. McGraw-Hill, London, 1981. 

GUstafsson JE. The Rasch-model for dichotomous items: theory, applications and a computer program. Reports from the Institute of Education, University of Göteborg, no. 63, Sweden, 1977. 

Hambleton RK, Ook LL. Latent trait models and their use in the analysis of educational test data. Journal of Educational Measurement, 1977; 14: 75-96. 

Molenaar TW. Programma beschrijving van PML voor het Rasch-model (Description of the PM-program for the Rasch-model, version 3.1). Heymans Bulletin, Vakgroep Statistiek en Meettheorie, Universiteit van Groningen, Groningen, 1981. 

Nunnally JC. Reliability of measurement. In: The encyclopedia of educational research. McMillan and Free Press, New York, 1982. 

Saal FE, Downey FG, Lahey NA. Rating the ratings: assessing the psychometric quality of rating data. Psychological Bulletin, 1980; 88: 413-428. 

Shrout PE, Fleiss JL. Intraclass correlation: uses in assessing rater reliability. Psychological Bulletin, 1979; 86: 420-428. 

Swanson DB, Mayewski RJ, Norsen L, Baran G, Mushlin AI. A psychometric study of measures of medical interviewing skills. In: Proceedings of the 20th Annual Conference of Research in Medical Éducation, Washington, 1981. 

Tinsley HEA, Weiss DJ. Interrater reliability and agreement of subjective judgments. Journal of Counseling Psychology, 1975; 22: 358- 376. Thorndike RL. Applied psychometrics. Houghton Mifflin Company, Boston, 1982.