2.1Validity, reliability, scalability

After constructing a measurement system for MAAS interviewing skills, we wanted to assess their instrumental utility.

Previously, we combined personal experience with opinions from professionals and research evidence to construct MAAS. This allowed us to develop and test the medical interview framework.

Now, it was time to study its instrumental utilityValidity, reliability, scalability and practical utility are the components of this. Here, we describe a systematic overview of the studies on the instrumental utility of MAAS medical interviews.

Crijnen, A. A. M., & Kraan, H. F. (1987). Assessing instrumental utility: issues of validity, reliability and scalability. In H. F. Kraan & A. A. M. Crijnen (Eds.), The Maastricht History-taking and Advice Checklist: studies of instrumental utility (pp. 119–145). Lundbeck, Amsterdam.

Instrumental utility

The question is now raised: How to establish the instrumental utility of our measures of medical interviewing skills in terms of validity, reliability, scalability and practicability? 

A group of variables constituting a measurement of the concept under study is assumed:

  • To represent the intended concept acceptably and adequately;
  • To measure it with considerable precision and efficiency.

The study of validity addresses the question of the quantitative and empirical relationship between the concept under study and the variables which represent this concept (de Groot, 1972).

Measures of interviewing skills are assumed to represent these skills acceptably and adequately, and to measure them with precision and efficiency

Reliability and scalability contribute to the quality of measurement in terms of precision and efficiency and, subsequently, to the assessment of validity.

Practicability refers to the convenience with which the instrument can be applied in research or educational settings. 

In this chapter  In order to develop adequate research designs, we reviewed methodological and psychometric theories of validity, reliability and scalability which are described in the following paragraphs. Subsequently, an overview of research settings that were used to answer the formulated questions is given. The practicability of our measurements will not be analyzed, but we report some of our experiences below.

Validity

Recently, Van Berkel (1984) drew up an inventory of 77 distinct types of validity and classified them into four major categories (Cronbach et al, 1955; Cronbach, 1970; de Groot, 1972; Dook et al, 1979; Thorndike, 1982). 

These four categories of validity are: 

  1. Criterion-orientated validity, which correlates results of a test with a criterion outside the test situation;
  2. Content validity, which refers to how adequately the content of the test represents the universe that the test intends to measure;
  3. Construct validity, which analyses the meaning of test scores in terms of psychological constructs;
  4. Experimental validity, which studies the generalizability of conclusions derived from experiments to situations outside the experimental setting. 

Inferences from the first three validity types are based on what a subject achieves on the pertinent and related tests, whereas inferences from the last validity type are based on a critical appraisal of the design of the test setting.

Philipsen (1984) approached the issue of validity slightly differently by differentiating between two dimensions in validity studies.

  • Firstly, he recognized the goals researchers try to achieve ranging in hierarchical order from face-validity, content validity to construct validity.
  • Secondly, he differentiated between the procedures which can be applied as predictive validity, discriminant validity or concurrent validity.

By combining both dimensions, nine types of validity are discerned. Philipsen’s contribution emphasizes the depth of analysis with regard to each procedure. As textbooks are organized around the four major types of validity mentioned by van Berkel, they are elaborated on in the following paragraphs, but we keep in mind that the depth of analysis can vary for each procedure. 

Criterion-orientated validity

In criterion-orientated validity, the question is studied of how well test scores are able to predict criterion performance. Criterion-orientated validity is sometimes called concurrent validity when no time has elapsed between the measurements, or predictive validity when a criterion is predicted for the future.

Criterion-orientated validity is primarily applied:

  • To tests which are used to select or classify subjects, such as students, patients or employees;
  • To tests which are used to decide what treatment should be given to subjects;
  • And to tests which are used as a substitute for a more cumbersome assessment procedure (Cronbach, 1970).

Criterion-orientated validity is operationalized by the correlation between test performance and future criterion performance:

  • High or modest correlations confirm the criterion-orientated validity of a test.
  • Criterion-orientated validity only provides firm evidence of validity when the measurements are intended as predictors in a specific research setting with a specified criterion which is measured validly in itself (de Groot, 1972).

The construction of criterion measurements forms the greatest problem for predictive validity since it is difficult, but vital, to obtain suitable, valid measurements of the criterion. Difficulties may arise when the criterion behavior is multi-dimensional, vague or equivocal. Theoretical considerations of the relationship between predictor and criterion are essentially unimportant in the determination of criterion-orientated validity. 

MAAS-MI

Although we acknowledge the importance of criterion-orientated validity, especially for selection or treatment purposes, we do not apply it to the MAAS. An adequate research design for the determination of the criterion-orientated validity or, more precisely, the predictive validity, of the MAAS, would be to record students’/future physicians’ medical interviewing skills by means of the MAAS at this moment and then correlate them with measurements of their interviewing skills recorded after the physicians have had several years of experience in daily practice (Crijnen et al, 1984). The strength of the correlations would indicate MAAS’s criterion-orientated validity. Since the time-lag necessary to obtain the future criterion measurements is beyond the scope of our project, we have been unable to study the predictive validity of the MAAS. Determination of the MAAS’s criterion-orientated validity should certainly be carried out in the future as the MAAS is already used to classify and select medical students. 

The establishment of criterion-orientated validity in terms of concurrent validity is elaborated in the sections on construct validity since theoretical considerations are strongly taken into account. 

Content validity

Content validity is studied when researchers evaluate whether the items of the test adequately represent the universe the test intends to measure. Determination of content validity is especially required when tests are designed to measure the degree of mastery of some domain of knowledge or skill.

Content validity was enhanced during construction of the Maastricht History-taking and Advice Checklist through:

  • Participation of a large group of physicians and psychologists who continually scrutinized the content of the MAAS-MI;
  • Attempts to define as clearly as possible:
    • The context in which we are interested (initial medical consultations);
    • The behavior we have tried to measure (six different categories of medical interview behavior);
    • The task given to the subjects (perform a medical interview);
  • Explicit formulation of the theoretical considerations of medical interviewing skills and empirical evidence which were used to construct the scales.

Content validity is determined by judgment of experts on how adequately the domain of interest is represented in the test:

  • A prerequisite for the assessment of content validity is a clear, detailed and explicit definition of the universe a researcher wishes to measure (see also Building Blocks for MAAS-MI). This definition ought to cover the kind of tasks and situations covered by the universe; the kinds of responses the observer wishes to count; and the instruction to the subjects (Cronbach, 1970);
  • Unfortunately, objective determination of a test’s content validity is difficult since few attempts are made to develop quantitative indices of content validity.

MAAS-MI

Since content validity was secured during test construction, we made one additional systematic effort to objectify content validity of the MAAS (see also Content Validity MAAS-Mental Health). 

Construct validity

In 1955, Cronbach and Meehl recommended that the construct validity of new tests should be established in addition to the criterion-orientated validation procedure which was used at that time but was severely criticized.

They defined construct validity as the analysis of the meaning of test scores in terms of psychological concepts or constructs. In interpreting test scores, researchers have to face the question: What constructs account for variance in test performance? Constructs were seen as ‘some postulated attributes of people assumed to be reflected in test performance’ (Cronbach et al, 1955). The concept of constructs was developed to describe or account for certain recurring characteristics of a subject’s behavior (Thorndike, 1982). 

What constructs account for variance in test scores?

Theories About a Construct  Although constructs cannot be assessed directly, researchers have developed a theory of the construct to a certain level of sophistication. They know how a construct will express itself, what sub-groups in the population possess a high or low degree, what conditions favor or inhibit expression of the construct, what test-tasks elicit the construct, etc. The theoretical considerations form an essential part of construct validity, since they suggest kinds of evidence that are relevant for assessing how well a measurement depends upon the construct.

Cronbach (1970) described a general outline for establishing construct validity that was based on these theoretical considerations:

  • First of all, researchers have to suggest what constructs might account for test performance;
  • Secondly, testable hypotheses are derived from the theory surrounding the construct;
  • Finally, researchers carry out studies to test the hypotheses empirically.

Theoretical reflections on the behavior of the construct under study underlie all procedures for the investigation of construct validity. Over the course of time, several procedures have been elaborated to evidence a measure’s construct validity. We confine ourself here to the four procedures described by Thorndike (1982) which rely heavily on the original work of Cronbach and Meehl (1955). 

1. Comparison of test tasks with conception of the attribute

The first question to ask about a method of measurement is: Do the items and the test task appear to call for the construct in question? Is the content reasonable for eliciting the construct we wish to measure? Congruence between the assumed construct and item content forms a first indicator of the essential nature of our method of measurement, but is in no way conclusive. Unfortunately, no precise methods are available for properly outlining the item or variable domain of a construct (Nunnally, 1967). This matter is left entirely to the researcher’s understanding of the construct. This procedure comes close to establishing content validity (de Groot, 1972). 

2. Correlational evidence of construct validity

This procedure comes closest to the meaning of construct validity. It states very generally that:

  • Measurements should show substantial high correlations with different measurements of the same construct, as well as with measurements of theoretically related constructs;
  • Whereas low correlations with measurements of other, theoretically unrelated attributes are expected.

This type of construct validity is elaborated in two distinct directions:

  • Firstly, the assessment of the nomological network;
  • Secondly, the assessment of convergent and divergent validity (Cronbach et al, 1955; Campbell et al, 1959; Thorndike, 1982). 

Nomological Network  Construct validity of a test is underscored when the relations in the nomological network, defined as the interlocking system of laws which constitute a theory, are supported by empirical evidence.

The nomological network describes the systems of laws which constitute a theory

The nomological network of a theory contains a theoretical model, related hypotheses and predictions which include empirical references, and empirical evidence stemming from previous validity studies (de Groot, 1972). Based on the nomological network, researchers develop hypotheses about the nature and strength of relations between the constructs under study and other constructs. They make judgments about the nature of certain activities and the skills required to perform them successfully.

In construct validity, these judgments are tested. When the predicted relations appear empirically, the construct validity of the measurements of the construct is supported. The relations predicted by the nomological network should be able to explain the strength of the correlations. An additional advantage of this procedure is that the researcher’s understanding of the coherence in daily-life is increased.

A fortunate side-effect of studying construct-validity is that we understand the coherence of interviewing skills with their effects on patient and physician better

When the relations fail to appear, the nomological network and/or the construct validity of the measurements of the construct are questioned. The uncertainty about the interpretation of negative results forced Nunnally (1967) to discredit the idea that sufficient evidence for construct validity is brought forward when the supposed measurements of a construct behave as expected. He stated that all that can be tested is the correlation between measurements of constructs, whereas researchers came to conclusions about both the theory which surrounds the test and the construct validity of the measurements. Studies of construct validity are only safe, according to Nunnally, when firstly a supposed measurement of a construct is related to a particular observable variable of which the domain is well defined and, secondly, when the assumption of the relationship between the two constructs is unarguable. Moreover, Nunnally warned researchers from assuming that constructs have objective reality. He proposed that a construct’s name could act as a useful way of labelling a particular set of observable variables. Validity would then be indicated by the extent to which the name accurately communicates the kind of observables that are being studied. 

The relation between a physician’s interview behavior as measured by MAAS-MI G and MAAS-MI MH and several outcomes of the interview such as patient-satisfaction and the quality of diagnosis and treatment plan are studied here to determine the nomological net of both instruments. Studies are carried out in a simulated consultation hour and in consultation hours in which general practitioners interview real patients. See also: Evidence Base – MAAS-MI G and Evidence Base – MAAS-MI MH.

Convergent and Divergent Validity  The assessment of a test’s convergent and divergent validity by means of the multitrait-multimethod matrix has been recommended by several authors as an appropriate way of assessing the identifiability of a proposed construct (Campbell et al, 1959; Fiske, 1971; Cronbach, 1972; Kerlinger, 1981; Thorndike, 1982).

Convergent validity refers to the assessment of the same construct by means of different methods, whereas divergent validity refers to the assessment of distinct constructs by means of the same and/or other methods. Campbell and Fiske (1959) approached the assessment of a test’s convergent and divergent validity systematically by applying each of several methods of measurement to each of several constructs.

They proposed the examination of the resulting matrix of correlations according to four criteria which refer to:

  • Convergence of the methods with regard to a pertinent construct;
  • Divergence respectively of constructs and methods;
  • A general pattern of correlations among the constructs.

This classical approach to convergent and divergent validity brings evidence to bear on the quality of the representation of the construct by the content of the test and it brings to light systematic variance introduced by the method of measurement. 

We study the convergent and divergent validity of MAAS-MI G and MAAS-MI MH by constructing a traditional multitrait- multimethod matrix. See also: Conv. & Div. Validity MAAS-MI G  and Conv. & Div. Validity MAAS-MI MH). 

Clinical Competency  In addition to the classical approach, several methodologists have described a less elaborated procedure to determine convergent and divergent validity by stating that the theory of a construct should be able to explain what other variables are correlated or uncorrelated with the measurements of the construct (Kerlinger, 1981; Thorndike, 1982). This procedure fails to provide information about the influence of shared method variance, but enables researchers to describe the content of the constructs more effectively. In this way, it is closely related to the assessment of the nomological net. 

We apply this procedure to establish the relations between measurements of medical interviewing skills (MAAS-MI G) and other dimensions of medical competency. See also: Interviewing Skills & Clinical Competency

3. Group differences as evidence of construct validity

If the understanding of a construct leads researchers to expect that distinct groups of subjects will respond differently to their measurements, this hypothesis can be tested. Evidence of construct validity is obtained when the hypothesis that the groups differ on the specific issue is proved by the data (Cronbach et al, 1955; Thorndike, 1982). When researchers apply this kind of construct validity, they have to be aware that they simultaneously test their understanding of and theory about the differences between the groups and the construct validity of their measurements. Positive results affirm both; negative results may stem from a shortcoming in one or both of them. 

4. Treatment effects as evidence of construct validity

Any experimentally introduced intervention or any naturally occuring change in conditions that might be expected to influence the construct under study, can be used to study construct validity (Cronbach et al, 1955; Thorndike, 1982). Construct validity is supported when scores are in the predicted direction. When two measurements are affected similarly by a variety of treatments, the suggestion is raised that they are measuring much the same trait which is a slightly different way of assessing convergent validity (Nunnally, 1967). Whether the degree of stability is encouraging or discouraging for the proposed interpretation depends upon the theory defining the construct (Cronbach et al, 1955). Furthermore, Thorndike (1982) remarked that measurements of states, as contrasted with measurements of traits, are especially sensitive to interventions. Traits are expected to be relatively insensitive to manipulations of conditions. The impact of an intervention on a pertinent construct provides useful information about the construct. 

We have studied the growth of medical students’ interviewing skills during medical school. Results supported the construct validity of MAAS-MI G measurements of interviewing skills. See also: Growth in Interviewing Skills over Medical School.

5. Conclusions about construct validity

It is evident that Cronbach and Meehl’s thoughts on construct validity form a fruitful contribution to the study of validity. Construct validity is essentially based on two important notions. Firstly, researchers formulate a nomological network from which testable hypotheses may be derived on the relation between the construct under study and other constructs. Secondly, researchers confirm or refute the hypotheses based upon empirical evidence stemming from a variety of test situations. Moreover, construct validity forces researchers to be very explicit about the theory which surrounds their constructs. 

The approaches to construct validity provided a useful frame of reference for the development of procedures for determining the construct validity of the MAAS-MI General and MAAS-MI Mental Health.

A more extensive description of how the theoretical notions of construct validity were used to develop research settings, procedures and additional instruments will be provided further in this chapter. 

Validity from experiments

A different type of validity from those addressed in the preceding paragraphs is the fourth type here described, because it points to the justification of generalizations drawn from results of experiments as related to situations outside the experiment.

With regard to this issue, Campbell and Stanley (1966) and Cook and Campbell (1979) invoked two types of validity called internal and external validity.

  • Internal validity refers to the inferences made by researchers that a relationship between two variables is causal or that the absence of a relationship implies the absence of cause. The concepts of covariation, time sequence and confounding variables are important in internal validity.
  • External validity refers to the validity from which researchers infer that the presumed causal relationship can be generalized to and across different types of persons, settings and times. Matters of internal and external validity are of importance to the study of the validity of the MAAS and are discussed in the pertinent chapters. 

With regard to the external validity of the MAAS-MI G and MAAS-MI MH influences of different physicians, different simulated-patients, real patients, different cases, different groups of subjects, etc., have to be taken into account in our measurements of interviewing skills. Some of these issues will be elaborated on in this thesis/site, such as case-influences or the influence of simulated-patients, whereas other influences have to be reserved for future studies. To enhance the external validity of the MAAS, the study of physicians’ interviewing skills while they are talking with real patients during their daily practice is important, but was only partially carried out.

Scalability

A central question in the construction of scales is how researchers go from responses on items by subjects to indices on the dimension of interest. Scales consist of groups of items which are all intended to measure various aspects of one common property or dimension. The number of items that are scored positively is usually taken as a measurement of the underlying dimension. Scaling models formalize the relationship between responses to a group of items and indices which represent the underlying dimension, also called the latent trait. In the literature, several scaling models are described which are based on either classical test theory or latent trait models

The place of scalability is not always clear since it is positioned somewhere between reliability and validity. The study of a test’s scalability provides information about the concurrent validity of the items and it determines whether all the items are measuring only one dimension. Scalability therefore combines issues of reliability and validity. 

Scaling models in classical test theory

A hallmark of classical test theory is the decomposition of a subject’s test-score into a true-score-component and an error-score-component which is uncorrelated with the true-score. This decomposition permits assessment of the reliability of a test by means of correlating equivalent forms of a test, by correlating results from two testings and by studying the consistency of performance over items (internal consistency). The process of item selection and the subsequent scale construction is primarily based on these indices of reliability. 

Classical test theory has been criticized over the last 15 years for its use of a linear model for the ‘number of items correct’- score, and for the restricted generalizability of the test conclusion. According to the critics, classical test theory neglects the discrete character of each item and its score by using the sums of zero or one scores as indices of a subject’s position on the dimension of interest. Important test parameters, such as item difficulties, item-test-correlations and internal consistency, depend heavily on the sample of subjects tested and the specific items used. Arbitrary elements therefore impinge on the final conclusions about persons and items. Although classical test theory assumes that the scaling problem is solved, critics state that the scaling model is dependent on the particular test and the tested population. 

Latent trait models

An alternative class of scaling models, called latent trait models or item-response models, has recently gained in popularity (Wright, 1977; Thorndike, 1982). The latent trait models specify the relationship between observable test performance and the unobservable traits or dimensions that are assumed to underlie test performance.

Latent trait models are characterized by three fundamental notions (Hambleton et al, 1977): 

  • The first notion refers to the unidimensionality of the latent space, which assumes that all the items in a test are homogeneous in the sense of measuring only one single ability or latent trait. The unidimensionality can be secured either by the formulation of a sound theory or through a factor-analysis of the test items.
  • The second notion is the assumption of local independence which states that the test item responses of a given subject are statistically independent. This means that a subject’s performance on one item does not affect his performance on other items in the test. In effect, no other ability besides the ability under study is common to the items.
  • The third notion refers to the item-characteristic curve which is a mathematical function that relates the probability of success on an item to the ability of a subject measured by the test. The number of parameters required to describe an item characteristic curve depends on the particular latent trait model. The Rasch model, for example, is the most demanding latent trait model and requires only one parameter to be estimated.

Advantages  Latent trait models have several advantages, the most important being that it is possible to estimate a subject’s ability on the same ability scale from any subset of items in the domain of items that have been fitted in the model. This enables tailored testing and the construction of tests measuring a similar dimension with different items.

MAAS-MI G and MAAS-MI MH comply with the demanding criteria of the one-parameter Rasch-model allowing for scales to be constructed independent of the studied population

A second advantage of latent trait models is that the shape of the item characteristic curve is invariant across subgroups of subjects chosen from the studied population. As a result, scales can be constructed independent of the specific sample of subjects on which the data were obtained. Furthermore, statistical models and computer programs are available that estimate the fit between the latent trait model and subjects’ responses to the items (Gustafsson, 1977; Molenaar, 1981). These programs enable researchers to select and eliminate items until the fit between the data and the model is optimal. 

In Conclusion  We observe that the latent trait models overcome the disadvantages of the scaling model in classical test theory. In the present study, the Rasch model is used because of its attractive though demanding features as methodology for constructing the scales of the MAAS-MI G, the MAAS-MI MH, and the scales of the Patient Satisfaction with Communication Checklist.

Reliability

Some error is involved in any type of measurement whether the subject of measurement is a person’s blood pressure or medical interviewing skills. Reliability concerns the extent to which measurements are stable over a variety of conditions in which essentially the same results should be obtained (Nunnally, 1982).

A reliable measure is one with small errors of measurement

The need for reliable measuring instruments is generally recognized and researchers are remarkably uniform about the definition of reliability. A reliable instrument is one with small errors of measurement, one that shows stability, consistency and dependability of scores for individuals on the trait, characteristic, or behavior being assessed (Mitchell, 1979).

There are at least three approaches to the reliability of observational data: observer agreement, the classical psychometric theory of reliability and generalizability theory (Mitchell, 1979). 

Observer agreement

The percentage agreement between observers is a commonly used index of the quality of data collected in observational studies, but it is not recommended as denoting the quality of observational data because it has several shortcomings. The shortcomings are insensitivity to degrees of agreement; some degree of agreement can be expected on the basis of chance alone; behavior with very high and low frequencies will have extremely high chance-levels of agreement. 

Several alternative coefficients have been developed to overcome these shortcomings. Cohen (1960) proposed as a coefficient of agreement for nominal scales the proportion of agreement corrected for chance, the so-called kappa. Kappa is almost never reported in scientific communications about medical interviewing skills, despite the fact that its use is recommended (Sanson-Fisher et al, 1981). Moreover, intra-class correlation, which is based on analysis of variance, forms a different type of agreement among observers (Shrout et al, 1979; Guilford et al, 1981). In intra-class correlation, the ratio of the variance of interest over the sum of the variance of interest plus error is determined and interpreted as the correlation between observers, which is taken as an indication of the reliability of the observations. Moreover, intra-class correlation enables determination of the reliability of the mean of several observers’ ratings which is of importance because averaging reduces the relative importance of errors of measurement, leaving the relationships enhanced. 

Mitchell (1979) recommends the assessment of inter-observer agreement as a necessary part of the development and use of observational measures. 

Classical psychometric theory of reliability

The classical test theory views a test score as consisting of two components, the true-score and the error-score.

  • The true score reflects the presence or extent of some trait or behavior attributed to stable differences among individuals.
  • The error-score is independent of the true-score and includes real error due to random fluctuations and influences of other sources of variation.

Determination of the reliability of a test is operationalized by the correlation between two scores on the test, which enables assessment of the size of the true-score and error-score components (Mitchell, 1979; Thorndike, 1982). 

In classical test theory, three procedures are commonly used to determine reliability.

  • Firstly, intra-observer or inter-observer reliability is obtained by the correlation between separate scorings of the same instrument by one or two observers. The true score reflects real differences between subjects, whereas the error-score reflects either inconsistencies in the observer or differences between observers in their use of the instrument along with random error.
  • Secondly, split-half or alternate-forms reliability is obtained by scores on two parts of the same instrument or on two very similar instruments. The true-score reflects consistent individual differences among subjects, whereas the error-score includes random fluctuations and real differences in subject behavior between the observed subdivisions.
  • Thirdly, test-retest reliability is obtained from scores on two separate administrations of the same instrument. The true-score is assumed to reflect a stable trait or behavior. The error-score consists of changes in behavior that occur between the two test administrations in addition to random fluctuations. 

It is evident that the constituents of true-score and error-score are dependent on the research setting in which reliability is determined. Mitchell (1979) concluded, therefore, that there is no perfect reliability coefficient, nor is there one that can be generally regarded as the best. She recommended a more inclusive theory to assess reliability; this is described in the following section. 

Generalizability theory

The generalizability theory assumes that test-scores are the result of a number of sources of variation such as subjects, observers, items or conditions. These sources of variation are called facets and a particular combination of facets makes up the universe about which test scores may be generalized. By analysis of variance, estimations are made about the contribution of each facet to the overall variation and the size of the different variance components can be established (Cronbach et al, 1972; Mitchell, 1979). The size of the components can be compared and subsequently combined to form a ratio, called generalizability coefficient, that represents the proportion of variance attributable to individual differences for a particular universe of conditions. 

The ideal data-gathering and analysis design for determination of the variance components seems to be a completely crossed multidimensional analysis of variance (Cronbach et al, 1972).

The ideal data-gathering and analysis design seems a completely crossed multi-dimensional analysis of variance

Researchers thus have to identify the different facets that are likely to be sources of variation. After data-collection, an analysis of variance design provides independent estimations of the contribution of each facet to the overall variation in the test scores. The conventional F-statistic establishes whether each facet makes a significant contribution to the scores. More importantly, the size of the variance components can be computed and compared. Finally, different generalizability coefficients can be computed depending on the universe about which researchers wish to generalize. The cost of the extra information provided by each facet is that the number of observations required is multiplied by the number of conditions sampled in the facet. 

Generalizability studies are recommended by Mitchell (1979) and Thorndike (1982) because the procedures are conservative and the coefficients are considered as reflecting the lower limits of the true dependability of the observational data, because the generalizability study provides much additional information and because the attention of the researcher is focused on the influence of other factors on behavior. See also: MAAS-MI G and MAAS-MI MH.

The practicability of MAAS-MI G and MAAS-MI MH

The issue of practicability refers to the tension between the costs of using an instrument and the goals that can be achieved. Since no procedures are available for determining the practicability of an instrument, we describe below some of our experiences with regard to the practicability of the MAAS-MI G in particular. 

The costs of developing instruments such as the MAAS-MI G and the MAAS-MI MH are high. Much manpower is required of researchers-physicians who have both research experience and clinical experience during the process of scale construction.

The costs of developing MAAS-MI General and MAAS-MI Mental Health, in terms of blending research and clinical experience, are high

Moreover, psychologists and psychometricians have often to be consulted to provide additional expertise on methodology and psychometrics. Once developed, the costs of applying this type of instrument are low. Observers are capable of scoring the MAAS-MI G after two to three training sessions of 3 hours in which they discuss the content of the items and observe several demonstration videotapes under supervision.

However, the costs of applying MAAS-MI in medical education and clinical practice are low

Faculty, including non-medical professionals, and medical students appear to be able to observe a consultation and to score a physician’s interviewing skills on the MAAS-MI G. During their first observations, observers are generally overwhelmed by the number of items, but this diminishes after they become acquainted with the content of the items and the position of the items in the checklist. Observers are able to score approximately three interviews of 15 minutes’ duration with simulated patients in one hour and 18 to 20 interviews in one day. Almost no evidence is available on the observation of consultations in general practice. 

It is also our impression that the use of the MAAS is not limited to the Dutch situation. Our colleagues from the Ben Gurion University of the Negev, Beer Sheva, Israel, have applied the MAAS to videotaped consultations of 4th year medical students (Maoz and Katz, personal communication). We have, moreover, observed and assessed several consultations of North-American physicians without experiencing difficulties. 

Several goals can be achieved by observing medical consultations with the MAAS-MI G or MAAS-MI MH:

  • Firstly, in medical education, detailed feedback can be given to students on their interview behavior during the consultation.
  • Secondly, the quality of students’/physicians’ medical interviewing skills can be assessed. The MAAS-MI G has been applied for both reasons to the undergraduate medical curriculum at Maastricht Medical School.
  • Thirdly, research into the relationship between the process of a consultation and the outcomes can be carried out.

See also Tools and Explanation.

Instrumental utility – a summary

In the preceding paragraphs, we have addressed the question of how to assess the instrumental utility of the MAAS-Medical Interview. Instrumental utility has been separated into three elements: validity, scalability and reliability.

  • Validity can be differentiated into four types: content validity, criterion-orientated validity, construct validity and the validity of experiments. To establish the validity of a test, several research settings and procedures have been developed and described in the literature.
  • The scalability of measurements can be studied by two scaling models which formalize the relationship between responses of subjects to a group of items and indices that represent the latent trait. The latent trait models have attractive measurement properties when compared to the classical test theory, although they are very demanding.
  • Determination of the reliability of observational data can be approached in three different ways: namely, by means of observer agreement, by the classical theory of reliability, and by the generalizability theory. Several methodologists recommend the application of generalizability theory because the coefficients do not inflate reliability and do provide much additional information.

The question of the instrumental utility of the MAAS-Medical Interview can be answered by separating the question into the afore-mentioned aspects. Each aspect provides the conceptual framework for the development and application of specific research settings, procedures and statistical analyses. The research settings are elaborated on below. 

From theory to practice

After reviewing the constituents of instrumental utility, the question is raised: What research has to be conducted to assess the instrumental utility of the MAAS-Medical Interview?

The review has made clear that one research setting will not suffice to answer all these questions. It also suggested the research settings that should be developed and carried through to assess the usefulness of our measurements of medical interviewing skills. In the following paragraphs, we describe the research settings and accompanying extensions that were originally conceived for the purpose of answering questions pertaining to the validity, reliability and scalability of the MAAS (Crijnen et al, 1984; Kraan et al, 1984). Subsequently, an overview is provided of the studies that have been carried out, the chapters in which they are described and the studies which have to be carried out in the near future. 

Research setting 1: simulated consultation hours with 40 residents in family medicine

This research setting, together with some small extensions, forms the core of our study because it enabled us to answer the major research questions.

Simulated Consultation Hours  A simulated consultation hour was organized in which 40 residents in Family Medicine interviewed four simulated patients who presented different cases (Crijnen et al, 1986). Each resident was asked to behave as if he had taken charge of a colleague’s practice and had to perform a complete medical consultation with each simulated patient. Two cases represented difficult yet frequently occurring somatic problems (myocardial infarction, inception of diabetes mellitus), whereas two other cases represented psychological problems (major depression, anxiety states). 

The following research questions were studied:

  • Reliability  Reliability was determined by means of a generalizability study by analyzing observations of a group of 6 observers who observed videotaped interviews of 20 somatic case presentations for MAAS-G and 20 psychological case presentations for MAAS-MH. The studies are described in MAAS-MI G and MAAS-MI MH.
  • Scalability  Scalability of the scales in MAAS-G and MAAS-MH was studied by means of Rasch-analyses on 100 interviews of physicians/medical students with patients who presented the myocardial infarction case (MAAS-MI G) and on 100 interviews of physicians/medical students with patients who presented a dysthymic disorder (MAAS-MI MH). The number of observations was achieved by extending the 40 observations obtained during the simulated- consultation hour to a total of 100 observations. The studies are described in MAAS-MI G and MAAS-MI MH
  • Convergent and divergent validity  Convergent and divergent validity of the MAAS-G and the MAAS-MH were studied as part of construct validity by means of two multitrait-multimethod matrices that were constructed by correlating four different methods of measurement of the physician’s medical interviewing skills. Data were obtained during the simulated-consultation hour. Studies are described in MAAS-MI G and MAAS-MI MH.
  • Nomological Network  The nomological network of medical interviewing skills was studied as part of construct validity by measuring the physician’s medical interviewing skills by means of MAAS-MI G and MAAS-MI MH, by the assessment of diagnosis and treatment plan and by the assessment of the patient’s satisfaction with the quality of the communication. To facilitate determination of the nomological net, the Patient Satisfaction with Communication Checklist was constructed. Data were obtained during the simulated consultation hours. The nomological net of MAAS-MI G has been analyzed although not yet described. The study of the nomological net of MAAS-MI MH has been analyzed and is described in MAAS-MI MH.
  • Patient Satisfaction with Communication  The construction of the Patient Satisfaction with Communication Checklist was accomplished by extending the 160 checklists which were filled in during the simulated consultation hour with 117 checklists filled in by real patients after they had consulted their own physician. This procedure anchored the PSCC in reality and enabled us to apply Rasch-analyses during scale construction. Scale construction of this checklist is described in Patient Satisfaction with Communication.

Research setting 2: interviewing skills and medical competence

This research setting enabled us to establish the MAAS-MI G’s convergent and divergent validity as elaborated by Kerlinger (1981) and Thorndike (1982). In this setting, physicians’ medical interviewing skills during a consultation with a simulated patient were measured together with their medical knowledge, interpersonal skills, care and concern for the patient, and medical problem-solving skills. The study is described in Interviewing Skills & Clinical Competency

Research setting 3: growth of interviewing skills during medical school

In this research setting, treatment effects were considered to provide evidence of construct validity. By assessing the quality of interviewing skills of all 563 undergraduate students at Maastricht Medical School at one point in time, we had the opportunity to reveal the patterns of growth of interviewing skills during medical school. In addition to the evidence on construct validity, this study provided information on the susceptibility of interviewing skills to influences from the medical curriculum. The study is presented in Growth in Interviewing Skills

Research setting 4: medical interviewing skills during consultation hours of general practitioners

The purpose of this study is to enhance external validity and to provide information about the nomological net of the MAAS-MI G and MAAS-MI MH. Medical consultations of 30 General Practitioners with 600 patients will be recorded and observed with MAAS-MI G and MAAS-MI MH. Moreover, physicians’ perception of the quality of the communication and the established diagnosis will be recorded. Patients will be asked to fill in the Patient Satisfaction with Communication Checklist and the General Health Questionnaire which determines whether patients’ experience psychological problems. This study has not yet been carried out. 

It is clear that the study of the instrumental utility of the MAAS-MI G and MAAS-MI MH can be separated into many smaller studies. The interested reader will surely be able to conceive of additional studies to test the instrumental utility of our measurements of medical interviewing skills. We have confined ourselves here to studies that were originally developed to assess reliability, scalability and validity of both instruments. A summary of studies on the instrumental utility of MAAS-MI G and MAAS-MI MH is given in Time-table 1 and Time-table 2.

Table 1 -- Establishing the instrumental utility of the MAAS-MI - General
Schermafbeelding 2021-02-18 om 11.20.48
Table 2 -- Establishing the instrumental utility of the MAAS-MI - Mental Health
Schermafbeelding 2021-02-18 om 11.21.08

The validity of simulated patients

Since the studies presented in this thesis rely heavily on the use of simulated patients, we reviewed the – non-too-abundant – literature about the validity of simulated patients and the communication between physicians and simulated patients. Simulated patients have several attractive characteristics which make them a valuable tool in medical education and research into the medical interview. The characteristics are that they appear to be available at times appropriate to the programme; that they can be trained to present a wide array of problems; that they can be interviewed repeatedly which yields the advantage of standardizing the clinical situation; that they are able to provide critical feedback. 

Several criteria were formulated for the use of simulated patients in educational settings that also hold true for research situations (Norman et al, 1982): 

  • Credibility or face-validity  Several studies reveal that simulated patients are almost indistinguishable from real patients;
  • Comprehensiveness  In comparison with written types of simulation or computer problems, simulated patients can be used to simulate virtually all aspects of the physician-patient encounter;
  • Precision  Well-trained simulated patients appear to provide a consistent clinical picture from one encounter to the next;
  • Validity  The differences in interviewing styles of physicians who talked with real and simulated patients have been studied several times. No differences were observed in the number of questions on history-taking and physical examination between interviews with real and simulated patients. (Norman et al, 1982). No differences were observed in the level of empathy expressed by medical students who interacted with real and simulated patients presenting psychologica) problems (Sanson-Fisher et al, 1980). Our own study revealed that simulated patients presented their roles very naturally according to the judgments made by general practitioners and residents in primary care (Crijnen et al, 1986). 

We conclude, therefore, that simulated patients form an accurate and valid representation of real patients because their performance resembles the behavior of real patients. Furthermore, physicians do not seem to have different interview styles when talking to real or simulated patients.

References

Selected Reading

Campbell DT, Fiske DW. Convergent and discriminant validation by the multi-trait multi-method matrix. Psychological Bulletin, 1959; 56: 81-105.

Groot AD de. Methodologie (Methodology). Mouton, ‘s-Gravenhage, 1972.

Guilford JP, Fruchter B. Fundamental statistics in psychology and education. McGraw-Hill, London, 1981. 

Gustafsson JE. The Rasch-model for dichotomous items: theory, applications and a computer program. Reports from the Institute of Education, No. 64, University of Goteborg, Sweden, 1977. 

Mitchell SK. Interobserver agreement, reliability and generalizability of data collected in observational studies. Psychological Bulletin, 1979; 86: 376-390. 

All References

More

Berkel HjM van. De diagnose van toetsvragen (The diagnosis of test items – dissertation). Universiteit van Amsterdam, Amsterdam, 1984. 

Campbell DT, Fiske DW. Convergent and discriminant validation by the multi-trait multi-method matrix. Psychological Bulletin, 1959; 56: 81-105.

Campbell DT, Stanley JC. Experimental and quasi-experimental designs for research. Rand McNally, Chicago, 1966. 

Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960; 20: 37-46. 

Cook TD, Campbell DT. Quasi-experimentation. Rand McNally, Chicago, 1979. 

Cronbach LJ, Meehl PE. Construct validity in psychological tests. Psychological Bulletin, 1955; 52: 281-302. 

Cronbach LJ. Essentials of psychological testing. Harper and Row, New York, 1970. 

Cronbach LJ, Gieser Cam, Nanda H, Rajaratham N. The dependability of behavioral measurements: theory of generalizability for scores and profiles. John Wiley and Sons, New York, 1972. 

Crijnen AAM, Kraan HF. Reliability and validity of the Maastricht History-taking and Advice Checklist in General Practice (research proposal). Department of Social Psychiatry, Rijksuniversiteit Limburg, Maastricht, 1984. 

Crijnen AAM, Thiel J van, Kraan HF. Evaluatie van consultvoering: een spreekuur nagebootst (Evaluation of a medical consultation: simulating consultation hours). Huisarts en Wetenschap, 1986; 29: 316- 318. 

Fiske DW. Measuring the concepts of personality. Aldine Publishing Company, Chicago, 1971. 

Groot AD de. Methodologie (Methodology). Mouton, ‘s-Gravenhage, 1972.

Guilford JP, Fruchter B. Fundamental statistics in psychology and education. McGraw-Hill, London, 1981. 

Gustafsson JE. The Rasch-model for dichotomous items: theory, applications and a computer program. Reports from the Institute of Education, No. 64, University of Goteborg, Sweden, 1977. 

Hambleton RL, Cook LL. Latent trait models and their use in the analysis of educational test data. Journal of Educational Measurement, 1977; 14: 75-96. 

Kerlinger FN. Foundations of behavioral research. Holt, Rinehart and Winston Inc., New York, 1981. 

Kraan HF, Crijnen AAM, DeVries M, Zuidweg J, Imbos T. Are medical interviewing skills teachable? Perspectief, 1986; 4: 29-51. 

Kraan HF, Crijnen AAM. Reliability and validity of the Maastricht History-taking and Advice Checklist in Primary Mental Health Care (research proposal). Department of Social Psychiatry, Rijksuniversiteit Limburg, Maastricht, 1984. 

Mitchell SK. Interobserver agreement, reliability and generalizability of data collected in observational studies. Psychological Bulletin, 1979; 86: 376-390. 

Molenaar 1 W. Programma beschrijving van PML voor het Rasch model (Description of the PML-program for the Rasch model – version 3.1). Heymans Bulletin, Vakgroep Statistiek en Meettheorie, Universiteit van Groningen, Groningen, 1981. 

Nunnally JC. Psychometric theory. McGraw-Hill, London, 1967. 

Nunnally JC. Reliability of measurement. In: The encyclopedia of educational research. MCMillan and Free Press, New York, 1982. 

Norman GR, Tugwell P, Feightner JW. A comparison of resident performance on real and simulated patients. Journal of Medical Education, 1982; 57: 708-715. 

Philipsen H. Onderzoek als datamatrix (Research as datamatrix- samenvatting college). Rijksuniversiteit Limburg, Maastricht, 1984. 

Sanson-Fisher R, Fairbairn S, Maguire P. Teaching skills in communication to medical students – a critical review of the methodology. Medical Education 1981; 15: 33-37. 

Sanson-Fisher RW, Poole AD. Simulated patients and the assessment of medical students’ interpersonal) skills. Medical Education, 1980; 14: 249-253. 

Shrout PE, Fleiss JL. Intraclass correlation: uses in assessing rater reliability. Psychological Bulletin, 1979; 86: 420-428. 

Thorndike RL. Applied psychometrics. Houghton Mifflin Cie., Boston, 1982. 

Wright BD. Solving measurement problems with the Rasch model. Journal of Educational Measurement, 1977; 14: 97-116.