Abstract
Background: Multisource feedback (MSF) is increasingly being used as one of the components in revalidation and recertification processes to guide physicians' continuing professional development. Data provided by co-workers (e.g., nurses, pharmacists, technicians) are recognized as integral for assessing a physician's communication, teamwork and interprofessional abilities. The purpose of this study was to examine both the reliability of co-worker scores and the association between co-worker familiarity and physician ratings as both affect perceptions of the quality of feedback and the likelihood that recipients will take their feedback seriously.
Method: MSF data from 9674 co-workers of 1341 Alberta physicians across 9 specialty groups were analyzed. Analyses for internal consistency and generalizability theory (G and D-studies) were used to assess reliability. The association between co-worker familiarity and the MSF scores they provided to physicians was assessed using ANOVA.
Results: Cronbach's alpha for all co-worker tools was > 0.90. Generalizability coefficients (EP2) varied by specialty and ranged from 0.56 to 0.72. D studies revealed that a minimum of 11 co-workers are necessary to achieve stability (i.e., EP2 > 0.70). Co-worker familiarity exerted a significant (p < .001) positive main effect on physician performance scores, across all specialty groupings.
Conclusions : This study confirms the reliability of co-worker scores and provides evidence that co-worker MSF data is stable and consistent for the purposes of providing physicians with feedback for professional development. Attention however needs to be paid to co-worker/physician familiarity as this relationship may favourably bias physician performance scores.
AUTHOR DETAILS
Gregg C. Trueman, PhD, NP is a nurse practitioner and associate professor in the School of Nursing and Midwifery at Mount Royal University in the Faculty of Community and Health Studies: Y348, 4825 Mount Royal Gate SW, Calgary, Alberta Canada, This research was conducted as part of his doctoral program under the supervision of Dr Jocelyn Lockyer at the University of Calgary.
Jocelyn M. Lockyer, PhD is professor and Senior Associate Dean-Education in the Faculty of Medicine at University of Calgary.
KEYWORDS
multisource feedback, workplace-based assessment; collaborative practice; co-worker feedback.
CO-WORKER FAMILIARITY AND PHYSICIAN MULTISOURCE FEEDBACK
MSF is a reliable workplace-based assessment strategy [1] used to collect data on the performance of practicing physicians, and has been used extensively to assess physicians and guide continuing professional development in Canada,[2,3] the USA,[4,5] the UK[6,7] and Europe.[8,9] Generally in a MSF assessment, patients, co-workers (e.g., nurses, technicians, pharmacists), and medical colleagues provide data. There may also be a self-assessment questionnaire. In some cases, the co-worker and medical colleague data are combined.[10] In other practice settings, two different questionnaires with different items for medical colleagues and non-physician co-workers are used to provide data.[11] When one questionnaire is used to assess both physician and non physician behaviors, and the feedback report is an aggregate of these data, the unique perspective provided by non-physician co-workers is lost.
The College of Physicians and Surgeons of Alberta (CPSA) Physician Achievement Review (PAR) Program (www.par-program.org) provides a unique opportunity to examine MSF data provided by co-workers across nine specialty groups. The instruments, adapted and implemented over a 12 year period, were first developed for family physicians[12] and later adapted across eight other specialty groups, to inform professional development. In the PAR program, the co-worker is a source distinct from the medical colleague and co-worker assessment questionnaire (CAQ) items are different from items on the medical colleague questionnaire. While each co-worker instrument varies by specialty, their focus has been on interprofessional teamwork (i.e., communication, professionalism and collaboration) and not on medical expertise. This is in direct contrast to UK programs where co-workers complete a colleague instrument that includes both MD and non-MD respondents.[13,14] In part, the CPSA sought to ensure that their tools captured different aspects of the physician?s work; different instruments with distinct foci completed by multiple groups would help ensure a broad range of feedback. Each physician participates in the PAR program every five years and identifies the medical colleagues and co-workers who will provide the data. Reliability speaks to the reducibility of feedback data, something that is particularly relevant when fewer numbers of co-workers are available to provide feedback. Thus, uptake of co-worker data for professional development purposes is predicated on reliable instrumentation.[15]
The PAR instruments were psychometrically assessed with an examination of evidence for validity16-19 and reliability[20-25] when they were first developed. While aspects of reliability have been re-examined for physician groups that have participated in PAR on more than one occasion,[12] there has not been a comprehensive examination of reliability for the co-worker instruments across all PAR specialties or specifically of reliability, a key component of validity. Of particular concern in the present study is characteristics of MSF design (e.g., number of co-workers providing information) known to influence the stability of MSF feedback.[26] Given the multiple sources of error in MSF data, internal consistency measures - while necessary - are insufficient to establish instrument reliability. Generalizability theory expands on alpha by using two types of studies - generalizability (G) and decision (D) studies - to quantify the amount of variance associated with different factors or facets, and to provide reliability evidence for measurement protocols (i.e., optimal numbers of items and assessors) in fully nested, unbalanced behavioural measures that constitute PAR data.[27]
Also of interest is the influence of familiarity - between the co-worker and index physician - on co-worker performance scores.[11] Initial MSF work did not find a definitive association that would preclude physician selection of their own co-worker.[16,28] More recently however, familiarity has been correlated in UK studies with more favourable feedback.[29] In a study of 68 underperforming physicians referred to the UK?s National Clinical Assessment Service, physicians who selected their own co-worker assessors were more likely to obtain higher scores compared to feedback provided by colleagues who were selected for them.[30] Research involving post graduate trainees also found that the length of the working relationship (i.e., familiarity) between assessor and trainee influenced their MSF scores.[13]
This study was undertaken to address the following questions: (1) what is the reliability of the MSF scores provided by PAR co-worker questionnaires?; and (2) what is the association between familiarity and the scores provided by physician selected co-workers? It was believed that this information would inform instrument revision and uptake of co-worker feedback for professional development purposes.
METHODS
Pivotal Research Inc, a company that administers the PAR program on behalf of the CPSA, created an anonymous dataset of co-worker data, collected between January 2006 and April 2011, for 150 physicians from each of the nine specialty groups (i.e., anesthesia, diagnostic imaging, episodic medicine, family medicine, laboratory medicine, medical specialists, pediatrics, psychiatry and surgery). The data set represented the most recent assessments of physicians in each specialty grouping. For each physician, responses from up to eight respondents were provided. Depending upon specialty grouping, the co-worker questionnaire contained between 17 and 22 items placed on a five point Likert scale, with a sixth "unable to assess" option. The co-worker's self-reported familiarity with the physician was also provided on a five point Likert scale [i.e., 1="not at all"; 2="not well"; 3="somewhat"; 4="well" and 5="very well" familiar]. Data describing the physician included: sex, medical school (Canadian, International), years since graduation, location of practice (urban, regional, or rural), and number of times the physician had participated in PAR, were available for physicians in each specialty group. No data describing the co-workers providing feedback were available for analysis.
ANALYSIS
Descriptive statistics were calculated for all physician socio demographic variables. At the instrument level, the number of questionnaires and items including the mean number of co-workers per physician, the mean score with standard deviation and range were calculated. Cronbach?s alpha was calculated to examine each tool?s internal consistency. MSF designs can threaten instrument reliability with several sources of error variance in that data are uncrossed (i.e., co-workers rate the physician on only one occasion), unbalanced (i.e., there are different numbers of co-workers providing input for each physician), and fully nested (i.e., co-workers for each physician are unique to that physician). AG study uses repeated-measures analysis of variance to simultaneously quantify the variance embedded in each facet that are interacting to create error but is not captured in the inter-item correlation matrices underpinning internal consistency reliability.[31] G studies calculate variance components expressed as a coefficient (EP2), where EP2 > 0.70 is generally considered the minimum threshold suitable for MSF instruments of similar intent.[32] For all co-worker tools, G studies were performed to estimate the variance associated with different facets: the physician, the co-worker, the questionnaire item, and residual error (i.e., measurement artifact). D studies then used the variance components derived from each G-study to improve the stability of measurement protocols used to collect physician feedback (e.g., the number of questionnaire items or the number of co-workers).
To assess the influence co-worker familiarity exerted on physician performance scores, a one way analysis of variance (ANOVA) with planned comparisons was used to identify differences between group means within the five levels of familiarity.[33] Effect sizes were calculated to determine the direction and magnitude of familiarity on performance scores to differentiate statistical from clinical significance.[34]
Ethics approval for this research was provided by the University of Calgary Conjoint Health Ethics Research Board.
RESULTS
Data from 9674 co-workers, provided to 1341 physicians, were analyzed. Physician socio-demographic characteristics are summarized in Table 1. The majority of physicians were male, practiced in an urban setting, and graduated from Canadian medical schools a mean of 22.6 years ago [range = 8 to 49 years]. With the exception of diagnostic imaging and laboratory medicine, most physicians had received PAR feedback on a previous occasion. At the level of the questionnaire, the mean number of co-workers in each medical specialty ranged from 6.32 for family physicians to 7.48 in diagnostic imaging (Table 2). Overall, co-workers? ratings were negatively skewed leading to a restriction of range at the upper end of the scale and toward favourable views of physician performance (Table 2). Mean item scores and standard deviations ranged from a low of 4.04(.94) [psychiatry] to 4.81(.49) [diagnostic imaging].
RELIABILITY
Cronbach?s alpha across the collection of questionnaires were very high, ranging from 0.90 (episodic medicine) to 0.96 (diagnostic imaging), providing evidence for each tools? internal consistency reliability (Table 3). G coefficients ranging from 0.56 to 0.72 indicated a different picture of reliability across the collection of tools, providing new evidence for how behavioural measurement can introduce unique sources of variance influencing MSF reliability (i.e., Table 3). The facet with the greatest variance was rater by item nested within physician, ranging from 22.62%: (anesthesia) to 43.41%: (diagnostic imaging) indicating significant variation within the co-worker: physician relationship. Additional error variance across the CAQ, ranging from 19.49%: (anesthesia) to 49.27%: (laboratory medicine), was attributed to rater nested within physician. G coefficients, and their associated standard error of measurement, ranged from a low of 0.56 (0.020) and 0.56 (0.022) for the diagnostic imaging and episodic medicine instruments respectively, to a high of 0.72 (0.024) for the medical specialist CAQ.
D-studies were conducted to determine if the number of questionnaire items or the number of co-workers were sufficient to produce stable feedback to the individual physician. G studies demonstrated a relatively small variance component that could be attributed to CAQ items across specialty grouping. As such, increasing the number of items resulted in little change in the G coefficient. However, changing the number of co-workers providing feedback did produce higher G coefficients across the entire collection of co-worker tools (Table 4). With the exception of the laboratory medicine tool, our review found that a minimum of 11 co-workers were required to provide stable data with a coefficient > 0.70.
FAMILIARITY
Overall, 72.4% [range 68% - 78%] of co-workers included their degree of familiarity with the physician ratee as part of their information (Table 5). MSF scores were significantly different between familiarity groups (p < .001), producing a moderate to moderately-large effect size across all specialty groupings. These data indicate that as co-worker familiarity with the physician increases, regardless of specialty grouping, so too did the performance scores co-workers assigned. Furthermore, planned comparison analyses showed that co-worker familiarity (i.e., well and very well) was associated with increased PAR scores compared to co-workers who were unfamiliar (i.e., not at all, not well, and somewhat) with the physician they were assessing. Though these findings are inconclusive as to magnitude, they nonetheless identify a direct, linear relationship between co-worker familiarity and the performance scores they assign to the physician.
DISCUSSION
This study examined non-physician co-worker feedback provided to 150 physicians across nine medical specialties. The PAR co-worker tools were developed over several years but not reviewed as a unified collection of workplace-based assessments. This study enabled a comparison of co-worker feedback across specialty groupings and allowed for a review of their internal consistency and generalizability coefficients, and the issue of assessor familiarity?s influence on performance scores, using data collected over a 5 year period.
All co-worker tools, including their respective sub-scales, demonstrated high internal consistency with alpha scores > 0.90, a small standard error of measurement providing evidence for the reliability of the CAQ across specialty grouping. While the range of EP2 approached 0.70 with eight co-workers, our D studies produced lower reliability coefficients across all specialty groupings suggesting that six of nine PAR specialties require a minimum of 11 co-workers for stable data, a finding similar to a UK study.10 These findings suggest that while current co-worker data is reliable for providing formative professional development information, our generalizability coefficients do not support using co-worker feedback alone for high stakes practice decisions, a finding supported by UK research.10
G studies have informed MSF research and influenced data collection procedures concerning the necessary numbers of raters required to provide reliable data or the numbers of items needed on MSF questionnaires. However, G studies have not considered how the data collection process itself may influence the feedback provided to individual physicians. In fact, the variance components drawn from MSF G studies? are rarely reported or published in research studies35
Our study examines variance components associated with multisource feedback [e.g., rater by item nested within the physician (ri:p) and rater nested within the physician (r:p)] . In this study, the relationship between the "rater and physician" and the "rater crossed by item and nested within physician? contributed the greatest amount of variance. Our study therefore points to the need to consider the influence that data collection processes have on the quality of co-worker ratings in the workplace setting. A review of the quality assurance process associated with MSF may need to include strategies that guide physicians in how and whom they select as co-worker assessors. It may also require a more robust understanding of how co-workers make assessment decisions about physician practice. Even knowing the value that physicians place on specific items addressed in co-worker questionnaires would be valuable in designing or revising MSF instrumentation.
CONCLUSION
This study confirms the reliability of co-worker scores and provides evidence that co-worker MSF data is stable and consistent for the purposes of physicians? continuing professional development. Indeed, co-worker instrumentation that are part of the suite of PAR instruments have been adopted in two other Canadian jurisdictions as part of their process for physician appraisal and professional development.36 Our study provides further insight regarding the role that co-worker familiarity plays in physician assessment. Unfortunately, the absence of co-worker socio demographic information precluded a more robust exploration of this relationship. Nevertheless our findings suggest a closer look at how physicians select co-workers, and other factors (e.g., how co-workers make decisions related to scoring), that may influence PAR performance is indicated.
ACKNOWLEDGEMENTS
I would like to thank Dr. Pamela Nordstrom, Dr. Tanya Beran and Dr Kent Hecker who supported my doctoral research at the University of Calgary and provided feedback on the development of this manuscript. Special thanks to Dr. David Keane at McMaster University who consulted on the ur-Genova software.
REFERENCES