Let’s face it—we’re all biased. For example, in studies of letters of evaluation and recommendation in academia, men are more often described with words considered “standout”1 (e.g., excellent, superb, exceptional), and/or related to “ability”2 (e.g., talent, intelligent, capacity, analytical), and with an emphasis on “research”,3 while women are more often described using words considered “grindstone”1, 2 (e.g., hardworking, conscientious, diligent), and with an emphasis on “teaching”;3 other studies have found that women are more commonly described with “communal” words (e.g., compassionate, caring, empathetic).3

In our educational system, descriptive evaluations of trainees have been a hallmark of assessment in the clinical setting. Gingerich et al.4 explored differing perspectives on the role of assessors (e.g., faculty) (as trainable, fallible [biased], and meaningfully idiosyncratic). In the discussion of fallibility, the authors note that humans are not (and cannot be) passive observers, but activate cognitive processes to help remember and make sense of their observations, although these cognitive processes are the sources of biases.

Two articles in this issue of JGIM address the issue of bias (gender and race) in the assessment of medical students and residents. Rojek and colleagues5 conducted an observational study of over 87,000 core clerkship evaluations of medical students at two geographically different medical schools. They employed natural language processing to identify differences in the text of evaluations of women compared to men and of underrepresented in medicine (URIM) compared with non-URIM students. Their analysis demonstrated significant differences in narrative language associated with gender and URIM status, even among students receiving the same grade. Female students and URIM students were significantly more likely to be described using personal attributes as opposed to competency-based behaviors. Specifically, for words that differed between men and women, 62% represented personal attributes and were found more commonly for women, while 19% represented competency-related behaviors and were more commonly applied to men. For words that differed between URIM and non-URIM students, comments about personal attributes were more common for URIM students, while competency-based comments were more frequent in evaluations of non-URIM students. The study also shed light on the ten most commonly used words that did not differ between the groups, providing insight that these words may not convey meaningful information about a student. The findings highlight the presence of bias within the descriptive evaluation of medical students, raise questions about the validity of these assessments in the evaluation process, including the Medical Student Performance Evaluation (MSPE), and serve as a call-to-action for concerted efforts to address and work to eliminate bias in all forms across the learning environment.

Klein et al.6 performed a systematic review of the literature from 1998 to 2018 to examine the presence and influence of gender bias on resident assessment. Five of the nine studies uncovered evidence of gender bias, with women receiving lower ratings in domains of performance and differences in qualitative comments. In the largest study,7 these differences in ratings were equivalent to adding several months of training for women to “catch up” to their male counterparts. In the study8 of the qualitative comments made by faculty in these emergency medicine (EM) programs, the “ideal” trainee had more stereotypically masculine traits, there was discordance in the feedback provided to struggling trainees (men received more consistent feedback while women received more discordant feedback, especially with regard to autonomy and assertiveness), and comments about a tension between autonomy and openness to feedback were only found in narratives of women. In other included studies, qualitative comments women received from faculty more often focused on communal or warmth-based descriptors and less often on agentic (related to agency) or competency-related descriptors. The authors of this review concluded that gender bias poses a potential threat to the integrity of resident assessment and there is a need to seek and understand the source(s) of the bias (e.g., in the assessment, the learner, the faculty), as well as the longer term consequences.

Since conducting their review, two articles have addressed gender bias in GME. In an observational study of one academic internal medicine training program,9 women scored higher in two domains (medical interviewing, and interpersonal and communication skills), although the absolute differences were small (effect sizes of approximately 0.12–0.15). A second qualitative study10 of the narrative comments of resident trainees across nine surgical specialties at one institution found that men were consistently described by more positive comments than women. The authors noted that in their study, “…women were often described as having a potential to succeed while men were simply expected to succeed.”10 In addition, some phrases were only written about women, notably in the area of disposition and humanism (e.g., “always smiling”). Finally, while there were no differences in the relative use of communal, grindstone, and ability words, men were more likely to receive “standout” words, with the word “leader” used three times more for men than women. The differences in the use of standout words, feedback to women about assertiveness, and speaking with confidence are also experienced by medical students.11

Similarly, there are differences in how different racial groups among medical trainees are described. In a study of MSPEs12 for 6000 applicants from 134 medical schools applying for 16 residency programs at one institution, White students were more likely to be described using “standout” or “ability” words while Black students were more likely to described as “competent.” In another single-institution study representing 4655 medical students from 123 US medical schools, Black students were less likely to be members of the Alpha Omega Alpha Medical Honor Society than White students, even after controlling for other measures of achievement.13 In a qualitative study of 27 underrepresented (Black, Hispanic, Native American) residents in GME training,14 themes emerged around microaggressions and bias, being tasked to be ambassadors, and challenges of personal and professional identity formation. The practice of routinely citing race (about which we typically don’t ask patients) in introductory statements15 (“This is a 65 year old African-American woman….”) reflects the difficulty of shifting views of race from a long-held and perpetuated biologic perspective to that of a social construct.16 With such deeply engrained specious beliefs, it is not surprising, but very concerning, to see these influences in the descriptions of trainees.

While it would be naïve of us to believe we will eliminate all bias in assessment programs, we must commit ourselves to action. We must address the gender and racial bias evident in the descriptive evaluations faculty provide about trainees, since fairness and equity in the process of assessment (fairness to trainees, faculty, and society) are essential for the integrity of our learning environment. What does this mean for our educational, including assessment, processes, programs, and leaders? The answer: Let’s face it. And, we have to change it.

Morgan et al.17 outlined a series of steps in addressing gender-based inequity, which are equally applicable to other forms of bias. The papers in this issue help with addressing the first step, which is to be explicit and name the systematic nature of the bias. In leading and participating in educational programs, understanding and highlighting the words people use in their assessments are part of making this more explicit, as is empowering trainees to come forward when they read these words about themselves. Implementing educational interventions that focus on dialogue among participants is another means of making the underlying bias explicit.18 Other steps the authors suggest include advancing efforts to promote equity, incorporating explicit evidence-based training, and increasing transparency on issues such as academic promotion.

We would not recommend eliminating or marginalizing descriptive evaluation of trainees as that would not address the fundamental underlying issue of bias (e.g., gender, race). We should be guiding faculty on what to say in their comments. We can illustrate the words that are helpful to explain their observations (including the 10 words shown in the Rojek6 study that might be of limited help). But we should not mislead ourselves and think that a new form or a new online training program will solve the issue. Forms help communicate goals but dialogue and conversation lead to understanding. If decisions are to be made by someone other than the author about what comments to include or not include in a final narrative (course, clerkship, rotation, MSPE), we need to ask: Who will make the decision about what to keep, delete, or change? How will the faculty member be provided feedback? How do we know the person editing is not influenced by their own bias(es)? Do we have a responsibility to the readers of the narrative to inform them if changes have been made? Should personal attributes be eliminated from comments or could they be desirable to help more fully describe an individual? Do we risk swinging the pendulum too far, yielding descriptions that risk becoming too formulaic and impersonal?

Collectively, we have to help one another to say what matters. Our best chance of doing this is to find and take the time to talk with one another. Ultimately, it is in our daily dialogue and through naming the problem that we will mitigate bias and not perpetuate it.