Lake Wobegon - Failing to Fail

Failure to Fail Part 1- Why faculty evaluation may not identify a failing learner

In Education & Quality Improvement by Nadim Lalani6 Comments

I recently gave a talk to fellow faculty on the phenomenon of “failure to fail” in emergency medicine. I am no expert, but I have tried to synthesize the details in a useful way. I have broken it down into three parts. Part 1 deals with the phenomenon of Failure to Fail.  In two follow up posts I will introduce some forms of evaluator bias and then provide a prescription for more effective learner assessment in the ER.


There are several truths that we can take for granted in evaluation:

  1. We expect our medical trainees to acquire the fundamental clinical skills
  2. We expect them to evolve from novice to expert.
  3. We aim to graduate cadres of competent physicians who will serve their communities safely, effectively, and conscientiously.

The current model of medical training is a blend of didactic teaching, clinical learning, simulation, and self-directed endeavors. We then try to evaluate the learners formatively and summatively using written exams and standardized clinical scenarios. However, we are learning that the best tool to evaluate learners is direct observation in the work context. This requires four things:

  1. Deliberate practice by the trainee
  2. Intentional observation by the faculty
  3. Specific actionable feedback from the faculty to the trainee
  4. Action planning of additional learning / practice / observation / feedback

Transitioning to this type of evaluation will place even more emphasis on direct faculty oversight. For this to be feasible we need to coach our faculty to:

  • Perform Direct observation
  • Perform a valid [i.e repeatable] evaluation of skills
  • Provide Effective feedback

Current State of Trainee Evaluation

FACT: The current model is sometimes failing to discriminate and fail learners

This occurs despite observed unsatisfactory performance, faculty being confident in their ability to discriminate, and faculty acknowledgement that this is a significant problem. In an effort to understand why, I discovered the Wobegon Effect.

The Wobegon Effect is named for a fictional town on a radio show – a town where:

All the women are strong, all the men are good looking, and all the children are above average

It is an allegory for the human tendency to overestimate one’s achievements and capabilities in relation to others (also known as illusory superiority). Grossly inflated performance ratings have been found practically everywhere in North America, including business (by both employees and managers), academia (where University professors overestimate their excellence [gulp]), and driving (where no one believes they are bad). Unsurprisingly this phenomenon is equally pervasive in medicine where:

  • Faculty struggle to provide honest feedback and consistent [valid] evaluations. [One study raters rated 66% trainees “above average” This is simply not possible!1]
  • Fazio et al 2 demonstrated that 40% of IM clerks should have failed, yet were passed… FORTY PERCENT!!!
  • In this study by Harasym et al3 showed that OSCE examiners are more likely to err on the side of passing than failing studentsFailure to Fail
    • Residents [in particular lowest performing ones] overestimate their competency compared to their ratings by faculty peers and nurses4
    • Moreover the biggest overestimations lay in the so called “soft skills” – Communication, teamwork and professionalism. These are often the problems that give faculty and colleagues headaches with a particular learner.
    • One reason might be because  soft skills are hard to quantify – unlike suturing skills where incompetence is quickly identified

The end result is a culture of “failure to fail” where graduates do not acquire the required skill-set and patient needs are not served (safety and patient satisfaction are reduced while diagnostic error is increased), resulting in a negative fallout besmirching the entire profession and eroding public trust. We cannot succeed in our work without the public having confidence in what we do.

Why is there a failure to fail learners?

There are many barriers to providing adequate evaluation:

Learner factors

  • Learners are all different. Moreover, the same learner will vary in skills through time as they grow and develop.
  • We all have good and bad days.
  • There exists a phenomenon called “Group-work effect” where medical teams can mask deficiencies of individual learners.

Institution factors

  • Tools of evaluation flawed – some eval forms are poorly designed to discriminate learners
  • We all work in the current culture of “too busy to teach”.
  • There is an incredible amount of work needed to change this culture

Faculty Factors

  • Faculty feel confident in ability when poled.
  • Faculty feel sense of responsibility to patient, profession, learner BUT …
  • Raters themselves are the largest source of variance in ratings of learners:
    • Examiners account for 40% of the variance in scores on OSCEs
    • Examiners’ ratings are FOUR times more varied than the “true” scores of students
    • Some tend to be more strict – “Hawks” … some are more lenient – “Doves”
    • Negligible effect of gender, ethnicity and age/experience [one UK study that “hawks are more likely to be ethnic minority and older 5]
  • Clinical competence of faculty members is also correlated with better evaluations 6
    • One Interesting study where faculty took OSCE themselves, then rated students … Results show that
      • Use their own practice style as frame of reference to rate learners
      • Better performers on the OSCE were more insightful and attentive evaluators

A convenience sample of U of S EM faculty polled in 2013 identified three top reasons: a fear of being perceived as unfair, lacking confidence in the evidence supporting failure, and uncertainly about how to identify specific behaviors in learners.

What does the literature say about failure to fail?

The literature identifies multiple factors that result in a failure to fail:

  1. Competing demands: The conflict between clinical and educational responsibilities mean that education suffers.
  2. Lack of Documentation:– Preceptors fail to record day-to-day performance. So when it comes to end of rotation eval there is often not enough evidence7.
  3. The Interpersonal Conflict Model: This model describes a phenomenon in which following phenomenon: faculty try to be gentle (to protect learner self-esteem and maintain an alliance with them) by softening feedback to ensure that it is not seen as a personal attack. This results in tension being created when they are forced to be negative or provide critical feedback. The emotional component of giving negative feedback makes it even more difficult. As a result, we overemphasize the positives and send a mixed or diluted message to our learners.
  4. Lack of Self Efficacy: There’s a lack of knowledge of what specifically to document. Faculty don’t know what type of information to jot down and struggle to identify specific behaviors associated with failure. The reported low self-confidence during evaluations is actually a product of our training [or rather lack thereof]. No one teaches us how to navigate minefields in evaluation. This is particularly evident for soft skills. Staff often think that their judgments are subjective interpretations.
  5. Anticipating an [arduous] Appeal Process: Having the additional responsibility of having to defend ones actions/comments and fearing of escalation (e.g. legal action) makes failing a learner even harder.
  6. Lack of Remediation Options: There is a lack of faculty support for remediating learners. This makes faculty unsure about what to do/advise for remediation after diagnosing a problem.


We have seen that the current model of medical training is failing to identify and fail underperforming learners. There are several reasons why, but faculty themselves play a large role in this culture of “Failure to Fail.” In my next post I will highlight some biases that we encounter when judging learners and provide a prescription for more effective learner evaluation.

Acknowledgement: Dr Jason Frank @drjfrank for pointing me towards relevant literature.
Note: This post was originally published on the ERMentor Blog. It was revised by Stephanie Zhou and Brent Thoma and reposted on CanadiEM on June 16th, 2016.


Paget M, Wu C, McIlwrick J, Woloschuk W, Wright B, McLaughlin K. Rater variables associated with ITER ratings. Adv Health Sci Educ Theory Pract. 2013;18(4):551-557.
Fazio S, Papp K, Torre D, Defer T. Grade inflation in the internal medicine clerkship: a national survey. Teach Learn Med. 2013;25(1):71-76.
Harasym P, Woloschuk W, Cunning L. Undesired variance due to examiner stringency/leniency effect in communication skill scores assessed in OSCEs. Adv Health Sci Educ Theory Pract. 2008;13(5):617-632.
Lipsett P, Harris I, Downing S. Resident self-other assessor agreement: influence of assessor, competency, and performance level. Arch Surg. 2011;146(8):901-906.
McManus I, Thompson M, Mollon J. Assessment of examiner leniency and stringency (’hawk-dove effect’) in the MRCP(UK) clinical examination (PACES) using multi-facet Rasch modelling. BMC Med Educ. 2006;6:42.
Kogan J, Hess B, Conforti L, Holmboe E. What drives faculty ratings of residents’ clinical skills? The impact of faculty’s own clinical skills. Acad Med. 2010;85(10 Suppl):S25-8.
Dudek N, Marks M, Regehr G. Failure to fail: the perspectives of clinical supervisors. Acad Med. 2005;80(10 Suppl):S84-7.
Nadim is an emergency physician at the South Health Campus in Calgary, Alberta. He is passionate about online learning and recently made a transition into human performance coaching. He is currently working on introducing the coaching model into medical education.