Bad Economics: Student Evaluations Can’t Be Used to Assess Professors. They’re Discriminatory against women. (says study with n=2)
This is part of a series explaining common economics misconceptions TL;DR: Professor publishes a terrible, terrible paper. I tear it apart. Intended readership: I assume you took some undergrad statistics.
In a recent Slate article, a TexasTech professor argues that, because teacher assessments are biased against women, they’re unacceptable to consider employee performance. While I would be easily convinced there’s gender bias in teacher evaluations, the paper the author uses to back up her claims is a heap of methodological garbage, which is what we’ll talk about today.
The authors inspired their study on MacNell et al. (2015) which actually does something pretty cool: they switch the name of male/female TAs (which only interact through forum posts) in online courses as a way to create a natural experiment. This is similar to the way racial discrimination is studied in the labor market by switching names on resumes sent out randomly.
The problem is that instead of replication MacNell et al. (2015), they instead got the evaluations of two online courses, one by a male professor and one by a female.[footnote]Extra bad: the appendix doesn’t say if the lectures are actually the same, only that the “content” is the same. This ambiguity is concerning[/footnote] So the entire methodological design goes out of the window: we’re now just comparing the evaluations of two people who happen to be of different genders.
In addition to that, the authors use an entirely subjective methodology (based on no previous cited research) to analyze the topics in the reviews. I’ve never seen something this blatant – they make up topics and barely explain the criteria to fit a review in the topic [footnote]I’m serious. Look at page 2 in the supplementary material[/footnote]. They might as well make up numbers for table 1 and 2 and save everyone time.
The worse part is that there is a way to do this sort of topic analysis more rigorously – you can do latent dirichlet allocation (LDA) on the words in each review and split those by the (perceived) gender of the instructor. In fact, doing LDA on scraped data from RateMyProfessors, splitting the data by gender and controlling for the class given and fixed effects of the individual professors would probably probably be the next best way to do a study of the topic after the Macnell et al. (2015) kind of methodology.
Extra pedantry: Comparing ordinal values like 5-point scale reviews by t-testing their mean is bad methodology. The difference between 5 and 4 is not the same as the difference between 1 and 2 – comparing means makes no sense. They should have used something like a Mann-Withney U-Test or testing the data through an ordered probit regression.