Drugs Fun, Heroin Still Dangerous

Carl Hart is a parent, Columbia professor, and five-year-running recreational heroin user, reports The Guardian. “I do not have a drug-use problem,” he says, “Never have. Each day, I meet my parental, personal and professional responsibilities. I pay my taxes, serve as a volunteer in my community on a regular basis and contribute to the global community as an informed and engaged citizen. I am better for my drug use.”

Hart makes it pretty clear he thinks drug use is a good thing. Good not only for himself, but for people in general. “Most drug-use scenarios cause little or no harm,” he says, “and some reasonable drug-use scenarios are actually beneficial for human health and functioning.” He supports some basic safeguards, including an age limit and possibly an exam-based competency requirement, “like a driver’s licence.” But otherwise, he thinks that most people can take most drugs safely.

The article mentions Hart’s research in passing, but doesn’t describe it. Instead, these claims seem to be based largely on Hart’s personal experiences with drugs. He’s been using heroin for five years and still meets his “parental, personal and professional responsibilities”. He likes to take amphetamine and cocaine “at parties and receptions.” He uses MDMA as a way to reconnect with his wife

When Hart wondered why people “go on [Ed.: yikes] about heroin withdrawal”, he conducted an ad hoc study on himself, first upping his heroin dose and then stopping (it’s not clear for how long). He describes going through an “uncomfortable” night of withdrawal, but says “he doesn’t feel the need or desire to take more heroin and never [felt] in any real danger.”

This is fascinating, but it seems like there’s a simple individual differences explanation — people differ (probably genetically) in how destructive and addictive they find certain substances, and Hart is presumably just very lucky and doesn’t find heroin (or anything else) all that addictive. This is still consistent with heroin being a terrible drug that ruins people’s lives for the average user.

Let’s imagine a simplified system where everyone either is resistant to a drug and can enjoy it recreationally, or finds it addictive and it ends up destroying their life. For alcohol, maybe 5% of people find it addictive (and become alcoholics) and the other 95% of us can enjoy it without any risk. In this case, society agrees that alcohol is safe for most people and we keep it legal. 

But for heroin, maybe 80% of people would find it addictive if they tried it. Even if 20% of people would be able to safely enjoy recreational heroin, you don’t know if it will destroy your life or not until you try it, so it’s a very risky bet. As a result, society is against heroin use and most people make the reasonable decision to not even try it.

Where that ruins-your-life-percentage (RYLP) stands for different drugs matters a lot for the kinds of drugs we want to accept as a society. Certainly a drug with a 0% RYLP should be permitted recreationally, and almost as certainly, a drug that ruined the lives of 100% of first-time users should be regulated in some way. The RYLP for real drugs will presumably lie somewhere in between. While we might see low-RYLP drugs as being worth the risk (our society’s current stance on alcohol), a RYLP of just ten or twenty percent starts looking kind of scary. A drug that ruins the lives of one out of every five first-time users is bad enough — you don’t need a RYLP of 80% for a drug to be very, very dangerous.

Listen, we also believe in the right to take drugs. We take drugs. Drugs good. Most drugs — maybe all drugs — should be legal. But this is very different from pretending that many drugs are not seriously, often dangerously addictive for a large percentage of the population. 

As far as we know, drugs like caffeine and THC aren’t seriously addictive and don’t ruin people’s lives. There’s even some fascinating evidence, from Reuven Dar, that nicotine isn’t addictive (though there may be other good reasons to avoid nicotine). But drugs like alcohol and yes, heroin, do seem to be seriously addictive, and recognizing this is important for allowing adults to make informed choices about how they want to get high off their asses.

Hart’s experience with withdrawal, and how he chooses to discuss it, seems particularly clueless. It’s possible that Hart really is able to quit heroin with minimal discomfort, but it’s confusing and kind of condescending that he doesn’t recognize it might be harder for other people. When people say things like, “I find heroin very addictive and withdrawal excruciating,” a good start is to take their reports seriously, not to turn around and say, “well withdrawal was a cakewalk FOR ME.”

This seems to be yet another example of the confusing trend in medicine and biology, where everyone seems to assume that all people are identical and there are no individual differences at all. If an exercise program works for me, it will work equally well for everyone else. If a dietary change cures my heartburn, it will work equally well for everyone’s heartburn. If a painkiller works well for me when I have a headache, it will work equally well for the pain from your chronic illness. The assumption seems to be that people’s bodies (and minds) are made up of a single indifferentiable substance which is identical across all people. But of course, people are different, and this should be neither controversial nor difficult to understand. This is why if you’re taking drugs it’s important to experiment — you need to figure out what works best for you.

This is kind of embarrassing for Carl Hart. He is a professor of neuroscience and psychology. His specialty is neuropsychopharmacology. He absolutely has the statistical and clinical background necessary to understand this point. At the risk of being internally redundant, different people are different from each other. They will have different needs. They will have different responses to the same drugs. Sometimes two people will have OPPOSITE reactions to the SAME drug! Presumably Carl Hart has heard of paradoxical reactions — he should be aware of this.

On the other hand, anyone who sticks their finger in Duterte’s eye is my personal hero. We should cut Hart some slack for generally doing the right thing around a contentious subject, even if we think he is dangerously wrong about this point.

Less slack should be cut for the article itself. This is very embarrassing for The Guardian. Hart is the only person they quote in the entire article. They don’t seem to have interviewed any other experts to see if they might disagree with or qualify Hart’s statements. This is particularly weird because other experts are clearly interested in commenting and the author clearly knows that they might disagree with Hart. They might have asked for a comment from Yale Professor, physician, and (statistically speaking) likely marijuana user, Nicholas Christakis, who would have been happy to offer a counterbalancing opinion. The Guardian was happy to print that Hart is critical of the National Institute on Drug Abuse (NIDA), “in particular of its director, Nora Volkow”, but there’s no indication that they so much as reached out to NIDA or to Volkow for comment (incidentally, here’s what Volkow has to say on the subject).

We can’t be sure, but it’s even possible they somewhat misrepresented Hart’s actual position. It’s disappointing but not surprising when a newspaper doesn’t understand basic statistics, and it would be unfair to hold them to the same standard we hold for Carl Hart. But it is fair to hold them accountable for the basics of journalistic practice, and it seems to us like they dropped the bong on this one.

Investigation: Hypobaric Hypoxia Causes Body Weight Reduction by Lippl et al. (2010)


One of the mysterious aspects of obesity is that it is correlated with altitude. People tend to be leaner at high altitudes and fatter near sea level. Colorado is the highest-altitude US state and also the leanest, with an obesity rate of only 22%. In contrast, low-altitude Louisiana has an obesity rate of about 36%. This is pretty well documented in the literature, and isn’t just limited to the United States. We see the same thing in countries around the world, from Spain to Tibet

A popular explanation for this phenomenon is the idea that hypoxia, or lack of oxygen, leads to weight loss. The story goes that because the atmosphere is thinner at higher altitudes, the body gets less oxygen, and this ends up making people leaner.

One paper claims to offer final evidence in favor of this theory: Hypobaric Hypoxia Causes Body Weight Reduction in Obese Subjects by Lippl, Neubauer, Schipfer, Lichter, Tufman, Otto, & Fischer in 2010. Actually, the webpage says 2012, but the PDF and all other sources say 2010, so whatever.

This paper isn’t terribly famous, but as of this writing it’s been cited 171 times, and it was covered by WIRED magazine in 2010, so let’s take a look.

This study focused on twenty middle-aged obese German men (mean age 55.7, mean BMI 33.7), all of whom normally lived at a low altitude — 571 ± 29 meters above sea level. Participants were first given a medical exam in Munich, Germany (530 meters above sea level) to establish baseline values for all measures. A week later, all twenty of the obese German men, as well as (presumably) the researchers, traveled to “the air‐conditioned Environmental Research Station Schneefernerhaus (UFS, Zugspitze, Germany)”, a former hotel in the Bavarian Alps (2,650 meters above sea level). The hotel/research station “was effortlessly reached by cogwheel train and cable car during the afternoon of day 6.”

Patients stayed in the Schneefernerhaus research station for a week, where they “ate and drank without restriction, as they would have at home.” Exercise was “restricted to slow walks throughout the station: more vigorous activity was not permitted.” They note that there was slightly less activity at the research station than there was at low altitudes, “probably due to the limited walking space in the high‐altitude research station.” Sounds cozy.

During this week-long period at high altitude, the researchers continued collecting measurements of the participants’ health. After the week was through, everyone returned to Munich (530 meters above sea level). At this point the researchers waited four weeks (it’s not clear why) before conducting the final health examinations, at which point the study concluded. We’re not sure what to say about this study design, except that it’s clear the film adaptation should be directed by Wes Anderson.

Schneefernerhaus Research Station. Yes, really.


While this design is amusing, the results are uninspiring. 

To begin with, the weight loss was minimal. During the week they spent at 2,650 meters, patients lost an average of 3 pounds (1.5 kg). They were an average of 232 lbs (105.1 kg) to begin with, so this is only about 1% of their body weight. Going from 232 lbs (105.1 kg) to 229 lbs (103.6 kg) doesn’t seem clinically relevant, or even all that noticeable. The authors, surprisingly, agree: “the absolute amount of weight loss was so small.”

More importantly, we’re not convinced that this tiny weight loss result is real, because the paper suffers from serious multiple comparison problems. Also known as p-hacking or “questionable research practices”, multiple comparisons are a problem because they can make it very likely to get a false positive. If you run one statistical test, there’s a small chance you will get a false positive, but as you run more tests, false positives get more and more likely. If you run enough tests, you are virtually guaranteed to get a false positive, or many false positives. If you try running many different tests, or try running the same test many different ways, and only report the best one, it’s possible to make pure noise look like a strong finding.

We see evidence of multiple comparisons in the paper. They collect a lot of measures and run a lot of tests. The authors report eight measures of obesity alone, as well many other measures of health. 

The week the patients spent at 2,650 meters — Day 7 to Day 14 — is clearly the interval of interest here, but they mostly report comparisons of Day 1 to the other days, and they tend to report all three pairs (D1 to D7, D1 to D14, and D1 to D42), which makes for three times the number of comparisons. It’s also confusing that there are no measures for D21, D28, and D35. Did they not collect data those days, or just not report it? We think they just didn’t collect data, but it’s not clear.

The authors also use a very unusual form of statistical analysis — for each test, first they conducted a nonparametric Friedmann procedure. Then, if that showed a significant rank difference, they did a Wilcoxon signed‐rank method test. It’s pretty strange to run one test conditional on another like this, especially for such a simple comparison. It’s also not clear what role the Friedmann procedure is playing in this analysis. Presumably they are referring to the Friedman test (we assume they don’t mean this procedure for biodiesel analysis) and this is a simple typo, but it’s not clear why they want to rank the means. In addition, the Wilcoxon signed‐rank test seems like a slightly strange choice. The more standard analysis here would be the humble paired t-test. 

Even if this really were best practice, there’s no way to know that they didn’t start by running paired t-tests, throwing those results out when they found that they were only trending in the right direction. And in fact, we noticed that if we compare body weight at D7 to D14 using a paired t-test, we find a p-value of .0506, instead of the p < .001 they report when comparing D1 to D14 with a Wilcoxon test. We think that this is the more appropriate analysis, and as you can see, it’s not statistically significant.

Regardless, the whole analysis is called into question by the number of tests they ran. By our count they conducted at least 74 tests in this paper, which is a form of p-hacking and makes the results very hard to interpret. It’s also possible that they conducted even more tests that weren’t reported in the paper. This isn’t really their fault — p-hacking wasn’t described until 2011 (and the term itself wasn’t invented until a few years later), so like most people they were almost certainly unfamiliar with issues of multiple comparisons when they did their analysis. While we don’t accuse the authors of acting in bad faith, we do think this seriously undermines our ability to interpret their results. When we ran the single test that we think was most appropriate, we found that it was not significant. 

And of course, the sample size was only 20 people, though perhaps there wasn’t room for many more people in the research station. On one hand this is pretty standard for intensive studies like this, but it reduces the statistical power. 

There appear to be about 68 statistical tests in this table alone. Every little star (*) indicates a significant test against the number in D1. It’s hard to tell for sure how many tests they performed here (due to their very weird procedure) but it’s as high as 68.


The authors claim to show that hypoxia causes weight loss, but this is overstating their case. They report that people brought to 2,650 meters lost a small amount of weight and had lower blood oxygen saturation [1], but we think the former result is noise and the latter result is unsurprising. Obviously if you bring people to 2,650 meters they will have lower blood oxygen, and there’s no evidence linking that to the reported weight loss. 

Even more concerning is the fact that there’s no control group, which means that this study isn’t even an experiment. Without a control group, there can be no random assignment, and with no random assignment, a study isn’t an experiment. As a result, the strong causal claim the authors draw from their results is pretty unsubstantiated. 

There isn’t an obvious fix for this problem. A control group that stayed in Munich wouldn’t be appropriate, because oxygen is confounded with everything else about altitude. If there were a difference between the Munich group and the Schneefernerhaus group, there would be no way to tell if that was due to the amount of oxygen or any of the other thousand differences between the two locations. A better approach would be to bring a control group to the same altitude, and give that control group extra oxygen, though that might introduce its own confounds — for example, the supplemental-oxygen group would all be wearing masks and carrying canisters. I guess the best way to do this would be to bring both groups to the Alps, give both of them canisters and masks, but put real oxygen in the canisters for one group and placebo oxygen (nitrogen?) in the canisters for the other groups.

We’re sympathetic to inferring causal relationships from correlational data, but the authors don’t report a correlation between blood oxygen saturation and weight loss, even though that would be the relevant test given the data that they have. Probably they don’t report it because it’s not significant. They do report, “We could not find a significant correlation between oxygen saturation or oxygen partial pressure, and either ghrelin or leptin.” These are tests that we might expect to be significant if hypoxia caused weight loss — which suggests that it does not. 

Unfortunately, the authors report no evidence for their mechanism and probably don’t have an effect to explain in the first place. This is too bad — the study asks an interesting question, and the design looks good at first. It’s only on reflection that you see that there are serious problems.

Thanks to Nick Brown for reading a draft of this post. 

[1] One thing that Nick Brown noticed when he read the first draft of this post is that the oxygen saturation percentages reported for D7 and D14 seem to be dangerously low. We’ve all become more familiar with oxygen saturation measures because of COVID, so you may already know that a normal range is 95-100%. Guidelines generally suggest that levels below 90% are dangerous, and should be cause to seek medical attention, so it’s a little surprising that the average for these 20 men was in the mid-80’s during their week at high altitude. We found this confusing so we looked into it, and it turns out that this is probably not a issue. Not only are lower oxygen saturation levels normal at higher altitudes, the levels can apparently be very low by sea-level standards without becoming dangerous. For example, in this study of residents of El Alto in Bolivia (an elevation of 4018 m), the mean oxygen saturation percentages were in the range of 85-88%. So while this is definitely striking, it’s probably not anything to worry about.

Investigation: Ultra-Processed Diets by Hall et al. (2019)

[This is Part One of a two-part analysis in collaboration with Nick Brown. Part Two is on Nick’s blog.]


Recently we came across a 2019 paper called Ultra-Processed Diets Cause Excess Calorie Intake and Weight Gain: An Inpatient Randomized Controlled Trial of Ad Libitum Food Intake, by Kevin D. Hall and colleagues. 

Briefly, Hall et al. (2019) is a metabolic ward study on the effects of “ultra-processed” foods on energy intake and weight gain. The participants were 20 adults, an average of 31.2 years old. They had a mean BMI of 27, so on average participants were slightly overweight, but not obese.

Participants were admitted to the metabolic ward and randomly assigned to one of two conditions. They either ate an ultra-processed diet for two weeks, immediately followed by an unprocessed diet for two weeks — or they ate an unprocessed diet for two weeks, immediately followed by an ultra-processed diet for two weeks. The study was ad libitum, so whether they were eating an unprocessed or an ultra-processed diet, participants were always allowed to eat as much as they wanted — in the words of the authors, “subjects were instructed to consume as much or as little as desired.”

The authors found that people ate more on the ultra-processed diet and gained a small amount of weight, compared to the unprocessed diet, where they ate less and lost a small amount of weight.

We’re not in the habit of re-analyzing published papers, but we decided to take a closer look at this study because a couple of things in the abstract struck us as surprising. Weight change is one main outcome of interest for this study, and several unusual things about this measure stand out immediately. First, the two groups report the same amount of change in body weight, the only difference being that one group gained weight and the other group lost it. In the ultra-processed diet group, people gained 0.9 ± 0.3 kg (p = 0.009), and in the unprocessed diet group, people lost 0.9 ± 0.3 kg (p = 0.007). (Those ± values are standard errors of the mean.) It’s pretty unlikely for the means of both groups to be identical, and it’s very unlikely that both the means and the standard errors would be identical.

It’s not impossible for these numbers to be the same (and in fact, they are not precisely equal in the raw data, though they are still pretty close), especially given that they’re rounded to one decimal place. But it is weird. We ran some simple simulations which suggest that this should only happen about 5% of the time — but this is assuming that the means and SDs of the two groups are both identical in the population, which itself is very unlikely.

Another test of interest reported in the abstract also seemed odd. They report that weight changes were highly correlated with energy intake (r = 0.8, p < 0.0001). This correlation coefficient struck us as surprising, because it’s pretty huge. There are very few measures that are correlated with one another at 0.8 — these are the types of correlations we tend to see between identical twins, or repeated measurements of the same person. As an example, in identical twins, BMI is correlated at about r = 0.8, and height at about r = 0.9.

We know that these points are pretty ticky-tacky stuff. By themselves, they’re not much, but they bothered us. Something already seemed weird, and we hadn’t even gotten past the abstract.

Even the authors found these results surprising, and have said so on a couple of occasions. As a result, we decided to take a closer look. Fortunately for us, the authors have followed best practices and all their data is available on the OSF.

To conduct this analysis, we teamed up with Nick Brown, with additional help from James Heathers. We focused on one particular dependent variable of this study, weight change, while Nick took a broader look at several elements of the paper.


Because we were most interested in weight change, we decided to begin by taking a close look at the file “deltabw”. In mathematics, delta usually means “change” or “the change in”, and “bw” here stands for “body weight”, so this title indicates that the file contains data for the change in participants’ body weights. On the OSF this is in the form of a SAS .sas7bdat file, but we converted it to a .csv file, which is a little easier to work with.

Here’s a screenshot of what the deltabw file looks like:

In this spreadsheet, each row tells us about the weight for one participant on one day of the 4-week-long study. These daily body weight measurements were performed at 6am each morning, so we have one row for every day. 

Let’s also orient you to the columns. “StudyID” is the ID for each participant. Here we can see that in this screenshot we are looking just at participant ADL001, or participant 01 for short. The “Period” variable tells us whether the participant was eating an ultra-processed (PROC) or an unprocessed (UNPROC) diet on that day. Here we can see that participant 01 was part of the group who had an unprocessed diet for the first two weeks, before switching to the ultra-processed diet for the last two weeks. “Day” tells us which day in the 28-day study the measurement is from. Here we show only the first 20 days for participant 01. 

“BW” is the main variable of interest, as it is the participant’s measured weight, in kilograms, for that day of the study. “DayInPeriod” tells us which day they are on for that particular diet. Each participant goes 14 days on one diet then begins day 1 on the other diet. “BaseBW” is just their weight for day 1 on that period. Participant 01 was 94.87 kg on day one of the unprocessed diet, so this column holds that value as long as they’re on that diet. “DeltaBW” is the difference between their weight on that day and the weight they were at the beginning of that period. For example, participant 01 weighed 94.87 kg on day one and 94.07 kg on day nine, so the DeltaBW value for day nine is -0.80.

Finally, “DeltaDaily” is a variable that we added, which is just a simple calculation of how much the participant’s weight changed each day. If someone weighed 82.85 kg yesterday and they weigh 82.95 kg today, the DeltaDaily would be 0.10, because they gained 0.10 kg in the last 24 hours.

To begin with, we were able to replicate the authors’ main findings. When we don’t round to one decimal place, we see that participants on the ultra-processed diet gained an average of 0.9380 (± 0.3219) kg, and participants on the unprocessed diet lost an average of 0.9085 (± 0.3006) kg. That’s only a difference of 0.0295 kg in absolute values in the means, and 0.0213 kg for the standard errors, which we still find quite surprising. Note that this is different from the concern about standard errors raised by Drs. Mackerras and Blizzard. Many of the standard errors in this paper come from GLM analysis, which assumes homogeneity of variances and often leads to identical standard errors. But these are independently calculated standard errors of the mean for each condition, so it is still somewhat surprising that they are so similar (though not identical).  

On average these participants gained and lost impressive, but not shocking amounts of weight. A few of the participants, however, saw weight loss that was very concerning. One woman lost 4.3 kg in 14 days which, to quote Nick Brown, “is what I would expect if she had dysentery” (evocative though perhaps a little excessive). In fact, according to the data, she lost 2.39 kg in the first five days alone. We also notice that this patient was only 67.12 kg (about 148 lbs) to begin with, so such a huge loss is proportionally even more concerning. This is the most extreme case, of course, but not the only case of such intense weight change over such a short period.

The article tells us that participants were weighed on a Welch Allyn Scale-Tronix 5702 scale, which has a resolution of 0.1 lb or 100 grams (0.1 kg). This means it should only display data to one decimal place. Here’s the manufacturer’s specification sheet for that model. But participant weights in the file deltabw are all reported to two decimal places; that is, with a precision of 0.01 kg, as you can clearly see from the screenshot above. Of the 560 weight readings in the data file, only 55 end in zero. It is not clear how this is possible, since the scale apparently doesn’t display this much precision. 

To confirm this, we wrote to Welch Allyn’s customer support department, who confirmed that the model 5702 has 0.1 kg resolution.

We also considered the possibility that the researchers measured people’s weight in pounds and then converted to kilograms, in order to use the scale’s better precision of 0.1 pounds (45.4 grams) rather than 100 grams. However, in this case, one would expect to see that all of the changes in weight were multiples of (approximately) 0.045 kg, which is not what we observe.


As we look closer at the numbers, things get even more confusing. 

As we noted, Hall et al. report participant weight to two decimal places in kilograms for every participant on every day. Kilograms to two decimal places should be pretty sensitive, but there are many cases where the exact same weight appears two or even three times in a row. For example, participant 21 is listed as having a weight of exactly 59.32 kg on days 12, 13, and 14, participant 13 is listed as having a weight of exactly 96.43 kg on days 10, 11, and 12, and participant 06 is listed as having a weight of exactly 49.54 kg on days 23, 24, and 25. 

Having the same weight for two or even three days in a row may not seem that strange, but it is very remarkable when the measurement is in kilograms precise to two decimal places. After all, 0.01 kg (10 grams) is not very much weight at all. A standard egg weighs about 0.05 kg (50 grams). A shot of liquor is a little less, usually a bit more than 0.03 kg (30 grams). A tablespoon of water is about 0.015 kg (15 grams). This suggests that people’s weights are varying by less than the weight of a tablespoon of water over the course of entire days, and sometimes over multiple days. This uncanny precision seems even more unusual when we note that body weight measurements were taken at 6 am every morning “after the first void”, which suggests that participants’ bodily functions were precise to 0.01 kg on certain days as well. 

The case of participant 06 is particularly confusing, as 49.54 kg is exactly one kilogram less, to two decimal places, than the baseline for this participant’s weight when they started, 50.54 kg. Furthermore, in the “unprocessed” period, participant 06 only ever seems to lose or gain weight in full increments of 0.10 kilograms. 

We see similar patterns in the data from other participants. Let’s take a look at the DeltaDaily variable. As a reminder, this variable is just the difference between a person’s weight on one day and the day before. These are nothing more than daily changes in weight. 

Because these numbers are calculated from the difference between two weight measurements, both of which are reported to two decimal places of accuracy, these numbers should have two places of accuracy as well. But surprisingly, we see that many of these weight changes are in full increments of 0.10.

Take a look at the histograms below. The top histogram is the distribution of weight changes by day. For example, a person might gain 0.10 kg between days 15 and 16, and that would be one of the observations in this histogram. 

You’ll see that these data have an extremely unnatural hair-comb pattern of spikes, with only a few observations in between. This is because the vast majority (~71%) of the weight changes are in exact multiples of 0.10, despite the fact that weights and weight changes are reported to two decimal places. That is to say, participants’ weights usually changed in increments like 0.20 kg, -0.10 kg, or 0.40 kg, and almost never in increments like -0.03 kg, 0.12 kg, or 0.28 kg. 

For comparison, on the bottom is a sample from a simulated normal distribution with identical n, mean, and standard deviation. You’ll see that there is no hair-comb pattern for these data.

As we mentioned earlier, there are several cases where a participant stays at the exact same weight for two or three days in a row. The distribution we see here is the cause. As you can see, the most common daily change is exactly zero. Now, it’s certainly possible to imagine why some values might end up being zero in a study like this. There might be a technical incident with the scale, a clerical error, or a mistake when recording handwritten data on the computer. A lazy lab assistant might lose their notes, resulting in the previous day’s value being used as the reasonable best estimate. But since a change of exactly zero is the modal response, a full 9% of all measurements, it’s hard to imagine that these are all omissions or technical errors.

In addition, there’s something very strange going on with the trailing digits:

On the top here we have the distribution of digits in the 0.1 place. For example, a measurement of 0.29 kg would appear as a 2 here. This follows about the distribution we would expect, though there are a few more 1’s and fewer 0’s than usual. 

The bottom histogram is where things get weird. Here we have the distribution of digits in the 0.01 place. For example, a measurement of 0.29 kg would appear as a 9 here. As you can see, 382/540 of these observations have a 0 in their 0.01’s place — this is the same as that figure of 71% of measured changes being in full increments of 0.10 kg that we mentioned earlier. 

The rest of the distribution is also very strange. When the trailing digit is not a zero, it is almost certainly a 1 or a 9, possibly a 2 or an 8, and almost never anything else. Of 540 observed weight changes, only 3 have a trailing digit of 5.

We can see that this is not what we would expect from (simulated) normally distributed data:

It’s also not what we would expect to see if they were measuring to one decimal place most of the time (~70%), but to two decimal places on occasion (~30%). As we’ve already mentioned, this doesn’t make sense from a methodological standpoint, because all daily weights are to two decimal places. But even it somehow were a measurement accuracy issue, we would expect an equal distribution across all the other digits besides zero, like this:

This is certainly not what we see in the reported data. The fact that 1 and 9 are the most likely trailing digit after 0, and that 2 and 8 are most likely after that, is especially strange.


When we first started looking into this paper, we approached Retraction Watch, who said they considered it a potential story. After completing the analyses above, we shared an early version of this post with Retraction Watch, and with our permission they approached the authors for comment. The authors were kind enough to offer feedback on what we had found, and when we examined their explanation, we found that it clarified a number of our points of confusion. 

The first thing they shared with us was this erratum from October 2020, which we hadn’t seen before. The erratum reports that they noticed an error in the documented diet order of one participant. This is an important note but doesn’t affect the analyses we present here, which have very little to do with diet conditions.

Kevin Hall, the first author on this paper, also shared a clarification on how body weights were calculated:

I think I just discovered the likely explanation about the distribution of high-precision digits in the body weight measurements that are the main subject of one of the blogs. It’s kind of illustrative of how difficult it is to fully report experimental methods! It turns out that the body weight measurements were recorded to the 0.1 kg according to the scale precision. However, we subtracted the weight of the subject’s pajamas that were measured using a more precise balance at a single time point. We repeated subtracting the mass of the pajamas on all occasions when the subject wore those pajamas. See the example excerpted below from the original form from one subject who wore the same pajamas (PJs) for three days and then switched to a new set. Obviously, the repeating high precision digits are due to the constant PJs! 😉

This matches what is reported in the paper, where they state, “Subjects wore hospital-issued top and bottom pajamas which were pre-weighed and deducted from scale weight.” 

Kevin also included the following image, which shows part of how the data was recorded for one participant: 

If we understand this correctly, the first time a participant wore a set of pajamas, the pajamas were weighed to three decimals of precision. Then, that measurement was subtracted from the participant’s weight on the scale (“Patient Weight”) on every consecutive morning, to calculate the participant’s body weight. For an unclear reason, this was recorded to two decimals of precision, rather than the one decimal of precision given by the scale, or the three decimals of precision given by the PJ weights. When the participant switched to a new set of pajamas, the new set was weighed to three decimals of precision, and that number was used to calculate participant body weight until they switched to yet another new set of pajamas, etc.

We assume that the measurement for the pajamas is given in kilograms, even though they write “g” and “gm” (“qm”?) in the column. I wish my undergraduate lab TAs were as forgiving as the editors at Cell Metabolism.

This method does account for the fact that participant body weights were reported to two decimal places of precision, despite the fact that the scale only measures weight to one decimal place of precision. Even so, there were a couple of things that we still found confusing.

The variable that interests us the most is the DeltaDaily variable. We can easily calculate that variable for the provided example, like so:

We can see that whenever a participant doesn’t change their pajamas on consecutive days, there’s a trailing zero. In this way, the pajamas can account for the fact that 71% of the time, the trailing digits in the DeltaDaily variable were zeros. 

We also see that whenever the trailing digit is not zero, that lets us identify when a participant has changed their pajamas. Note of course that about ten percent of the time, a change in pajamas will also lead to a trailing digit of zero. So every trailing digit that isn’t zero is a pajama change, though a small number of the zeros will also be “hidden” pajama changes.

In any case, we can use this to make inferences about how often participants change their pajamas, which we find rather confusing. Participants often change their pajamas every day for multiple days in a row, or go long stretches without apparently changing their pajamas at all, and sometimes these are the same participants. It’s possible that these long stretches without any apparent change of pajamas are the result of the “hidden” changes we mentioned, because about 10% of the time changes would happen without the trailing digit changing, but it’s still surprising.

For example, participant 05 changes their pajamas on day 2, day 5, and day 10, and then apparently doesn’t change their pajamas again until day 28, going more than two weeks without a change in PJs. Participant 20, in contrast, changes pajamas at least 16 times over 28 days, including every day for the last four days of the study. The record for this, however, has to go to participant 03, who at one point appears to have switched pajamas every day for at least seven days in a row. Participant 03 then goes eight days in a row without changing pajamas before switching pajamas every day for three days in a row. 

Participant 08 (the participant from the image above) seems to change their pajamas only twice during the entire 28-day study, once on day 4 and again on day 28. Certainly this is possible, but it doesn’t look like the pajama-wearing habits we would expect. It’s true that some people probably want to change their pajamas more than others, but this doesn’t seem like it can be entirely attributed to personality, as some people don’t change pajamas at all for a long time, and then start to change them nearly every day, or vice-versa.

We were also unclear on whether the pajamas adjustment could account for the most confusing pattern we saw in the data for this article, the distribution of digits in the .01 place for the DeltaDaily variable:

The pajamas method can explain why there are so many zeros — any day a participant didn’t change their pajamas, there would be a zero, and it’s conceivable that participants only changed their pajamas on 30% of the days they were in the study. 

We weren’t sure if the pajamas method could explain the distribution of the other digits. For the trailing digits that aren’t zero, 42% of them are 1’s, 27% of them are 9’s, 9% of them are 2’s, 8% of them are 8’s, and the remaining digits account for only about 3% each. This seems very strange.

You’ll recall that the DeltaDaily values record the changes in participant weights between consecutive days. Because the weight of the scale is only precise to 0.1 kg, the data in the 0.01 place records information about the difference between two different pairs of pajamas. For illustration, in the example Kevin Hall provided, the participant switched between a pair of pajamas weighing 0.418 kg and a pair weighing 0.376 kg. These are different by 0.042 kg, so when they rounded it to two digits, the difference we see in the DeltaDaily has a trailing digit of 4. 

We wanted to know if the pajama adjustment could explain why the difference (for the digit in the 0.01’s place) between the weights of two pairs of pajamas are 14x more likely to be a 1 than a 6, or 9x more likely to be a 9 than a 3. 

Verbal arguments quickly got very confusing, so we decided to run some simulations. We simulated 20 participants, for 28 days each, just like the actual study. On day one, simulated participants were assigned a starting weight, which was a random integer between 40 and 100. Every day, their weight changed by an amount between -1.5 and 1.5 by increments of 0.1 (-1.5, -1.4, -1.3 … 1.4, 1.5), with each increment having an equal chance of occuring. 

The important part of the simulation were the pajamas, of course. Participants were assigned a pajama weight on day 1, and each day they had a 35% chance of changing pajamas, and being assigned a new pajama weight. The real question was how to generate a reasonable distribution of pajama weights. We didn’t have much to go off of, just the two values in the image that Kevin Hall shared with us. But we decided to give it a shot with just that information. Weights of 418 g and 376 g have a mean of just under 400 g and a standard deviation of 30 g, so we decided to sample our pajama weights from a normal distribution with those parameters.

When we ran this simulation, we found a distribution of digits in the 0.01 place that didn’t show the same saddle-shaped distribution as in the data from the paper:

We decided to run some additional simulations, just to be sure. To our surprise, when the SD of the pajamas is smaller, in the range of 10-20 g, you can sometimes get saddle-shaped distributions just like the ones we saw in data from the paper. Here’s an example of what the digits can look like when the SD of the pajamas is 15 g:

It’s hard for us to say whether a standard deviation of 15 g or of 30 g is more realistic for hospital pajamas, but it’s clear that under certain circumstances, pajama adjustments can create this kind of distribution (we propose calling it the “pajama distribution”).

While we find this distribution surprising, we conclude that it is possible given what we know about these data and how the weights were calculated.


When we took a close look at these data, we originally found a number of patterns that we were unable to explain. Having communicated with the authors, we now think that while there are some strange choices in their analysis, most of these patterns can be explained when we take into account the fact that pajama weights were deducted from scale weights, and the two weights had different levels of precision.

While these patterns can be explained by the pajama adjustment described by Kevin Hall, there are some important lessons here. The first, as Kevin notes in his comment, is that it can be very difficult to fully record one’s methods. It would have been better to include the full history of this variable in the data files, including the pajama weights, instead of recording the weights and performing the relevant comparisons by hand. 

The second is a lesson about combining data of different levels of precision. The hair-comb pattern that we observed in the distribution of DeltaDaily scores was truly bizarre, and was reason for serious concern. It turns out that this kind of distribution can occur when a measure with one decimal of precision is combined with another measure with three decimals of precision, with the result being rounded to two decimals of precision. In the future researchers should try to avoid combining data in this way to avoid creating such artifacts. While it may not affect their conclusions, it is strange for the authors to claim that someone’s weight changed by (for example) 1.27 kg, when they have no way to measure the change to that level of precision.

There are some more minor points that this explanation does not address, however. We still find it surprising how consistent the weight change was in this study, and how extreme some of the weight changes were. We also remain somewhat confused by how often participants changed (or didn’t change) their pajamas. 

This post continues in Part Two over at Nick Brown’s blog, where he covers several other aspects of the study design and data.

Thanks again to Nick Brown for comparing notes with us on this analysis, to James Heathers for helpful comments, and to a couple of early readers who asked to remain anonymous. Special thanks to Kevin Hall and the other authors of the original paper, who have been extremely forthcoming and polite in their correspondence. We look forward to ongoing public discussion of these analyses, as we believe the open exchange of ideas can benefit the scientific community.

Statistics is an Excellent Servant and a Bad Master


Imagine a universe where every cognitive scientist receives extensive training in how to deal with demand characteristics. 

(Demand characteristics describe any situation in a study where a participant either figures out what a study is about, or thinks they have, and changes what they do in response. If the participant is friendly and helpful, they may try to give answers that will make the researchers happy; if they have the opposite disposition, they might intentionally give nonsense answers to ruin the experiment. This is a big part of why most studies don’t tell participants what condition they’re in, and why some studies are run double-blind.)

In the real world, most students get one or two lessons about demand characteristics when they take their undergrad methods class. When researchers are talking about a study design, sometimes we mention demand, but only if it seems relevant.

Let’s return to our imaginary universe. Here, things are very different. Demand characteristics are no longer covered in undergraduate methods courses — instead, entire classes are exclusively dedicated to demand characteristics and how to deal with them. If you major in a cognitive science, you’re required to take two whole courses on demand — Introduction to Demand for the Psychological Sciences and Advanced Demand Characteristics

Often there are advanced courses on specific forms of demand. You might take a course that spends a whole semester looking at the negative-participant role (also known as the “screw-you effect”), or a course on how to use deception to avoid various types of demand. 

If you apply to graduate school, how you did in these undergraduate courses will be a major factor determining whether they let you in. If you do get in, you still have to take graduate-level demand courses. These are pretty much the same as the undergrad courses, except they make you read some of the original papers and work through the reasoning for yourself. 

When presenting your research in a talk or conference, you can usually expect to get a couple of questions about how you accounted for demand in your design. Students are evaluated based on how well they can talk about demand and how advanced the techniques they use are.

Every journal requires you to include a section on demand characteristics in every paper you submit, and reviewers will often criticize your manuscript because you didn’t account for demand in the way they expected. When you go up for a job, people want to know that you’re qualified to deal with all kinds of demand characteristics. If you have training in dealing with an obscure subtype of demand, it will help you get hired.

It would be pretty crazy to devote such a laser focus to this one tiny aspect of the research process. Yet this is exactly what we do with statistics.


Science is all about alternative explanations. We design studies to rule out as many stories as we can. Whatever stories remain are possible explanations for our observations. Over time, we whittle this down to a small number of well-supported theories. 

There’s one alternative explanation that is always a concern. For any relationship we observe, there’s a chance that what we’re seeing is just noise. Statistics is a set of tools designed to deal with this problem. This holds a special place in science because “it was noise” is a concern for every study in every field, so we always want to make sure to rule it out.    

But of course, there are many alternative explanations that we need to be concerned with. Whenever you’re dealing with human participants, demand characteristics will also be a possible alternative. Despite this, we don’t jump down people’s throats about demand. We only bring up these issues when we have a reason to suspect that it is a problem for the design we’re looking at.

There will always be more than one way to look at any set of results. We can never rule out every alternative explanation — the best we can do is account for the most important and most likely alternatives. We decide which ones to account for by using our judgement, by taking some time to think about what alternatives we (and our readers) will be most concerned about. 

The right answer will look different for different experiments. But the wrong answer is to blindly throw statistics at every single study. 

Statistics is useful when a finding looks like it could be the result of noise, but you’re not sure. For example, let’s say we’re testing a new treatment for a disease. We have a group of 100 patients who get the treatment and a control group of 100 people who don’t get the treatment. If 52/100 people recover when they get the treatment, compared to 42/100 recovering in the control group, does that mean the treatment helped? Or is the difference just noise? I can’t tell with just a glance, but a simple chi-squared test can tell me that p = .013, meaning there’s only a 1.3% chance that we would see something like this from noise alone.

That’s helpful, but it would be pointless to run a statistical test if we saw 43/100 people recover with the treatment, compared to 42/100 in the control group. I can tell that this is very consistent with noise (p > .50) just by looking at it. And it would be pointless to run a statistical test if we saw 98/100 people recover with the treatment, compared to 42/100 in the control group. I can tell that this is very inconsistent with noise (p < .00000000000001) just by looking at it. If something passes the interocular trauma test (the conclusion hits you between the eyes), you don’t need to pull out another test.

This might sound outlandish today, but you can do perfectly good science without any statistics at all. After all, statistics is barely more than a hundred years old. Sir Francis Galton came up with the concept of the standard deviation in the 1860s, and the story with the ox didn’t happen until 1907. It took until the 1880s to dream up correlation. Karl Pearson was born in 1857 but didn’t do most of his statistics work until around the turn of the century. Fisher wasn’t even born until 1890. He introduced the term variance for the first time in 1918, but both that term and the ANOVA didn’t gain popularity until the publication of his book in 1925.

This means that Galileo, Newton, Kepler, Hooke, Pasteur, Mendel, Lavoisier, Maxwell, von Helmholtz, Mendeleev, etc. did their work without anything that resembled modern statistics, and that Einstein, Curie, Fermi, Bohr, Heisenberg, etc. etc. did their work in an age when statistics was still extremely rudimentary. We don’t need statistics to do good research.

This isn’t an original idea, or even a particularly new one. When statistics was young, people understood this point better. For an example, we can turn to Sir Austin Bradford Hill. He was trained by Karl Pearson (who, among other things, invented the chi-squared test we used earlier), was briefly president of the Royal Statistical Society, and was sometimes referred to as the world’s leading medical statistician. As early as the 1920s, he was pioneering the introduction of the randomized clinical trial in medicine. As far as opinions on statistics go, the man was pretty qualified. 

While you may not know his name, you’re probably familiar with his work. He was one of the researchers who demonstrated the connection between cigarette smoking and lung cancer, and in 1965 he gave a speech about his work on the topic. Most of the speech was a discussion of how one can infer a causal relationship from largely correlational data, as he had done with the smoking-lung cancer connection, a set of considerations that came to be known as the Bradford Hill criteria

But near the end of the speech, he turns to a discussion of tests of significance, as he calls them, and their limitations:

No formal tests of significance can answer [questions of cause and effect]. Such tests can, and should, remind us of the effects that the play of chance can create, and they will instruct us in the likely magnitude of those effects. Beyond that they contribute nothing to the ‘proof’ of our hypothesis. 

Nearly forty years ago, amongst the studies of occupational health that I made for the Industrial Health Research Board of the Medical Research Council was one that concerned the workers in the cotton-spinning mills of Lancashire (Hill 1930). … All this has rightly passed into the limbo of forgotten things. What interests me today is this: My results were set out for men and women separately and for half a dozen age groups in 36 tables. So there were plenty of sums. Yet I cannot find that anywhere I thought it necessary to use a test of significance. The evidence was so clear cut, the differences between the groups were mainly so large, the contrast between respiratory and non-respiratory causes of illness so specific, that no formal tests could really contribute anything of value to the argument. So why use them?

Would we think or act that way today? I rather doubt it. Between the two world wars there was a strong case for emphasizing to the clinician and other research workers the importance of not overlooking the effects of the play of chance upon their data. Perhaps too often generalities were based upon two men and a laboratory dog while the treatment of choice was deducted from a difference between two bedfuls of patients and might easily have no true meaning. It was therefore a useful corrective for statisticians to stress, and to teach the needs for, tests of significance merely to serve as guides to caution before drawing a conclusion, before inflating the particular to the general. 

I wonder whether the pendulum has not swung too far – not only with the attentive pupils but even with the statisticians themselves. To decline to draw conclusions without standard errors can surely be just as silly? Fortunately I believe we have not yet gone so far as our friends in the USA where, I am told, some editors of journals will return an article because tests of significance have not been applied. Yet there are innumerable situations in which they are totally unnecessary – because the difference is grotesquely obvious, because it is negligible, or because, whether it be formally significant or not, it is too small to be of any practical importance. What is worse, the glitter of the t-table diverts attention from the inadequacies of the fare. Only a tithe, and an unknown tithe, of the factory personnel volunteer for some procedure or interview, 20% of patients treated in some particular way are lost to sight, 30% of a randomly-drawn sample are never contacted. The sample may, indeed, be akin to that of the man who, according to Swift, ‘had a mind to sell his house and carried a piece of brick in his pocket, which he showed as a pattern to encourage purchasers.’ The writer, the editor and the reader are unmoved. The magic formulae are there. 

Of course I exaggerate. Yet too often I suspect we waste a deal of time, we grasp the shadow and lose the substance, we weaken our capacity to interpret the data and to take reasonable decisions whatever the value of P. And far too often we deduce ‘no difference’ from ‘no significant difference.’ Like fire, the chi-squared test is an excellent servant and a bad master.


We grasp the shadow and lose the substance. 

As Dr. Hill notes, the blind use of statistical tests is a huge waste of time. Many designs don’t need them; many arguments don’t benefit from them. Despite this, we have long disagreements about which of two tests is most appropriate (even when both of them will be highly significant), we spend time crunching numbers when we already know what we will find, and we demand that manuscripts have their statistics arranged just so — even when it doesn’t matter.

This is an institutional waste of time as well as a personal one. It’s weird that students get so much training in statistics. Methods are almost certainly more important, but most students are forced to take multiple stats classes, while only one or two methods classes are even offered. This is also true at the graduate level. Methods and theory courses are rare in graduate course catalogs, but there is always plenty of statistics.

Some will say that this is because statistics is so much harder to learn than methods. Because it is a more difficult subject, it takes more time to master. Now, it’s true that students tend to take several courses in statistics and come out of them remembering nothing at all about statistics. But this isn’t because statistics is so much more difficult. 

We agree that statistical thinking is very important. What we take issue with is the neurotic focus on statistical tests, which are of minor use at best. The problem is that our statistics training spends multiple semesters on tests, while spending little to no time at all on statistical thinking. 

This also explains why students don’t learn anything in their statistics classes. Students can tell, even if only unconsciously, that the tests are unimportant, so they have a hard time taking them seriously. They would also do poorly if we asked them to memorize a phone book — so much more so if we asked them to memorize the same phone book for three semesters in a row.

The understanding of these tests is based on statistical thinking, but we don’t teach them that. We’ve become anxious around the tests, and so we devote more and more of the semester to them. But this is like becoming anxious about planes crashing and devoting more of your pilot training time to the procedure for making an emergency landing. If the pilots get less training in the basics, there will be more emergency landings, leading to more anxiety and more training, etc. — it’s a vicious cycle. If you just teach students statistical thinking to begin with, they can see why it’s useful and will be able to easily pick up the less-important tests later on, which is exactly what I found when I taught statistics this way.

The bigger problem is turning our thinking over to machines, especially ones as simple as statistical tests.

Your new overlord.

Sometimes a test is useful, sometimes it is not. We can have discussions about when a test is the right choice and when it is the wrong one. Researchers aren’t perfect, but we have our judgement and damn it, we should be expected to use it. We may be wrong sometimes, but that is better than letting the p-values call all the shots. 

We need to stop taking tests so seriously as a criterion for evaluating papers. There’s a reason, of course, that we are on such high alert about these tests — the concept of p-hacking is only a decade old, and questionable statistical practices are still being discovered all the time. 

But this focus on statistical issues tends to obscure deeper problems. We know that p-hacking is bad, but a paper with perfect statistics isn’t necessarily good — the methods and theory, even the basic logic, can be total garbage. In fact, this is part of how we got in the p-hacking situation in the first place: by using statistics as the main way of telling if a paper is any good or not! 

Putting statistics first is how we end up with studies with beautifully preregistered protocols and immaculate statistics, but deeply confounded methods, on topics that are unimportant and frankly uninteresting. This is what Hill meant when he said that “the glitter of the t-table diverts attention from the inadequacies of the fare”. Confounded methods can produce highly significant p-values without any p-hacking, but that doesn’t mean the results of such a study are of any value at all. 

This is why I find proposals to save science by revising statistics so laughable. Surrendering our judgement to Bayes factors instead of p-values won’t do anything to solve our problems. Changing the threshold of significance from .05 to .01, or .005, or even .001 won’t make for better research. We shouldn’t try to revise statistics, we should use it less often. 

Thanks to Adam Mastroianni, Grace Rosen, and Alexa Hubbard for reading drafts of this piece.

You Make My Head Hurt

“Catastrophic failure [of the unhelmeted skull] during testing…experiencing a maximum load of 520 pounds of force,” says the Journal of Neurosurgery: Pediatrics.

According to NASA, the average push strength of an adult male is about 220 lbs of force, with a standard deviation of 68 lbs. If Gregor Clegane were three standard deviations from the mean (the top 0.1%), he would be able to produce about 424 lbs of force, which is not quite enough. He would need to be about 4.5 standard deviations above average to crush a skull with his bare hands.

This is pretty extreme, but if strength is normally distributed in Westeros, Gregor would only be about 1 in 147,160. Another way of saying this is that if one baby were born every day, a man as strong as this would come around about every 403 years. Since birth rates are much higher than that, it’s not impossible.

This is also consistent with what we know about Gregor in general. He’s described as being nearly eight feet tall, or 96 inches. The average height of men in the United States is about 70 inches, with a standard deviation of 4 inches. This means that Gregor is about 6.5 standard deviations taller than average. It seems likely that he would be similarly above average in terms of his strength.

Verdict – it is statistically possible that someone strong enough to crush a skull with their hands exists in Westeros, and Ser Gregor is a good candidate for the role.

Hindsight is Stats 2020, Part III: Final-First Exams

[This is Part III of a retrospective on teaching statistics over summer 2020. Part I and Part II.]

Exams were my white whale for this course.

My design goals were clear. Someone who knows their stuff should be able to prove what they know and walk out of the class. Students should be encouraged to learn as fast as they can, and they should be rewarded for getting ahead of the class if they want to. And there should be almost no consequences for failure, so that students can experiment without torpedoing their grade.

But exams are famously plagued with problems. Rescheduling exams for students who are sick or have to miss a day. Deciding who gets to do make-up exams. The endless questions about exam format — “professor, will this be on the final?” Somehow, we complain about all this but take it for granted. Why not come up with a way to make these problems a thing of the past?

1. Final-First Exams

These days, professors have gotten more comfortable experimenting with exam formats. Lots of exams are open notes, open book, or even take-home. Some classes let you drop your lowest exam score. I’ve even heard of professors giving five exams and dropping your worst two.

Dropping tests is cool, because it fixes some of the classic problems. Have to miss an exam? No problem, just drop that one. No need for make-up exams. If you bomb an exam, just drop it.

This is the right direction, but we can do better. What else can we tinker with, to make exams even better?

I thought back to the cumulative format, and why it doesn’t work for teaching. Why have cumulative exams, then? Doesn’t it just serve to obscure your expectations? My class format was fractal, so that students could see what’s coming, know what’s expected of them. Why not use this approach with exams, too?

Dropping one exam isn’t cool. You know what’s cool? Dropping ALL the exams.

I call the format Final-First, because your first exam is a final exam. In fact, every exam is a final exam, meaning every exam covers all of the material covered in the whole course. The exams have nearly identical formats, differing only in the particulars. I swap out the numbers and some of the details on the questions, but once you’ve seen one final, you have a pretty good sense of all of them.

This course was six weeks long, and I gave them a final exam at the end of every week. This means they had a final exam at the end of Week 1, at the end of Week 2, at the end of Week 3, and so on…

Since these were all final exams, I didn’t expect most of them would do very well on the first exam. But that’s ok, because we dropped all their exam scores except for the best one. The exam grade, as it contributed to their grade for the class as a whole, was entirely based on their best exam. Other exam grades didn’t contribute at all.

If a student gets a 90% on the third final, it doesn’t matter how they did on the first two. Why should a student suffer if they get a 10% on the first exam but manage to nail it with a 90% later on? Clearly that student has done a great job and learned all the material we wanted them to, even though they struggled at first. In fact, isn’t that more impressive?

This format has some great features, which are beautifully in line with my design goals:

  • Good Incentives: If you understand the material quickly, you should be rewarded. Students who succeed are rewarded with more freedom. No one who has mastered the material should be forced to go through the motions. If you get a grade you’re happy with, you can choose to skip the rest of the exams with no downside.
  • Safety Net: Each exam offers a new chance to set a minimum threshold for your grade. Once you get a 85 on one exam, you can rest easy that your grade won’t go any lower. With this design there are no consequences for failure. You can bomb (or miss) as many exams as you want without any risk to your final grade.
  • Low Anxiety: Students who are able to get a good grade on one of the early exams will be able to worry about things other than cramming for the next exam. Maybe they’ll use it to study more, or maybe they’ll just go to the beach. I don’t care. If you can get an 80 on the final exam in week two of a six-week class, you deserve to go to the beach.
  • Transparency: With this format, there’s no more need for, “what will be on the test?” Once you have taken the first final, you will know (approximately) the format of all the other finals. This has the added benefit of:
  • Context: Seeing all the material at once will allow you to begin building a tapestry of ideas in your head. You will never be blindsided by new material, things you didn’t realize were expected of you. Once you’ve seen one final exam, you’ve seen them all, and being exposed to all the material early on will help you learn it better.
  • Feedback: You will be able to tell what skills you have mastered and which you need to work on. This will allow you to spend your study time wisely. Previous exams become a great tool for review. You can go over your performance with the TA or professor and be able to see exactly what you need to work on for the next exam, because the next exam is so similar.

I was really happy with this design. It hit all of my design goals, and it resolves a lot of the classic problems with exams.

Other people liked the idea too. I was on a date with a PhD student and we were talking about teaching, so I told her about this design. She said, “that sounds a bit insane upfront, but not so much when you think about it.”

Now there was nothing to do but try it out. For this class, I made the exam 50% of the final grade. Normally, making a single evaluation a huge chunk of the grade is unfair. But with this format, the exams are the best one of six evaluations, and besides, the exams test what I really want them to know.

1.1 The Results

Final-First exams worked really, really well.

I was worried that students would be confused by the format, or would be terrified when they failed the first Exam, but I actually got very few questions about it. Students seemed to understand what I was trying.

It really did solve all the usual exam problems. No one ever asked me for a makeup exam. Only once did I have to clarify what would be on the exam. When students wanted to meet to go over their answers, we were able to make real progress, because it was immediately clear to me what parts of the material they had mastered and what they were still struggling with. In many cases we could look back over two or three different exams and see the same thing tripping them up every time over multiple weeks.

Most people improved steadily over time. The average grade went from 60% on Exam 1 (this was by design; see below) to 85% on Exam 6. Students took the exams pretty freely. Some of them took every exam, but on average they took only 4 of the 6 exams.

A few students actually got their best grade quite early on. On the first final, at the end of the first week of class, the highest grade was an incredible 88% (!!!). This student kept taking exams, though, and was able to eventually beat her record with a 92.5% on Exam 5.

The student who got the second-highest score on Exam 1 got a 84%, again very high for having taken only three classes. This student chose to skip most of the other exams. He did take Exam 5, but only got a 75.5%, so in the end his final grade was actually based on his exam score from the first week of class!

I was a little surprised that more students didn’t try to get a great grade early on. When I think about this format, one of the most exciting things to me is the idea that you can teach yourself all the material, get ahead of the class, get a great exam grade halfway through, and not have to show up to class anymore. But while a few students got great scores on Exams 3 and 4, that was the exception. It might be different in a semester-long class. Six weeks is just not much time to teach yourself, even if you really commit to it!

These are extreme cases of the safety net working as intended, but the design worked equally well for students with less extreme grades. To my surprise, only 26 of the 39 students took Exam 6, the final final exam. I think this means that by the end of the class, many of them were satisfied enough with their exam grade that they chose not to take this last final. Of those who did take Exam 6, only 18 got a better grade on the final final than on any previous final, which means that 8 people didn’t improve their grade at all on the final final.

The best exam grade in the entire course, a 97.5%, was actually earned on Exam 5. Perhaps unsurprisingly, that student chose not to take Exam 6.

These grades are really impressive, because the exams were not easy. I came in with specific expectations of what a student should know by the end of intro stats. These expectations were reasonable, but they were also pretty high. We expect too little of undergrads, and we underestimate what they are capable of doing and understanding.

I didn’t change my expectations at all during this course. Every student who earned a 90% on an exam met my expectations, and every student who did better than that exceeded my expectations. In my opinion, a good grade means that they mastered the material.

1.2 Student Opinion

Students really liked the exams. Some of the most positive feedback was about this part of the class. Take a look:

“This was one of my favorite aspects of the course because it genuinely did relieve a lot of stress. My biggest fears for this course revolved around completing it and not only doing poorly, but also learning nothing. I think the weekly exams allowed me to continually refresh and apply what we had reviewed without the anxiety of failing the course.”

“I thought the idea of getting graded based on the best exam was exceptional since we learn more as we continue taking the class.”

“To be honest, this is the best [exam] format I’ve ever taken! It really gives me the motivation to study harder each time without getting too stressed out.”

Other comments were much the same. As you’ll notice, the experience students had with the format was exactly the experience I was aiming for. A few other notes of interest were:

“I found myself studying ahead of time to supplement the material I have not learned yet”

“Towards the end it was fine, but the first few were pretty stressful for me.”

The one complaint, which I did see a few times, was that the Exams tested them on questions they didn’t recognize and hadn’t seen before. But of course, this was by design, because I wanted to see if they really understood the concepts.

Some students seemed to understand this, with one noting, “[Jeff] helped us prepare as best as we could without actually giving us the answers.” And once again I’ll point to their excellent exam grades as proof that the difference in format wasn’t actually a problem.

2. Exam Design

This format is certainly the most interesting part of the exams. But the design of the exams and the exam questions is worth discussing as well.

The Final-First exam format doesn’t work if you don’t pay close attention to the design of the exams. Exams need to be nearly identical, so that students always know what’s coming on the next one. But they can’t be too similar, or else students will memorize them by rote. You need to keep mixing it up.

I had a plan for the exams going in. As I argued in What You Want from Tests, exams should be used to test the knowledge that students carry around in their heads, the bits that an expert will internalize. That’s what I was aiming for in this class. Research reports would cover their ability to actually do stats, and exams would cover their memory and intuition for the most important concepts.

Then, of course, the whole course was forced online. Immediately I knew that this meant that exams would de facto be open book, open notes, and really, open Google. So I knew that I would have to pivot away from my original plans. I couldn’t just focus on internalized knowledge.

(I never explicitly told students that the exams were open notes, but I never told them not to look things up either.)

I actually think this ended up improving the exams. I stand by what I said in What You Want from Tests, but it can be more complicated than I imply in that essay.

2.1 Exam Structure

The structure of the exams mirrored the structure of the course — after all, every exam was a final. Each exam was 50 points in total. Of that, 15 points had to do with basic data skills, 15 points went to descriptive statistics, and 15 points were on the use and interpretation of inferential statistics. Just like the course, the exams were divided into these three sub-topics.

The remaining 5 points went to what I called “advanced topics”. These were questions about things we mentioned in lecture but were slightly outside the scope of the class, more complex questions about the use of core concepts, or questions that tested their intuitions in ways that we had hinted at, but hadn’t explicitly discussed.

An interesting feature of this is that a student who mastered all the core material, but hadn’t yet achieved that deeper understanding, would only get a 90% on the exam, because the advanced section was the last 10% of the exam grade. A grade of higher than 90% means that a student understood not only all of the material at the expected level, but was making progress into understanding it more completely.

This is why I am so confident that the students who got above a 90% on their exam grade not only met my standards, they exceeded them. That last ten percent came from questions that were, by design, more difficult than an intro stats student should be able to answer.

2.2 Exam Difficulty

Maybe other teachers already know this, but something I had never realized before was that a teacher has a lot of control over the difficulty curve of an exam. I knew that a professor could make an exam more or less difficult, but I didn’t understand that you have a lot of control over the distribution of scores.

This was particularly important for a class using the Final-First exam format. In this system, most students take a final exam in Week 1, and of course most of them will bomb it. There’s a big difference in morale, however, between bombing an exam with 50% and bombing it with 5%!

I wanted to encourage students to do well. I wanted to make sure they felt like they could succeed from the very beginning. To make this happen, I designed the exam so that it was easy to get a decent score, but hard to get a great score. (For those of you who are statistically inclined, compare item response theory.)

(This is also how I asked Liz to grade the research reports. Make it easy to get a decent grade but hard to get a perfect grade, I said.)

I had already decided that 15 points, or 30% of the exam, was devoted to data skills. This stuff is pretty easy, and so I knew that most students would be getting a good chunk of points from this section right from the start. In the other two sections, I made sure to include a couple easy questions, to keep the baseline grade relatively high.

The fact that the average score on Exam 1 was 60% shows that I was successful. In fact, even in Week 1, the lowest exam grade was a 40%. That doesn’t sound like much, but considering that we were only 17% of the way through the class, I think it’s pretty good.

I used some other tricks for this as well. One was that the exam was almost entirely multiple-choice. A classic problem with multiple choice questions is that students always have a decent chance to get the right answer by just guessing. For example, a student guessing on a multiple-choice question with four answers will get the right answer 25% of the time. An exam with nothing but 4-answer multiple choice questions has a baseline grade of 25%. It’s even worse for an exam that’s all true/false, which has a baseline of 50%. This is why up until 2016, the SAT took off 1/4 a point for each wrong answer. Statistically, it meant that a student who did nothing but guess would get a score of about zero.

But we can turn this same force to our advantage. To adjust the baseline score, I can change the number of answers I include for my multiple choice questions. This is exactly what I did. For the Data section, which I wanted to be a score-booster, all the multiple choice questions had only a few answers each. For the Advanced section, where I wanted students to earn points only if they really knew their stuff, most of the multiple choice questions had 8 or more response options! And for the other sections, which I wanted to land somewhere in between, I included a mix.

Of course, there are limits to how lenient we want to be. In particular, true/false questions seem too easy — a baseline of 50% just from guessing is way too high. One idea that I really like is True / False / Can’t Tell questions. At a shallow level, these are just true/false questions with three options instead of two. But at a deeper level, this encourages students to engage with the question in a new way. Instead of just determining which answer is right, they have to think about whether they even have enough information to make that call. It literally adds another dimension to the question. This is especially well-suited to statistics, which is all about making informed guesses based on limited information.

I used a similar approach in some of my short answer questions. I’ve noticed that in class, students are often much more comfortable telling you why something is wrong than trying to give you the right answer themselves. I translated this into “What’s wrong with…” questions. Students would be given a short paragraph that described some statistics. In each case I had inserted an error into the paragraph. For example, sometimes I would say that a variable wasn’t skewed, but I would report a mean and median that were strikingly different. Students would have to pick out the mistake and tell me why it was wrong.

This is a really important skill in real life. A big part of the practice of using stats as a scientist is noticing when something is wrong in an analysis, whether you’re checking your own analysis or looking over someone else’s work.

I included one of these questions in the Data section for almost every exam, since they are a good way to ask about data features like skew and range without just asking students to regurgitate the definitions. I also included a few in the Descriptive Statistics sections, and I think that added some nice variety. You know a student doesn’t understand correlation when you report r = 1.2 and they don’t catch it.

I realize now that I never included any of these questions about inferential statistics. This was a mistake, since catching errors in the reporting of tests is something that comes up all the time. If I taught this class again, I would put “What’s wrong with…” questions in all three sections of the exam.

Another way to control exam difficulty is with paired questions. You include two questions about the same topic, but one is easy, and one is harder. For example, in my descriptive statistics sections, I always included two questions where I described some data and asked students what plot or chart they should use to represent that data. By design, the first of these was always pretty easy, and the second was, while not exactly hard, a more sincere test of their understanding.

This has some great features. First, it helps raise their baseline score. A student who understands the idea even a little will usually get the first question right, and this will boost their grade. They essentially get partial credit on that concept, even though the question is multiple choice. (They say you can’t give partial credit on multiple choice questions, but what do they know?) But a student only gets full credit if they can answer the more challenging question. Again we see that the design makes it easy to get a decent grade, but hard to get a perfect grade.

Second, it helps with feedback. For any topic on the exam, if a student gets neither question right, they clearly do not understand the topic at all. If they get the easy one right but not the harder one, they understand the basics but haven’t quite got the whole idea. And if they get both right, it’s clear they understand it at the level I want them to. If they somehow get the hard question right and the easy question wrong, this tells you that they were probably guessing. You can look at the exam and see exactly how students are doing with each of the core skills.

2.3 Difficulty Over the Course of the Class

As important as the difficulty curve within an exam is, it’s also worth mentioning difficulty curves over time. Part of the reason to make an exam easy to pass but hard to ace is that this is good for student morale, while still being an accurate measure of their ability. With a Final-First exam, you also want to worry about difficulty over time.

Students shouldn’t get a good grade on the first final unless they really know their stuff. Early on, exam grades should be pretty low. But if exam grades go down with every exam, or even if they fail to go up, that’s bad for morale. It tells the students that they aren’t learning anything from the class. That shouldn’t be true, and even if it is, you shouldn’t be telling them that!

My recommendation is that your hardest exam should go first, and your easiest exam (still staying true to what you want them to get out of the class) should go last, with the other exams in order of difficulty in between. And of course, for the reasons described above, your hardest exam should still be designed so that on average students do decently on it. If the average score on the first final is less than 50%, you’ve probably done something wrong.

One thing that I would like to do someday is create a way to generate exams automatically. These exams are formulaic by design, so it would be relatively easy to write a script that would mix & match components and spit out as many exams as you want. Not only could this make the exams more fair and regular, you could do things like share multiple practice exams with your students.

3. Exams Online

As with everything else, I was worried about exams being online. There were the concerns around cheating, as I mentioned above, and also just around giving an exam remotely.

I was wrong. Holding exams online is one of the best things I’ve ever done for a class. It was so easy that I am seriously considering using online exams for in-person classes in the future.

I ended up running all my exams through Qualtrics, a survey software I use in my research. Qualtrics is flexible and it has a lot of nice features that are helpful for exams, but I suspect you could run online exams with other survey platforms.

Exams were run every week. Since my students were located all around the world, and since many of them had jobs or other responsibilities, I opened the exam for a full 24 hours. Lectures were Monday / Tuesday / Wednesday, and every week the exam was open from 5:00pm EST Thursday to 5:00pm EST Friday. Using the survey software, it was easy to have it open all day and let them drop in whenever they wanted. I also liked how this didn’t cut into class time.

Qualtrics automatically records the time when a session is opened and when it is submitted, so I used that to time their exams. The exam would begin as soon as a student clicked on the link, since that prompted Qualtrics to record the session start. I recommended that they time themselves to ensure that they didn’t go over. We compared their start and their submit times to see if they followed directions. Some of them did go over by a little, but we were lenient, and graded those exams too. To my surprise, no one tried to sneak in a much longer exam session.

After some pilot testing with my sister, I ended up making the exam only 45 minutes long. This isn’t much time, but I figured it would be easy to add time later if I had to. I was worried that students would complain, and fully expected that I would have to bump it up to 60 minutes after the first few exams. But this ended up being unfounded too. I didn’t get any complaints about the exam length — students never mentioned it! — and so I kept it 45 minutes long for the whole course.

Short exams also fit my design goals. There’s no need to belabor an examination. As long as it’s accurate, it should be as short as possible. Once again, I imagined how it would be if, through some horrible clerical error, I was forced to take the class myself. I knew I would be able to ace the exam in about 15 minutes, so I wouldn’t be forced to waste more than a tiny amount of time. That’s how it should be.

Running exams online also gave us huge benefits on the backend. Exams were incredibly simple to grade. Once all the scores were in, I would take the exam myself, putting in all the right answers and writing ANSWER KEY in the name field at the end. Then, when Liz downloaded all the responses for grading, she could just use Excel functions to compare each of their answers to the responses I put for the answer key, and automatically assign points that way. There were always a few short-answer questions to grade by hand, but the majority of the grading, for every single student, could be accomplished in just a few minutes.

And unlike working with scantron or paper forms, there is no headache when it comes to digitizing the results. Answers and scores were in a spreadsheet from the beginning.

It was easy to make answer keys for the same reason. Admittedly I didn’t know this at first — all the credit goes to Liz. It turns out that you can make Qualtrics generate a PDF of all the answers given by a specific person, so all we had to do was get it to spit out the ANSWER KEY responses and, surprise, there was the answer key. Again your mileage may vary, but online systems can be very powerful.

The online format does offer students the opportunity to cheat. But as I already mentioned, I don’t think they did, and I don’t think it would matter either way. There are things you could do to help prevent this, if you were worried, like giving a narrower exam window or putting out multiple versions of the exam to prevent crosstalk, the sorts of things we already do in the classroom. You could make projects a bigger part of their grade. But I think it’s to everyone’s advantage to trust the students.

With a well-designed exam, it will be easier to learn the material than it will be to cheat. The same goes for open notes. If you make a good exam, it will actually be quicker for students to leave their notes closed.

5. What I Didn’t Get To

I got to put almost everything I wanted to in this course, but there were a few things I missed.

I’ve always wanted there to be a bigger role for teams, but the teams in this class didn’t work very well. It seems like there should be ways to encourage students to help one another out, reward them for working together. But all the ideas that come to mind, like giving students bonus points for helping their teammates, have obvious problems. So while I want to incentivize teamwork and peer support, I haven’t come up with a way to make it happen yet.

Students would also really benefit from giving and watching presentations. I was able to do this for my RA, and it’s clear to me that she gained a lot from making the presentations and from getting feedback. Criticizing presentations and giving feedback is also good practice for statistical literacy, and it might be less intimidating for the average student.

But it would be difficult to have every student give a presentation. It’s probably impossible for large class sizes, and it doesn’t seem like it would work well online. During the semester, you might be able to do it in recitation, either for extra credit, or in small teams.

But the real problem is that giving a single presentation is like answering a single math problem. It’s just not that much practice. Unless the class size were very small, you probably couldn’t set it up so that every student got to present multiple times. This might be better suited to an advanced course. The breakout room activities, given that they include small and regular “presentations”, might be the best we can do here.

6. Concluding Remarks

I’ve heard a lot about the things you can and can’t do when teaching stats. I’ve heard that you can’t get students to pay attention. That you can’t make them care about the subject. That they’re all cheating on their assignments. That they aren’t smart enough to learn how to use statistical software on their own.

Things are bad in education today, but they’re not bad because of lack of funding, or because students are unmotivated. Things are bad because educators lack vision.

What else do you call it when everyone knows what the problems are, but no one manages to dream up solutions? We have the ability to make education work for us, and nothing special is required, just careful thought and patient experimentation.

In particular, there are huge gains to be had in developing approaches that let students and teachers stress less over the material and waste less time. This may free them to spend more time learning, but it may also free them to have a life outside the classroom. A class with more hours of homework, longer tests, and more fiendish questions is not a better class. In most cases it is a worse one.

What could be better than learning more, with less effort, and in less time? Let us celebrate academic laziness. Perfection comes not when there are no more assignments to add, but when there are no more assignments to take away.

Students have almost no control, of course, but it’s confusing how teachers continue to design classes with backbreaking grading loads for themselves. Just give fewer assignments, shorter assignments, assignments that are easier to grade. You can do this without making your class worse. In fact, you can do it while making your class better.

So many teachers teach classes that they themselves would hate. If you wouldn’t want to take your class, if you wouldn’t find it easy, then what are you doing? It seems unnecessarily cruel to me. Make your classes enjoyable. If you can’t make them enjoyable, at least make them easy. If you can’t make them easy, at least make sure they’re not a huge pain.

So many teachers are paranoid about students cheating, collaborating, or doing too well on tests. Are you a teacher, or a mall cop? When classes are fair, students don’t cheat. Even when classes are rigged, most students still refuse to cheat. Taking this approach creates a system where the most honest students are the ones who have the most to lose. I have seen too many honest students fail what should have been an easy class.

It’s August as I’m writing this, and online I have seen many examples of college professors sharing heavy-handed “how to be ok pages” or “COVID pages” that they plan to attach to their syllabi for the fall semester. These pages contain assurances that you can come to the professor with anything, that you can get extra time when you need it, and so on. Professors love these pages because it makes them feel like they’re doing something to make a difference. But these promises are hot air and all your students know it. If the structure of your class is cruel, this kind of statement becomes a sick joke. And if the structure of your class is kind, then you don’t need a page at the front of your syllabus trumpeting it. It’s the fundamental rule of communication: show, don’t tell. Put your good intentions in the structure of your class or not at all.

Just make a class that doesn’t suck.

Hindsight is Stats 2020, Part II: Design Goals & Grades

[This is Part II of a retrospective on teaching statistics over summer 2020. Part I is here.]

Grades are stupid. But at the end of the day, my university forces me to give everyone a final grade. And you do want to evaluate your students based on something, so they can know what they mastered and how they can still improve.

1. Design Goals

To begin with, I tried to work out my design goals. I started by thinking about the ways that classes normally fail and decided to work backwards from there.

One of the most blatant failures in the education system is when students are forced to take a class that they’ve already taken, or on a subject they already know. So my first goal was that someone who really knows the topic should be able to get a 100 with very little effort. There’s an easy way to check if this works: the course should be designed so that if, as the professor, I were to take it, I would ace it easily.

And not just ace it. Someone who really knows the material should, after demonstrating their knowledge, be able to walk out of the course entirely and never have to come back. Once you know the material, you shouldn’t be forced to waste your time regurgitating it.

A related problem is forcing students to waste time on concepts they already understand; or, conversely, moving on to new material before a student is ready. This is tricky because students really do learn things at different speeds. We can’t tailor the lectures to every student, but we can do things to help. Students should be given freedom to focus on the problems they find challenging. Once a student has mastered something, we should try not to bother them about it.

Similarly, most classes don’t incentivize students to learn things on their own. There’s no point getting ahead of the rest of the class. You’ll just be bored, and it might even hurt you, since it will be taking away from the time you could be using to cram the old material. This is a perverse incentive. If a student is ready to go further on their own, we should let them.

Basically, if a student wants to speedrun my class, who am I to complain? Let them do it.

Another classic way that classes screw up is by making students afraid of failure. With traditional grading, students have no room to experiment with different ways of learning, understanding, and studying. The class format requires them to obsess about every evaluation, and encourages them to do the minimum amount required to get the grade, to take no risks. If they try something interesting and fail, their GPA plummets. This leads students to obsess over pointless minutiae like what precisely is on the test and exactly how to word their answers.

I wanted to save them the time they spend thinking about this nonsense. If they choose to spend that saved time studying, so much the better. If they don’t, then all we are losing is their anxiety. Either way, we should reward students for taking risks and attempting to go deeper with the material, not punish them.

In the end I came up with three ways to evaluate student progress.

First, I had a system to replace class participation and attendance, based off of small team activities, which counted for 30% of the final grade.

Second, I had students independently analyze two simple datasets of their choice, and write up a report about each. Together the two reports counted for 20% of the final grade.

Third, I invented a new exam format (covered in the next post), which counted for 50% of the final grade.

2. Teams & Breakout Rooms

I really hate attendance.

Taking attendance is undignified. It’s disrespectful of students, who are assumed to be incapable of making informed decisions about their education, and of the professor, who is implicitly supporting that assumption. If students are sick, have a family emergency, or need to go to the dentist, they should be able to do so without worrying about their grade. They shouldn’t have to send me an email with a doctor’s note. I don’t like getting those emails—just stay home if you’re sick—and I’m sure students don’t like sending them.

All of this is doubly true of online teaching. All the lectures are recorded. Students can watch and re-watch my presentations as many times as they want. Why should any of us care about them being “in class” when that means almost nothing in a virtual classroom?

When I taught Introduction to Psychology last summer, I tried using a participation-based system. Rather than taking attendance, I had my TA mark down when students spoke in class. The idea was that this would encourage them not just to show up, but to participate in class discussions. I also hoped it would encourage them to do the assigned reading, which we discussed each day.

This didn’t work. Students would speak up even when they had nothing to add, just to get the grade. The quality of discussion suffered for it. Some very shy students didn’t speak at all, and lost points despite the fact that they were doing great in the class otherwise. It was a huge pain for my TA to keep track of it all. This system didn’t do anything I hoped it would, and I think it was a failure.

We could just chuck attendance altogether. But on the other hand, it’s good to have some kind of incentive for students to show up to class. Recorded lectures are about as good as live ones, but if students show up to class most of the time, they can ask questions and I can get a sense of what they do and don’t understand. It would be good to encourage most of them to be there most of the time. Can we come up with a way to make this happen?

2.1 Enter the Zoom Room

One of the things that everyone learned early on in the pandemic is that video calls suck. Jumping onto a Zoom call is excruciating, and afterwards you feel drained of all will to live. Turning off your camera helps, but not by much.

At first this seemed universal. People speculated that it was something inherent to the Zoom platform. There were theories that even subtle video latency was unnatural and jarring. But over time, I noticed two exceptions. The first was direct calls, with smaller groups. Hanging out with one or two friends over Zoom, while not as much fun as hanging out in person, didn’t make me want to tear my eyes out the way a Zoom call with several people did.

The other exception was playing virtual trivia. Early on in the pandemic, my friend Liz from my PhD cohort set up a virtual trivia night for students in our program. In virtual trivia, we would all gather in one Zoom room to start off. For each round, teams would be sent off into individual breakout rooms for 10-15 minutes to answer questions. Then we would all come back to the main room for scoring. We’d do this process for each round, with a couple of trivia rounds each night.

This was infinitely better than every other group call I had been on, and it wasn’t just that we were a group of PhD students drinking late at night. The breakout rooms were just as relaxed as being on a small call, and they broke up the evening in a way that made the main room much more fun, even though the full group was pretty large.

When I started thinking about how to run an online class, I knew I would have to include something like this.

(Liz also happened to be my TA for the stats course!)

I had been wanting to incorporate something about teams for a while, and this seemed like the perfect way to do it. Instead of sending teams off for rounds of trivia, I would send them off to do breakout room activities, and call them back to discuss the answers.

These activities took different formats depending on the topic we were covering each day, but most of them worked something like this. I put up a question or a task on the slides, and then sent the students into breakout rooms for about 10 or 15 minutes. When they came back, I randomly chose a couple teams to share their answers.

Getting the correct answer wasn’t the point. If the group provided an answer that seriously engaged with the activity, the group got credit for that activity, even if their answer was incorrect. The only way to get no credit was to not engage with the question or to give no answer at all. If I didn’t call on a team, that activity didn’t affect their grade.

This seemed to be the perfect replacement for attendance. At least one member of every group would need to be there every day, while individual members could come and go if they needed to. But part of their individual success would come from helping to make sure that the whole team was successful, so it was still in their interest to show up and help out whenever possible. I didn’t need to keep track of who was there, I just needed to give activities and ask them for their answers. And I didn’t even need to grade their responses, just record if they made an attempt.

I also hoped that this would give them some level of social support for the class — the kind of friendship they would normally get from the students sitting next to them, and people to go to if they needed help or support.

Another benefit was that this broke up the huge lectures into smaller chunks. Intermissions had already broken the 2.75-hour classes into two sessions of about 1 hour 15 minutes. With breakout room activities, days could end up being four sessions of about 30 minutes each, with activities and an intermission in between. That’s a lot better.

This was also meant to be a grade boost. A whopping 30% of their final grade came from their team grade, and because all you had to do was show up and try to answer the questions, I expected most teams to get 100%. I included this grade boost because I didn’t want them to worry about their final grade too much. This way, they would still have to work to get an excellent grade, but a student who did a decent job wouldn’t have to worry about failure. (As I mentioned earlier, I think that grades are kind of a joke.)

I shared a brief stats experience survey with my students the week before class, and I assigned them to teams based on their responses. I wanted to make sure that each team had a diverse collection of skills — that there was at least one student in every group who was comfortable with public speaking, at least one with decent math skills, and so on. The idea was that every team would have the skills they needed to succeed, and they would all have someone to turn to for help on any subject. I ended up with eight teams of five students each.

2.2 How did Breakout Rooms Work?

The grading worked just as planned. Seven of the eight teams got perfect marks on their breakout room activities. The other group missed one day (none of them showed up) and got about 90% on the team grade. But in general this provided exactly the padding I intended.

Or, almost. In retrospect, 30% was way too much. Students got really good grades anyways, and it wasn’t all thanks to the team grade — remember, more than 50% got an A! Making the team grades only 20% or even only 10% wouldn’t have changed their grades by very much, because they were all doing so well on other parts of the class. Mostly, I think it should have counted for less than 30% because it’s a shame that so much of their grade came from something unrelated to their understanding of the material. I am very happy so many of them got a 95 — I just think it would be better for them to get a 95 from nailing the assignments and exams than showing up and participating! It’s something I would do differently next time.

The activities worked really well. Lectures can be, let’s face it, pretty boring, and I think having these class exercises helped keep students from falling asleep. There’s also no better way to learn something than doing it yourself, and so following each lesson with an exercise was a good idea. And it was nice on my end to take a quick break, wait a few minutes, and see how they had done when they came back.

You do have to be careful with the activities, though. Activities work well if they are a simple problem, something the students couldn’t do when they signed on, but can do now that they’ve seen the day’s lecture. This helps the lesson stick in memory, and demonstrates why what they just learned is actually useful. Activities can also take a “don’t take my word for it, see for yourself” approach, and I liked this when I was able to use it.

No matter what though, the activities have to be easy. They aren’t a challenge or an exam; they exist to round out the lecture and serve as a teaching aid. It’s ok if students struggle with the details; it can be good for them to get a sense of their own limitations. But if they get stuck, can’t do the activity, or reach a dead end, then they don’t learn anything. The implicit message is that they can’t handle it, and that’s not the right message to send them. They can handle things that you’ve prepared them for; don’t give them assignments you haven’t prepared them for.

Students had mixed opinions of the teams. I got feedback like, “there was zero accountability for the breakout rooms … Most of the time, my teammates wouldn’t show up” and “as the days progressed, my group became unresponsive to the point where I was simply doing the work and presenting it on my own.” A few of them did have positive things to say about the teams, but clearly that was the minority opinion.

Most students liked the breakout room activities, though. “I was able to apply the material and then receive feedback (if called on) instantly. The breakout rooms presented a great opportunity to work through what was being discussed,” one student said. Another wrote, “Breakout rooms really allowed me to understand the application of concepts. I don’t think I would have been able to work through the research reports (or the finals) with as much ease had we not gone through related work individually and then as a class.”

The only complaint I saw about these activities was that I gave students too much time to work on them. I find this confusing, because I assumed students would be happy to have an extra 5-minute break to go and make a sandwich or something. Either way, I mark this idea as another success. It does seem like it helped the concepts and skills really stick with them.

Some students suggested that the activities be designed to more directly prepare them for the exams — basically, to have the activities be examples of the kind of questions that appeared on the exams. I can see why they proposed this, but I don’t like it. The exams are designed to try to see if students can generalize stats concepts to new situations. (And from their grades, it’s clear that by the end they could!) If I give them practice with questions of a similar format, I think that would defeat the purpose.

Obviously then, the problem is the teams, and it’s not clear to me what the solution is. Students suggested that I could have them do the work as a team but then call on individual students for the answers. That’s a little too invasive for my taste. One reason to have teams is to help less confident students — you know, the kind who would hate being called on.

I could imagine making the teams larger, maybe groups of 7-10. With more students, it’s more likely that some of them would show up. I could also make the teams smaller, maybe just 2 or 3 people per team. This would lead to less diffusion of responsibility. In either case, I’m sure there would still be slackers. Students don’t like having slackers on their team, but if everyone is getting a 100% on their team grades anyways, I don’t mind if there are a couple freeloaders. Maybe teaching this in person, if that ever happens, would change the dynamic and solve the whole problem.

If I were to teach this in a classroom rather than online, I would have them do more class activities, but have each activity be smaller/shorter. Sending people to breakout rooms on Zoom is a bit of a commitment. It takes a minute to send them out and to re-orient on coming back, so you want them to get their money’s worth. But teaching in person, it would be better to just give them more diverse tasks. Rather than giving them a 10-minute worksheet, I would do something like throw three histograms up the board and give them 3 minutes to tell me what values you could and could not reject from each.

3. Research Reports

About a year ago, I wrote an essay called What You Want from Tests, where I outline two kinds of knowledge that you need to have mastery over a skill. The first is the sort of things that every expert carries around inside their head, and this is what I argue you should try to examine with exams and quizzes. The other kind of knowledge is the ability to actually use the skill. Without the ability to use the skill, any knowledge is just trivia. You’re not an expert, you’re just a fan.

Statistics is a skill-based course, so the second kind of knowledge is really important. I didn’t just want my students to memorize a bunch of facts about statistics, I wanted them to learn how to actually use statistics.

A few years ago I was working with an undergraduate who had volunteered to be my research assistant. She was an exceptionally bright and curious student, who always asked remarkably insightful questions. She was also very diligent, and had already taken several stats classes before she started working with me. She had even taken some MA-level stats courses, which is unusual and impressive for an undergrad.

Despite all this, I discovered that she did not really understand stats. She had a hard time conducting even basic analyses. She didn’t understand many of the concepts. Despite her excellent grades, almost nothing from the classes had stuck with her.

I already knew that she was gifted, and I was aware of the shortcomings of the usual stats education approaches, so I reassured her that it was not her fault, and I offered to help her do something about it.

At this point I had already done a lot of thinking about how to do a better job teaching stats, and I realized that people always forget to teach this practical side of the skill, even though the practical side is what actually matters. Now, there’s no mystery about how to teach skills. I learned stats by struggling through real analyses for projects that were actually important to me, and everyone agrees that working on a project you genuinely care about is the best way to pick up a new skill.

But this doesn’t work in every situation. Even for me, it was a struggle, and this sink-or-swim approach is too harsh for the classroom. It’s also inefficient for beginners, because real data is messy and confusing. If students bring in a real problem, the correct approach might be too advanced for an intro class. And scale makes it impossible. Do we expect every student in an intro course to be able to bring in a project they’re thrilled about? They don’t know anything about the topic yet, so they don’t know what a good project would be.

I realized that all these problems could be fixed by using fake datasets. It’s easy enough to generate data, and you can make it look however you want. And unlike a real project, you can introduce concepts one at a time so that the student is always ready for them.

So that summer, I made a bunch of practice datasets for my RA to work with. I wrote a set of R functions that would automatically generate datasets to my specifications. At the start of each day, I would give my RA a short lesson on a stats concept, and then send her a couple datasets. Naturally, most of the datasets would be in some way related to that day’s lesson. She would work on them all morning, prepare some slides, and at noon, before we broke for lunch, she would give us a presentation on what she found out. I let my other RAs give feedback first (giving critique is great training as well), and then I would ask questions and give her feedback.

The first datasets were extremely simple, and they gave her no trouble at all. Once she was comfortable with conducting simple analyses on her own, I introduced complications, the sort of wrinkles one would expect to find in a real dataset. First I introduced the concept of statistical power, and gave her some critically underpowered studies, so she could learn to interpret those null results as inconclusive. Then we had a discussion of outliers, when and when not to exclude them, and the datasets for that day included different kinds of outliers. We covered causal inference, interactions, p-hacking, and many other concepts in the same way. The concepts in these lessons were cumulative. Once we had covered outliers, for example, I would sometimes put outliers in the datasets later on.

The datasets at the start of the semester were really easy. The datasets by the end were almost as tricky as real-world data. But at no point did my RA work on anything that was too hard for her. Each new complication was just one step up from something she had already mastered, so she was always prepared to tackle it.

3.1 Class Projects

I knew I wanted to do something similar for my class, to give them the same kind of practice with the practical side of things. In particular, I like this approach because for each dataset, you have to figure out what statistical test to run on the data. This is one of the stats skills you use most often in the real world, and it’s often the first question you ask when thinking about an analysis. Yet somehow, intro stats classes almost never teach this skill. At best, students get handed an extremely confusing flowchart. I knew I could do better.

Unfortunately the approach I used with my RA doesn’t exactly scale. I couldn’t give them the same kind of step-by-step training. I couldn’t have them all give a presentation on every dataset, and of course, many students are terrified of presenting to begin with.

Still, I figured I could come up with something that captured most of the benefits. I took several of the simpler datasets that I had made for my RA and I put them in a folder on the class website. Rather than having to analyze all of them, students were required to pick two of these datasets and write a research report about each of them. They could do these two reports at any point during the class, but since they weren’t taught how to do most analyses until about halfway through, I expected most of them to do these assignments during the second half of the course.

Students are taught to write long. This is a bad habit, especially when working with such simple datasets. I limited research reports to a maximum of one page long, including any graphs and/or tables. Students should learn to be concise, and besides, I didn’t want Liz to have to sift through dozens of extra pages when grading.

Each research report was 10% of the final grade, so these assignments were 20% of their grade in total. They were free to analyze the data however they wanted, but in particular we thought that R, SPSS, and Excel/Google Sheets were good choices, so I included one session for each of those approaches in the lectures. This wasn’t much training, to be sure. A lot of people might have seen this as a big risk — you’re expecting them to use R or SPSS with barely more than an hour of training each? But I wasn’t worried about it. Somehow I knew that they were up to the task.

Fig. 1: “Burgers Have Cheese???.png”, an example created in the course of instruction on the use of Google Sheets.

Originally, I was planning to let students do up to two additional research reports for extra credit. But in the week before class, one of the students suggested that instead of doing research reports for extra credit, we could let them re-do research reports that they weren’t satisfied with. This basically translated to “do 4 research reports, get your grade from the best two”.

I liked this for a couple of reasons. First, it let them make mistakes on early research reports without huge consequences, which was one of my design goals for the class. Second, students who were struggling would be encouraged to do additional reports, which would give them the extra practice they need, while students who didn’t need additional help wouldn’t be bothered.

I implemented this change, with the requirement that the do-overs would have to be on new datasets. Students would get feedback from Liz about how to do better, but they would have to apply those lessons in a new context. I limited them to two of these do-overs at most. I wanted them to be able to learn from their mistakes, but also I didn’t want each of them doing 10 reports.

The research reports were not really about the grades. They weren’t so much intended as evaluations. Really, they were more like practice, or lessons. What I really wanted them to get out of the research reports was, “I can do this and it’s not scary”, because I think it will help set them up to be confident when using these skills in real life (and on the Exams). It wasn’t about challenging or testing them, it was about giving them the opportunity to try things for themselves.

About halfway through the course, one student emailed me to ask for more guidance on how to format the reports. At the very least, she said, I should give them an example of what one would look like. I told her:

This assignment is designed to mimic what doing analysis is like in the real world. Data is emailed to you in a confusing format, and the file is poorly organized. The people who have hired you to conduct the analysis don’t know exactly what they want and can’t tell you what kind of test to conduct; after all, that’s what they hired you for. I’m trying to give you a controlled version of this experience — not nearly so confusing as real life, but where you are asked to exercise your judgment and the knowledge we’ve covered in class. Giving you any more guidance on how to conduct the analysis or write the report would defeat the purpose of the assignment.

To this student’s credit, she totally understood my point and ended up getting a 98 on both research reports.

A final reason to like the research reports is that they capture my “walk out of class once you’ve mastered the material” goal. If you already took stats but you were for some reason forced to take my class, or if you decide to teach yourself all the material in the first week, then you can just throw together two one-page reports, get an A+ on both of them, and forget about this part of the class entirely.

3.2 How did they do?

Students really surprised me on the research reports. When I first looked at the grades, I thought that maybe Liz had been too lenient. Almost all of them had gotten A’s! But when I looked closer, I saw that the students had earned them. The reports weren’t perfect, but they showed serious critical thinking and really creative engagement with the datasets. All very impressive for a subject they had been studying for less than six weeks!

When I looked back, I saw that on their first submissions, many students had gotten B’s and C’s. Liz wasn’t being too lenient at all. In fact, her feedback was intensely detailed! But this helped the students enormously. It’s clear that the students took that feedback and turned it around for their do-overs, and that’s what ended up earning them those A’s.

Some students, I was happy to see, didn’t need the do-overs. One student did her first two, got a 98 and a 99, and unsurprisingly, chose not to submit any more. Another student, who had said in class that she was terrible at math, gave it a shot and to her great surprise earned a 93 and a 90. She decided that was good enough for her, and didn’t send in another. The system works.

I especially liked how diverse the reports were. Students used all sorts of weird charts and phrased their results in all sorts of unusual ways. Not wrong per se, just the sort of thing an expert would never do. I think this demonstrates real understanding. Rather than just copying someone else’s approach, they had come up with their own, often slightly bizarre perspective, and then applied it. That’s what mastery looks like, folks.

How about the software? Some of them came to me or to Liz for help, but honestly, not as many as you might expect. For the most part they seem to have taught themselves.

When I was looking through the reports, I saw that most of them chose to use R for their research reports, and almost all of them did a solid job of it. This was a big surprise, but it’s very encouraging.

In conversations about how to teach stats, I’ve often heard, “It would be great if we could teach the students R or python. But you just can’t teach the average student a programming language in only one semester. It would take up too much of the lecture, and there would be too many questions for the TAs to handle. We should stick to SPSS worksheets and formulas for now, that’s the sort of thing that students can deal with.” I’m happy to have evidence that, in my opinion, proves this entirely false. Apparently students can learn the basics of R with almost no instruction, and in less than six weeks, as long as you give them the right environment for it.

I’m pretty happy with the research reports. Is there anything I would do differently next time? Well, one thing Liz pointed out to me is that while I gave them 24 different datasets, most of the reports were on the same 4 or 5 options. These were some of the most straightforward datasets, and most of them were analyses of correlation between two variables.

Now, as I said before, the research reports are not really about challenging students. I’m fine with them doing two easy reports, since doing any independent report at all is great for intro stats. But conducting correlation tests both times does slightly defeat the purpose of doing two reports.

A better system would be to break the research reports up into different bundles. Bundle A could be the easy ones and Bundle B could be more challenging. Bundle A could include one set of tests and Bundle B could include the others, so that every student would have to use at least two different tests. You could maybe include a Bundle C of advanced datasets. These could either give you extra points just for attempting them, or they could be strictly for extra credit. In any case, adding some more structure to the research reports would probably improve them.

Hindsight is Stats 2020, Part I: Fractal Course Design

This summer (2020) I taught Statistics for the Behavioral Sciences.

The course was unusual for a number of reasons. I’ve wanted to teach stats for a long time, so I came into this class with a collection of unorthodox ideas that I’ve been sitting on for a few years.

Things went really well. I had high expectations, but more than half (!) of my students got an A or higher. I didn’t shift my expectations, or make the class easier halfway through. These grades mean that most of the students either mastered the material to my satisfaction or came very close to doing so. This approach worked and I would definitely recommend it.

1. Course Format

1.1 Being Online

The big curveball for this class was the pandemic, which made it necessary to teach the class online. I’ve never taken a course online, and I had never expected to teach one that way. Going into this, I had almost no experience with online classes. When we transitioned to online instruction in March, I was TA’ing for a class, so I got to see how that went. But that was about it.

I’m confident in my skills, but there were a few things in particular that I was worried about.

One of the really rewarding parts of teaching is getting to know your students. But Zoom isn’t that great, so I was worried that there might be no personal connection. Partly I was worried that the class would be less enjoyable. People like making friends and knowing that the instructor cares about them. But part of it was also practical. Without that sense of the classroom and knowledge of the students, I was concerned that I wouldn’t be able to tell when students didn’t understand the material. Maybe I wouldn’t be able to explain things as well when they had questions.

The other major concern I had was cheating. I knew that in the transition towards online classes brought on by the pandemic, many schools forced students to install unsettling exam-monitoring software on their personal devices. This sort of thing is pretty evil. While I would never consider spying on my students, it did make me worry about cheating on exams. With online exams, it seems like it could be a real problem. But I also know from being a TA that students cheat a lot less than professors think they do. In the end I took no special steps to prevent cheating. I don’t really care about or believe in grades, and I decided to trust the students.

1.2 Personal Connection

It turns out that both of these concerns were unfounded.

Admittedly, there was very little personal connection. I didn’t get to know most of my students. I would recognize their names, but I never even saw most of their faces.

But no one seemed to suffer for it. In the end we still developed the rapport that you need for good teaching. In their evaluations, students said things like:

“Jeff was a great teacher! He clearly loved the subject, and wanted to try and teach it in a more accessible way”

“Jeff specifically explained things very well and was so real. It was nice hearing examples in ‘layman’s terms’ that were more approachable”

“I really felt as if this teacher wanted us to do well, and helped us learn as much as possible in the clearest way possible. … Great great teacher!”


“Jeff is cool”

This experience has changed my mind about classroom engagement, and makes me doubt some of the common wisdom about teaching.

Is getting to know your students a reasonable expectation? Certainly we can get to know our students. But is it appropriate? Students aren’t in your class to be your friends, and you’re not there to be their pal. People are in the classroom to, hopefully, learn something.

“Personal connection” often seems to be used as a proxy for respecting your students and treating them like human beings. But — surprise! — you can respect your students and treat them like human beings without necessarily having a friendship with them, or even knowing their names. Students are sensitive to this difference. They care about being treated with respect, but don’t seem to care about the other stuff.

A cynical take would be that professors use the excuse of “getting to know their students” to push students into having an unnecessarily friendly relationship. But pretending to be equals when you are in a position of power over someone is at best dishonest, and at worst is a way of denying that you have a responsibility to them.

I do think there are things you can do to drive engagement. But I don’t know if it really matters. My students got really good grades and displayed surprisingly deep understanding of the material, so it didn’t hurt their education. And many of them told me that this was one of the most enjoyable courses they have ever taken, so it didn’t seem to make learning any less fun.

1.3 Cheating

I was even more wrong about cheating. I didn’t see any evidence of cheating on exams or assignments, and there was plenty of evidence that they weren’t cheating. Students made lots of simple mistakes, which they could have avoided if they were cheating. Exam scores improved incrementally over time, just as you would expect from honest learning. Their assignments and answers on the tests were idiosyncratic, not the carbon copies you might expect if they were sharing answers. If students were cheating, they didn’t leave any trace of it, and so I’m inclined to believe that they didn’t.

The lack of cheating is a little weird. When I was a TA, I would catch students cheating all the time. They usually do a bad job of it — they forget that I was a student not too long ago, and so they don’t realize that I know most of the tricks. So the fact that we didn’t see any of the classic signs is strong evidence that there wasn’t any cheating.

So why didn’t they cheat on my class, when they do cheat during the semester? I think it has to do with trust. In the exit survey for the class, one student wrote down, “no feeling of being ‘cheated’ by the prof”. Another student wrote, “My biggest fears for this course revolved around completing it and not only doing poorly, but also learning nothing.”

Students tend to stoop to cheating when they think, often correctly, that there is no other way to do well in the course. When professors are unclear about expectations, or make examinations needlessly difficult, the students feel cheated by the professor, and will cheat themselves. When you see an exam filled with trick questions, it’s hard not to feel like the game is rigged. But to their credit, even in this situation, most students still won’t cheat.

Teachers have a lot to learn about cheating. If you don’t cheat your students, most of them won’t cheat on your assignments. It’s about trust. Not your trusting that they won’t cheat on assignments — their trusting that you won’t cheat them in their education.

This all makes it especially disappointing that, during this pandemic, so many schools are engaging in unethical surveillance of their students in the name of academic honesty. Students just don’t cheat all that much, even when they definitely could get away with it.

2. Course Content

So much for the course format. What was I actually teaching?

2.1 What’s Wrong with Stats?

Statistics education is pretty terrible, and everyone knows it. All the professors who teach stats agree: students come into class, usually manage to pass, and retain almost nothing.

Everyone is looking for the magic bullet. But even so, no one thinks it’s a great mystery. Professors and TAs will all tell you the same thing: the problem is motivational. The majority of students, they say, simply aren’t interested in learning this esoteric form of math. As a result, most of the proposed solutions are motivational as well: find a way to make it fun and interesting, or at least find the right set of rewards and punishments.

But when I was a TA for intro stats, I noticed that this didn’t match what I saw at all. The students in my recitations were engaged, and really wanted to understand stats. They asked insightful and sophisticated questions, and were always pestering me for more detail. Yet somehow they seemed to come back every week having forgotten everything we discussed the week before. This isn’t the behavior of students who are checked out — this is the behavior of students who are trying, and repeatedly failing, to build a model of what is going on around them.

Even if I had been wrong about most students, there were a few of them who were clearly both able and motivated. These students got perfect scores on multiple tests and assignments, regularly came to my office hours, and discussed many of the concepts in great detail. They showed me the extensive, meticulous notes they had taken in lecture. But when it came to answering simple questions about the material in a new context, they always came up blank.

These students weren’t lacking in motivation or intelligence. So it must be external; something about the class was failing them. Even if everyone in the class were as motivated as these high-achievers, we would still be having trouble with comprehension and retention.

2.2 Driver’s Ed

I think the motivation story is all wrong. The problem is that the subject is taught at the wrong level.

Imagine you are taking a driver’s ed course, and have just shown up to the first day of class. The professor gets up and says, “Hi everyone, in this class you’re going to learn all about cars. Cars are really amazing. Some people use cars to get to work. Some people use them to get to school. Some people use them to go on vacation! There are a lot of kinds of cars. The big ones are called trucks. Those ones carry things like fruit and gravel. In this course you’ll learn all the different kinds and their uses, and we’ll talk a bit about the history of cars.”

You raise your hand, “Excuse me, professor. I’m here because I want to learn how to drive. I didn’t come here to learn about the types or history of automobiles. I’m sure that knowledge will come in handy in some ways, but it’s really not my focus. How do you actually drive?”

“Worry not,” he says, “To drive, move the wheel back and forth.”

So you leave that course and you sign up for a different one. You show up to the new class, and the professor gets up and says, “Hi everyone, in this class we’re going to learn all about cars. We’re going to be starting with the drivetrain. It’s important that you be able to describe and identify all the parts. Look at this diagram. Here’s the gearbox (which you can see is constant-mesh), clutch mechanism, the flywheel, the differential…” You get up and walk out of the room.

Neither of these classes will teach you how to drive. And sadly, this is a pretty good metaphor for how statistics is usually taught. Some statistics courses give students an overview of probability theory and a brief sense of the history, without teaching them how to actually conduct an analysis. Others throw the equations right on the board and start discussing the terms without any context. All too often, a single class will try to include both of these approaches. This is probably worse than either of them alone.

Students don’t want to learn a list of tests, the life history of Ronald Fisher, or the exact meanings of the terms in the formula for the pooled standard deviation. All these are things one naturally picks up over time, but none of it is useful without the core knowledge. Students want to learn what statistics is and how we actually use it. But somehow they seem to come away from our courses without having been taught either of these things.

Driver’s ed focuses on the point of contact: how to use the car. Similarly, the main goal of this class was statistical skills and how to use them.

I wanted students to become statistically literate. Most students won’t end up being researchers or statisticians in the same way that most people who take driver’s ed won’t end up being auto mechanics or engineers for GM. We still benefit from knowing what a car is and how to operate it. Similarly, students benefit from knowing what statistics is and how to use it. For those students who do want to go on to use statistics professionally, this will still give them a strong foundation. Auto mechanics don’t suffer from having taken driver’s ed in high school.

The focus was limited and practical. Students were taught how to recognize different kinds of variables and data, interpret standard plots and graphs, read and understand statistical reports, and conduct basic analyses using statistical software. I alluded to other subjects of interest in lectures, but in the lessons and the evaluations, I focused on these basic skills.

We can also talk a little bit about what I didn’t want to cover. The history of stats is interesting, but most of the time it doesn’t help you be a better statistician. The most important thing to know about the history is that these tests and concepts were just invented by a few guys not all that different from you and me. Anyone can make up a concept or design a new test. You assign it a Greek letter and suddenly it sounds official, but for all we know, Fisher came up with it while sitting in the bathtub. Besides that, most of the details don’t matter. Aside from a couple of helpful examples, I didn’t teach them anything about the history of statistics.

You do need to know a few symbols to be able to interpret tests, but I didn’t want to cover much in the way of formatting. I don’t care if students report a number as 0.02 or .02 or 0.0212; I don’t care if they write “p-value” or “p-value”. Time is limited, and I don’t want to waste their time or my time going over this nonsense. If by the end of the class, they know the concepts but not the formatting, then I have succeeded. If they know the formatting but not the concepts, I have definitely failed. So I decided to focus on the concepts and, as much as possible, ignore the formatting.

2.3 Fractal

So that’s what I wanted to teach. How do you actually teach something like this?

Most courses take a cumulative approach. You start with the basics, and the material slowly becomes more and more complex. Each lesson builds on all the previous lessons. At the end you finally tackle the most advanced material. Then you take the final.

In my experience, this falls apart by the second week of class. Students who miss even a single lecture are cut adrift, left to founder or drown. Even if you make it to every class, your safety isn’t guaranteed. If you don’t understand the explanation they give in lecture, you’re out of luck, because the class is never going to come back to that topic again.

Rather than being cumulative, my course approach was fractal. A fractal is a figure or function where every part has the same character as the whole. Every part contains copies of the whole thing. That’s how I structured the course: every part of the course was nested within other parts of the course.

A photo of my stats course from space. jk it’s fractal broccoli

You could be the best teacher who ever lived, with the most beautiful slides imaginable. It doesn’t matter — students just can’t learn something in one go. This is especially true in statistics. The classic learning pattern for the subject is brief flashes of insight, a feeling of sudden understanding, and then losing your hold on it and slipping back into confusion. This is normal.

For some reason, people don’t understand this. Everyone thinks there is going to be a shortcut explanation for these ideas, but we don’t think that way about other skills. We don’t think that painters will master three-point perspective in a single session, and we don’t expect programming students to master for loops in a single day. Maybe you can get the gist after the first introduction, but really understanding these topics takes time. Somehow we see stats differently. In particular, there is a whole genre of articles and blog posts all about how to explain p-values. These assume that the concept can be distilled into a single statement, or a single lesson. But that’s crazy. You can’t understand p-values in one hour, no matter how good the explanation is.

I think of statistics as really being three closely-related topics: a language for talking about data in general, descriptive statistics for talking about individual variables, and inferential statistics for making educated guesses about the world on the basis of limited samples.

The structure was built around these topics. The first day of class was an overview of the entire course, introducing all three topics in very general terms. Day 2 and Day 3 were another microcosm: again we covered the whole course, this time in slightly more detail.

Week 2 covered data in more detail. Week 3 covered descriptive statistics. Weeks 4 and 5 covered inferential statistics. Finally, in week 6, we went even deeper into inferential statistics, exposing exactly how the math behind the tests works.

This means that students see every single topic many times before the end of the course. For example, the two-sample t-test appears a total of six times in the lectures. It appears first in day one, during the complete overview, again in the lectures for day three, and then again in weeks three, four, and six.

It doesn’t matter if you don’t understand the two-sample t-test the first time, or the second time, or even the third time you see it. It doesn’t matter if you miss a few classes. It doesn’t matter if one of the examples I use doesn’t make sense to you. We will come back to this concept again, in a new context, with new examples. By the end of the class, you will get to see it from every angle.

These things take time. Mastery of a subject comes only when you return to an idea over and over, seeing it in new situations and becoming more familiar with it, building your own understanding. The structure of the class needs to support this, or students won’t be able to learn a damn thing.

2.4 Context

My influences in this were the Snowflake Method, and Progressive Rendering from It’s Time For An Intuition-First Calculus Course. Both of these perspectives emphasize understanding the gist of an idea before getting stuck in the details. To quote the reasoning from It’s Time For An Intuition-First Calculus Course:

The “start-to-finish” approach seems official. Orderly. Rigorous. And it doesn’t work.

What, exactly, do you know when you’ve seen the first 20% of a portrait in full resolution? A forehead? Do you even know the gender? The age? The teacher has forgotten that you’ve never seen the full picture and likely can’t appreciate that you’re even seeing a forehead!

Progressive rendering (blurry-to-sharp) gives a full overview, a rough approximation of what the expert sees, and gets you curious about more. After the overview, we start filling in the details. And because you have an idea of where you’re going, you’re excited to learn. What’s better: “Let’s download the next 10% of the forehead”, or “Let’s sharpen the picture”?

Let’s admit it: we forget the details of most classes. If we’ll have a hazy memory anyway, shouldn’t it be of the entire picture? That has the best shot of enticing us to sharpen the details later on.

Sometimes I think of this course as Intuition-First Statistics. “Intuition-first” doesn’t mean our goal is to teach good statistical intuitions, though hopefully students do get some of that. It means that we should start by working with intuitions, and that everything else will follow from that. Because, although it may sound surprising, students actually have pretty strong statistical intuitions.

The problem is context. The cumulative or start-to-finish approach makes perfect sense to the instructor, but only because they already know what is coming. They can see the context; how everything is connected.

The students don’t have any of that. They just get hit in the face with new material that they never saw coming. Every day it’s some new bullshit. They have no idea what is up next, what it means, or how it all is related. They’re always being knocked off-balance by new topics you didn’t prepare them for, and they never have time to figure out how it’s all connected.

Your Students
Your Students

This is a huge problem, because context really matters for comprehension and memory. A great example comes from research by Bransford & Johnson (1972). In their studies, participants heard a paragraph like the one below. Take a look at this passage and see if you can figure out what it is all about:

The procedure is actually quite simple. First you arrange things into different groups. Of course, one pile may be sufficient depending on how much there is to do. If you have to go somewhere else due to lack of facilities that is the next step, otherwise you are pretty well set. It is important not to overdo things. That is, it is better to do too few things at once than too many. In the short run this may not seem important but complications can easily arise. A mistake can be expensive as well. At first the whole procedure will seem complicated. Soon, however, it will become just another facet of life. It is difficult to foresee any end to the necessity for this task in the immediate future, but then one never can tell. After the procedure is completed one arranges the materials into different groups again. Then they can be put into their appropriate places. Eventually they will be used once more and the whole cycle will then have to be repeated. However, that is part of life.

One third of the participants heard the paragraph without any context. It didn’t make much sense to them, and they had trouble recalling what they had heard.

The next third of the participants, before hearing the paragraph, were told that it was about doing laundry. To these participants, the paragraph made perfect sense, and they had very little trouble recalling the details.

The final third learned the topic only after they’d heard the entire paragraph. These participants also found the paragraph confusing, and even having been given the context, weren’t able to recall much about it. Context alone isn’t enough; you need to see the context up front.

Something similar happens in class. Without context, even the most motivated students have trouble remembering the material. They have a hard time memorizing tests or equations because they don’t understand what a test is used for, let alone how it works. I don’t have trouble with the equations, but only because I understand what the tests were created to do. It’s easy to put things into their proper categories them if you have a good grasp of the system; it’s impossible if you don’t even know what categories there are.

The fractal approach solves this problem. The first two or three times I went over the material, I didn’t expect them to remember any of it. We cover all the material early on, because being introduced to everything at a shallow level prepares students to understand the material in depth once it comes back around again.