When it comes to consuming media on the internet, the Wadsworth Constant is your friend. You should have skipped the first clause of this paragraph. Maybe the second half of the paragraph really.
You’ve probably heard of The Dyatlov Pass incident, where nine hikers died in the Ural Mountains under mysterious circumstances. Well, the mystery appears to have been solved using computer simulation methods developed to animate the Disney film Frozen, combined with data from research at General Motors, where they “played rather violently with human corpses”.
Back near the beginning of the pandemic, Ada Palmer wrote a post over at her blog Ex Urbe, in response to the question, “If the Black Death caused the Renaissance, will COVID also create a golden age?” This piece is perhaps even more interesting now that we’ve seen how the first year of the pandemic has played out. If you haven’t read it yet, you should!
We love this piece of speculative burrito fiction. The fiction is speculative, not the burritos. At least, we’re pretty sure it’s fiction.
Exciting new developments in education: NYU cognitive science professor Todd Gureckis, in an effort to learn how to make more engaging video lectures, studies the masters: YouTubers. The results are already pretty impressive, and we suspect they will get even more engaging over time.
[Morpheus Voice] You think it’s the year 2021, when in fact it’s still early fall 1993. We have only bits and pieces of information, but what we know for certain is that in the early ’90s, AOL opened the doors of usenet, trapping the internet in Eternal September. Whatever date you think it is, the real date (as of this writing) is Sunday September 10043, 1993.
In a recent post, we argued that the US Senate is actually a pretty liberal institution, even when it happens to be controlled by conservatives, because it helps safeguard minority opinions against the majority.
That said, there’s a lot to dislike about Congress. The big sticking point in our book is gridlock. If a bill is popular in the House and the Senate doesn’t like it, that bill is never making it to graduation. The same is true for a bill popular in the Senate that the House can’t stand. There are too many opportunities for individual officials to obstruct the process. If the Speaker of the House or the Senate Majority Leader decides to stomp on a bill that everyone else likes, there’s not much anyone can do about it.
Right now, every legislature in the world is either unicameral — a single group of legislators make all the laws — or bicameral — two groups or chambers of legislators each write and pass laws and then duke it out.
Unicameral legislatures are pretty straightforward, but they have a number of drawbacks. There are many ways to elect legislators, but you’ll probably have reason to regret your decision, whatever you choose. If your legislators are all noblemen (as was once the case in countries like the UK) most of your population doesn’t have representation. If your legislators are elected proportionally, based on population (as is the case in the US House of Representatives), you end up with a populist system that makes it easy for the majority to abuse any minority. Regardless of how you determine who gets to be a legislator, if you only have one chamber, there’s always the chance that something will go wrong in that chamber, and they’ll pass some really stupid laws.
Having two chambers mitigates these issues, especially when the chambers are elected in different ways. A populist chamber that tries to pass anti-minority laws can be blocked by a chamber that better represents minorities. A chamber of elites that tries to pass a discriminatory law can be blocked by a populist chamber. This system of checks and balances has obvious benefits compared to letting a single chamber run everything, and this is a large part of why many legislatures are bicameral. Unfortunately, this leads to political gridlock.
So why stop at two?
One chamber is tyranny. A second chamber introduces checks and balances, but leads to political gridlock. A third chamber can introduce further checks and balances, and break the deadlock to boot.
In the two-chamber system, a bill can be introduced by either chamber, but it needs the approval of both chambers to be sent to the executive branch to be signed into law. This creates a bottleneck.
In a three-chamber system, we can relax that requirement. Any of the three chambers can propose a bill, and if they can get one of the other chambers to approve it, then it goes to the executive. If we have three chambers, A, B, and C, a law can be passed by A & B, B & C, or A & C working together. That way if one of the chambers is deadlocked on an issue, or one powerful legislator is trying to keep a bill from being passed, the other two chambers of congress can just work around it.
Currently in the US government, a presidential veto can only be overridden by a 2/3 majority vote in both the House and the Senate. In a three-chamber system, there could be more than one option. A veto could be overridden by a 2/3 majority vote in the two chambers that originally approved it, or by a simple majority vote in all three chambers.
You could extend these principles to a system with 5, 7, or even more chambers, but three is complicated enough.
The main distinction between the three chambers would be in how they are elected.
Let’s imagine how this could play out in an ideal scenario. One of the chambers should have proportional representation, like in the US House of Representatives. We also think it’s a good idea for one of the chambers to have intentionally disproportionate representation, like the US Senate, though it doesn’t necessarily need to be by state.
How should the third chamber be organized? We have some ideas.
As we argued in our earlier piece on the Senate, disproportionate representation can be good because it can protect minority groups from the whims of the majority. This doesn’t make all that much sense when the representation is by state — residents of North Dakota are not what we normally mean by “minority” — but maybe we can take this and run with it.
When we talk about minorities, we usually mean ethnic or religious minorities. So one thing we could do is make a chamber where every ethnic group or every religious group gets equal representation. For example, in a religious legislature, there might be 10 seats for Christians, 10 seats for Muslims, 10 seats for Sikhs, 10 seats for atheists, and so on. This would mean that even if, for example, the majority of a country were atheists, religious minorities would still have a stake in writing the laws of the country.
This approach protects minorities in a very structural way, which is great. There are a couple reasons to dislike this idea, however. First of all, the boundaries between different ethnic groups and different religions are pretty unclear. Do Catholics and Protestants count as “the same religion” for the purposes of allocating seats in this new chamber? How about Mormons? Do atheists and agnostics get separate representation? What if someone invents a new religion? How many representatives go to the Church of the Flying Spaghetti Monster? Will the chamber be taken overby Satanists?
Second, building a legislative body around something like race or religion puts the power to decide these questions in the hands of the government. We like government pretty ok, but we don’t think it should be deciding which religions “count” as separate religions.
Third, we think that dividing people up this way supports ideas like racial essentialism, which are basically a form of racist pseudoscience.
Ultimately it seems that this is a system that might work in some places (including Lebanon), but is politically dicey overall.
If there’s one group in America that really deserves more representation, it’s Native Americans.
This chamber would be really simple. There are 574 federally recognized tribal governments in the United States, and each government would get a seat, electing their representative however they wanted to.
We do notice that 231 of these tribes are located in Alaska, however. Many of them seem to have populations of only a couple hundred people, which seems a little bit disproportionate even by the standards of disproportionate representation. Probably the thing to do would be to give each of the twelve Alaska Native Regional Corporations the same level of representation.
Native American tribes are, at least nominally, sovereign nations, and maybe it sounds weird to give a sovereign government a say in writing our laws. Well…
There’s another old joke about making Israel the 51st state, which is that they would object on the grounds that they’d only get 2 votes in the Senate. Good idea, why not.
America should take its role as a superpower more seriously. American laws influence the whole world, so maybe we should give the rest of the world some say in our laws. They already get ambassadors, so why not representatives. What’s more, America is home to millions of immigrants, who can’t vote and don’t get any representation.
In this chamber, each of the 195 or so countries of the world would send one representative, who would vote on laws just like any normal member of Congress. Naturally, the USA would get one representative in this chamber as well, and it would probably make sense to have that representative serve as the president of that chamber, as the Vice President does for the Senate.
This may seem like handing over the reins to foreign powers, but this chamber can’t pass any laws on its own. To pass a law, it would have to work with either the House or the Senate, so everything still has to be approved by an all-American body. And anyways, the President can still veto any law they want.
In addition, this chamber gives us a surprisingly strong carrot & stick for international relations. Any country would jump at the chance to have a say in setting American policy. Offering a seat, or threatening to take it away, is something that foreign powers would take seriously. If we threatened to kick Russia out, they would pay attention, and it’s interesting to think what North Korea might agree to if we hinted we might give them a seat in Congress. If this chamber existed today, I would say we should kick Myanmar out right away.
This system does seem like it might be a little hackable. After all, what’s to stop a country from breaking up into many smaller countries to get more votes? On reflection, however, if other world governments want to balkanize themselves to get more representatives, that’s ok with us, especially since plenty of regions already want to do this.
A problem with most governments is that they are the tyranny of the old over the young. This is perverse because young people will live to see the full consequences of the laws passed today, while old people may not. Old men declare war, but it’s young men who have to fight and die. Again, this seems pretty unfair.
I’ve heard that taxation without representation is bad, and while we could in theory be doing better, the fact of the matter is Americans ages 18-29 have only one representative in the House and no representation at all in the Senate.
In our third chamber, representation could be by age bracket. Decades seems like the natural breakdown here, so we might assign 10 seats to people in their 20’s, 10 seats to people in their 30’s, and so on. Representatives would be elected for 2-year terms, and the main qualification would be that you would need to be in the proper age bracket on election day. If I ran as a 29-year-old and won, I would serve out my term as a representative for 20-somethings. Once my term was up I could run for the same chamber, but because I would then be 31, I would have to run as a representative for 30-somethings.
It’s not clear how far we should take this — the elderly deserve representation too, but we probably don’t want to mandate a voting block of representatives in their 90’s. Because other chambers tend to skew old, maybe it would make sense to have a cutoff at retirement age. Let’s imagine we raise the retirement age to 70 (possibly a good idea on its own), and so the final bracket in this chamber would be 70+. If an 80-year old ran for a seat in this chamber, she would be running to represent everyone age 70 and up.
Legal scholar @tinybaby, proposes a similar system, saying, “each state should have an allocation of 100 years of age for their senators. you can send two 50 year olds but if you want dianne feinstein in the senate you gotta send a 13 year old too”. All we can say is that this is an even more creative solution.
With some of these systems, it’s hard to know how to slice and dice the population for representation. This is how we end up with things like gerrymandering, which no one likes except politicians.
A solution to this is to use metrics that are easy to measure and hard to fake. For example, we could have a chamber composed entirely of the 100 tallest people in the country. Wait, that’s just basketball players. Ok, why not the NBA (& WNBA) winning teams for that year? At least we’ll know that these representatives are good at something.
People already complain about politicians spending too much time dunking on each other, and not enough time on policy. If this is going to happen anyways, let’s at least make the dunking literal.
One explanation for this standstill is the constant partisan bickering. According to this explanation, our representatives are too busy dunking on each other (see above) and scoring points for their parties to actually do their jobs.
If this is the case, one solution would be a chamber that forces the two parties to work together. This chamber would look a lot like the Senate — each state would elect and send two representatives. The twist is that every state would have to send both a Republican and a Democrat, so the chamber would always be 50 Democrats and 50 Republicans.
We’ll take another page out of Lebanon’s playbook here. In Lebanon, all elections are by universal suffrage, so a Christian candidate needs to court votes from Muslim voters in their district, and vice-versa. The same would happen for this chamber — representatives would be elected in a general election, not a primary, so they would benefit from getting votes from members of the other party. Like every state, Massachusetts would have to send one Republican, and the Republican they chose would undoubtedly be the one who best figured out how to appeal to liberal voters.
This precludes all the usual fights over who controls what branch of government, at least for this chamber. In this chamber, no party is ever in control — it’s always 50/50.
If the votes in this chamber were by simple majority, then because both parties have 50 seats, either party would be able to pass any law they wanted to. That’s not what we want, so in this chamber, laws can only be passed by a two-thirds majority. As a result, you literally cannot pass a law in this chamber unless it’s strongly supported by both parties.
No more fighting over the most controversial issues of the nation. Leave that to the House and the Senate. In this chamber, there’s no point trying to pass a law unless you think it has a chance at broad appeal — and the hope is, this would encourage them to aim for the policies that most Americans already want.
There are lots of humans in America, but there are even more animals. They deserve representation too, from the greatest moose to the lowliest grasshopper. They deserve some kind of… animal house. That’s the whole joke, sorry.
There’s another old problem with democracy, which is that you can only elect people who are running for office. These people are naturally power-hungry; you can tell because they’re running for office. Who would want to vote for that guy? As Douglas Adams put it:
“The major problem — one of the major problems, for there are several — one of the many major problems with governing people is that of whom you get to do it; or rather of who manages to get people to let them do it to them. To summarize: it is a well-known fact that those people who must want to rule people are, ipso facto, those least suited to do it. To summarize the summary: anyone who is capable of getting themselves made President should on no account be allowed to do the job.”
The problem with elections is you’re electing someone who decided to run. They want power, and that means you can’t trust them.
There are also many people who we like and trust, but who wouldn’t want to run for Senate or wouldn’t do a good job if they did, people like (depending on your political views) Adam Savage, John Oliver, Oprah (or Oprah for men, Joe Rogan), Cory Doctorow, or Ursula K. Le Guin’s ghost. Not all of these people are living American citizens, but you get the idea. I don’t know if many of these people would make good legislators. I think most of them would even turn down the job. But there’s some political savvy there that we’re letting go to waste.
So let’s cut the Gordian knot on this one by combining these two issues into a solution. We want to find people who would be good at governing but who would never think of running, and we want to take advantage of the political savvy of people who wouldn’t make good legislators or wouldn’t even take the job in the first place.
So what we do is we elect clever people as nominators, and let them nominate people they think would be good for the job. In this chamber, you can’t run for a position. What you can do is run to nominate people to this chamber. If you win, you get to nominate people for a few — let’s say three — positions. Assuming they accept the nomination, the people you choose serve out a normal term in congress.
To keep this from being politics-as-usual, there would need to be some limits on who could run as a nominator and who could be nominated. The whole benefit of this system is to bring in genius from outside the normal political world, people who would normally be too humble to run. We can’t have this system filled up with career politicians, because the whole point is that constantly bringing in new people is a good way to flush out the swamp.
To begin with, nominators can’t have been elected to any state or federal position. If they win, and nominate people to this chamber, they are barred from politics, and can’t run for any position or hold any office in the future. This keeps them from being consistent kingmakers.
Similarly, we want the members of this chamber to be intelligent outsiders, so nominees can’t have been previously elected to any federal position. We think state and lower elected positions might be all right, as this would be a good way to elevate someone gifted from local politics to the national stage. Unlike nominators, however, nominees should be able to run for office in the future. They can’t be nominated to this chamber again, but if they do a really great job and everyone agrees they’re a wonderful politician, it’s clear they would be able to do well in the the House or Senate, and they should be allowed to run for other offices.
This does sound a little strange, but many of our most important federal officials are already appointed. In fact, one of the main powers of the president is appointing officials like ambassadors, cabinet positions, federal judges, and many others. In some ways, the president is really just a super-nominator position. This chamber is simply taking the good idea of appointing officials, making it more democratic, and extending it to Congress.
One of our favorite things about the idea of nomination is that it’s a way to get normal people, who can’t or wouldn’t run for office, involved in government directly. We liked this so much that it got us thinking about other ways you might be able to get normal people involved in government.
Originally we played around with different ideas about how to find normal people who could be elevated to office. A chamber composed of the most viral tiktokers that year? A chamber of the mods of the 20 subreddits with the most subscribers? A chamber of the best dolphin trainers? These are very democratic ideas, but we couldn’t find a way to make them work. All these ideas would be immediately captured by people seeking to gain office, which defeats the porpoise, and reddit would be in charge of our electoral security, which seems like a big ask. There’s also surprisingly strong circumstantial evidence that Ghislaine Maxwell was the moderator of several major subreddits, so maybe this isn’t such a great idea after all.
These people are more normal than Ted Cruz politicians, but reddit mods and twitch streamers are not exactly normal either. They’re regular people, but they’re not all that representative.
So instead, why not a chamber elected by lottery? In this chamber, representatives are randomly drawn from the population of all adult citizens, or maybe all registered voters.
Most of these people would have no experience in government. To account for this, each member of this chamber would be elected for terms of 6 years, but with the terms staggered, so that every two years only one-third of the members would be replaced by lottery. (This is exactly how the Senate does it.) This means that while every one of these representatives was randomly drawn from the population, at any given time one-third of them would have at least four years of experience, one-third of them would have at least two years of experience, and one-third of them would be incoming freshmen.
A member could theoretically be elected to this chamber more than once, but because election is by lottery, the chances of this happening are pretty damn slim (though given how weird the world is, it would probably still happen at some point). However, like the nomination idea above, these randomly-elected legislators might do such a good job that they would go on to be directly elected to other branches of government.
While you might be concerned about a chamber filled with randomly selected Americans, this chamber still can’t pass laws without the help of either the House or the Senate. It’s true their views will probably be more diverse than the views held in the House and the Senate, but you say that like it’s a bad thing.
In some ways, this encompasses many of the other ideas we described above, and does a better job of it. A chamber elected by true lottery will not only be balanced in terms of demographics, it will actually be representative. The distribution of gender, age, race, education, religion, profession, and so on in this chamber will all be nearly identical to the United States in general. Gerrymandering literally can’t affect it, since it’s a random sample. It’s hard to imagine a better way of getting diverse voices in politics.
One of the mysterious aspects of obesity is that it is correlated with altitude. People tend to be leaner at high altitudes and fatter near sea level. Colorado is the highest-altitude US state and also the leanest, with an obesity rate of only 22%. In contrast, low-altitude Louisiana has an obesity rate of about 36%. This is pretty well documented in the literature, and isn’t just limited to the United States. We see the same thing in countries around the world, from Spain to Tibet.
A popular explanation for this phenomenon is the idea that hypoxia, or lack of oxygen, leads to weight loss. The story goes that because the atmosphere is thinner at higher altitudes, the body gets less oxygen, and this ends up making people leaner.
This study focused on twenty middle-aged obese German men (mean age 55.7, mean BMI 33.7), all of whom normally lived at a low altitude — 571 ± 29 meters above sea level. Participants were first given a medical exam in Munich, Germany (530 meters above sea level) to establish baseline values for all measures. A week later, all twenty of the obese German men, as well as (presumably) the researchers, traveled to “the air‐conditioned Environmental Research Station Schneefernerhaus (UFS, Zugspitze, Germany)”, a former hotel in the Bavarian Alps (2,650 meters above sea level). The hotel/research station “was effortlessly reached by cogwheel train and cable car during the afternoon of day 6.”
Patients stayed in the Schneefernerhaus research station for a week, where they “ate and drank without restriction, as they would have at home.” Exercise was “restricted to slow walks throughout the station: more vigorous activity was not permitted.” They note that there was slightly less activity at the research station than there was at low altitudes, “probably due to the limited walking space in the high‐altitude research station.” Sounds cozy.
During this week-long period at high altitude, the researchers continued collecting measurements of the participants’ health. After the week was through, everyone returned to Munich (530 meters above sea level). At this point the researchers waited four weeks (it’s not clear why) before conducting the final health examinations, at which point the study concluded. We’re not sure what to say about this study design, except that it’s clear the film adaptation should be directed by Wes Anderson.
While this design is amusing, the results are uninspiring.
To begin with, the weight loss was minimal. During the week they spent at 2,650 meters, patients lost an average of 3 pounds (1.5 kg). They were an average of 232 lbs (105.1 kg) to begin with, so this is only about 1% of their body weight. Going from 232 lbs (105.1 kg) to 229 lbs (103.6 kg) doesn’t seem clinically relevant, or even all that noticeable. The authors, surprisingly, agree: “the absolute amount of weight loss was so small.”
More importantly, we’re not convinced that this tiny weight loss result is real, because the paper suffers from serious multiple comparison problems. Also known as p-hacking or “questionable research practices”, multiple comparisons are a problem because they can make it very likely to get a false positive. If you run one statistical test, there’s a small chance you will get a false positive, but as you run more tests, false positives get more and more likely. If you run enough tests, you are virtually guaranteed to get a false positive, or many false positives. If you try running many different tests, or try running the same test many different ways, and only report the best one, it’s possible to make pure noise look like a strong finding.
We see evidence of multiple comparisons in the paper. They collect a lot of measures and run a lot of tests. The authors report eight measures of obesity alone, as well many other measures of health.
The week the patients spent at 2,650 meters — Day 7 to Day 14 — is clearly the interval of interest here, but they mostly report comparisons of Day 1 to the other days, and they tend to report all three pairs (D1 to D7, D1 to D14, and D1 to D42), which makes for three times the number of comparisons. It’s also confusing that there are no measures for D21, D28, and D35. Did they not collect data those days, or just not report it? We think they just didn’t collect data, but it’s not clear.
The authors also use a very unusual form of statistical analysis — for each test, first they conducted a nonparametric Friedmann procedure. Then, if that showed a significant rank difference, they did a Wilcoxon signed‐rank method test. It’s pretty strange to run one test conditional on another like this, especially for such a simple comparison. It’s also not clear what role the Friedmann procedure is playing in this analysis. Presumably they are referring to the Friedman test (we assume they don’t mean this procedure for biodiesel analysis) and this is a simple typo, but it’s not clear why they want to rank the means. In addition, the Wilcoxon signed‐rank test seems like a slightly strange choice. The more standard analysis here would be the humble paired t-test.
Even if this really were best practice, there’s no way to know that they didn’t start by running paired t-tests, throwing those results out when they found that they were only trending in the right direction. And in fact, we noticed that if we compare body weight at D7 to D14 using a paired t-test, we find a p-value of .0506, instead of the p < .001 they report when comparing D1 to D14 with a Wilcoxon test. We think that this is the more appropriate analysis, and as you can see, it’s not statistically significant.
Regardless, the whole analysis is called into question by the number of tests they ran. By our count they conducted at least 74 tests in this paper, which is a form of p-hacking and makes the results very hard to interpret. It’s also possible that they conducted even more tests that weren’t reported in the paper. This isn’t really their fault — p-hacking wasn’t described until 2011 (and the term itself wasn’t invented until a few years later), so like most people they were almost certainly unfamiliar with issues of multiple comparisons when they did their analysis. While we don’t accuse the authors of acting in bad faith, we do think this seriously undermines our ability to interpret their results. When we ran the single test that we think was most appropriate, we found that it was not significant.
And of course, the sample size was only 20 people, though perhaps there wasn’t room for many more people in the research station. On one hand this is pretty standard for intensive studies like this, but it reduces the statistical power.
The authors claim to show that hypoxia causes weight loss, but this is overstating their case. They report that people brought to 2,650 meters lost a small amount of weight and had lower blood oxygen saturation , but we think the former result is noise and the latter result is unsurprising. Obviously if you bring people to 2,650 meters they will have lower blood oxygen, and there’s no evidence linking that to the reported weight loss.
Even more concerning is the fact that there’s no control group, which means that this study isn’t even an experiment. Without a control group, there can be no random assignment, and with no random assignment, a study isn’t an experiment. As a result, the strong causal claim the authors draw from their results is pretty unsubstantiated.
There isn’t an obvious fix for this problem. A control group that stayed in Munich wouldn’t be appropriate, because oxygen is confounded with everything else about altitude. If there were a difference between the Munich group and the Schneefernerhaus group, there would be no way to tell if that was due to the amount of oxygen or any of the other thousand differences between the two locations. A better approach would be to bring a control group to the same altitude, and give that control group extra oxygen, though that might introduce its own confounds — for example, the supplemental-oxygen group would all be wearing masks and carrying canisters. I guess the best way to do this would be to bring both groups to the Alps, give both of them canisters and masks, but put real oxygen in the canisters for one group and placebo oxygen (nitrogen?) in the canisters for the other groups.
We’re sympathetic to inferring causal relationships from correlational data, but the authors don’t report a correlation between blood oxygen saturation and weight loss, even though that would be the relevant test given the data that they have. Probably they don’t report it because it’s not significant. They do report, “We could not find a significant correlation between oxygen saturation or oxygen partial pressure, and either ghrelin or leptin.” These are tests that we might expect to be significant if hypoxia caused weight loss — which suggests that it does not.
Unfortunately, the authors report no evidence for their mechanism and probably don’t have an effect to explain in the first place. This is too bad — the study asks an interesting question, and the design looks good at first. It’s only on reflection that you see that there are serious problems.
Thanks to Nick Brown for reading a draft of this post.
 One thing that Nick Brown noticed when he read the first draft of this post is that the oxygen saturation percentages reported for D7 and D14 seem to be dangerously low. We’ve all become more familiar with oxygen saturation measures because of COVID, so you may already know that a normal range is 95-100%. Guidelines generally suggest that levels below 90% are dangerous, and should be cause to seek medical attention, so it’s a little surprising that the average for these 20 men was in the mid-80’s during their week at high altitude. We found this confusing so we looked into it, and it turns out that this is probably not a issue. Not only are lower oxygen saturation levels normal at higher altitudes, the levels can apparently be very low by sea-level standards without becoming dangerous. For example, in this study of residents of El Alto in Bolivia (an elevation of 4018 m), the mean oxygen saturation percentages were in the range of 85-88%. So while this is definitely striking, it’s probably not anything to worry about.
New decade just dropped. Eli Dourado is optimistic in detail. Feeling good about energy, electric cars, and filtering your blood to keep you from aging. Definitely cheaper than bathing in the blood of virgins. Don’t try this last one at home… yet 😉
Briefly, Hall et al. (2019) is a metabolic ward study on the effects of “ultra-processed” foods on energy intake and weight gain. The participants were 20 adults, an average of 31.2 years old. They had a mean BMI of 27, so on average participants were slightly overweight, but not obese.
Participants were admitted to the metabolic ward and randomly assigned to one of two conditions. They either ate an ultra-processed diet for two weeks, immediately followed by an unprocessed diet for two weeks — or they ate an unprocessed diet for two weeks, immediately followed by an ultra-processed diet for two weeks. The study was ad libitum, so whether they were eating an unprocessed or an ultra-processed diet, participants were always allowed to eat as much as they wanted — in the words of the authors, “subjects were instructed to consume as much or as little as desired.”
The authors found that people ate more on the ultra-processed diet and gained a small amount of weight, compared to the unprocessed diet, where they ate less and lost a small amount of weight.
We’re not in the habit of re-analyzing published papers, but we decided to take a closer look at this study because a couple of things in the abstract struck us as surprising. Weight change is one main outcome of interest for this study, and several unusual things about this measure stand out immediately. First, the two groups report the same amount of change in body weight, the only difference being that one group gained weight and the other group lost it. In the ultra-processed diet group, people gained 0.9 ± 0.3 kg (p = 0.009), and in the unprocessed diet group, people lost 0.9 ± 0.3 kg (p = 0.007). (Those ± values are standard errors of the mean.) It’s pretty unlikely for the means of both groups to be identical, and it’s very unlikely that both the means and the standard errors would be identical.
It’s not impossible for these numbers to be the same (and in fact, they are not precisely equal in the raw data, though they are still pretty close), especially given that they’re rounded to one decimal place. But it is weird. We ran some simple simulations which suggest that this should only happen about 5% of the time — but this is assuming that the means and SDs of the two groups are both identical in the population, which itself is very unlikely.
Another test of interest reported in the abstract also seemed odd. They report that weight changes were highly correlated with energy intake (r = 0.8, p < 0.0001). This correlation coefficient struck us as surprising, because it’s pretty huge. There are very few measures that are correlated with one another at 0.8 — these are the types of correlations we tend to see between identical twins, or repeated measurements of the same person. As an example, in identical twins, BMI is correlated at about r = 0.8, and height at about r = 0.9.
We know that these points are pretty ticky-tacky stuff. By themselves, they’re not much, but they bothered us. Something already seemed weird, and we hadn’t even gotten past the abstract.
To conduct this analysis, we teamed up with Nick Brown, with additional help from James Heathers. We focused on one particular dependent variable of this study, weight change, while Nick took a broader look at several elements of the paper.
Because we were most interested in weight change, we decided to begin by taking a close look at the file “deltabw”. In mathematics, delta usually means “change” or “the change in”, and “bw” here stands for “body weight”, so this title indicates that the file contains data for the change in participants’ body weights. On the OSF this is in the form of a SAS .sas7bdat file, but we converted it to a .csv file, which is a little easier to work with.
Here’s a screenshot of what the deltabw file looks like:
In this spreadsheet, each row tells us about the weight for one participant on one day of the 4-week-long study. These daily body weight measurements were performed at 6am each morning, so we have one row for every day.
Let’s also orient you to the columns. “StudyID” is the ID for each participant. Here we can see that in this screenshot we are looking just at participant ADL001, or participant 01 for short. The “Period” variable tells us whether the participant was eating an ultra-processed (PROC) or an unprocessed (UNPROC) diet on that day. Here we can see that participant 01 was part of the group who had an unprocessed diet for the first two weeks, before switching to the ultra-processed diet for the last two weeks. “Day” tells us which day in the 28-day study the measurement is from. Here we show only the first 20 days for participant 01.
“BW” is the main variable of interest, as it is the participant’s measured weight, in kilograms, for that day of the study. “DayInPeriod” tells us which day they are on for that particular diet. Each participant goes 14 days on one diet then begins day 1 on the other diet. “BaseBW” is just their weight for day 1 on that period. Participant 01 was 94.87 kg on day one of the unprocessed diet, so this column holds that value as long as they’re on that diet. “DeltaBW” is the difference between their weight on that day and the weight they were at the beginning of that period. For example, participant 01 weighed 94.87 kg on day one and 94.07 kg on day nine, so the DeltaBW value for day nine is -0.80.
Finally, “DeltaDaily” is a variable that we added, which is just a simple calculation of how much the participant’s weight changed each day. If someone weighed 82.85 kg yesterday and they weigh 82.95 kg today, the DeltaDaily would be 0.10, because they gained 0.10 kg in the last 24 hours.
To begin with, we were able to replicate the authors’ main findings. When we don’t round to one decimal place, we see that participants on the ultra-processed diet gained an average of 0.9380 (± 0.3219) kg, and participants on the unprocessed diet lost an average of 0.9085 (± 0.3006) kg. That’s only a difference of 0.0295 kg in absolute values in the means, and 0.0213 kg for the standard errors, which we still find quite surprising. Note that this is different from the concern about standard errors raised by Drs. Mackerras and Blizzard. Many of the standard errors in this paper come from GLM analysis, which assumes homogeneity of variances and often leads to identical standard errors. But these are independently calculated standard errors of the mean for each condition, so it is still somewhat surprising that they are so similar (though not identical).
On average these participants gained and lost impressive, but not shocking amounts of weight. A few of the participants, however, saw weight loss that was very concerning. One woman lost 4.3 kg in 14 days which, to quote Nick Brown, “is what I would expect if she had dysentery” (evocative though perhaps a little excessive). In fact, according to the data, she lost 2.39 kg in the first five days alone. We also notice that this patient was only 67.12 kg (about 148 lbs) to begin with, so such a huge loss is proportionally even more concerning. This is the most extreme case, of course, but not the only case of such intense weight change over such a short period.
The article tells us that participants were weighed on a Welch Allyn Scale-Tronix 5702 scale, which has a resolution of 0.1 lb or 100 grams (0.1 kg). This means it should only display data to one decimal place. Here’s the manufacturer’s specification sheet for that model. But participant weights in the file deltabw are all reported to two decimal places; that is, with a precision of 0.01 kg, as you can clearly see from the screenshot above. Of the 560 weight readings in the data file, only 55 end in zero. It is not clear how this is possible, since the scale apparently doesn’t display this much precision.
To confirm this, we wrote to Welch Allyn’s customer support department, who confirmed that the model 5702 has 0.1 kg resolution.
We also considered the possibility that the researchers measured people’s weight in pounds and then converted to kilograms, in order to use the scale’s better precision of 0.1 pounds (45.4 grams) rather than 100 grams. However, in this case, one would expect to see that all of the changes in weight were multiples of (approximately) 0.045 kg, which is not what we observe.
As we look closer at the numbers, things get even more confusing.
As we noted, Hall et al. report participant weight to two decimal places in kilograms for every participant on every day. Kilograms to two decimal places should be pretty sensitive, but there are many cases where the exact same weight appears two or even three times in a row. For example, participant 21 is listed as having a weight of exactly 59.32 kg on days 12, 13, and 14, participant 13 is listed as having a weight of exactly 96.43 kg on days 10, 11, and 12, and participant 06 is listed as having a weight of exactly 49.54 kg on days 23, 24, and 25.
Having the same weight for two or even three days in a row may not seem that strange, but it is very remarkable when the measurement is in kilograms precise to two decimal places. After all, 0.01 kg (10 grams) is not very much weight at all. A standard egg weighs about 0.05 kg (50 grams). A shot of liquor is a little less, usually a bit more than 0.03 kg (30 grams). A tablespoon of water is about 0.015 kg (15 grams). This suggests that people’s weights are varying by less than the weight of a tablespoon of water over the course of entire days, and sometimes over multiple days. This uncanny precision seems even more unusual when we note that body weight measurements were taken at 6 am every morning “after the first void”, which suggests that participants’ bodily functions were precise to 0.01 kg on certain days as well.
The case of participant 06 is particularly confusing, as 49.54 kg is exactly one kilogram less, to two decimal places, than the baseline for this participant’s weight when they started, 50.54 kg. Furthermore, in the “unprocessed” period, participant 06 only ever seems to lose or gain weight in full increments of 0.10 kilograms.
We see similar patterns in the data from other participants. Let’s take a look at the DeltaDaily variable. As a reminder, this variable is just the difference between a person’s weight on one day and the day before. These are nothing more than daily changes in weight.
Because these numbers are calculated from the difference between two weight measurements, both of which are reported to two decimal places of accuracy, these numbers should have two places of accuracy as well. But surprisingly, we see that many of these weight changes are in full increments of 0.10.
Take a look at the histograms below. The top histogram is the distribution of weight changes by day. For example, a person might gain 0.10 kg between days 15 and 16, and that would be one of the observations in this histogram.
You’ll see that these data have an extremely unnatural hair-comb pattern of spikes, with only a few observations in between. This is because the vast majority (~71%) of the weight changes are in exact multiples of 0.10, despite the fact that weights and weight changes are reported to two decimal places. That is to say, participants’ weights usually changed in increments like 0.20 kg, -0.10 kg, or 0.40 kg, and almost never in increments like -0.03 kg, 0.12 kg, or 0.28 kg.
For comparison, on the bottom is a sample from a simulated normal distribution with identical n, mean, and standard deviation. You’ll see that there is no hair-comb pattern for these data.
As we mentioned earlier, there are several cases where a participant stays at the exact same weight for two or three days in a row. The distribution we see here is the cause. As you can see, the most common daily change is exactly zero. Now, it’s certainly possible to imagine why some values might end up being zero in a study like this. There might be a technical incident with the scale, a clerical error, or a mistake when recording handwritten data on the computer. A lazy lab assistant might lose their notes, resulting in the previous day’s value being used as the reasonable best estimate. But since a change of exactly zero is the modal response, a full 9% of all measurements, it’s hard to imagine that these are all omissions or technical errors.
In addition, there’s something very strange going on with the trailing digits:
On the top here we have the distribution of digits in the 0.1 place. For example, a measurement of 0.29 kg would appear as a 2 here. This follows about the distribution we would expect, though there are a few more 1’s and fewer 0’s than usual.
The bottom histogram is where things get weird. Here we have the distribution of digits in the 0.01 place. For example, a measurement of 0.29 kg would appear as a 9 here. As you can see, 382/540 of these observations have a 0 in their 0.01’s place — this is the same as that figure of 71% of measured changes being in full increments of 0.10 kg that we mentioned earlier.
The rest of the distribution is also very strange. When the trailing digit is not a zero, it is almost certainly a 1 or a 9, possibly a 2 or an 8, and almost never anything else. Of 540 observed weight changes, only 3 have a trailing digit of 5.
We can see that this is not what we would expect from (simulated) normally distributed data:
It’s also not what we would expect to see if they were measuring to one decimal place most of the time (~70%), but to two decimal places on occasion (~30%). As we’ve already mentioned, this doesn’t make sense from a methodological standpoint, because all daily weights are to two decimal places. But even it somehow were a measurement accuracy issue, we would expect an equal distribution across all the other digits besides zero, like this:
This is certainly not what we see in the reported data. The fact that 1 and 9 are the most likely trailing digit after 0, and that 2 and 8 are most likely after that, is especially strange.
When we first started looking into this paper, we approached Retraction Watch, who said they considered it a potential story. After completing the analyses above, we shared an early version of this post with Retraction Watch, and with our permission they approached the authors for comment. The authors were kind enough to offer feedback on what we had found, and when we examined their explanation, we found that it clarified a number of our points of confusion.
The first thing they shared with us was this erratum from October 2020, which we hadn’t seen before. The erratum reports that they noticed an error in the documented diet order of one participant. This is an important note but doesn’t affect the analyses we present here, which have very little to do with diet conditions.
Kevin Hall, the first author on this paper, also shared a clarification on how body weights were calculated:
I think I just discovered the likely explanation about the distribution of high-precision digits in the body weight measurements that are the main subject of one of the blogs. It’s kind of illustrative of how difficult it is to fully report experimental methods! It turns out that the body weight measurements were recorded to the 0.1 kg according to the scale precision. However, we subtracted the weight of the subject’s pajamas that were measured using a more precise balance at a single time point. We repeated subtracting the mass of the pajamas on all occasions when the subject wore those pajamas. See the example excerpted below from the original form from one subject who wore the same pajamas (PJs) for three days and then switched to a new set. Obviously, the repeating high precision digits are due to the constant PJs! 😉
This matches what is reported in the paper, where they state, “Subjects wore hospital-issued top and bottom pajamas which were pre-weighed and deducted from scale weight.”
Kevin also included the following image, which shows part of how the data was recorded for one participant:
If we understand this correctly, the first time a participant wore a set of pajamas, the pajamas were weighed to three decimals of precision. Then, that measurement was subtracted from the participant’s weight on the scale (“Patient Weight”) on every consecutive morning, to calculate the participant’s body weight. For an unclear reason, this was recorded to two decimals of precision, rather than the one decimal of precision given by the scale, or the three decimals of precision given by the PJ weights. When the participant switched to a new set of pajamas, the new set was weighed to three decimals of precision, and that number was used to calculate participant body weight until they switched to yet another new set of pajamas, etc.
We assume that the measurement for the pajamas is given in kilograms, even though they write “g” and “gm” (“qm”?) in the column. I wish my undergraduate lab TAs were as forgiving as the editors at Cell Metabolism.
This method does account for the fact that participant body weights were reported to two decimal places of precision, despite the fact that the scale only measures weight to one decimal place of precision. Even so, there were a couple of things that we still found confusing.
The variable that interests us the most is the DeltaDaily variable. We can easily calculate that variable for the provided example, like so:
We can see that whenever a participant doesn’t change their pajamas on consecutive days, there’s a trailing zero. In this way, the pajamas can account for the fact that 71% of the time, the trailing digits in the DeltaDaily variable were zeros.
We also see that whenever the trailing digit is not zero, that lets us identify when a participant has changed their pajamas. Note of course that about ten percent of the time, a change in pajamas will also lead to a trailing digit of zero. So every trailing digit that isn’t zero is a pajama change, though a small number of the zeros will also be “hidden” pajama changes.
In any case, we can use this to make inferences about how often participants change their pajamas, which we find rather confusing. Participants often change their pajamas every day for multiple days in a row, or go long stretches without apparently changing their pajamas at all, and sometimes these are the same participants. It’s possible that these long stretches without any apparent change of pajamas are the result of the “hidden” changes we mentioned, because about 10% of the time changes would happen without the trailing digit changing, but it’s still surprising.
For example, participant 05 changes their pajamas on day 2, day 5, and day 10, and then apparently doesn’t change their pajamas again until day 28, going more than two weeks without a change in PJs. Participant 20, in contrast, changes pajamas at least 16 times over 28 days, including every day for the last four days of the study. The record for this, however, has to go to participant 03, who at one point appears to have switched pajamas every day for at least seven days in a row. Participant 03 then goes eight days in a row without changing pajamas before switching pajamas every day for three days in a row.
Participant 08 (the participant from the image above) seems to change their pajamas only twice during the entire 28-day study, once on day 4 and again on day 28. Certainly this is possible, but it doesn’t look like the pajama-wearing habits we would expect. It’s true that some people probably want to change their pajamas more than others, but this doesn’t seem like it can be entirely attributed to personality, as some people don’t change pajamas at all for a long time, and then start to change them nearly every day, or vice-versa.
We were also unclear on whether the pajamas adjustment could account for the most confusing pattern we saw in the data for this article, the distribution of digits in the .01 place for the DeltaDaily variable:
The pajamas method can explain why there are so many zeros — any day a participant didn’t change their pajamas, there would be a zero, and it’s conceivable that participants only changed their pajamas on 30% of the days they were in the study.
We weren’t sure if the pajamas method could explain the distribution of the other digits. For the trailing digits that aren’t zero, 42% of them are 1’s, 27% of them are 9’s, 9% of them are 2’s, 8% of them are 8’s, and the remaining digits account for only about 3% each. This seems very strange.
You’ll recall that the DeltaDaily values record the changes in participant weights between consecutive days. Because the weight of the scale is only precise to 0.1 kg, the data in the 0.01 place records information about the difference between two different pairs of pajamas. For illustration, in the example Kevin Hall provided, the participant switched between a pair of pajamas weighing 0.418 kg and a pair weighing 0.376 kg. These are different by 0.042 kg, so when they rounded it to two digits, the difference we see in the DeltaDaily has a trailing digit of 4.
We wanted to know if the pajama adjustment could explain why the difference (for the digit in the 0.01’s place) between the weights of two pairs of pajamas are 14x more likely to be a 1 than a 6, or 9x more likely to be a 9 than a 3.
Verbal arguments quickly got very confusing, so we decided to run some simulations. We simulated 20 participants, for 28 days each, just like the actual study. On day one, simulated participants were assigned a starting weight, which was a random integer between 40 and 100. Every day, their weight changed by an amount between -1.5 and 1.5 by increments of 0.1 (-1.5, -1.4, -1.3 … 1.4, 1.5), with each increment having an equal chance of occuring.
The important part of the simulation were the pajamas, of course. Participants were assigned a pajama weight on day 1, and each day they had a 35% chance of changing pajamas, and being assigned a new pajama weight. The real question was how to generate a reasonable distribution of pajama weights. We didn’t have much to go off of, just the two values in the image that Kevin Hall shared with us. But we decided to give it a shot with just that information. Weights of 418 g and 376 g have a mean of just under 400 g and a standard deviation of 30 g, so we decided to sample our pajama weights from a normal distribution with those parameters.
When we ran this simulation, we found a distribution of digits in the 0.01 place that didn’t show the same saddle-shaped distribution as in the data from the paper:
We decided to run some additional simulations, just to be sure. To our surprise, when the SD of the pajamas is smaller, in the range of 10-20 g, you can sometimes get saddle-shaped distributions just like the ones we saw in data from the paper. Here’s an example of what the digits can look like when the SD of the pajamas is 15 g:
It’s hard for us to say whether a standard deviation of 15 g or of 30 g is more realistic for hospital pajamas, but it’s clear that under certain circumstances, pajama adjustments can create this kind of distribution (we propose calling it the “pajama distribution”).
While we find this distribution surprising, we conclude that it is possible given what we know about these data and how the weights were calculated.
When we took a close look at these data, we originally found a number of patterns that we were unable to explain. Having communicated with the authors, we now think that while there are some strange choices in their analysis, most of these patterns can be explained when we take into account the fact that pajama weights were deducted from scale weights, and the two weights had different levels of precision.
While these patterns can be explained by the pajama adjustment described by Kevin Hall, there are some important lessons here. The first, as Kevin notes in his comment, is that it can be very difficult to fully record one’s methods. It would have been better to include the full history of this variable in the data files, including the pajama weights, instead of recording the weights and performing the relevant comparisons by hand.
The second is a lesson about combining data of different levels of precision. The hair-comb pattern that we observed in the distribution of DeltaDaily scores was truly bizarre, and was reason for serious concern. It turns out that this kind of distribution can occur when a measure with one decimal of precision is combined with another measure with three decimals of precision, with the result being rounded to two decimals of precision. In the future researchers should try to avoid combining data in this way to avoid creating such artifacts. While it may not affect their conclusions, it is strange for the authors to claim that someone’s weight changed by (for example) 1.27 kg, when they have no way to measure the change to that level of precision.
There are some more minor points that this explanation does not address, however. We still find it surprising how consistent the weight change was in this study, and how extreme some of the weight changes were. We also remain somewhat confused by how often participants changed (or didn’t change) their pajamas.
This post continues in Part Two over at Nick Brown’s blog, where he covers several other aspects of the study design and data.
Thanks again to Nick Brown for comparing notes with us on this analysis, to James Heathers for helpful comments, and to a couple of early readers who asked to remain anonymous. Special thanks to Kevin Hall and the other authors of the original paper, who have been extremely forthcoming and polite in their correspondence. We look forward to ongoing public discussion of these analyses, as we believe the open exchange of ideas can benefit the scientific community.
Imagine a universe where every cognitive scientist receives extensive training in how to deal with demand characteristics.
(Demand characteristics describe any situation in a study where a participant either figures out what a study is about, or thinks they have, and changes what they do in response. If the participant is friendly and helpful, they may try to give answers that will make the researchers happy; if they have the opposite disposition, they might intentionally give nonsense answers to ruin the experiment. This is a big part of why most studies don’t tell participants what condition they’re in, and why some studies are run double-blind.)
In the real world, most students get one or two lessons about demand characteristics when they take their undergrad methods class. When researchers are talking about a study design, sometimes we mention demand, but only if it seems relevant.
Let’s return to our imaginary universe. Here, things are very different. Demand characteristics are no longer covered in undergraduate methods courses — instead, entire classes are exclusively dedicated to demand characteristics and how to deal with them. If you major in a cognitive science, you’re required to take two whole courses on demand — Introduction to Demand for the Psychological Sciences and Advanced Demand Characteristics.
Often there are advanced courses on specific forms of demand. You might take a course that spends a whole semester looking at the negative-participant role (also known as the “screw-you effect”), or a course on how to use deception to avoid various types of demand.
If you apply to graduate school, how you did in these undergraduate courses will be a major factor determining whether they let you in. If you do get in, you still have to take graduate-level demand courses. These are pretty much the same as the undergrad courses, except they make you read some of the original papers and work through the reasoning for yourself.
When presenting your research in a talk or conference, you can usually expect to get a couple of questions about how you accounted for demand in your design. Students are evaluated based on how well they can talk about demand and how advanced the techniques they use are.
Every journal requires you to include a section on demand characteristics in every paper you submit, and reviewers will often criticize your manuscript because you didn’t account for demand in the way they expected. When you go up for a job, people want to know that you’re qualified to deal with all kinds of demand characteristics. If you have training in dealing with an obscure subtype of demand, it will help you get hired.
It would be pretty crazy to devote such a laser focus to this one tiny aspect of the research process. Yet this is exactly what we do with statistics.
Science is all about alternative explanations. We design studies to rule out as many stories as we can. Whatever stories remain are possible explanations for our observations. Over time, we whittle this down to a small number of well-supported theories.
There’s one alternative explanation that is always a concern. For any relationship we observe, there’s a chance that what we’re seeing is just noise. Statistics is a set of tools designed to deal with this problem. This holds a special place in science because “it was noise” is a concern for every study in every field, so we always want to make sure to rule it out.
But of course, there are many alternative explanations that we need to be concerned with. Whenever you’re dealing with human participants, demand characteristics will also be a possible alternative. Despite this, we don’t jump down people’s throats about demand. We only bring up these issues when we have a reason to suspect that it is a problem for the design we’re looking at.
There will always be more than one way to look at any set of results. We can never rule out every alternative explanation — the best we can do is account for the most important and most likely alternatives. We decide which ones to account for by using our judgement, by taking some time to think about what alternatives we (and our readers) will be most concerned about.
The right answer will look different for different experiments. But the wrong answer is to blindly throw statistics at every single study.
Statistics is useful when a finding looks like it could be the result of noise, but you’re not sure. For example, let’s say we’re testing a new treatment for a disease. We have a group of 100 patients who get the treatment and a control group of 100 people who don’t get the treatment. If 52/100 people recover when they get the treatment, compared to 42/100 recovering in the control group, does that mean the treatment helped? Or is the difference just noise? I can’t tell with just a glance, but a simple chi-squared test can tell me that p = .013, meaning there’s only a 1.3% chance that we would see something like this from noise alone.
That’s helpful, but it would be pointless to run a statistical test if we saw 43/100 people recover with the treatment, compared to 42/100 in the control group. I can tell that this is very consistent with noise (p > .50) just by looking at it. And it would be pointless to run a statistical test if we saw 98/100 people recover with the treatment, compared to 42/100 in the control group. I can tell that this is very inconsistent with noise (p < .00000000000001) just by looking at it. If something passes the interocular trauma test (the conclusion hits you between the eyes), you don’t need to pull out another test.
This might sound outlandish today, but you can do perfectly good science without any statistics at all. After all, statistics is barely more than a hundred years old. Sir Francis Galton came up with the concept of the standard deviation in the 1860s, and the story with the ox didn’t happen until 1907. It took until the 1880s to dream up correlation. Karl Pearson was born in 1857 but didn’t do most of his statistics work until around the turn of the century. Fisher wasn’t even born until 1890. He introduced the term variance for the first time in 1918, but both that term and the ANOVA didn’t gain popularity until the publication of his book in 1925.
This means that Galileo, Newton, Kepler, Hooke, Pasteur, Mendel, Lavoisier, Maxwell, von Helmholtz, Mendeleev, etc. did their work without anything that resembled modern statistics, and that Einstein, Curie, Fermi, Bohr, Heisenberg, etc. etc. did their work in an age when statistics was still extremely rudimentary. We don’t need statistics to do good research.
This isn’t an original idea, or even a particularly new one. When statistics was young, people understood this point better. For an example, we can turn to Sir Austin Bradford Hill. He was trained by Karl Pearson (who, among other things, invented the chi-squared test we used earlier), was briefly president of the Royal Statistical Society, and was sometimes referred to as the world’s leading medical statistician. As early as the 1920s, he was pioneering the introduction of the randomized clinical trial in medicine. As far as opinions on statistics go, the man was pretty qualified.
While you may not know his name, you’re probably familiar with his work. He was one of the researchers who demonstrated the connection between cigarette smoking and lung cancer, and in 1965 he gave a speech about his work on the topic. Most of the speech was a discussion of how one can infer a causal relationship from largely correlational data, as he had done with the smoking-lung cancer connection, a set of considerations that came to be known as the Bradford Hill criteria.
But near the end of the speech, he turns to a discussion of tests of significance, as he calls them, and their limitations:
No formal tests of significance can answer [questions of cause and effect]. Such tests can, and should, remind us of the effects that the play of chance can create, and they will instruct us in the likely magnitude of those effects. Beyond that they contribute nothing to the ‘proof’ of our hypothesis.
Nearly forty years ago, amongst the studies of occupational health that I made for the Industrial Health Research Board of the Medical Research Council was one that concerned the workers in the cotton-spinning mills of Lancashire (Hill 1930). … All this has rightly passed into the limbo of forgotten things. What interests me today is this: My results were set out for men and women separately and for half a dozen age groups in 36 tables. So there were plenty of sums. Yet I cannot find that anywhere I thought it necessary to use a test of significance. The evidence was so clear cut, the differences between the groups were mainly so large, the contrast between respiratory and non-respiratory causes of illness so specific, that no formal tests could really contribute anything of value to the argument. So why use them?
Would we think or act that way today? I rather doubt it. Between the two world wars there was a strong case for emphasizing to the clinician and other research workers the importance of not overlooking the effects of the play of chance upon their data. Perhaps too often generalities were based upon two men and a laboratory dog while the treatment of choice was deducted from a difference between two bedfuls of patients and might easily have no true meaning. It was therefore a useful corrective for statisticians to stress, and to teach the needs for, tests of significance merely to serve as guides to caution before drawing a conclusion, before inflating the particular to the general.
I wonder whether the pendulum has not swung too far – not only with the attentive pupils but even with the statisticians themselves. To decline to draw conclusions without standard errors can surely be just as silly? Fortunately I believe we have not yet gone so far as our friends in the USA where, I am told, some editors of journals will return an article because tests of significance have not been applied. Yet there are innumerable situations in which they are totally unnecessary – because the difference is grotesquely obvious, because it is negligible, or because, whether it be formally significant or not, it is too small to be of any practical importance. What is worse, the glitter of the t-table diverts attention from the inadequacies of the fare. Only a tithe, and an unknown tithe, of the factory personnel volunteer for some procedure or interview, 20% of patients treated in some particular way are lost to sight, 30% of a randomly-drawn sample are never contacted. The sample may, indeed, be akin to that of the man who, according to Swift, ‘had a mind to sell his house and carried a piece of brick in his pocket, which he showed as a pattern to encourage purchasers.’ The writer, the editor and the reader are unmoved. The magic formulae are there.
Of course I exaggerate. Yet too often I suspect we waste a deal of time, we grasp the shadow and lose the substance, we weaken our capacity to interpret the data and to take reasonable decisions whatever the value of P. And far too often we deduce ‘no difference’ from ‘no significant difference.’ Like fire, the chi-squared test is an excellent servant and a bad master.
We grasp the shadow and lose the substance.
As Dr. Hill notes, the blind use of statistical tests is a huge waste of time. Many designs don’t need them; many arguments don’t benefit from them. Despite this, we have long disagreements about which of two tests is most appropriate (even when both of them will be highly significant), we spend time crunching numbers when we already know what we will find, and we demand that manuscripts have their statistics arranged just so — even when it doesn’t matter.
This is an institutional waste of time as well as a personal one. It’s weird that students get so much training in statistics. Methods are almost certainly more important, but most students are forced to take multiple stats classes, while only one or two methods classes are even offered. This is also true at the graduate level. Methods and theory courses are rare in graduate course catalogs, but there is always plenty of statistics.
Some will say that this is because statistics is so much harder to learn than methods. Because it is a more difficult subject, it takes more time to master. Now, it’s true that students tend to take several courses in statistics and come out of them remembering nothing at all about statistics. But this isn’t because statistics is so much more difficult.
We agree that statistical thinking is very important. What we take issue with is the neurotic focus on statistical tests, which are of minor use at best. The problem is that our statistics training spends multiple semesters on tests, while spending little to no time at all on statistical thinking.
This also explains why students don’t learn anything in their statistics classes. Students can tell, even if only unconsciously, that the tests are unimportant, so they have a hard time taking them seriously. They would also do poorly if we asked them to memorize a phone book — so much more so if we asked them to memorize the same phone book for three semesters in a row.
The understanding of these tests is based on statistical thinking, but we don’t teach them that. We’ve become anxious around the tests, and so we devote more and more of the semester to them. But this is like becoming anxious about planes crashing and devoting more of your pilot training time to the procedure for making an emergency landing. If the pilots get less training in the basics, there will be more emergency landings, leading to more anxiety and more training, etc. — it’s a vicious cycle. If you just teach students statistical thinking to begin with, they can see why it’s useful and will be able to easily pick up the less-important tests later on, which is exactly what I found when I taught statistics this way.
The bigger problem is turning our thinking over to machines, especially ones as simple as statistical tests.
Sometimes a test is useful, sometimes it is not. We can have discussions about when a test is the right choice and when it is the wrong one. Researchers aren’t perfect, but we have our judgement and damn it, we should be expected to use it. We may be wrong sometimes, but that is better than letting the p-values call all the shots.
We need to stop taking tests so seriously as a criterion for evaluating papers. There’s a reason, of course, that we are on such high alert about these tests — the concept of p-hacking is only a decade old, and questionable statistical practices are still being discovered all the time.
But this focus on statistical issues tends to obscure deeper problems. We know that p-hacking is bad, but a paper with perfect statistics isn’t necessarily good — the methods and theory, even the basic logic, can be total garbage. In fact, this is part of how we got in the p-hacking situation in the first place: by using statistics as the main way of telling if a paper is any good or not!
Putting statistics first is how we end up with studies with beautifully preregistered protocols and immaculate statistics, but deeply confounded methods, on topics that are unimportant and frankly uninteresting. This is what Hill meant when he said that “the glitter of the t-table diverts attention from the inadequacies of the fare”. Confounded methods can produce highly significant p-values without any p-hacking, but that doesn’t mean the results of such a study are of any value at all.
This is why I find proposals to save science by revising statistics so laughable. Surrendering our judgement to Bayes factors instead of p-values won’t do anything to solve our problems. Changing the threshold of significance from .05 to .01, or .005, or even .001 won’t make for better research. We shouldn’t try to revise statistics, we should use it less often.
Thanks to Adam Mastroianni, Grace Rosen, and Alexa Hubbard for reading drafts of this piece.
In The Witcher 3, some abilities increase your fast attack damage by 5%. Boooooring. When I kill that cockatrice, I have no idea if the 5% extra damage saved my ass, but my guess is that it didn’t make the difference between success and failure.
Game designers often have the natural instinct to push their systems to be more granular, i.e. using smaller and smaller pieces. It seems to offer a way to meet several design goals. Sometimes we want to capture aspects of real life that other systems have left out. RPGs are abstractions of real life, and because it’s so easy to notice what is missing, it can be tempting to write those parts back in. But this can get ridiculous pretty quickly. “D&D is ok, but it’s strange how getting a really good night’s sleep doesn’t improve performance, the way it does in the real world. I want my characters to get a bonus if they’ve slept well the night before. A +1 bonus on a d20 roll is too high. Wait! What if I made it d100? Then I could give them a +1 for each two hours of sleep, up to +5!”
This has two major problems. First, keeping track of systems like this is unbearable. “Wait! I forgot to add my bonus from using my whetstone every night, and that other bonus for fighting opponents who are using bludgeoning weapons. Can I re-roll that last attack?” Actually doing the calculations is even worse.
You can partially avoid this if a computer keeps track of all the bonuses and does all the math for you. This is why video games, like The Witcher 3, are more likely to use granular systems and more likely to be successful when they do so. But of course, I was just complaining about The Witcher. Fixing the math doesn’t make granularity work.
The second problem is that small changes are, well, small. There’s a flavor aspect — double damage feels like a much bigger deal than 5% more damage — but flavor isn’t the main issue. The real concern is that small effects don’t make for a strategic system. If an ability doubles my damage, then I want to take that into account, and so do my enemies. If an ability increases my damage by 5%, I will do the same thing I was going to do in the first place. I’ll deal a little more damage, which probably won’t come close to tipping the balance of the battle, and it will be a huge headache keeping track of it all.
(There are some secondary problems as well — a system like this is much more likely to end up with game-breaking hacks that come from combining modifiers, and you probably won’t be able to iron all of them out before release.)
The solution is to make your game less granular, by using the biggest pieces possible. Players don’t care about multipliers below x2 — and neither should you. Toss them out.
Even better are effects that go beyond multipliers. In The Witcher 3, Geralt can also unlock an ability that lets him deflect arrows and crossbow bolts. That change is more fundamental, because it actually changes how the mechanics work.
Agency is what makes games enjoyable — I did that, I made it happen. My choices were relevant. To make attributions of agency, you need to be able to easily determine causality — if I can’t point to the major factor(s) responsible for an outcome, I can’t tell if my choices mattered. That ruins the game.
Basketball is back, which means an unending stream of bickering about who is the GOAT. Only one man, however, has performed the double-double to end all double doubles. In 1921 William Howard Taft became the Chief Justice of the United States after serving as the President from 1909-1913. Take that, LeBron.
According to nature, crabs are the most perfect form. You may not like it, but 🦀 is what peak performance looks like. This sacred knowledge inspired us so much, we even made a meme. We think this is the first step in the process of memes themselves evolving to be more crab-like.
Looking to spice things up in the bedroom? Why not try history’s most mysterious sex position, first described in Aristophanes’ classic comedy Lysistrata in 411 B.C. “The women are very reluctant, but the deal is sealed with a solemn oath around a wine bowl, Lysistrata choosing the words and Calonice repeating them on behalf of the other women. It is a long and detailed oath, in which the women abjure all their sexual pleasures, including the Lioness on the Cheese Grater (a sexual position).”
The basic argument against the Senate is that it’s undemocratic. Senators aren’t elected proportionally, and so some senators represent more people than others. If democracy is all about giving a voice to the people, it seems pretty perverse to give more of a voice to some people than to others.
But it turns out that disproportionate representation isn’t just compatible with democracy, it’s one of the most important safeguards of a liberal society.
It’s not just that every person deserves a vote. Liberalism also says that every way of life deserves to exist, as long as it doesn’t infringe on someone else’s way of life (e.g. no cannibals). After all, America isn’t a melting pot, it’s more of a patchwork quilt. I’m not into Lutheranism, extreme body modification, or small yappy dogs, but I think that people who are into these things deserve to be able to live how they want and celebrate these aspects of their lifestyle.
The basic argument against Democracy is the old saying that democracy is two wolves and a sheep voting on what’s for dinner (no, it wasn’t Benjamin Franklin). With 1/3 of the vote, the sheep always gets eaten. In a country with 49 sheep and 51 wolves, as long as we have strict proportional representation, all the sheep still get eaten. If the vice president is a wolf, even a 50/50 split isn’t safe.
If the sheep all live in Sheepsylvania, however, they have a better chance to stand up for themselves. They may be outnumbered, but they still get two votes in the Senate. If they also have friends in Elkowa, Beavermont, and Llamassouri, that provides even more protection. It may not be enough to save them, but they will still do a lot better than they would with proportional representation. Disproportionate representation allows them to protect themselves even when they are enormously outnumbered.
States don’t correspond perfectly to different ways of life, and this is a fair criticism of the system. Disproportionate representation might work even better if we explicitly tied representation to specific minority groups. But states do have some correspondence to different ways of life.
Most people these days think about disproportionate representation in terms of liberal versus conservative. But really, the differences in disproportionate representation today are urban versus rural. It happens to be that most rural states are also conservative, but population density comes with being a rural area, not from voting Republican. There are plenty of rural voters who are very liberal but still prefer to live in the woods. It’s not hard to imagine that urban voters — who are already more privileged in terms of wealth and education — might accidentally or even intentionally pass laws that would destroy a rural way of life for millions of people. For just one example, consider how decisions made in major cities can impact rural schooling. It’s important to have a political system that allows minorities to protect themselves.
Your state doesn’t even have to be all that rural to begin with. The Senate benefits the interests of pretty much anyone not living in California (11.9% of the population), Texas (8.7%), Florida (6.5%), or New York (5.8%). If you’re from Virginia, Hawaii, Iowa, Louisiana, Maryland, etc., and you don’t want California and Texas telling you how to live your life, then the Senate is acting in your favor.
We’d like to take this opportunity to remind you of Bernie Sanders.
The state of Vermont has a very unusual but, we think, excellent way of life. Anarcho-socialist-libertarian-progressivism isn’t a way of life shared by most Americans, but we think it has a lot going for it. If representation were proportionate, we could maybe send Bernie, as an independent, to the house of representatives, where he would be just one voice among 435. But with disproportionate representation, we’ve sent Bernie to the US Senate, where we can punch above our weight. Bernie can work to protect our way of life, and he can help to bring our values (flannel, maple syrup, and Ben & Jerry’s) to the rest of the country. You’re welcome, America.
When you’re poor, the sad truth is that the de jure government probably doesn’t care about you much. There probably aren’t a lot of legitimate jobs in your area; you can’t afford to move away; even if you’re very talented, someone with better connections or a fancier-sounding degree will probably beat you out when competing for the few good jobs available.
This is especially true for marginalized groups, in particular when they’re targeted by law enforcement. In a legal system like ours, there are so many pointless and mutually contradictory laws that everyone is guilty of something. If the police watch you for long enough, they will eventually find something that they can arrest you for. (Obviously it’s even worse if they’re willing to lie or plant evidence, but the point is that it can happen even without this.)
Even if they only put you away for a few weeks, a criminal record will probably kill your chances of getting a legitimate job in the future. If you want to serve your community, or even just put food on the table, your only choice may be an illegal job.
But “criminal” doesn’t mean “evil”. Modern governments criminalize lots of things they really shouldn’t. If I couldn’t get a legal job, I would be pretty happy selling weed. I don’t think weed should be illegal, and there would be plenty of satisfied customers, so I would be open to sticking my thumb in the government’s eye over this issue if I didn’t have any other option. A similar argument can be made for other drugs, prescription medications, etc. — even giving medical care without a license. If none of these work for you, then remember that during prohibition, the government criminalized alcohol. Ask yourself how guilty you would feel selling booze in the 1920’s, if you had no other job prospects.
Since criminal activity is often the only way for the very poor to make their way in the world, criminal organizations are often the only local institutions around. And because the official government doesn’t really care about these neighborhoods (they may even be actively antagonistic), criminal organizations often end up being the only thing protecting the poor.
The affluent have a hard time understanding all of this, and for many people, a reasoned argument can’t shake the scary image of the criminal or gang member as an uncultured, unreasoning thug.
The good news is that this is what art is for. Fiction can give us, even if only distantly, the sense of what life is like for people who are different from us.
So let’s imagine what a movie to flip this script would look like. Our hero is a young black man who grew up in a poor but respectable suburb of a major American city. He’s talented, but there aren’t many opportunities in his hometown. Like many young men with few options, he joins the Army. He’s quickly recognized as a crack shot and natural leader, gets recruited to the Green Berets, and receives multiple commendations. He also makes some very close friends. Once he returns to civilian life, one of his best friends from the Army runs for mayor, wins, and the protagonist spends the next few years helping his friend try to make things better in their city. He makes some money, starts dating a woman from a well-respected family, and begins thinking of settling down.
But when war is declared in the Middle East, his friend the mayor returns to military service to serve his country, and our hero joins him. They’re shipped overseas, and see a few years of intense combat. His friend the mayor goes missing in battle, presumed captured; our hero is injured and, after recuperating, is honorably discharged from service.
He’s sent home, only to find that things are worse than ever. The new mayor neglects social services in favor of pursuing a “tough on crime” agenda popular with the middle class. The police are encouraged to make lots of high-profile arrests, and they quickly grow fat on civil forfeiture. Constant harassment by the police leads anyone with the means to try to leave the poor parts of town. As money flows out of the neighborhood, so do most businesses, taking with them the last few legitimate jobs.
Soon, almost no one can make a living without turning to some kind of crime. Often this is only opening a hairdresser’s without a license, or running a restaurant in your living room, but the cops crack down on these businesses just the same — and legally, they’re in the right.
The first thing our hero sees when he gets home is the police arresting a kid who tried to steal food from a gas station. He steps in to try to help, but the cops pull their guns on him. With his Special Forces training, he’s able to disarm the cops, free the kid, and make his escape, all without hurting anyone. What he doesn’t know is that he’s made a powerful enemy. One of the cops he embarrassed was the new sheriff, a close ally of the corrupt mayor, who recognizes him. The sheriff puts out an APB, and soon our hero finds that cops are crawling all over the city looking for him, and no one will take him in.
He eventually finds shelter with a preacher at a local church, who has seen enough of police brutality. He had shut down the church and begun to turn to drinking, but seeing someone stand up to the cops and get away with it has given him new hope for the future.
It’s not just the preacher who takes note. Our hero starts attracting followers. First it’s his young cousin, a flashy dresser and accomplished boxer, hot-headed but idealistic. Next it’s a real beast of a man, a former bouncer who’s out of work and goes by “Lil’ Jon” or something, who impresses our hero by first beating him in a fight, and then throwing him off a bridge and into the river. Soon, more than a dozen people are hanging out in the basement of the abandoned church.
Our hero can’t get legal work — in the eyes of the law, he’s a wanted criminal, who assisted the escape of a thief, resisted arrest, and assaulted officers. For one reason or another, neither can any of his followers. Even if he turned himself in, now that he knows he personally embarrassed the sheriff, he’s not confident he would survive to make it to trial.
But just because he can’t get legal work doesn’t mean he can’t make a difference for his community. The cops have been stealing property, cars, even cash from anyone they want, so he decides to steal it back. Under cover of night, the new band of friends break into the impound lot and take as many cars as they can drive away, leaving the guards hogtied but otherwise unharmed.
With this success under their belt, the group grows bolder. They find the location of a multi-millionaire CEO’s summer home up in the hills, break in, and take everything of value. Next they knock over an armored car on its way to a bank, taking everything and even recruiting the driver to join their cause. With so much money on hand, the preacher helps them launder it, distributing the money to those in need and making it appear to come from the church.
Their fame, or maybe infamy, grows. When the cops try to arrest people on trumped-up charges, our hero intervenes, and many of the people he saves (now considered criminals, whether they like it or not) decide to join him. His fiancée escapes from her controlling parents and finds him hiding in the urban jungle. When bounty hunters are sent to track him down, more often than not, they end up being convinced by his cause and joining him instead. Even some of the cops on the force throw away their badges and turn outlaw. The sheriff and the mayor stop calling him a “violent wanted criminal” and start calling him a “notorious gang leader”.
The rest of the movie is dedicated to all of the tricks they pull. They place a call on an anonymous tip line, “revealing” that the gang headquarters is in an abandoned mall. Half of the cop cars in the city converge on the mall, leaving the gang to heist a shipment of insulin, which they distribute for free to the needy. Our hero disguises himself and poses as a bounty hunter, joining the hunt for his own gang. He crashes a fundraiser at the mayor’s house and tells the rich what he really thinks of them. He gets captured and the rest of the gang has to break him out of jail. Eventually, his friend the mayor is released in a prisoner-of-war exchange, comes back, wins election once again, and pardons them all. The wicked mayor and the sheriff are exposed for their crimes and held accountable, and our hero finally marries his sweetheart. And of course, you’d call it Hood Robin.