A better title would be, Were Polish Aristocrats in the 1890s really that Obese?, because the chapter makes a number of striking claims about rates of overweight and obesity in Poland around the turn of the century, especially among women, and especially especially among the upper classes.
Budnik & Henneberg draw on data from historical sources to estimate height and body mass for men and women in different classes. The data all come from people in Poland in the period 1887-1914, most of whom were from Warsaw. From height and body mass estimates they can estimate average BMI for each of these groups. (For a quick refresher on BMI, a value under 18.5 is underweight, over 25 is overweight, and over 30 is obese.)
They found that BMIs were rather high; somewhat high for every class but quite high for the middle class and nobility. Peasants and working class people had average BMIs of about 23, while the middle class and nobles had average BMIs of just over 25.
This immediately suggests that more than half of the nobles and middle class were overweight or obese. The authors also estimate the standard deviation for each group, which they use to estimate the percentage of each group that is overweight and obese. The relevant figure for obesity is this:
As you can see, the figure suggests that rates of obesity were rather high. Many groups had rates of obesity around 10%, while about 20% of middle- and upper-class women were obese.
This is pretty striking. One in five Polish landladies and countesses were obese? Are you sure?
To begin with, it contradicts several other sources on what baseline human weight would be during this period. The first is a sample of Union Army veterans examined by the federal government between 1890-1900. The Civil War was several decades before, so these men were in their 40s, 50s, and 60s. This is in almost the exact same period, and this sample of veterans was Caucasian, just like the Polish sample, but the rate of obesity in this group was only about 3%.
Of course, the army veterans were all men, and not a random sample of the population. But we have data from hunter-gatherers of both genders that also suggests the baseline obesity rate should be very low. As just one example, the hunter-gatherers on Kitava live in what might be called a tropical paradise. They have more food than they could ever eat, including potatoes, yams, fruits, seafood, and coconuts, and don’t exercise much more than the average westerner. Their rate of obesity is 0%. It seems weird that Polish peasants, also eating lots of potatoes, and engaged in backbreaking labor, would be so more obese than these hunter-gatherers.
On the other hand, if this is true, it would be huge for our understanding of the history of obesity, so we want to check it out.
Because this seems so weird, we decided to do a few basic sanity checks. For clarity, we refer to the Polish data as reported in the chapter by Budnik & Henneberg as the Warsaw data, since most (though not all) of these data come from Warsaw.
The first sanity check is comparing the obesity rates in the Warsaw data to the obesity rates in modern Poland. Obesity rates have been rising since the 1890s  so people should be more obese now than they were back then.
The Warsaw data suggests that men at the time were somewhere between 0% and 12.9% obese (mean of categories = 7.3%) and women at the time were between 8.8% and 20.9% obese (mean of categories = 16.2%). In comparison, in data from Poland in 1975, 7% of men were obese and 13% of women were obese. This suggests that obesity rates were flat (or perhaps even fell) between 1900 and 1975, which seems counterintuitive, and kinda weird.
In data from Poland in 2016, 24% of men were obese and 22% of women were obese. This also seems weird. It took until 2016 for the average woman in Poland to be as obese as a middle-class Polish woman from 1900? This seems like a contradiction, and since the more recent data is probably more accurate, it may mean that the Warsaw data is incorrect.
There’s another sanity check we can make. Paintings and photographs from the time period in question provide a record of how heavy people were at the time. If the Warsaw data is correct, there should be lots of photographs and paintings of obese Poles from this era. We checked around to see if we could find any, focusing especially on trying to get images of Poles from Warsaw.
We found a few large group photographs and paintings, and some pictures of individuals, and no way are 20% of them obese.
We begin with Sokrates Starynkiewicz, who was president of Warsaw from 1875 to 1892. He looks like a very trim gentleman, and if we look at this photograph of his funeral from 1902, we see that most of the people involved look rather trim as well:
In addition, a photograph of a crowd from 1895:
And here’s a Warsaw street in 1905:
People in these photographs do not look very obese. But most of the people in these photographs are men, and the Warsaw data suggests that rates of obesity for women were more than twice as high.
We decided to look for more photographs of women from the period, and found this list from the Krakow Post of 100 Remarkable Women from Polish History, many of whom seem to have beendecoratedsoldiers (note to self: do not mess with Polish women). We looked through all of the entries for individuals who were adults during the period 1887-1914. There are photographs and/or portraits for many of them, but none of them appear to be obese. Several of them were painters, but none of the subjects of their paintings appear obese either. (Unrelatedly, one of them dated Charlie Chaplin and also married a Count and a Prince.)
If rates of obesity were really 20% for middle and upper class women, then there should be photographic evidence, and we can’t find any. What we have found is evidence that Polish women are as beautiful as they are dangerous, which is to say, extremely.
If we’re skeptical of the Warsaw data, we have to wonder if there’s something that could explain this discrepancy. We can think of three possibilities.
The first is that we have a hard time imagining that whoever collected this data got all these 19th-century Poles to agree to be weighed totally naked. If they were wearing all of their clothes, or any of their clothes, that could explain the whole thing. (It might also explain the large gender and class effects.)
Clothing weighed a lot back then. Just as one example, a lady’s dolman could weigh anywhere between 6 and 12 pounds, and a skirt could weigh another 12 pounds by itself. We found another source that suggested a lady’s entire outfit in the 1880s (though not Poland specifically) would weigh about 25 lbs.
As far as we can tell, there’s no mention of clothes, clothing, garments, shoes, etc. in the chapter, so it’s quite possible they didn’t account for clothing at all. All the original documents seem to be in Polish and we don’t speak Polish, so it’s possible the original authors don’t mention it either. (If you speak Polish and are interested in helping unravel this, let us know!)
Also, how did you even weigh someone in 1890s Poland? Did they carry around a bathroom scale? We found one source that claims the first “bathroom” scale was introduced in 1910, but they must have been using something in 1890.
Sir Francis Galton, who may have come up with the idea of weighing human beings, made some human body weight measurements in 1884 at London’s International Health Exhibition. He invited visitors to fill out a form, walk through his gallery, and have their measurements taken along a number of dimensions, including colour-sense, depth perception, sense of touch, breathing capacity, “swiftness of blow with fist”, strength of their hands, height, arm span, and weight. (Galton really wanted to measure the size of people’s heads as well, but wasn’t able to, because it would have required ladies to remove their bonnets.) In the end, they were given a souvenir including their measurements. To take people’s weights, Galton describes using “a simple commercial balance”.
Galton also specifically says, “Overcoats should be taken off, the weight required being that of ordinary indoor clothing.” This indicates he was weighing people in their everyday clothes (minus only overcoats), which suggests that the Polish data may also include clothing weight. “Stripping,” he elaborates, “was of course inadmissible.”
Also of interest may be Galton’s 1884 paper, The Weights of British Noblemen During the Last Three Generations, which we just discovered. “Messrs. Berry are the heads of an old-established firm of wine and coffee merchants,” he writes, “who keep two huge beam scales in their shop, one for their goods, and the other for the use and amusement of their customers. Upwards of 20,000 persons have been weighed in them since the middle of last century down to the present day, and the results are recorded in well-indexed ledgers. Some of those who had town houses have been weighed year after year during the Parliamentary season for the whole period of their adult lives.”
Naturally these British noblemen were not being weighed in a wine and coffee shop totally naked, and Galton confirms that the measurements should be, “accepted as weighings in ‘ordinary indoor clothing’.” This seems like further evidence that the Warsaw data likely included the weight of individuals’ clothes.
Another explanation has to do with measurements and conversions. Poland didn’t switch to the metric system until after these measurements were made (various sources say 1918, 1919, 1925, etc.), so some sort of conversion from outdated units has to be involved. This chapter does recognize that, and mentions that body mass was “often measured in Russian tsar pounds (1 kg = 2.442 pounds).”
We have a few concerns. First, if it was “often” measured in these units, what was it measured in the rest of the time?
Second, what is a “Russian tsar pound”? We can’t find any other references for this term, or for “tsar pound”, but we think it refers to the Russian funt (фунт). We’ve confirmed that the conversion rate for the Russian funt matches the rate given in the chapter (409.5 g, which comes out to a rate of 2.442 in the opposite direction), which indicates this is probably the unit that they meant.
But we’ve also found sources that say the funt used in Warsaw had a different weight, equivalent to 405.2 g. Another source gives the Polish funt as 405.5 g. In any case, the conversion rate they used may be wrong, and that could also account for some of the discrepancy.
The height measurements might be further evidence of possible conversion issues. The authors remark on being surprised at how tall everyone was — “especially striking is the tallness of noble males” — and this could be the result of another conversion error. Or it could be another side effect of clothing, if they were measured with their shoes on, since men’s shoes at the time tended to have a small heel. (Galton measured height in shoes, then the height of the heel, and subtracted the one from the other, but we don’t know if the Polish anthropometers thought to do this.)
A third possibility is that the authors estimated the standard deviation of BMI incorrectly. To figure out how many people were obese, they needed not only the mean BMI of the groups, they needed an estimate of how much variation there was. They describe their procedure for this estimation very briefly, saying “standard deviations were often calculated from grouped data distributions.” (There’s that vague “often” again.)
What is this technique? We don’t know. To support this they cite Jasicki et al. (1962), which is the book Zarys antropologii (“Outline of Anthropology”). While we see evidence this book exists, we can’t find the original document, and if we could, we wouldn’t be able to read it since we don’t speak Polish. As a result, we’re concerned they may have overestimated how much variation there was in body weights at the time.
These three possibilities seem sufficient to explain the apparently high rates of obesity in the Warsaw data. We think the Warsaw data is probably wrong, and our best guess for obesity rates in the 1890s is still in the range of 3%, rather than 10-20%.
We cannot more highly recommend Tim Carroll’s pen-and-paper game Lineage, “a game about telling the story of a Royal family through the ages.” You and up to five friends (or enemies, I don’t judge) play as historians, piecing together the records of several generations of a great and powerful family. The resulting stories are prone to hilarity and tragedy, and are also “a handy world building tool for game masters, authors, and admirers of the sorts of diagrams that lurk in the appendices of thick fantasy novels.” While it was developed with royal families in mind, Tim notes that royals are not the only ones with insane family trees, and it could also be used to tell the stories of other kinds of families. Also interesting is that it was developed in the context of thinking about one-player role playing games.
Apparently Regina George from Mean Girls was based on Alec Baldwin’s character in Glengarry Glen Ross. I knew she seemed familiar. Also, why is she named “Queen George”? Is Mean Girls an allegory for the Revolutionary War?
I come from one of the most rural states in America, and I lived in a town of 200 people for a couple of years. And I think there is not an appreciation of rural America or the values of rural America, the sense of community that exists in rural America. And somehow or another, the intellectual elite does have, in some cases, a contempt for the people who live in rural America. I think we’ve got to change that attitude and start focusing on the needs of people in rural America, treat them with respect, and understand there are areas there are going to be disagreements, but we can’t treat people with contempt.
The dead but amazing HistoryHouse.com was the home of some of the best history writing of all time. A gem from the archives: Rah Rah, Rasputin. Here’s the pull quote. While their comrade was preparing to poison Rasputin: “…Felix’s friends were upstairs listening to Yankee Doodle Dandy, the only record they owned, over and over and over. Really.”
“Zoom Escaper is a tool to help you escape Zoom meetings and other videoconferencing scenarios. It allows you to self-sabotage your audio stream, making your presence unbearable to others.” We do not officially endorse Zoom Escaper, but we do think you should know about it. Please enjoy responsibly. There’s also Zoom Deleter, which does what it says on the tin.
Trust in the media is at an all-time low. Unusual proposal for a solution: replace journalists with CEOs, who are (apparently) much more trusted than journalists, especially the CEO of the company respondents work for. We appreciate how original this take is — we’ll also note that, if they want to go by this metric, scientists are even more trusted as a group, by at least 10 points.
Of course, scientists have their own limitations. In Questionable Practices by Researchers and Teenage Wizards, psychologist Sacha Epskamp compares Questionable Research Practices (QRPs) to his experiences as a teenager cheating at Magic the Gathering (MTG) in order to beat his older brother. I also sometimes cheated at card games when I was very young (by high school I knew better) — could that be part of why I find open science issues so intuitive?
Update from last month: NYU professor Todd Gureckis continues to impress with his attempts to improve video lectures based on insights gleaned from watching YouTubers. If any YouTubers read this, we would be very interested to hear what you think.
The clear winner was electing representatives by lottery, drawing them randomly from the pool of all adult citizens or all voters, for a fixed term (formally known as sortition). Since election is by random selection, as long as the chamber has enough members, it’s guaranteed to be largely representative in terms of gender, race, religion, age, profession, and so on. Representatives would be ordinary people, instead of career politicians.
Now, while it’s very fun to sit in our armchairs and speculate about political science, the truth is that we don’t have much influence on how the branches of government are organized. The United States will not be switching to a Tricameral system or electing representatives by sortition any time soon. Neither will any other country in the world, is our guess. Despite centuries of research on various voting systems, lots of countries are still using first-past-the-post voting. It’s hard to imagine this will be much different.
We don’t have the power to make this happen. But we do have the power to set up a website.
So today we’d like to introduce a little idea we call The People’s Bill. Why do we trust politicians, lowest of the low, to write our laws for us? We’re Americans, by God. We can write our own laws.
The idea is pretty simple. We could set up a website, with a text form that anyone could edit, and the People could write whatever bill they want.
If you’re concerned that only Americans should be able to write American laws, then we could limit editing privileges to IP addresses from within the US. But there are ways around this, of course, and why not take good ideas from the rest of the world?
To keep it getting obscenely long, as bills often do, we would set up a character limit. As a red-blooded American I obviously want to set the limit to 1,776 characters, but that’s probably not long enough (by this parenthetical, this post has already passed 1,776 characters). Setting it up to be 100 tweets long would also be amusing, but that’s only 2,800 characters. But we notice that the Declaration of Independence is about 8,000 characters long, depending on version, so let’s go with that.
People would have a month to debate and draft as much as they want, within those limits. Then, at the end of every month, the bill would be finalized, and closed to editing. A permanent snapshot would be taken, and automatically emailed to all 100 senators and 435 representatives, with instructions that this is the Will of the People et cetera et cetera.
If you have any experience with online assignments, you know that closing an assignment at midnight can get pretty crazy. To help prevent a furious final dash to make edits at 11:59 PM the night before, we wouldn’t take the final snapshot at midnight. Instead, we would randomly select a time on the day in question, keep that random deadline a secret, and take the final snapshot then.
With this system, there’s no question what bills people want passed. Every member of Congress gets an email about it every month, containing a bill that the People wrote and that contains a curated list of what they want passed into law. It may not end up being the bill the country needs. But it’s hard to imagine it won’t end up being the bill we deserve.
You may be feeling skeptical that people can coordinate on the internet, let along coordinate to produce anything of value. But we think there’s reason to believe that this isn’t such a problem.
First of all, open-source software is an unqualified success. Linux was started by one Finn at the tender age of 21, and thanks to decades of collaborative writing from the community, now contains several million lines of code. Apache is free, open-source, and serves about 25% of all websites. If you’re reading this, there’s a good chance your browser is one of these success stories — Firefox is fully open-source and the open-source Chromium project forms the base for both Google Chrome and Microsoft Edge.
Of course, all of these had some kind of central leadership. Linus Torvalds coordinated the development of Linux at some level, even if he didn’t write all the code himself.
But people are perfectly capable of coordinating themselves, given the chance. Consider Reddit’s 2017 April Fools’ Day project, called Place. This project started with a canvas 1000 pixels wide by 1000 pixels tall, for a cool one million pixels total. For 72 hours, Reddit users could place a new pixel every 5-20 minutes, in any of sixteen different colors. Despite there being no top-down organization or authority, the redditors soon organized themselves and the canvas into stunning displays of coordination. The final canvas included dozens of national flags, logos, memes, a rainbow road, a Windows 95 taskbar, a recreation of the Mona Lisa (though she appears to be flipping us the bird), and a complete rendition of The Tragedy of Darth Plagueis The Wise, courtesy of r/prequelmemes. You can see the final canvas in all its maddening glory here, a timelapse of its evolution here, and the Wikipedia page for the project here.
Of course, Wikipedia itself may be the greatest of all crowdsourced endeavors. It is the largest encyclopedia in the world, and probably the best. In high school, our teachers told us not to cite Wikipedia as a source, as it was too unreliable. Today, media giants like Facebook and YouTube use Wikipedia entries in the fight against fake news. All courtesy of any random yahoo with an internet connection.
All right, so ordinary people can make open-source software, collaborate to create giant pixel-art renditions of copypastas and Renaissance masterpieces, and can even create the largest encyclopedia in all of history. Can ordinary people really write laws, though? Laws aren’t like pixel art, or even encyclopedia pages, right?
Indeed they’re not. First of all, making good pixel art is really hard, probably harder than writing laws. Second, history shows us that normal people can write perfectly good laws — you don’t need to be a lawyer or career politician. Why would you?
On our tricameralism post, one commenter mentioned Ezra Klein’s interview of Hélène Landemore (NYT, archive.is), a political scientist. We’ll have to resist quoting it at length here; seriously, give it a read or a listen.
Particularly interesting were the stories she told about ordinary citizens writing laws for themselves. Here’s one:
Iceland decided to rewrite its constitution in 2010. And they decided to use a very innovative, inclusive, participatory method. They started with a national forum of 950 randomly-selected citizens that were tasked with coming up with the main values and ideas that they wanted to see entrenched in the new document.
And then they had an election to choose 25 constitution drafters, if you will, among a pool of nonprofessional politicians, because they had been convinced, after the 2008 crisis, that they were all corrupt. So by law, they were excluded from participating in this election. And those 25 decided to work with the larger public by publishing their drafts at regular intervals, putting them online and collecting some feedback through a crowdsourced sort of process. And then they put the resulting proposal to a nationwide referendum. Two-thirds of the voting population approved, and then parliament killed it and never turned it into a bill.
Will the People’s Bill fix our current political system? Honestly, we doubt it. Like the constitution drafters in Iceland, we fully expect Congress will kill the People’s Bill every time it comes around. Most months it will probably never get proposed; if it ever is proposed, most of the ideas in the bill will probably never make it into law.
There are reasons to try this idea anyways. First of all, if Congress ignores the suggestions of the People, this will be another way of making it clear where their priorities lie (as if you needed any more convincing, but still).
Second, writing our own bill every month, even if it never becomes law, gets people involved in democracy. It’s a chance for people to discover they can write laws that are just as good as the laws written in Albany, Austin, Tallahassee, Denver, and Washington. Will the People’s Bill be messier than bills written by politicians? Yes, but it will also be more original, and more creative. Will the People’s Bill contain allusions to The Tragedy of Darth Plagueis The Wise? Almost certainly.
Laws on the books are unclear and poorly written; perhaps at times intentionally so. If you have no experience writing laws yourself, you might be tempted to assume the problem is with you. But if you’ve written parts of the People’s Bill for three years running, maybe you’ll look at a piece of federal legislation and say, “You call this a law? My grandmother could write a better law than this!” And maybe you’d be right, because maybe she helps write the People’s Bill too.
If we’re lucky, this new and well-deserved confidence will inspire more ordinary people to run for office, question the legal status quo, and so on. It lowers the barriers to entry, and encourages people to open their minds to new approaches to governance.
Third, it will help loosen the grip of the laws on our mind. It’s one thing to understand intellectually that laws were written by morons just like you and me, but it’s a whole other thing to actually be one of the morons writing the law. The US legal code wasn’t handed down on Mount Sinai. People wrote those laws, and sometimes they will be flawed, backwards, or just plain stupid. Americans used to have a much healthier disrespect for the law, and it’s time we brought that back.
Fourth, and finally, there are a lot of great policy ideas out there, but as a country we tend to discuss the same few ideas over and over again — like an otter chasing its tail, only much less cute. The People’s Bill would be an opportunity to discuss great policy ideas that aren’t even on the radar right now. To discover good ideas that are considered normal in other places and times, but that aren’t on the docket here. Some of them will sound crazy, and some of them might even be crazy. But some ideas that sound crazy right now will end up being policy twenty years from now. Whatever else you might think of the idea, it seems like a safe bet that randos on the internet can beat United States Senators at coming up with out-of-the-box ideas.
We haven’t set up the website for The People’s Bill yet, because in the true democratic spirit of the project, we want to get suggestions about how to set it up; how the website should be structured, what software we should consider, and so on. Below are a few of our thoughts, but we’re sure you will have other suggestions, and we really want to hear your ideas.
A very straightforward option would be to set this up using MediaWiki. Wikis have talk pages, edit histories, and make it relatively simple to manage users and permissions. Every month we could set up a new page for that month’s bill, and lock the page at the end of the month. This would probably be the easiest way to set up the site.
However, a wiki wouldn’t allow people to draft different bills in parallel, and wouldn’t make it easy to compare different drafts of the same bill. There wouldn’t be any way to figure out which version of the bill has the most popular support — whoever edited the wiki most recently would always have the final say. So another option would be to use something like a forum. Different bills could be in different threads, and users could vote on which bills they like. At the end of the month, the top thread would be sent to Congress, and the rest would be locked, starting the cycle all over again. You could literally just use a subreddit for this, or you could build some kind of custom forum setup.
We could dream up more esoteric options too, though they would probably require more effort. You could link up Git to some kind of forum interface, allowing people to both vote on and branch bills as they saw fit, with all branches appearing as their own posts on the forum, complete with comments, vote tallies, and so on.
Some of these systems are more chaotic than others. A single wiki page, for example, would be sort of maddening. Anyone could wander in at any time and change the entire bill. Anyone could wander in at any time and revert the bill to a previous version. In contrast, a more forum-like approach might force the bill into reasonable sections and subsections, which could be clearly debated. This has obvious benefits, but this country already has a method for writing bills in the normal way — it’s called Congress.
We might even want to make The People’s Bill as chaotic as possible. Some months the bill might end up being the first 8,000 characters of the script of Bee Movie, but that’s a risk we’re willing to take. Just imagine Ted Cruz getting that bill in his email.
Whatever we use, we want to make sure that it’s easily accessible. It should be easy to use and easy to join — we literally want your grandma to help draft these bills. Anyone with an internet connection should be able to join up, without too much trouble. So while esoteric systems might have some nice features, we have to balance that against wide engagement. It’s only democratic.
Thanks to Taylor Hadden and Casey Jamieson for giving feedback on a draft of this post.
Carl Hart is a parent, Columbia professor, and five-year-running recreational heroin user, reports The Guardian. “I do not have a drug-use problem,” he says, “Never have. Each day, I meet my parental, personal and professional responsibilities. I pay my taxes, serve as a volunteer in my community on a regular basis and contribute to the global community as an informed and engaged citizen. I am better for my drug use.”
Hart makes it pretty clear he thinks drug use is a good thing. Good not only for himself, but for people in general. “Most drug-use scenarios cause little or no harm,” he says, “and some reasonable drug-use scenarios are actually beneficial for human health and functioning.” He supports some basic safeguards, including an age limit and possibly an exam-based competency requirement, “like a driver’s licence.” But otherwise, he thinks that most people can take most drugs safely.
The article mentions Hart’s research in passing, but doesn’t describe it. Instead, these claims seem to be based largely on Hart’s personal experiences with drugs. He’s been using heroin for five years and still meets his “parental, personal and professional responsibilities”. He likes to take amphetamine and cocaine “at parties and receptions.” He uses MDMA as a way to reconnect with his wife.
When Hart wondered why people “go on [Ed.: yikes] about heroin withdrawal”, he conducted an ad hoc study on himself, first upping his heroin dose and then stopping (it’s not clear for how long). He describes going through an “uncomfortable” night of withdrawal, but says “he doesn’t feel the need or desire to take more heroin and never [felt] in any real danger.”
This is fascinating, but it seems like there’s a simple individual differences explanation — people differ (probably genetically) in how destructive and addictive they find certain substances, and Hart is presumably just very lucky and doesn’t find heroin (or anything else) all that addictive. This is still consistent with heroin being a terrible drug that ruins people’s lives for the average user.
Let’s imagine a simplified system where everyone either is resistant to a drug and can enjoy it recreationally, or finds it addictive and it ends up destroying their life. For alcohol, maybe 5% of people find it addictive (and become alcoholics) and the other 95% of us can enjoy it without any risk. In this case, society agrees that alcohol is safe for most people and we keep it legal.
But for heroin, maybe 80% of people would find it addictive if they tried it. Even if 20% of people would be able to safely enjoy recreational heroin, you don’t know if it will destroy your life or not until you try it, so it’s a very risky bet. As a result, society is against heroin use and most people make the reasonable decision to not even try it.
Where that ruins-your-life-percentage (RYLP) stands for different drugs matters a lot for the kinds of drugs we want to accept as a society. Certainly a drug with a 0% RYLP should be permitted recreationally, and almost as certainly, a drug that ruined the lives of 100% of first-time users should be regulated in some way. The RYLP for real drugs will presumably lie somewhere in between. While we might see low-RYLP drugs as being worth the risk (our society’s current stance on alcohol), a RYLP of just ten or twenty percent starts looking kind of scary. A drug that ruins the lives of one out of every five first-time users is bad enough — you don’t need a RYLP of 80% for a drug to be very, very dangerous.
Listen, we also believe in the right to take drugs. We take drugs. Drugs good. Most drugs — maybe all drugs — should be legal. But this is very different from pretending that many drugs are not seriously, often dangerously addictive for a large percentage of the population.
As far as we know, drugs like caffeine and THC aren’t seriously addictive and don’t ruin people’s lives. There’s even some fascinating evidence, from Reuven Dar, that nicotine isn’t addictive (though there may be other good reasons to avoid nicotine). But drugs like alcohol and yes, heroin, do seem to be seriously addictive, and recognizing this is important for allowing adults to make informed choices about how they want to get high off their asses.
Hart’s experience with withdrawal, and how he chooses to discuss it, seems particularly clueless. It’s possible that Hart really is able to quit heroin with minimal discomfort, but it’s confusing and kind of condescending that he doesn’t recognize it might be harder for other people. When people say things like, “I find heroin very addictive and withdrawal excruciating,” a good start is to take their reports seriously, not to turn around and say, “well withdrawal was a cakewalk FOR ME.”
This seems to be yet another example of the confusing trend in medicine and biology, where everyone seems to assume that all people are identical and there are no individual differences at all. If an exercise program works for me, it will work equally well for everyone else. If a dietary change cures my heartburn, it will work equally well for everyone’s heartburn. If a painkiller works well for me when I have a headache, it will work equally well for the pain from your chronic illness. The assumption seems to be that people’s bodies (and minds) are made up of a single indifferentiable substance which is identical across all people. But of course, people are different, and this should be neither controversial nor difficult to understand. This is why if you’re taking drugs it’s important to experiment — you need to figure out what works best for you.
This is kind of embarrassing for Carl Hart. He is a professor of neuroscience and psychology. His specialty is neuropsychopharmacology. He absolutely has the statistical and clinical background necessary to understand this point. At the risk of being internally redundant, different people are different from each other. They will have different needs. They will have different responses to the same drugs. Sometimes two people will have OPPOSITE reactions to the SAME drug! Presumably Carl Hart has heard of paradoxical reactions — he should be aware of this.
On the other hand, anyone who sticks their finger in Duterte’s eye is my personal hero. We should cut Hart some slack for generally doing the right thing around a contentious subject, even if we think he is dangerously wrong about this point.
Less slack should be cut for the article itself. This is very embarrassing for The Guardian. Hart is the only person they quote in the entire article. They don’t seem to have interviewed any other experts to see if they might disagree with or qualify Hart’s statements. This is particularly weird because other experts are clearly interested in commenting and the author clearly knows that they might disagree with Hart. They might have asked for a comment from Yale Professor, physician, and (statistically speaking) likely marijuana user, Nicholas Christakis, who would have been happy to offer a counterbalancing opinion. The Guardian was happy to print that Hart is critical of the National Institute on Drug Abuse (NIDA), “in particular of its director, Nora Volkow”, but there’s no indication that they so much as reached out to NIDA or to Volkow for comment (incidentally, here’s what Volkow has to say on the subject).
We can’t be sure, but it’s even possible they somewhat misrepresented Hart’s actual position. It’s disappointing but not surprising when a newspaper doesn’t understand basic statistics, and it would be unfair to hold them to the same standard we hold for Carl Hart. But it is fair to hold them accountable for the basics of journalistic practice, and it seems to us like they dropped the bong on this one.
When it comes to consuming media on the internet, the Wadsworth Constant is your friend. You should have skipped the first clause of this paragraph. Maybe the second half of the paragraph really.
You’ve probably heard of The Dyatlov Pass incident, where nine hikers died in the Ural Mountains under mysterious circumstances. Well, the mystery appears to have been solved using computer simulation methods developed to animate the Disney film Frozen, combined with data from research at General Motors, where they “played rather violently with human corpses”.
Back near the beginning of the pandemic, Ada Palmer wrote a post over at her blog Ex Urbe, in response to the question, “If the Black Death caused the Renaissance, will COVID also create a golden age?” This piece is perhaps even more interesting now that we’ve seen how the first year of the pandemic has played out. If you haven’t read it yet, you should!
We love this piece of speculative burrito fiction. The fiction is speculative, not the burritos. At least, we’re pretty sure it’s fiction.
Exciting new developments in education: NYU cognitive science professor Todd Gureckis, in an effort to learn how to make more engaging video lectures, studies the masters: YouTubers. The results are already pretty impressive, and we suspect they will get even more engaging over time.
[Morpheus Voice] You think it’s the year 2021, when in fact it’s still early fall 1993. We have only bits and pieces of information, but what we know for certain is that in the early ’90s, AOL opened the doors of usenet, trapping the internet in Eternal September. Whatever date you think it is, the real date (as of this writing) is Sunday September 10043, 1993.
In a recent post, we argued that the US Senate is actually a pretty liberal institution, even when it happens to be controlled by conservatives, because it helps safeguard minority opinions against the majority.
That said, there’s a lot to dislike about Congress. The big sticking point in our book is gridlock. If a bill is popular in the House and the Senate doesn’t like it, that bill is never making it to graduation. The same is true for a bill popular in the Senate that the House can’t stand. There are too many opportunities for individual officials to obstruct the process. If the Speaker of the House or the Senate Majority Leader decides to stomp on a bill that everyone else likes, there’s not much anyone can do about it.
Right now, every legislature in the world is either unicameral — a single group of legislators make all the laws — or bicameral — two groups or chambers of legislators each write and pass laws and then duke it out.
Unicameral legislatures are pretty straightforward, but they have a number of drawbacks. There are many ways to elect legislators, but you’ll probably have reason to regret your decision, whatever you choose. If your legislators are all noblemen (as was once the case in countries like the UK) most of your population doesn’t have representation. If your legislators are elected proportionally, based on population (as is the case in the US House of Representatives), you end up with a populist system that makes it easy for the majority to abuse any minority. Regardless of how you determine who gets to be a legislator, if you only have one chamber, there’s always the chance that something will go wrong in that chamber, and they’ll pass some really stupid laws.
Having two chambers mitigates these issues, especially when the chambers are elected in different ways. A populist chamber that tries to pass anti-minority laws can be blocked by a chamber that better represents minorities. A chamber of elites that tries to pass a discriminatory law can be blocked by a populist chamber. This system of checks and balances has obvious benefits compared to letting a single chamber run everything, and this is a large part of why many legislatures are bicameral. Unfortunately, this leads to political gridlock.
So why stop at two?
One chamber is tyranny. A second chamber introduces checks and balances, but leads to political gridlock. A third chamber can introduce further checks and balances, and break the deadlock to boot.
In the two-chamber system, a bill can be introduced by either chamber, but it needs the approval of both chambers to be sent to the executive branch to be signed into law. This creates a bottleneck.
In a three-chamber system, we can relax that requirement. Any of the three chambers can propose a bill, and if they can get one of the other chambers to approve it, then it goes to the executive. If we have three chambers, A, B, and C, a law can be passed by A & B, B & C, or A & C working together. That way if one of the chambers is deadlocked on an issue, or one powerful legislator is trying to keep a bill from being passed, the other two chambers of congress can just work around it.
Currently in the US government, a presidential veto can only be overridden by a 2/3 majority vote in both the House and the Senate. In a three-chamber system, there could be more than one option. A veto could be overridden by a 2/3 majority vote in the two chambers that originally approved it, or by a simple majority vote in all three chambers.
You could extend these principles to a system with 5, 7, or even more chambers, but three is complicated enough.
The main distinction between the three chambers would be in how they are elected.
Let’s imagine how this could play out in an ideal scenario. One of the chambers should have proportional representation, like in the US House of Representatives. We also think it’s a good idea for one of the chambers to have intentionally disproportionate representation, like the US Senate, though it doesn’t necessarily need to be by state.
How should the third chamber be organized? We have some ideas.
As we argued in our earlier piece on the Senate, disproportionate representation can be good because it can protect minority groups from the whims of the majority. This doesn’t make all that much sense when the representation is by state — residents of North Dakota are not what we normally mean by “minority” — but maybe we can take this and run with it.
When we talk about minorities, we usually mean ethnic or religious minorities. So one thing we could do is make a chamber where every ethnic group or every religious group gets equal representation. For example, in a religious legislature, there might be 10 seats for Christians, 10 seats for Muslims, 10 seats for Sikhs, 10 seats for atheists, and so on. This would mean that even if, for example, the majority of a country were atheists, religious minorities would still have a stake in writing the laws of the country.
This approach protects minorities in a very structural way, which is great. There are a couple reasons to dislike this idea, however. First of all, the boundaries between different ethnic groups and different religions are pretty unclear. Do Catholics and Protestants count as “the same religion” for the purposes of allocating seats in this new chamber? How about Mormons? Do atheists and agnostics get separate representation? What if someone invents a new religion? How many representatives go to the Church of the Flying Spaghetti Monster? Will the chamber be taken overby Satanists?
Second, building a legislative body around something like race or religion puts the power to decide these questions in the hands of the government. We like government pretty ok, but we don’t think it should be deciding which religions “count” as separate religions.
Third, we think that dividing people up this way supports ideas like racial essentialism, which are basically a form of racist pseudoscience.
Ultimately it seems that this is a system that might work in some places (including Lebanon), but is politically dicey overall.
If there’s one group in America that really deserves more representation, it’s Native Americans.
This chamber would be really simple. There are 574 federally recognized tribal governments in the United States, and each government would get a seat, electing their representative however they wanted to.
We do notice that 231 of these tribes are located in Alaska, however. Many of them seem to have populations of only a couple hundred people, which seems a little bit disproportionate even by the standards of disproportionate representation. Probably the thing to do would be to give each of the twelve Alaska Native Regional Corporations the same level of representation.
Native American tribes are, at least nominally, sovereign nations, and maybe it sounds weird to give a sovereign government a say in writing our laws. Well…
There’s another old joke about making Israel the 51st state, which is that they would object on the grounds that they’d only get 2 votes in the Senate. Good idea, why not.
America should take its role as a superpower more seriously. American laws influence the whole world, so maybe we should give the rest of the world some say in our laws. They already get ambassadors, so why not representatives. What’s more, America is home to millions of immigrants, who can’t vote and don’t get any representation.
In this chamber, each of the 195 or so countries of the world would send one representative, who would vote on laws just like any normal member of Congress. Naturally, the USA would get one representative in this chamber as well, and it would probably make sense to have that representative serve as the president of that chamber, as the Vice President does for the Senate.
This may seem like handing over the reins to foreign powers, but this chamber can’t pass any laws on its own. To pass a law, it would have to work with either the House or the Senate, so everything still has to be approved by an all-American body. And anyways, the President can still veto any law they want.
In addition, this chamber gives us a surprisingly strong carrot & stick for international relations. Any country would jump at the chance to have a say in setting American policy. Offering a seat, or threatening to take it away, is something that foreign powers would take seriously. If we threatened to kick Russia out, they would pay attention, and it’s interesting to think what North Korea might agree to if we hinted we might give them a seat in Congress. If this chamber existed today, I would say we should kick Myanmar out right away.
This system does seem like it might be a little hackable. After all, what’s to stop a country from breaking up into many smaller countries to get more votes? On reflection, however, if other world governments want to balkanize themselves to get more representatives, that’s ok with us, especially since plenty of regions already want to do this.
A problem with most governments is that they are the tyranny of the old over the young. This is perverse because young people will live to see the full consequences of the laws passed today, while old people may not. Old men declare war, but it’s young men who have to fight and die. Again, this seems pretty unfair.
I’ve heard that taxation without representation is bad, and while we could in theory be doing better, the fact of the matter is Americans ages 18-29 have only one representative in the House and no representation at all in the Senate.
In our third chamber, representation could be by age bracket. Decades seems like the natural breakdown here, so we might assign 10 seats to people in their 20’s, 10 seats to people in their 30’s, and so on. Representatives would be elected for 2-year terms, and the main qualification would be that you would need to be in the proper age bracket on election day. If I ran as a 29-year-old and won, I would serve out my term as a representative for 20-somethings. Once my term was up I could run for the same chamber, but because I would then be 31, I would have to run as a representative for 30-somethings.
It’s not clear how far we should take this — the elderly deserve representation too, but we probably don’t want to mandate a voting block of representatives in their 90’s. Because other chambers tend to skew old, maybe it would make sense to have a cutoff at retirement age. Let’s imagine we raise the retirement age to 70 (possibly a good idea on its own), and so the final bracket in this chamber would be 70+. If an 80-year old ran for a seat in this chamber, she would be running to represent everyone age 70 and up.
Legal scholar @tinybaby, proposes a similar system, saying, “each state should have an allocation of 100 years of age for their senators. you can send two 50 year olds but if you want dianne feinstein in the senate you gotta send a 13 year old too”. All we can say is that this is an even more creative solution.
With some of these systems, it’s hard to know how to slice and dice the population for representation. This is how we end up with things like gerrymandering, which no one likes except politicians.
A solution to this is to use metrics that are easy to measure and hard to fake. For example, we could have a chamber composed entirely of the 100 tallest people in the country. Wait, that’s just basketball players. Ok, why not the NBA (& WNBA) winning teams for that year? At least we’ll know that these representatives are good at something.
People already complain about politicians spending too much time dunking on each other, and not enough time on policy. If this is going to happen anyways, let’s at least make the dunking literal.
One explanation for this standstill is the constant partisan bickering. According to this explanation, our representatives are too busy dunking on each other (see above) and scoring points for their parties to actually do their jobs.
If this is the case, one solution would be a chamber that forces the two parties to work together. This chamber would look a lot like the Senate — each state would elect and send two representatives. The twist is that every state would have to send both a Republican and a Democrat, so the chamber would always be 50 Democrats and 50 Republicans.
We’ll take another page out of Lebanon’s playbook here. In Lebanon, all elections are by universal suffrage, so a Christian candidate needs to court votes from Muslim voters in their district, and vice-versa. The same would happen for this chamber — representatives would be elected in a general election, not a primary, so they would benefit from getting votes from members of the other party. Like every state, Massachusetts would have to send one Republican, and the Republican they chose would undoubtedly be the one who best figured out how to appeal to liberal voters.
This precludes all the usual fights over who controls what branch of government, at least for this chamber. In this chamber, no party is ever in control — it’s always 50/50.
If the votes in this chamber were by simple majority, then because both parties have 50 seats, either party would be able to pass any law they wanted to. That’s not what we want, so in this chamber, laws can only be passed by a two-thirds majority. As a result, you literally cannot pass a law in this chamber unless it’s strongly supported by both parties.
No more fighting over the most controversial issues of the nation. Leave that to the House and the Senate. In this chamber, there’s no point trying to pass a law unless you think it has a chance at broad appeal — and the hope is, this would encourage them to aim for the policies that most Americans already want.
There are lots of humans in America, but there are even more animals. They deserve representation too, from the greatest moose to the lowliest grasshopper. They deserve some kind of… animal house. That’s the whole joke, sorry.
There’s another old problem with democracy, which is that you can only elect people who are running for office. These people are naturally power-hungry; you can tell because they’re running for office. Who would want to vote for that guy? As Douglas Adams put it:
“The major problem — one of the major problems, for there are several — one of the many major problems with governing people is that of whom you get to do it; or rather of who manages to get people to let them do it to them. To summarize: it is a well-known fact that those people who must want to rule people are, ipso facto, those least suited to do it. To summarize the summary: anyone who is capable of getting themselves made President should on no account be allowed to do the job.”
The problem with elections is you’re electing someone who decided to run. They want power, and that means you can’t trust them.
There are also many people who we like and trust, but who wouldn’t want to run for Senate or wouldn’t do a good job if they did, people like (depending on your political views) Adam Savage, John Oliver, Oprah (or Oprah for men, Joe Rogan), Cory Doctorow, or Ursula K. Le Guin’s ghost. Not all of these people are living American citizens, but you get the idea. I don’t know if many of these people would make good legislators. I think most of them would even turn down the job. But there’s some political savvy there that we’re letting go to waste.
So let’s cut the Gordian knot on this one by combining these two issues into a solution. We want to find people who would be good at governing but who would never think of running, and we want to take advantage of the political savvy of people who wouldn’t make good legislators or wouldn’t even take the job in the first place.
So what we do is we elect clever people as nominators, and let them nominate people they think would be good for the job. In this chamber, you can’t run for a position. What you can do is run to nominate people to this chamber. If you win, you get to nominate people for a few — let’s say three — positions. Assuming they accept the nomination, the people you choose serve out a normal term in congress.
To keep this from being politics-as-usual, there would need to be some limits on who could run as a nominator and who could be nominated. The whole benefit of this system is to bring in genius from outside the normal political world, people who would normally be too humble to run. We can’t have this system filled up with career politicians, because the whole point is that constantly bringing in new people is a good way to flush out the swamp.
To begin with, nominators can’t have been elected to any state or federal position. If they win, and nominate people to this chamber, they are barred from politics, and can’t run for any position or hold any office in the future. This keeps them from being consistent kingmakers.
Similarly, we want the members of this chamber to be intelligent outsiders, so nominees can’t have been previously elected to any federal position. We think state and lower elected positions might be all right, as this would be a good way to elevate someone gifted from local politics to the national stage. Unlike nominators, however, nominees should be able to run for office in the future. They can’t be nominated to this chamber again, but if they do a really great job and everyone agrees they’re a wonderful politician, it’s clear they would be able to do well in the the House or Senate, and they should be allowed to run for other offices.
This does sound a little strange, but many of our most important federal officials are already appointed. In fact, one of the main powers of the president is appointing officials like ambassadors, cabinet positions, federal judges, and many others. In some ways, the president is really just a super-nominator position. This chamber is simply taking the good idea of appointing officials, making it more democratic, and extending it to Congress.
One of our favorite things about the idea of nomination is that it’s a way to get normal people, who can’t or wouldn’t run for office, involved in government directly. We liked this so much that it got us thinking about other ways you might be able to get normal people involved in government.
Originally we played around with different ideas about how to find normal people who could be elevated to office. A chamber composed of the most viral tiktokers that year? A chamber of the mods of the 20 subreddits with the most subscribers? A chamber of the best dolphin trainers? These are very democratic ideas, but we couldn’t find a way to make them work. All these ideas would be immediately captured by people seeking to gain office, which defeats the porpoise, and reddit would be in charge of our electoral security, which seems like a big ask. There’s also surprisingly strong circumstantial evidence that Ghislaine Maxwell was the moderator of several major subreddits, so maybe this isn’t such a great idea after all.
These people are more normal than Ted Cruz politicians, but reddit mods and twitch streamers are not exactly normal either. They’re regular people, but they’re not all that representative.
So instead, why not a chamber elected by lottery? In this chamber, representatives are randomly drawn from the population of all adult citizens, or maybe all registered voters.
Most of these people would have no experience in government. To account for this, each member of this chamber would be elected for terms of 6 years, but with the terms staggered, so that every two years only one-third of the members would be replaced by lottery. (This is exactly how the Senate does it.) This means that while every one of these representatives was randomly drawn from the population, at any given time one-third of them would have at least four years of experience, one-third of them would have at least two years of experience, and one-third of them would be incoming freshmen.
A member could theoretically be elected to this chamber more than once, but because election is by lottery, the chances of this happening are pretty damn slim (though given how weird the world is, it would probably still happen at some point). However, like the nomination idea above, these randomly-elected legislators might do such a good job that they would go on to be directly elected to other branches of government.
While you might be concerned about a chamber filled with randomly selected Americans, this chamber still can’t pass laws without the help of either the House or the Senate. It’s true their views will probably be more diverse than the views held in the House and the Senate, but you say that like it’s a bad thing.
In some ways, this encompasses many of the other ideas we described above, and does a better job of it. A chamber elected by true lottery will not only be balanced in terms of demographics, it will actually be representative. The distribution of gender, age, race, education, religion, profession, and so on in this chamber will all be nearly identical to the United States in general. Gerrymandering literally can’t affect it, since it’s a random sample. It’s hard to imagine a better way of getting diverse voices in politics.
One of the mysterious aspects of obesity is that it is correlated with altitude. People tend to be leaner at high altitudes and fatter near sea level. Colorado is the highest-altitude US state and also the leanest, with an obesity rate of only 22%. In contrast, low-altitude Louisiana has an obesity rate of about 36%. This is pretty well documented in the literature, and isn’t just limited to the United States. We see the same thing in countries around the world, from Spain to Tibet.
A popular explanation for this phenomenon is the idea that hypoxia, or lack of oxygen, leads to weight loss. The story goes that because the atmosphere is thinner at higher altitudes, the body gets less oxygen, and this ends up making people leaner.
This study focused on twenty middle-aged obese German men (mean age 55.7, mean BMI 33.7), all of whom normally lived at a low altitude — 571 ± 29 meters above sea level. Participants were first given a medical exam in Munich, Germany (530 meters above sea level) to establish baseline values for all measures. A week later, all twenty of the obese German men, as well as (presumably) the researchers, traveled to “the air‐conditioned Environmental Research Station Schneefernerhaus (UFS, Zugspitze, Germany)”, a former hotel in the Bavarian Alps (2,650 meters above sea level). The hotel/research station “was effortlessly reached by cogwheel train and cable car during the afternoon of day 6.”
Patients stayed in the Schneefernerhaus research station for a week, where they “ate and drank without restriction, as they would have at home.” Exercise was “restricted to slow walks throughout the station: more vigorous activity was not permitted.” They note that there was slightly less activity at the research station than there was at low altitudes, “probably due to the limited walking space in the high‐altitude research station.” Sounds cozy.
During this week-long period at high altitude, the researchers continued collecting measurements of the participants’ health. After the week was through, everyone returned to Munich (530 meters above sea level). At this point the researchers waited four weeks (it’s not clear why) before conducting the final health examinations, at which point the study concluded. We’re not sure what to say about this study design, except that it’s clear the film adaptation should be directed by Wes Anderson.
While this design is amusing, the results are uninspiring.
To begin with, the weight loss was minimal. During the week they spent at 2,650 meters, patients lost an average of 3 pounds (1.5 kg). They were an average of 232 lbs (105.1 kg) to begin with, so this is only about 1% of their body weight. Going from 232 lbs (105.1 kg) to 229 lbs (103.6 kg) doesn’t seem clinically relevant, or even all that noticeable. The authors, surprisingly, agree: “the absolute amount of weight loss was so small.”
More importantly, we’re not convinced that this tiny weight loss result is real, because the paper suffers from serious multiple comparison problems. Also known as p-hacking or “questionable research practices”, multiple comparisons are a problem because they can make it very likely to get a false positive. If you run one statistical test, there’s a small chance you will get a false positive, but as you run more tests, false positives get more and more likely. If you run enough tests, you are virtually guaranteed to get a false positive, or many false positives. If you try running many different tests, or try running the same test many different ways, and only report the best one, it’s possible to make pure noise look like a strong finding.
We see evidence of multiple comparisons in the paper. They collect a lot of measures and run a lot of tests. The authors report eight measures of obesity alone, as well many other measures of health.
The week the patients spent at 2,650 meters — Day 7 to Day 14 — is clearly the interval of interest here, but they mostly report comparisons of Day 1 to the other days, and they tend to report all three pairs (D1 to D7, D1 to D14, and D1 to D42), which makes for three times the number of comparisons. It’s also confusing that there are no measures for D21, D28, and D35. Did they not collect data those days, or just not report it? We think they just didn’t collect data, but it’s not clear.
The authors also use a very unusual form of statistical analysis — for each test, first they conducted a nonparametric Friedmann procedure. Then, if that showed a significant rank difference, they did a Wilcoxon signed‐rank method test. It’s pretty strange to run one test conditional on another like this, especially for such a simple comparison. It’s also not clear what role the Friedmann procedure is playing in this analysis. Presumably they are referring to the Friedman test (we assume they don’t mean this procedure for biodiesel analysis) and this is a simple typo, but it’s not clear why they want to rank the means. In addition, the Wilcoxon signed‐rank test seems like a slightly strange choice. The more standard analysis here would be the humble paired t-test.
Even if this really were best practice, there’s no way to know that they didn’t start by running paired t-tests, throwing those results out when they found that they were only trending in the right direction. And in fact, we noticed that if we compare body weight at D7 to D14 using a paired t-test, we find a p-value of .0506, instead of the p < .001 they report when comparing D1 to D14 with a Wilcoxon test. We think that this is the more appropriate analysis, and as you can see, it’s not statistically significant.
Regardless, the whole analysis is called into question by the number of tests they ran. By our count they conducted at least 74 tests in this paper, which is a form of p-hacking and makes the results very hard to interpret. It’s also possible that they conducted even more tests that weren’t reported in the paper. This isn’t really their fault — p-hacking wasn’t described until 2011 (and the term itself wasn’t invented until a few years later), so like most people they were almost certainly unfamiliar with issues of multiple comparisons when they did their analysis. While we don’t accuse the authors of acting in bad faith, we do think this seriously undermines our ability to interpret their results. When we ran the single test that we think was most appropriate, we found that it was not significant.
And of course, the sample size was only 20 people, though perhaps there wasn’t room for many more people in the research station. On one hand this is pretty standard for intensive studies like this, but it reduces the statistical power.
The authors claim to show that hypoxia causes weight loss, but this is overstating their case. They report that people brought to 2,650 meters lost a small amount of weight and had lower blood oxygen saturation , but we think the former result is noise and the latter result is unsurprising. Obviously if you bring people to 2,650 meters they will have lower blood oxygen, and there’s no evidence linking that to the reported weight loss.
Even more concerning is the fact that there’s no control group, which means that this study isn’t even an experiment. Without a control group, there can be no random assignment, and with no random assignment, a study isn’t an experiment. As a result, the strong causal claim the authors draw from their results is pretty unsubstantiated.
There isn’t an obvious fix for this problem. A control group that stayed in Munich wouldn’t be appropriate, because oxygen is confounded with everything else about altitude. If there were a difference between the Munich group and the Schneefernerhaus group, there would be no way to tell if that was due to the amount of oxygen or any of the other thousand differences between the two locations. A better approach would be to bring a control group to the same altitude, and give that control group extra oxygen, though that might introduce its own confounds — for example, the supplemental-oxygen group would all be wearing masks and carrying canisters. I guess the best way to do this would be to bring both groups to the Alps, give both of them canisters and masks, but put real oxygen in the canisters for one group and placebo oxygen (nitrogen?) in the canisters for the other groups.
We’re sympathetic to inferring causal relationships from correlational data, but the authors don’t report a correlation between blood oxygen saturation and weight loss, even though that would be the relevant test given the data that they have. Probably they don’t report it because it’s not significant. They do report, “We could not find a significant correlation between oxygen saturation or oxygen partial pressure, and either ghrelin or leptin.” These are tests that we might expect to be significant if hypoxia caused weight loss — which suggests that it does not.
Unfortunately, the authors report no evidence for their mechanism and probably don’t have an effect to explain in the first place. This is too bad — the study asks an interesting question, and the design looks good at first. It’s only on reflection that you see that there are serious problems.
Thanks to Nick Brown for reading a draft of this post.
 One thing that Nick Brown noticed when he read the first draft of this post is that the oxygen saturation percentages reported for D7 and D14 seem to be dangerously low. We’ve all become more familiar with oxygen saturation measures because of COVID, so you may already know that a normal range is 95-100%. Guidelines generally suggest that levels below 90% are dangerous, and should be cause to seek medical attention, so it’s a little surprising that the average for these 20 men was in the mid-80’s during their week at high altitude. We found this confusing so we looked into it, and it turns out that this is probably not a issue. Not only are lower oxygen saturation levels normal at higher altitudes, the levels can apparently be very low by sea-level standards without becoming dangerous. For example, in this study of residents of El Alto in Bolivia (an elevation of 4018 m), the mean oxygen saturation percentages were in the range of 85-88%. So while this is definitely striking, it’s probably not anything to worry about.
New decade just dropped. Eli Dourado is optimistic in detail. Feeling good about energy, electric cars, and filtering your blood to keep you from aging. Definitely cheaper than bathing in the blood of virgins. Don’t try this last one at home… yet 😉
Briefly, Hall et al. (2019) is a metabolic ward study on the effects of “ultra-processed” foods on energy intake and weight gain. The participants were 20 adults, an average of 31.2 years old. They had a mean BMI of 27, so on average participants were slightly overweight, but not obese.
Participants were admitted to the metabolic ward and randomly assigned to one of two conditions. They either ate an ultra-processed diet for two weeks, immediately followed by an unprocessed diet for two weeks — or they ate an unprocessed diet for two weeks, immediately followed by an ultra-processed diet for two weeks. The study was ad libitum, so whether they were eating an unprocessed or an ultra-processed diet, participants were always allowed to eat as much as they wanted — in the words of the authors, “subjects were instructed to consume as much or as little as desired.”
The authors found that people ate more on the ultra-processed diet and gained a small amount of weight, compared to the unprocessed diet, where they ate less and lost a small amount of weight.
We’re not in the habit of re-analyzing published papers, but we decided to take a closer look at this study because a couple of things in the abstract struck us as surprising. Weight change is one main outcome of interest for this study, and several unusual things about this measure stand out immediately. First, the two groups report the same amount of change in body weight, the only difference being that one group gained weight and the other group lost it. In the ultra-processed diet group, people gained 0.9 ± 0.3 kg (p = 0.009), and in the unprocessed diet group, people lost 0.9 ± 0.3 kg (p = 0.007). (Those ± values are standard errors of the mean.) It’s pretty unlikely for the means of both groups to be identical, and it’s very unlikely that both the means and the standard errors would be identical.
It’s not impossible for these numbers to be the same (and in fact, they are not precisely equal in the raw data, though they are still pretty close), especially given that they’re rounded to one decimal place. But it is weird. We ran some simple simulations which suggest that this should only happen about 5% of the time — but this is assuming that the means and SDs of the two groups are both identical in the population, which itself is very unlikely.
Another test of interest reported in the abstract also seemed odd. They report that weight changes were highly correlated with energy intake (r = 0.8, p < 0.0001). This correlation coefficient struck us as surprising, because it’s pretty huge. There are very few measures that are correlated with one another at 0.8 — these are the types of correlations we tend to see between identical twins, or repeated measurements of the same person. As an example, in identical twins, BMI is correlated at about r = 0.8, and height at about r = 0.9.
We know that these points are pretty ticky-tacky stuff. By themselves, they’re not much, but they bothered us. Something already seemed weird, and we hadn’t even gotten past the abstract.
To conduct this analysis, we teamed up with Nick Brown, with additional help from James Heathers. We focused on one particular dependent variable of this study, weight change, while Nick took a broader look at several elements of the paper.
Because we were most interested in weight change, we decided to begin by taking a close look at the file “deltabw”. In mathematics, delta usually means “change” or “the change in”, and “bw” here stands for “body weight”, so this title indicates that the file contains data for the change in participants’ body weights. On the OSF this is in the form of a SAS .sas7bdat file, but we converted it to a .csv file, which is a little easier to work with.
Here’s a screenshot of what the deltabw file looks like:
In this spreadsheet, each row tells us about the weight for one participant on one day of the 4-week-long study. These daily body weight measurements were performed at 6am each morning, so we have one row for every day.
Let’s also orient you to the columns. “StudyID” is the ID for each participant. Here we can see that in this screenshot we are looking just at participant ADL001, or participant 01 for short. The “Period” variable tells us whether the participant was eating an ultra-processed (PROC) or an unprocessed (UNPROC) diet on that day. Here we can see that participant 01 was part of the group who had an unprocessed diet for the first two weeks, before switching to the ultra-processed diet for the last two weeks. “Day” tells us which day in the 28-day study the measurement is from. Here we show only the first 20 days for participant 01.
“BW” is the main variable of interest, as it is the participant’s measured weight, in kilograms, for that day of the study. “DayInPeriod” tells us which day they are on for that particular diet. Each participant goes 14 days on one diet then begins day 1 on the other diet. “BaseBW” is just their weight for day 1 on that period. Participant 01 was 94.87 kg on day one of the unprocessed diet, so this column holds that value as long as they’re on that diet. “DeltaBW” is the difference between their weight on that day and the weight they were at the beginning of that period. For example, participant 01 weighed 94.87 kg on day one and 94.07 kg on day nine, so the DeltaBW value for day nine is -0.80.
Finally, “DeltaDaily” is a variable that we added, which is just a simple calculation of how much the participant’s weight changed each day. If someone weighed 82.85 kg yesterday and they weigh 82.95 kg today, the DeltaDaily would be 0.10, because they gained 0.10 kg in the last 24 hours.
To begin with, we were able to replicate the authors’ main findings. When we don’t round to one decimal place, we see that participants on the ultra-processed diet gained an average of 0.9380 (± 0.3219) kg, and participants on the unprocessed diet lost an average of 0.9085 (± 0.3006) kg. That’s only a difference of 0.0295 kg in absolute values in the means, and 0.0213 kg for the standard errors, which we still find quite surprising. Note that this is different from the concern about standard errors raised by Drs. Mackerras and Blizzard. Many of the standard errors in this paper come from GLM analysis, which assumes homogeneity of variances and often leads to identical standard errors. But these are independently calculated standard errors of the mean for each condition, so it is still somewhat surprising that they are so similar (though not identical).
On average these participants gained and lost impressive, but not shocking amounts of weight. A few of the participants, however, saw weight loss that was very concerning. One woman lost 4.3 kg in 14 days which, to quote Nick Brown, “is what I would expect if she had dysentery” (evocative though perhaps a little excessive). In fact, according to the data, she lost 2.39 kg in the first five days alone. We also notice that this patient was only 67.12 kg (about 148 lbs) to begin with, so such a huge loss is proportionally even more concerning. This is the most extreme case, of course, but not the only case of such intense weight change over such a short period.
The article tells us that participants were weighed on a Welch Allyn Scale-Tronix 5702 scale, which has a resolution of 0.1 lb or 100 grams (0.1 kg). This means it should only display data to one decimal place. Here’s the manufacturer’s specification sheet for that model. But participant weights in the file deltabw are all reported to two decimal places; that is, with a precision of 0.01 kg, as you can clearly see from the screenshot above. Of the 560 weight readings in the data file, only 55 end in zero. It is not clear how this is possible, since the scale apparently doesn’t display this much precision.
To confirm this, we wrote to Welch Allyn’s customer support department, who confirmed that the model 5702 has 0.1 kg resolution.
We also considered the possibility that the researchers measured people’s weight in pounds and then converted to kilograms, in order to use the scale’s better precision of 0.1 pounds (45.4 grams) rather than 100 grams. However, in this case, one would expect to see that all of the changes in weight were multiples of (approximately) 0.045 kg, which is not what we observe.
As we look closer at the numbers, things get even more confusing.
As we noted, Hall et al. report participant weight to two decimal places in kilograms for every participant on every day. Kilograms to two decimal places should be pretty sensitive, but there are many cases where the exact same weight appears two or even three times in a row. For example, participant 21 is listed as having a weight of exactly 59.32 kg on days 12, 13, and 14, participant 13 is listed as having a weight of exactly 96.43 kg on days 10, 11, and 12, and participant 06 is listed as having a weight of exactly 49.54 kg on days 23, 24, and 25.
Having the same weight for two or even three days in a row may not seem that strange, but it is very remarkable when the measurement is in kilograms precise to two decimal places. After all, 0.01 kg (10 grams) is not very much weight at all. A standard egg weighs about 0.05 kg (50 grams). A shot of liquor is a little less, usually a bit more than 0.03 kg (30 grams). A tablespoon of water is about 0.015 kg (15 grams). This suggests that people’s weights are varying by less than the weight of a tablespoon of water over the course of entire days, and sometimes over multiple days. This uncanny precision seems even more unusual when we note that body weight measurements were taken at 6 am every morning “after the first void”, which suggests that participants’ bodily functions were precise to 0.01 kg on certain days as well.
The case of participant 06 is particularly confusing, as 49.54 kg is exactly one kilogram less, to two decimal places, than the baseline for this participant’s weight when they started, 50.54 kg. Furthermore, in the “unprocessed” period, participant 06 only ever seems to lose or gain weight in full increments of 0.10 kilograms.
We see similar patterns in the data from other participants. Let’s take a look at the DeltaDaily variable. As a reminder, this variable is just the difference between a person’s weight on one day and the day before. These are nothing more than daily changes in weight.
Because these numbers are calculated from the difference between two weight measurements, both of which are reported to two decimal places of accuracy, these numbers should have two places of accuracy as well. But surprisingly, we see that many of these weight changes are in full increments of 0.10.
Take a look at the histograms below. The top histogram is the distribution of weight changes by day. For example, a person might gain 0.10 kg between days 15 and 16, and that would be one of the observations in this histogram.
You’ll see that these data have an extremely unnatural hair-comb pattern of spikes, with only a few observations in between. This is because the vast majority (~71%) of the weight changes are in exact multiples of 0.10, despite the fact that weights and weight changes are reported to two decimal places. That is to say, participants’ weights usually changed in increments like 0.20 kg, -0.10 kg, or 0.40 kg, and almost never in increments like -0.03 kg, 0.12 kg, or 0.28 kg.
For comparison, on the bottom is a sample from a simulated normal distribution with identical n, mean, and standard deviation. You’ll see that there is no hair-comb pattern for these data.
As we mentioned earlier, there are several cases where a participant stays at the exact same weight for two or three days in a row. The distribution we see here is the cause. As you can see, the most common daily change is exactly zero. Now, it’s certainly possible to imagine why some values might end up being zero in a study like this. There might be a technical incident with the scale, a clerical error, or a mistake when recording handwritten data on the computer. A lazy lab assistant might lose their notes, resulting in the previous day’s value being used as the reasonable best estimate. But since a change of exactly zero is the modal response, a full 9% of all measurements, it’s hard to imagine that these are all omissions or technical errors.
In addition, there’s something very strange going on with the trailing digits:
On the top here we have the distribution of digits in the 0.1 place. For example, a measurement of 0.29 kg would appear as a 2 here. This follows about the distribution we would expect, though there are a few more 1’s and fewer 0’s than usual.
The bottom histogram is where things get weird. Here we have the distribution of digits in the 0.01 place. For example, a measurement of 0.29 kg would appear as a 9 here. As you can see, 382/540 of these observations have a 0 in their 0.01’s place — this is the same as that figure of 71% of measured changes being in full increments of 0.10 kg that we mentioned earlier.
The rest of the distribution is also very strange. When the trailing digit is not a zero, it is almost certainly a 1 or a 9, possibly a 2 or an 8, and almost never anything else. Of 540 observed weight changes, only 3 have a trailing digit of 5.
We can see that this is not what we would expect from (simulated) normally distributed data:
It’s also not what we would expect to see if they were measuring to one decimal place most of the time (~70%), but to two decimal places on occasion (~30%). As we’ve already mentioned, this doesn’t make sense from a methodological standpoint, because all daily weights are to two decimal places. But even it somehow were a measurement accuracy issue, we would expect an equal distribution across all the other digits besides zero, like this:
This is certainly not what we see in the reported data. The fact that 1 and 9 are the most likely trailing digit after 0, and that 2 and 8 are most likely after that, is especially strange.
When we first started looking into this paper, we approached Retraction Watch, who said they considered it a potential story. After completing the analyses above, we shared an early version of this post with Retraction Watch, and with our permission they approached the authors for comment. The authors were kind enough to offer feedback on what we had found, and when we examined their explanation, we found that it clarified a number of our points of confusion.
The first thing they shared with us was this erratum from October 2020, which we hadn’t seen before. The erratum reports that they noticed an error in the documented diet order of one participant. This is an important note but doesn’t affect the analyses we present here, which have very little to do with diet conditions.
Kevin Hall, the first author on this paper, also shared a clarification on how body weights were calculated:
I think I just discovered the likely explanation about the distribution of high-precision digits in the body weight measurements that are the main subject of one of the blogs. It’s kind of illustrative of how difficult it is to fully report experimental methods! It turns out that the body weight measurements were recorded to the 0.1 kg according to the scale precision. However, we subtracted the weight of the subject’s pajamas that were measured using a more precise balance at a single time point. We repeated subtracting the mass of the pajamas on all occasions when the subject wore those pajamas. See the example excerpted below from the original form from one subject who wore the same pajamas (PJs) for three days and then switched to a new set. Obviously, the repeating high precision digits are due to the constant PJs! 😉
This matches what is reported in the paper, where they state, “Subjects wore hospital-issued top and bottom pajamas which were pre-weighed and deducted from scale weight.”
Kevin also included the following image, which shows part of how the data was recorded for one participant:
If we understand this correctly, the first time a participant wore a set of pajamas, the pajamas were weighed to three decimals of precision. Then, that measurement was subtracted from the participant’s weight on the scale (“Patient Weight”) on every consecutive morning, to calculate the participant’s body weight. For an unclear reason, this was recorded to two decimals of precision, rather than the one decimal of precision given by the scale, or the three decimals of precision given by the PJ weights. When the participant switched to a new set of pajamas, the new set was weighed to three decimals of precision, and that number was used to calculate participant body weight until they switched to yet another new set of pajamas, etc.
We assume that the measurement for the pajamas is given in kilograms, even though they write “g” and “gm” (“qm”?) in the column. I wish my undergraduate lab TAs were as forgiving as the editors at Cell Metabolism.
This method does account for the fact that participant body weights were reported to two decimal places of precision, despite the fact that the scale only measures weight to one decimal place of precision. Even so, there were a couple of things that we still found confusing.
The variable that interests us the most is the DeltaDaily variable. We can easily calculate that variable for the provided example, like so:
We can see that whenever a participant doesn’t change their pajamas on consecutive days, there’s a trailing zero. In this way, the pajamas can account for the fact that 71% of the time, the trailing digits in the DeltaDaily variable were zeros.
We also see that whenever the trailing digit is not zero, that lets us identify when a participant has changed their pajamas. Note of course that about ten percent of the time, a change in pajamas will also lead to a trailing digit of zero. So every trailing digit that isn’t zero is a pajama change, though a small number of the zeros will also be “hidden” pajama changes.
In any case, we can use this to make inferences about how often participants change their pajamas, which we find rather confusing. Participants often change their pajamas every day for multiple days in a row, or go long stretches without apparently changing their pajamas at all, and sometimes these are the same participants. It’s possible that these long stretches without any apparent change of pajamas are the result of the “hidden” changes we mentioned, because about 10% of the time changes would happen without the trailing digit changing, but it’s still surprising.
For example, participant 05 changes their pajamas on day 2, day 5, and day 10, and then apparently doesn’t change their pajamas again until day 28, going more than two weeks without a change in PJs. Participant 20, in contrast, changes pajamas at least 16 times over 28 days, including every day for the last four days of the study. The record for this, however, has to go to participant 03, who at one point appears to have switched pajamas every day for at least seven days in a row. Participant 03 then goes eight days in a row without changing pajamas before switching pajamas every day for three days in a row.
Participant 08 (the participant from the image above) seems to change their pajamas only twice during the entire 28-day study, once on day 4 and again on day 28. Certainly this is possible, but it doesn’t look like the pajama-wearing habits we would expect. It’s true that some people probably want to change their pajamas more than others, but this doesn’t seem like it can be entirely attributed to personality, as some people don’t change pajamas at all for a long time, and then start to change them nearly every day, or vice-versa.
We were also unclear on whether the pajamas adjustment could account for the most confusing pattern we saw in the data for this article, the distribution of digits in the .01 place for the DeltaDaily variable:
The pajamas method can explain why there are so many zeros — any day a participant didn’t change their pajamas, there would be a zero, and it’s conceivable that participants only changed their pajamas on 30% of the days they were in the study.
We weren’t sure if the pajamas method could explain the distribution of the other digits. For the trailing digits that aren’t zero, 42% of them are 1’s, 27% of them are 9’s, 9% of them are 2’s, 8% of them are 8’s, and the remaining digits account for only about 3% each. This seems very strange.
You’ll recall that the DeltaDaily values record the changes in participant weights between consecutive days. Because the weight of the scale is only precise to 0.1 kg, the data in the 0.01 place records information about the difference between two different pairs of pajamas. For illustration, in the example Kevin Hall provided, the participant switched between a pair of pajamas weighing 0.418 kg and a pair weighing 0.376 kg. These are different by 0.042 kg, so when they rounded it to two digits, the difference we see in the DeltaDaily has a trailing digit of 4.
We wanted to know if the pajama adjustment could explain why the difference (for the digit in the 0.01’s place) between the weights of two pairs of pajamas are 14x more likely to be a 1 than a 6, or 9x more likely to be a 9 than a 3.
Verbal arguments quickly got very confusing, so we decided to run some simulations. We simulated 20 participants, for 28 days each, just like the actual study. On day one, simulated participants were assigned a starting weight, which was a random integer between 40 and 100. Every day, their weight changed by an amount between -1.5 and 1.5 by increments of 0.1 (-1.5, -1.4, -1.3 … 1.4, 1.5), with each increment having an equal chance of occuring.
The important part of the simulation were the pajamas, of course. Participants were assigned a pajama weight on day 1, and each day they had a 35% chance of changing pajamas, and being assigned a new pajama weight. The real question was how to generate a reasonable distribution of pajama weights. We didn’t have much to go off of, just the two values in the image that Kevin Hall shared with us. But we decided to give it a shot with just that information. Weights of 418 g and 376 g have a mean of just under 400 g and a standard deviation of 30 g, so we decided to sample our pajama weights from a normal distribution with those parameters.
When we ran this simulation, we found a distribution of digits in the 0.01 place that didn’t show the same saddle-shaped distribution as in the data from the paper:
We decided to run some additional simulations, just to be sure. To our surprise, when the SD of the pajamas is smaller, in the range of 10-20 g, you can sometimes get saddle-shaped distributions just like the ones we saw in data from the paper. Here’s an example of what the digits can look like when the SD of the pajamas is 15 g:
It’s hard for us to say whether a standard deviation of 15 g or of 30 g is more realistic for hospital pajamas, but it’s clear that under certain circumstances, pajama adjustments can create this kind of distribution (we propose calling it the “pajama distribution”).
While we find this distribution surprising, we conclude that it is possible given what we know about these data and how the weights were calculated.
When we took a close look at these data, we originally found a number of patterns that we were unable to explain. Having communicated with the authors, we now think that while there are some strange choices in their analysis, most of these patterns can be explained when we take into account the fact that pajama weights were deducted from scale weights, and the two weights had different levels of precision.
While these patterns can be explained by the pajama adjustment described by Kevin Hall, there are some important lessons here. The first, as Kevin notes in his comment, is that it can be very difficult to fully record one’s methods. It would have been better to include the full history of this variable in the data files, including the pajama weights, instead of recording the weights and performing the relevant comparisons by hand.
The second is a lesson about combining data of different levels of precision. The hair-comb pattern that we observed in the distribution of DeltaDaily scores was truly bizarre, and was reason for serious concern. It turns out that this kind of distribution can occur when a measure with one decimal of precision is combined with another measure with three decimals of precision, with the result being rounded to two decimals of precision. In the future researchers should try to avoid combining data in this way to avoid creating such artifacts. While it may not affect their conclusions, it is strange for the authors to claim that someone’s weight changed by (for example) 1.27 kg, when they have no way to measure the change to that level of precision.
There are some more minor points that this explanation does not address, however. We still find it surprising how consistent the weight change was in this study, and how extreme some of the weight changes were. We also remain somewhat confused by how often participants changed (or didn’t change) their pajamas.
This post continues in Part Two over at Nick Brown’s blog, where he covers several other aspects of the study design and data.
Thanks again to Nick Brown for comparing notes with us on this analysis, to James Heathers for helpful comments, and to a couple of early readers who asked to remain anonymous. Special thanks to Kevin Hall and the other authors of the original paper, who have been extremely forthcoming and polite in their correspondence. We look forward to ongoing public discussion of these analyses, as we believe the open exchange of ideas can benefit the scientific community.
Imagine a universe where every cognitive scientist receives extensive training in how to deal with demand characteristics.
(Demand characteristics describe any situation in a study where a participant either figures out what a study is about, or thinks they have, and changes what they do in response. If the participant is friendly and helpful, they may try to give answers that will make the researchers happy; if they have the opposite disposition, they might intentionally give nonsense answers to ruin the experiment. This is a big part of why most studies don’t tell participants what condition they’re in, and why some studies are run double-blind.)
In the real world, most students get one or two lessons about demand characteristics when they take their undergrad methods class. When researchers are talking about a study design, sometimes we mention demand, but only if it seems relevant.
Let’s return to our imaginary universe. Here, things are very different. Demand characteristics are no longer covered in undergraduate methods courses — instead, entire classes are exclusively dedicated to demand characteristics and how to deal with them. If you major in a cognitive science, you’re required to take two whole courses on demand — Introduction to Demand for the Psychological Sciences and Advanced Demand Characteristics.
Often there are advanced courses on specific forms of demand. You might take a course that spends a whole semester looking at the negative-participant role (also known as the “screw-you effect”), or a course on how to use deception to avoid various types of demand.
If you apply to graduate school, how you did in these undergraduate courses will be a major factor determining whether they let you in. If you do get in, you still have to take graduate-level demand courses. These are pretty much the same as the undergrad courses, except they make you read some of the original papers and work through the reasoning for yourself.
When presenting your research in a talk or conference, you can usually expect to get a couple of questions about how you accounted for demand in your design. Students are evaluated based on how well they can talk about demand and how advanced the techniques they use are.
Every journal requires you to include a section on demand characteristics in every paper you submit, and reviewers will often criticize your manuscript because you didn’t account for demand in the way they expected. When you go up for a job, people want to know that you’re qualified to deal with all kinds of demand characteristics. If you have training in dealing with an obscure subtype of demand, it will help you get hired.
It would be pretty crazy to devote such a laser focus to this one tiny aspect of the research process. Yet this is exactly what we do with statistics.
Science is all about alternative explanations. We design studies to rule out as many stories as we can. Whatever stories remain are possible explanations for our observations. Over time, we whittle this down to a small number of well-supported theories.
There’s one alternative explanation that is always a concern. For any relationship we observe, there’s a chance that what we’re seeing is just noise. Statistics is a set of tools designed to deal with this problem. This holds a special place in science because “it was noise” is a concern for every study in every field, so we always want to make sure to rule it out.
But of course, there are many alternative explanations that we need to be concerned with. Whenever you’re dealing with human participants, demand characteristics will also be a possible alternative. Despite this, we don’t jump down people’s throats about demand. We only bring up these issues when we have a reason to suspect that it is a problem for the design we’re looking at.
There will always be more than one way to look at any set of results. We can never rule out every alternative explanation — the best we can do is account for the most important and most likely alternatives. We decide which ones to account for by using our judgement, by taking some time to think about what alternatives we (and our readers) will be most concerned about.
The right answer will look different for different experiments. But the wrong answer is to blindly throw statistics at every single study.
Statistics is useful when a finding looks like it could be the result of noise, but you’re not sure. For example, let’s say we’re testing a new treatment for a disease. We have a group of 100 patients who get the treatment and a control group of 100 people who don’t get the treatment. If 52/100 people recover when they get the treatment, compared to 42/100 recovering in the control group, does that mean the treatment helped? Or is the difference just noise? I can’t tell with just a glance, but a simple chi-squared test can tell me that p = .013, meaning there’s only a 1.3% chance that we would see something like this from noise alone.
That’s helpful, but it would be pointless to run a statistical test if we saw 43/100 people recover with the treatment, compared to 42/100 in the control group. I can tell that this is very consistent with noise (p > .50) just by looking at it. And it would be pointless to run a statistical test if we saw 98/100 people recover with the treatment, compared to 42/100 in the control group. I can tell that this is very inconsistent with noise (p < .00000000000001) just by looking at it. If something passes the interocular trauma test (the conclusion hits you between the eyes), you don’t need to pull out another test.
This might sound outlandish today, but you can do perfectly good science without any statistics at all. After all, statistics is barely more than a hundred years old. Sir Francis Galton came up with the concept of the standard deviation in the 1860s, and the story with the ox didn’t happen until 1907. It took until the 1880s to dream up correlation. Karl Pearson was born in 1857 but didn’t do most of his statistics work until around the turn of the century. Fisher wasn’t even born until 1890. He introduced the term variance for the first time in 1918, but both that term and the ANOVA didn’t gain popularity until the publication of his book in 1925.
This means that Galileo, Newton, Kepler, Hooke, Pasteur, Mendel, Lavoisier, Maxwell, von Helmholtz, Mendeleev, etc. did their work without anything that resembled modern statistics, and that Einstein, Curie, Fermi, Bohr, Heisenberg, etc. etc. did their work in an age when statistics was still extremely rudimentary. We don’t need statistics to do good research.
This isn’t an original idea, or even a particularly new one. When statistics was young, people understood this point better. For an example, we can turn to Sir Austin Bradford Hill. He was trained by Karl Pearson (who, among other things, invented the chi-squared test we used earlier), was briefly president of the Royal Statistical Society, and was sometimes referred to as the world’s leading medical statistician. As early as the 1920s, he was pioneering the introduction of the randomized clinical trial in medicine. As far as opinions on statistics go, the man was pretty qualified.
While you may not know his name, you’re probably familiar with his work. He was one of the researchers who demonstrated the connection between cigarette smoking and lung cancer, and in 1965 he gave a speech about his work on the topic. Most of the speech was a discussion of how one can infer a causal relationship from largely correlational data, as he had done with the smoking-lung cancer connection, a set of considerations that came to be known as the Bradford Hill criteria.
But near the end of the speech, he turns to a discussion of tests of significance, as he calls them, and their limitations:
No formal tests of significance can answer [questions of cause and effect]. Such tests can, and should, remind us of the effects that the play of chance can create, and they will instruct us in the likely magnitude of those effects. Beyond that they contribute nothing to the ‘proof’ of our hypothesis.
Nearly forty years ago, amongst the studies of occupational health that I made for the Industrial Health Research Board of the Medical Research Council was one that concerned the workers in the cotton-spinning mills of Lancashire (Hill 1930). … All this has rightly passed into the limbo of forgotten things. What interests me today is this: My results were set out for men and women separately and for half a dozen age groups in 36 tables. So there were plenty of sums. Yet I cannot find that anywhere I thought it necessary to use a test of significance. The evidence was so clear cut, the differences between the groups were mainly so large, the contrast between respiratory and non-respiratory causes of illness so specific, that no formal tests could really contribute anything of value to the argument. So why use them?
Would we think or act that way today? I rather doubt it. Between the two world wars there was a strong case for emphasizing to the clinician and other research workers the importance of not overlooking the effects of the play of chance upon their data. Perhaps too often generalities were based upon two men and a laboratory dog while the treatment of choice was deducted from a difference between two bedfuls of patients and might easily have no true meaning. It was therefore a useful corrective for statisticians to stress, and to teach the needs for, tests of significance merely to serve as guides to caution before drawing a conclusion, before inflating the particular to the general.
I wonder whether the pendulum has not swung too far – not only with the attentive pupils but even with the statisticians themselves. To decline to draw conclusions without standard errors can surely be just as silly? Fortunately I believe we have not yet gone so far as our friends in the USA where, I am told, some editors of journals will return an article because tests of significance have not been applied. Yet there are innumerable situations in which they are totally unnecessary – because the difference is grotesquely obvious, because it is negligible, or because, whether it be formally significant or not, it is too small to be of any practical importance. What is worse, the glitter of the t-table diverts attention from the inadequacies of the fare. Only a tithe, and an unknown tithe, of the factory personnel volunteer for some procedure or interview, 20% of patients treated in some particular way are lost to sight, 30% of a randomly-drawn sample are never contacted. The sample may, indeed, be akin to that of the man who, according to Swift, ‘had a mind to sell his house and carried a piece of brick in his pocket, which he showed as a pattern to encourage purchasers.’ The writer, the editor and the reader are unmoved. The magic formulae are there.
Of course I exaggerate. Yet too often I suspect we waste a deal of time, we grasp the shadow and lose the substance, we weaken our capacity to interpret the data and to take reasonable decisions whatever the value of P. And far too often we deduce ‘no difference’ from ‘no significant difference.’ Like fire, the chi-squared test is an excellent servant and a bad master.
We grasp the shadow and lose the substance.
As Dr. Hill notes, the blind use of statistical tests is a huge waste of time. Many designs don’t need them; many arguments don’t benefit from them. Despite this, we have long disagreements about which of two tests is most appropriate (even when both of them will be highly significant), we spend time crunching numbers when we already know what we will find, and we demand that manuscripts have their statistics arranged just so — even when it doesn’t matter.
This is an institutional waste of time as well as a personal one. It’s weird that students get so much training in statistics. Methods are almost certainly more important, but most students are forced to take multiple stats classes, while only one or two methods classes are even offered. This is also true at the graduate level. Methods and theory courses are rare in graduate course catalogs, but there is always plenty of statistics.
Some will say that this is because statistics is so much harder to learn than methods. Because it is a more difficult subject, it takes more time to master. Now, it’s true that students tend to take several courses in statistics and come out of them remembering nothing at all about statistics. But this isn’t because statistics is so much more difficult.
We agree that statistical thinking is very important. What we take issue with is the neurotic focus on statistical tests, which are of minor use at best. The problem is that our statistics training spends multiple semesters on tests, while spending little to no time at all on statistical thinking.
This also explains why students don’t learn anything in their statistics classes. Students can tell, even if only unconsciously, that the tests are unimportant, so they have a hard time taking them seriously. They would also do poorly if we asked them to memorize a phone book — so much more so if we asked them to memorize the same phone book for three semesters in a row.
The understanding of these tests is based on statistical thinking, but we don’t teach them that. We’ve become anxious around the tests, and so we devote more and more of the semester to them. But this is like becoming anxious about planes crashing and devoting more of your pilot training time to the procedure for making an emergency landing. If the pilots get less training in the basics, there will be more emergency landings, leading to more anxiety and more training, etc. — it’s a vicious cycle. If you just teach students statistical thinking to begin with, they can see why it’s useful and will be able to easily pick up the less-important tests later on, which is exactly what I found when I taught statistics this way.
The bigger problem is turning our thinking over to machines, especially ones as simple as statistical tests.
Sometimes a test is useful, sometimes it is not. We can have discussions about when a test is the right choice and when it is the wrong one. Researchers aren’t perfect, but we have our judgement and damn it, we should be expected to use it. We may be wrong sometimes, but that is better than letting the p-values call all the shots.
We need to stop taking tests so seriously as a criterion for evaluating papers. There’s a reason, of course, that we are on such high alert about these tests — the concept of p-hacking is only a decade old, and questionable statistical practices are still being discovered all the time.
But this focus on statistical issues tends to obscure deeper problems. We know that p-hacking is bad, but a paper with perfect statistics isn’t necessarily good — the methods and theory, even the basic logic, can be total garbage. In fact, this is part of how we got in the p-hacking situation in the first place: by using statistics as the main way of telling if a paper is any good or not!
Putting statistics first is how we end up with studies with beautifully preregistered protocols and immaculate statistics, but deeply confounded methods, on topics that are unimportant and frankly uninteresting. This is what Hill meant when he said that “the glitter of the t-table diverts attention from the inadequacies of the fare”. Confounded methods can produce highly significant p-values without any p-hacking, but that doesn’t mean the results of such a study are of any value at all.
This is why I find proposals to save science by revising statistics so laughable. Surrendering our judgement to Bayes factors instead of p-values won’t do anything to solve our problems. Changing the threshold of significance from .05 to .01, or .005, or even .001 won’t make for better research. We shouldn’t try to revise statistics, we should use it less often.
Thanks to Adam Mastroianni, Grace Rosen, and Alexa Hubbard for reading drafts of this piece.