When Data Fails

In the early 1920s, car manufacturers had a big problem on their hands—engine knock. Engine knock is when fuel combusts inside your car’s motor in an uneven manner. This uneven combustion makes an annoying knocking sound and can cause permanent damage to your car’s engine.

To solve this problem, automative engineers began adding lead (tetraethyl lead) to the gasoline they used in hopes of stabilizing the fuel combustion happening inside the engine. It worked. By adding lead, the engineers were able to increase the octane level of the gasoline, even out the combustion, and reduce engine knock in the process.

Unfortunately, it wasn’t long until the negative side effects of all this added lead began to appear. After 15 workers died of suspected lead poisoning in 1924, the Surgeon General, “suspended the production of leaded gasoline and convened a panel to investigate the potential dangers.”

One of the people involved with this panel was the toxicologist Robert Kehoe. Kehoe wanted to understand lead’s toxicity and was ready to go to extreme lengths to do so. As Sharon McGrayne stated in Prometheans in the Lab:

In the years to come, Kehoe conducted a series of elaborate and dangerous experiments to show that lead does not accumulate in the body. He fed and aerated young men with lead for up to five years while measuring the lead in their feces and urine…Everyone he measured contained lead, whether they were Mexican peasants cooking and eating off of lead-glazed pottery or his controls, men who worked in tetraethyl lead plants. As a result, Kehoe decided that some lead pollution was natural and normal in everyone.

The Surgeon General agreed as well, concluding that there were “no good grounds for prohibiting the use of ethyl gasoline…as a motor fuel, provided that its distribution and use are controlled by proper regulations.” This opened the floodgates for leaded gasoline to flow freely into gas tanks across the U.S. And it did. As McGrayne noted, “by 1960, leaded gasoline accounted for nearly 90 percent of all automotive fuel sold [in the United States].”

Thankfully, a geochemist by the name of Clair Patterson knew that all this lead was dangerous to the public. So, in the mid 1960s, he began a public crusade against the use of lead in industrial processes. His efforts pushed Congress to pass the Clean Air Act of 1970, which forced refineries to start removing lead from their gasoline. By 1996, lead was officially banned in the U.S. for consumer use.

Looking back today, we can see that Robert Kehoe was wrong in a big way. His data failed him, but why? Because he didn’t have the right data. Though Kehoe’s measurements suggested that lead was not accumulating in the body, he overlooked the place where it was accumulating—the bones. As a result of this omission, we ended up pumping lead into the environment for over 50 years to our own dismay.

Kehoe’s error is a prime example of how data, when not used properly, can lead us astray. But there are other ways that data can fail us as well. To illustrate this, below I have compiled my favorite examples of when data fails.

It’s What You Don’t See (or Why the Average is Not the Distribution)

One of the most common ways data fails is when it only describes a subset of what you are actually trying to analyze. Similar to Robert Kehoe’s error of omission highlighted above, when we only see a part of our data, we can come to the wrong conclusion. One area where this is common is examining the average instead of the distribution. Sam Savage provides a great example of this in The Flaw of Averages:

An apocryphal example concerns the statistician who drowned while fording a river that was, on average, only three feet deep.

In this case, knowing the distribution of the river’s depth (or the maximum depth) would have been far superior than knowing only the average depth.

But real world cases of this exist as well. Data scientists at Uber correctly identified that average wait time isn’t the best measure of a ride matching algorithm’s usefulness. For example, imagine an algorithm that neglects less populated areas (drastically increasing the wait time for the small number of riders in those areas), but lowers the wait time for all other riders. In this case, the average wait time may have gone down, but it did so at the expense of those living in sparsely populated locations.

We see the same issue in the financial world when someone chooses an investment because of its high return without considering its volatility. Once again, when you only consider a subset of the information available, you run the risk of a bad outcome.

Lastly, sometimes “the average” result doesn’t apply to your individual experience. For example, you’ve probably heard that people who spend money on experiences are happier than those who spend money on material goods. But, what if this is only true for some portion of the population (i.e. extroverts)? What if this result doesn’t apply to everyone (as this new research suggests)?

This is the problem with using a simple measure such as an average. Sometimes, it’s not what you see that gets you in trouble, it’s what you don’t see.

When Patterns Break

Another common way in which data can be misleading is when that data is derived from a process that changes over time. Anytime the underlying mechanisms that create a data series change, patterns tend to break down.

Consider Benjamin Graham’s early investment style of purchasing net nets. For the uninitiated, net nets are companies that have a net current asset value greater than their market capitalization. In other words, you can buy the entire company for less than its liquidation value. Think about that. If you sold off all the company’s component parts, you would get back more money than what the company is currently selling for. That’s like buying a suitcase with $100,000 in it for only $50,000. It might seem crazy to think that these kinds of situations ever existed, but they did.

Unfortunately, thanks to a better informed investor class, net nets are now a rarity. This strategy, which was first introduced by Graham in Security Analysis in 1934, doesn’t have the edge it once had. Not only are there more investors analyzing stocks today, but they have more information too. The end result is net nets appearing only on rare occasions or in specific industries.

Another example of a financial pattern breaking is the beloved price-to-earnings ratio (“P/E ratio”). For nearly a decade, the P/E ratio has been above its historical average, causing concern for many investors. However, is this historical average still relevant given how much the composition of U.S. stocks has changed over the last century? Not only have new kinds of companies and new ways of doing business been invented, but modifications in accounting rules have also made P/E less comparable over time. This doesn’t mean that the P/E ratio is useless, but it probably should be taken with a grain of salt.

Regardless of how you feel about net nets or the P/E ratio, both are prime examples of how patterns can break when the underlying processes that created them change over time.

Chains of Ignorance

Lastly, data fails when it’s wrong. Just as we stand on the shoulders of giants when relying on correct ideas, we can be imprisoned by the chains of ignorance when relying on incorrect ones. Consider the case of the infamous Reinhart-Rogoff paper “Growth in a Time of Debtthat contained a major Excel error. As stated here:

Reinhart and Rogoff’s work showed average real economic growth slows (a 0.1% decline) when a country’s debt rises to more than 90% of gross domestic product (GDP) – and this 90% figure was employed repeatedly in political arguments over high-profile austerity measures.

However, this 0.1% decline was inaccurate because an Excel formula hadn’t been dragged down properly. After fixing the formula, the 0.1% decrease became a 2.2% increase, completely changing the paper’s conclusion.

This error made headlines because of how influential the paper was. According to Google Scholar, it had been cited by more than 4,500 other academic papers. And many of those academic papers were cited by thousands of other academic papers, and so forth. A long chain of ignorance from a single Excel error.

This is why faulty data can be so destructive, especially when it comes from an influential source. The best way to prevent yourself from contributing to such a chain of ignorance is to remain slightly skeptical of recent findings, no matter who they are from.

The Bottom Line

Anytime you work with data you should consider how it could fail you. Is it missing something? Did the process creating it change over time? Is it just plain wrong?

All of these issues (and more) can affect how you interpret information and make decisions. Trust me, I know. As someone who makes arguments with data often, I’ve seen how it can lead us astray. I’m not immune from this either. I’ve made the mistakes listed above before (and may make them again one day). But the point isn’t to strive for perfection. The point is to be a little bit better about what data we rely on now, and in the future.

Happy investing and thank you for reading!

If you liked this post, consider signing up for my newsletter.

This is post 308. Any code I have related to this post can be found here with the same numbering: https://github.com/nmaggiulli/of-dollars-and-data

For disclosure information please visit: https://ritholtzwealth.com/blog-disclosures/

OfDollarsAndData.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to Amazon.com and affiliated sites.