A Ten Minute Read
I recently found myself in a debate about, of all things, data. Specifically, “Likert”-style data, and how frequently it is used (poorly) to draw broad conclusions about The Truth of the Matter – a classic example of using math to make things look scientifically rigorous. Statistical abuses like this are what cause people to quote Disraeli (“There are three types of lies…”)
We were discussing appropriate measures to drive customer-oriented cultures. The current belle of the ball is “Net Promoter Score”, a metric that “proves” people really like their iPhones and Nordstrom, but not their cable companies. (Millions of dollars spent reaching that conclusion.)
My client was attempting to make significant strategic decisions – impacting marketing, finance, and sales – based on their analysis of NPS and other surveys.
So far, so good.
Trouble was, the client was making mathematically-derived inferences: applying percentages to capital outlays, comparing these “scores” across product lines, and, even worse, benchmarking themselves against their competition.
On the face of it, this feels like “good analysis” – so why isn’t it?
Likert Data
Most everyone has seen a Likert-style survey instrument; these are the questionnaires that ask respondents to answer based on a ranked continuum, such as how strongly they agree or their level of satisfaction.
Before looking at the math involved, there are two things not to like about these surveys:
No Opinion, No Help
“Neutral” is useless. The point of a survey is usually to discover how people feel about something; generally, people aren’t “neutral” about anything. Thanks to a central-tendency bias, though, more people will pick “neutral” than actually feel that way. Good surveys force respondents to make a choice by taking away that “middle” neutral option. (There are some people who disagree with this. Even some who Strongly Disagree. They’re wrong.)
Committee-Vetted Verbiage Obfuscates Obvious Connotations
In other words, we can’t be certain that what’s being answered matches what’s being asked. For instance, here is a question from a hotel survey:
There are many truthful ways to answer that. If it was not everything I had hoped it would be, and yet my arrival matched expectations, I could honestly say I “Disagree” – it was not everything I hoped.
(If there were six things I had hoped would be part of my arrival, but I only experienced five … Does that mean I Strongly Disagree that it was “everything I had hoped”? Or do I need to miss two of six experiences for it to be “Strong”? And where I might Disagree because it wasn’t “everything I had hoped”, someone else receiving the exact same experience could “Strongly Agree” – either because they had hoped for less, or because they interpreted the extremes differently.)
Surveys often feature absolutes, e.g., “I always stay with Brand X”. I most frequently stay with Brand X, but I don’t always stay with them. So do I “Disagree” with the “always” bit, or “Agree” to indicate my preference?
I recently received a survey that asked: “How well do you understand SharePoint?” and the answers ranged from “Highly Dissatisfied” to “Highly Satisfied”. I can’t even begin to fathom how I would answer that. (However satisfied I am with SharePoint, my response gives no clarity to how well I understand it! Are they asking how satisfied I am with my own understanding?)
One particularly egregious example of bafflingly-bad survey questions is in the “StrengthsFinder” instrument that Gallup – who should know better – charges a lot of money for. I digress a bit on that pool of dog’s breakfast here.
Perhaps I’m being obtuse: Good survey design is difficult. Now that so many online tools make it easy to create and distribute surveys, we ignore that too often (at our peril). If there’s room to interpret the question, subsequent interpretation of the results increases geometrically; soon, even though we have all this “data”, we realize it doesn’t actually mean anything.
But that’s OK – we still have room to analyze it incorrectly.
On to the math!
WARNING: Math Ahead
We generally assign numbers to responses on Likert data. Because we learned in second grade that “math” is what we do to numbers, we try to use “math” on Likert responses – which means we average these numbers and hope it tells a story. But they’re made-up numbers, used to keep track of responses: They don’t mean anything. They’re just a ranking. We could use letters (e.g., “A” through “E”) or roman numerals or vegetable emojis.
The numbers assigned to the answers do not represent real values.
Likert data is “ordinal” data: It ORDERS things. First is better than second. “1” is better than “2” (unless “10 “ is the best). Ordinal data – these arbitrary numbers – offer no opinion on the gap.
“Strongly Agreeing” is better than “Agreeing” which is better than “Disagreeing”. But by how much? Is the distance between “Strongly Agree” and “Agree” the same for everyone? (Hint: No.) Is the distance between “Strongly Agree” and “Agree” the same as the distance between “Neutral” and “Strongly Disagree”? Who knows?
When we treat ordinal data like “interval” data, things get messy. First, we have to presume the gaps are the same, and quantifiably so (e.g., the distance between “Strongly Agree” and “Strongly Disagree” is exactly four times the distance between “Strongly Agree” and “Agree”).
Basic mathematics theory tells us that Ordinal data cannot be used as normal math data. Full stop.
Look at what can happen when we try.
In the most simple scenario, let’s assume we survey ten people. Five of them vehemently love our product, the other five violently loathe it:
That provides a “Mean” score of “3”, or “Neutral”.
If we used the Mean, we would interpret this data to represent that no one really cares about our product. This would be very dangerous!
Either way, the Marketing team has a lot of work to do – but whether they spend their effort learning what their acolytes appreciate and determining what their apostates detest, or figuring out how to make the masses care, is very different work.
What’s a Picture Worth?
Simply put, do not try to use math to tell the story. If you need detailed statistics to explain your results, you used the wrong tool.
Use the results in graphical form to highlight difference of opinion.
Here are three different data sets – each with the same “average”.
These were lifted from leadership alignment reviews with individual clients. Even though each had the same weighted mean score, clearly each has different needs. (Because, after all, we’re averaging fake numbers).
By calling it out graphically, we can easily see where there is variability/lack of alignment, and how extreme people may or may not feel about the issues.
Because of the lack of interval information, you may want to consider combining the “positives” with each other and the “negatives” with each other.
Here is how we addressed that with “Bipolar” data – which, like, Likert scales, appears to be math-like, but in fact is not:
Note the lack of a “Neutral” option.
Net Promoter Score fans argue that because the instrument aggregates answers – treating “9” and “10” essentially as one answer (“Promoters”) and “0” through “6” as one (“Detractors”) – all sins should be forgiven. Strictly following this method is, at best, meh (again, the statistical “difference” between an 8 and a 9, or a 6 and a 7, is hard to determine). There is an accepted “correct” NPS calculation, yet people still feel the need to play with Excel: if you start seeing “mean” scores, though, you know you’re either being hoodwinked, or listening to someone short on clues.
As in all things, though, NPS is just a number on a dashboard – an indicator on how well the pursuit of customer satisfaction is going. Pursuing the metric itself – any metrics – is a mistake.
My college roommate’s high school sweetheart (take a moment, I’ll wait) liked to call out people who were “victims of their own education”; she would have a field day with those who like to apply statistical techniques that don’t belong.
Analyzing data is supposed to help us understand our world. Once we start using data in ways it’s not intended, we move into a fictional world. Forget Big Data, it may as well be Fake Data.
At that point, even though we feel better and/or smarter about our rigorous (but completely useless) analysis, we make it impossible for a Man of Action to take the information and do anything with it.
EPILOGUE: The Absolute HorseShit™ That is StrengthsFinder™
The Gallup Organization has made bank selling a inspirational plaques based on its “StrengthsFinder™” survey. They market this as an accurate assessment of an individual’s, well, strengths, and how the collective fits together on a team.
First thing to know, it is complete horseshit. It is a fundamentally unsound instrument.
When people receive assessments – whether it’s StrengthsFinder, Myers-Briggs (less controversial, but still only superficially “accurate”), or “Are You a Rory or a Lorelei?” – they tend to feel they have been accurately described. This is called the “Barnum” effect ( named after the “sucker-born-every-minute” guy), or the “acceptance phenomenon”.
Real research backs this up: in essence, people believe what they want to believe, and are blind to what they don’t want to see. So when a survey comes back and says, “You are a disciplined Man of Action who likes to get things done,” well, who’s going to argue with that?
The StrengthsFinder survey featured dozens of bipolar questions treated as Likert data. Assessees answer on a scale between he descriptor on the left “Strongly Describes Me” and the descriptor on the right “Strongly Describes Me”, with a Neutral in the middle:
I have had clients where the “top achievers” consistently deliver negative results, but they were clearly very sick organizations. Conversely, are there people who consistently deliver positive results and are not top achievers? That would be a rough place to work.
Aren’t these almost exactly the same thing? Isn’t making a deadline saying you would do something by a certain time… and then doing what you said you would do? Is StrengthsFinder™ trying to imply that “making deadlines” is somehow cutting corners, a cheat, like rolling through Stop signs in South Philly?
The implication, I guess, is that “thinkers” do it on their own, and “readers” piggyback off others… The difference between “experience” and “high falutin’ book learnin’”. Apparently, it’s not possible to figure out how something works by reading about it. Note the misequivocation – bookworms love what they do; Hard Knox graduates only like to figure.
Just remember this when you’re hiring a babysitter; you can’t have both.
What about a guy like me, who trusts that others will cower as I claw my way to the top?
Remember this was written before US Senators challenged Betsy DeVos to Muay Thai kickboxing. I can’t even imagine what was in the passed bottle when this question was written: Are there seriously people who are both “passionate” about enriching the souls of their fellow man, while constantly on the lookout for ways to beat the crap out of same fellow man?
Check your shoes, StrengthsFinder.
Because NASCAR pit crews, neurosurgery teams, and the 2016 New England Patriots all just wing it as they go. Look, the morons that wrote this personality “test” clearly believe – and their output demonstrates – that working as a team equals sloppy, insipid, and spiritually suffocating work… But there are some of us who like work done well, and believe that their peers and colleagues do, too.
If the people who wrote StrengthsFinder were ever invited to a social function – and, you know, showed up and pulled themselves away from the platter of still-slightly-frozen-in-the-middle Trader Joe’s spanakopita – they may discover that being a “good listener” is almost exactly half of what’s required to be a “good conversationalist”.
Apparently, Iron Eyes Cody wrote StrengthsFinder.