Skip to main content

Mo data, mo problems? What we shouldn't do

Rule-makers must ensure access to data is balanced between business and society

When too much information can be a bad thing

In 2017, three economists – Oded Netzer and Alain Lemaire, both of Columbia, and Michal Herzenstein of the University of Delaware – looked for ways to predict the likelihood of whether a borrower would pay back a loan1. The scholars utilised data from Prosper, a peer-to-peer lending site. On it, potential borrowers write a brief description of why they need a loan and why they are likely to make good on it, and potential lenders decide whether to provide them the money. Overall, about 13 per cent of borrowers defaulted2.

It turns out the language that potential borrowers use is a strong predictor of their probability of paying back. And it is an important indicator even if you control for other relevant information lenders were able to obtain about those potential borrowers, including credit ratings and income.

Listed below are 10 phrases the researchers found that are commonly used when applying for a loan. Five of them positively correlate with paying back the loan. Five of them negatively correlate with paying back the loan. In other words, five tend to be used by people you can trust, five by people you cannot. See if you can guess which are which.

Lower interest rate
Will pay
Minimum payment
Thank you

You might think – or at least hope – that a polite, openly religious person who gives his word would be among the most likely to pay back a loan. But in fact this is not the case. This type of person, the data shows, is less likely than average to make good on their debt.

Here are the phrases grouped by the likelihood of paying back.

Terms used in loan applications by people most likely to pay back

Low interest rate
Minimum payment

Terms used in loan applications by people most likely to default

Will pay
Thank you

Before we discuss the ethical implications of this study, let’s think through, with the help of the study’s authors, what it reveals about people. What should we make of the words in the different categories?

First, let’s consider the language that suggests someone is more likely to make their loan payments. Phrases such as 'lower interest rate' or 'after-tax' indicate a certain level of financial sophistication on the borrower’s part, so it’s perhaps not surprising they correlate with someone more likely to pay their loan back. In addition, if he or she talks about positive achievements such as being a college graduate and being debt-free, he or she is also likely to pay their loans.

Now let’s consider language that suggests someone is unlikely to pay their loans. Generally, if someone tells you he will pay you back, he will not pay you back. The more assertive the promise, the more likely he will break it. If someone writes ‘I promise I will pay back, so help me God,’ he is among the least likely to pay you back. Appealing to your mercy – explaining that he needs the money because he has a relative in the hospital – also means he is unlikely to pay you back.

In fact, mentioning any family member – a husband, wife, son, daughter, mother, or father – is a sign someone will not be paying back. Another word that indicates default is ‘explain’, meaning if people are trying to explain why they are going to be able to pay back a loan, they likely won’t.

The authors did not have a theory for why thanking people is evidence of likely default. In sum, according to these researchers, giving a detailed plan of how he can make his payments and mentioning commitments he has kept in the past are evidence someone will pay back a loan. Making promises and appealing to your mercy is a clear sign someone will go into default.

Regardless of the reasons – or what it tells us about human nature that making promises is a sure sign someone will, in actuality, not do something – the scholars found the test was an extremely valuable piece of information in predicting default. Someone who mentions God was 2.2 times more likely to default. This was among the single highest indicators that someone would not pay back.

But the authors also believe their study raises ethical questions. While this was just an academic study, some companies do report that they utilise online data in approving loans. Is this acceptable? Do we want to live in a world in which businesses use the words we write to predict whether we will pay back a loan? It is, at a minimum, creepy – and, quite possibly, scary.

A consumer looking for a loan in the near future might have to worry about not merely her financial history but also her online activity. And she may be judged on factors that seem absurd – whether she uses the phrase ‘Thank you’ or invokes God, for example.

Further, what about a woman who legitimately needs to help her sister in a hospital and will most certainly pay back her loan afterwards? It seems awful to punish her because, on average, people claiming to need help for medical bills have often been proven to be lying. A world functioning this way starts to look awfully dystopian.

This is the ethical question: do corporations have the right to judge our fitness for their services based on abstract but statistically predictive criteria not directly related to those services?

Leaving behind the world of finance, let’s look at the larger implications on, for example, hiring practices. Employers are increasingly scouring social media when considering job candidates. That may not raise ethical questions if they’re looking for evidence of bad-mouthing previous employers or revealing previous employers’ secrets. There may even be some justification for refusing to hire someone whose Facebook or Instagram posts suggest excessive alcohol use. But what if they find a seemingly harmless indicator that correlates with something they care about?

Researchers at the University of Cambridge and Microsoft gave 58,000 US Facebook users a variety of tests about their personality and intelligence.

They found that Facebook likes are frequently correlated with IQ, extraversion, and conscientiousness3. For example, people who like Mozart, thunderstorms and curly fries on Facebook tend to have higher IQs. People who like Harley-Davidson motorcycles, the country music group Lady Antebellum, or the page ‘I Love Being a Mom’ tend to have lower IQs. Some of these correlations may be due to the curse of dimensionality. If you test enough things, some will randomly correlate. But some interests may legitimately correlate with IQ.

Nonetheless, it would seem unfair if a smart person who happens to like Harleys couldn’t get a job commensurate with his skills because he was, without realizing it, signalling low intelligence.

In fairness, this is not an entirely new problem. People have long been judged by factors not directly related to job performance – the firmness of their handshakes or the neatness of their dress. But a danger of the data revolution is that, as more of our life is quantified, these proxy judgements can get more esoteric yet more intrusive. Better prediction can lead to subtler and more nefarious discrimination.

Better data can also lead to another form of discrimination, what economists call price discrimination. Businesses are often trying to figure out what price they should charge for goods or services. Ideally they want to charge customers the maximum they are willing to pay. This way, they will extract the maximum possible profit.

Most businesses usually end up picking one price that everyone pays. But sometimes they are aware that the members of a certain group will, on average, pay more. This is why movie theatres charge more to middle-aged customers – at the height of their earning power – than to students or senior citizens, and why airlines often charge more to last-minute purchasers. They price discriminate.

Big Data may allow businesses to get substantially better at learning what customers are willing to pay – and thus gouging certain groups of people. Optimal Decisions Group was a pioneer in using data science to predict how much consumers are willing to pay for insurance. How did they do it? They found prior customers most similar to those currently looking to buy insurance – and saw how high a premium they were willing to take on. In other words, they ran a doppelganger search.

A doppelganger search is entertaining if it helps us predict whether a baseball player will return to his former greatness. A doppelganger search is great if it helps us cure someone’s disease. But if a doppelganger search helps a corporation extract every last penny from you? That’s not so cool. My spendthrift brother would have a right to complain if he got charged more online than tightwad me.

We have a right to fear that better and better use of online data will give insurance companies, lenders and other corporate entities too much power over us.

On the other hand, Big Data has also been enabling consumers to score some blows against businesses that overcharge them or deliver shoddy products.

One important weapon is sites, such as Yelp, that publish reviews of restaurants and other services. A study by economist Michael Luca, of Harvard, has shown the extent to which businesses are at the mercy of Yelp reviews.4 Comparing those reviews to sales data in the state of Washington, he found that one fewer star on Yelp will make a restaurant’s revenues drop 5 to 9 per cent.

Consumers are also aided in their struggles with business by comparison shopping sites. As discussed in Freakonomics, when an internet site began reporting the prices different companies were charging for term life insurance, these prices fell dramatically. If an insurance company was overcharging, customers would know it and use someone else. The total savings to consumers? $1bn per year.

Data on the internet, in other words, can tell businesses which customers to avoid and which they can exploit. It can also tell customers the businesses they should avoid and who is trying to exploit them. Big Data to date has helped both sides in the struggle between consumers and corporations. We have to make sure it remains a fair fight.

This is an edited extract from Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us about Who We Really Are by Seth Stevens-Davidowitz.


1 Oded Netzer, Alain Lemaire and Michal Herzenstein, ‘When Words Sweat: Identifying Signals for Loan Default in the Text of Loan Applications,’ 2016.

2 Peter Renton, ‘Another Analysis of Default Rates at Lending Club and Prosper,’ 25 October 2012, weblink.

3 Michal Kosinski, David Stillwell and Thore Graepel, ‘Private Traits and Attributes Are Predictable from Digital Records of Human Behavior,’ PNAS 110, no.15 (2013).

4 Michael Luca, ‘Reviews, Reputation, and Revenue: The Case of Yelp,’ unpublished manuscript, 2011.