06/07/2018 10:41 AM IST | Updated 06/07/2018 12:20 PM IST

Why This Data Scientist Feels Data Policies Will Shape The Future Of Nations

Data Scientist Vasant Dhar Thinks Data Is The New Oil, But Not The New Water

Courtesy Vasant Dhar

Vasant Dhar is a data scientist, and a professor at the Stern School of Business and the Center for Data Science at New York University, where his research focuses on artificial intelligence and machine learning. He is also the founder of a machine learning-based hedge fund.

In this interview, he explains why he thinks data policies will determine the future of nation states, why it is it crucial to build proper data infrastructure, and why data is the new oil, but not the new water.


You have said that data policies will shape the future of nation states. You distinguish between four such policies: the US model of data monetisation, the EU/Japan model as defined by the newly minted General Data Protection Regulations (GDPR), China's use of data for control, and the India model, where you say data is used for inclusion. Which is the best model?

There isn't a best one. We've seen what happened with the monetisation model: it led to a disaster in terms of people exploiting that platform in ways you had not even envisioned, because there were not restrictions or constraints on that data. The GDPR framework imposes costs because businesses are scared about how to do things. They are like, "We are not even going to offer these services because we might get into trouble."

So you might argue that GDPR is going to introduce friction. The Chinese model – great for innovation, but very risky from the standpoint of misuse and centralised authority.

In India, the problems around data are somewhat different from the US, where there are concerns about people hacking platforms.

In India, the emphasis is on inclusion. The small business guy can't get a loan, and is invisible to the mainstream. He wants his transactions to be visible to the bank, wants his transaction to be visible to the lender for legitimacy.

So anything that facilitates more inclusion will lead to better markets, more efficient markets, better pricing. A guy shouldn't be paying 10% a day to get a loan to put his cart out on the street. Someone should be able to say, "This guy is a pretty good credit risk. He pays his staff at the end of the day, I've seen his transactions." None of that stuff was visible earlier. The fact that it is now visible is a potential data empowerment avenue.

Do we actually have an Indian model? We don't even have a data law.

I'm just calling it the India model. Its probably the least well defined of the four. The reason I call it that – is that's where I see it headed. There is so much of a need for inclusion of people into the mainstream, much more so than China. So that will be an important part of how data policy develops in India.

The rest of it will probably be an amalgam of the US and the EU: This is an open democracy, there are going to be concerns about misuse, there are going to be concerns about big brother.

India will probably end up with this inclusion empowerment thing on the one hand, because its needed and lots of people want it. On the other side, probably the more affluent part of society will say, "Hey, I don't want my data shared." That will probably be driven by whats happening in the US and EU.

Is there is a problem with an argument that says poor people don't need privacy as much as rich people?

It is not like they don't need privacy, but they need more 'other stuff', to which privacy takes a back seat. There are other things that are more important to them. Maybe in the Maslow's hierarchy, privacy comes a bit higher.

One of the debates around Aadhaar has been the flip-side of inclusion: Exclusion. If the database is paramount, what happens if you are excluded from the database?

I guess the question is — is that a systemic problem or is that a problem of exceptions?

I don't know the answer, but I suspect it is not a systemic problem. I suspect that it is not tens of millions of people who can't get in. I don't know what the numbers are, but I suspect they are small.

If the numbers are small, we should address that problem — why is that happening? Those problems need to be addressed on a case by case process.

Is there a problem with using the silicon-valley model of "build the plane while you are flying" to roll out essential services?

There are couple of ways of looking at this — one is that you are always building. As you get data you are constantly improving. So, the process of product improvement never ends.

There is always a risk — you have the problem of Rumsfeldian unknowns. You just don't know what the unknown unknowns are until you actually try it out. Then you can see them, and then you can fix them.

But the alternative to that is paralysis — then you never get off the ground.

We've done a spate of stories on how Indian state governments, like Andhra Pradesh, are inter-linking everything from caste, religion, home addresses, and medical records, to build detailed profiles of citizens.

People saying, "Hey, this great, this is a cool idea, we can link this — we can do this, we can do that." That's dangerous, because it does not apply the test of intent.

The test of intent is: Is it reasonable to assume that the person who provided this information would have no problem with this information being used in this way?

If I buy something online, that doesn't require you to ship me anything, and you ask me for my home address. Well, why are you asking me for my address? What's your intent?

And say I gave you my address, and you started using that to guess my sexual orientation or something else. The question the data collector must ask is, "Hey, is this consistent with why this person gave me the data? Would they have no problem with me also inferring their sexual orientation?"

Is it a problem that India still does not have a data privacy law? That anyone can do anything with any form of data?

Yes, it is dangerous because data is a potentially potent form of information that can be used for many things. You don't want to do the wrong thing — you don't want privacy laws that are not well thought out. That's worse than not having law.

But by now there are sufficient experiences around data use, misuse, and hacks. The internet economy is 20 plus years old now. It isn't new. In this day and age, there should be data privacy laws.

Should government departments enter the data market? The Indian railways is exploring ways of selling user data gathered from its online booking portal, by divesting its shares in IRCTC.

Governments have a different object function from companies: Companies exist for their shareholders, governments exist for their citizens.

That implies their roles and responsibilities around data should be different. Companies should monetise data. You should expect something very different from the government.

The government should not be in the business of making money from data. They should be in the business of providing better services, and if they do it through data that is well and good, but they should not be in the business of selling that data.

Something about this just rings strange to me — that — "Well, we'll divest it and all of this data that was given to us as the custodians is now going to someone — who we don't even know — who is buying it and paying for it."

Everyone now says, "Data as the new oil." Is there a case for saying, "Data is the new water." I.e, it is a free public good that should not be monetised?

No. Water is a commodity: It is uniform it is always the same, it always has the same chemical structure.

Different types of data have different kinds of risks associated with them. Someone asked me this question in response to my Hindustan Times editorial, "Shouldn't all government data be public?" My answer is, "No."

You need to assess the risks of making this data public. What if terrorists get it? Do you really want to share your data with everyone? I don't think so. Not all of it. Maybe some of it, where there is no risk. I think there needs to be some consideration of risk in making decisions about data.

More On This Topic