Is there such a thing as Responsible Data Collection?

It’s time the industry adopted the following principles of responsible data collection…

Mr Kaiser Fung from 'Principal Analytics Prep' has come up with 7 Principles of Responsible Data Collection in the face of the Cambridge Analytica disaster. Since his article produces a false 404 error when visited via civil-rights preserving anonimization tools, I will cite all the relevant bits. Of course the article that discusses limits on data collection promotes itself no-opt (that's less than opt-out) data collection by Google, comScore, Quantcast, disqus etc.

First-person not second- or third-person permission

When you create a new Facebook account, you are asked if you’d like to upload a contact list. If you choose not to, Facebook will still have lots of suggested friends for you. How does Facebook know who you know? One source of data is your friends. If your friends agree to upload their contact lists to Facebook, and your name or email or phone number happens to be on those lists, then by a reverse lookup, Facebook knows who your friends are. Such predictions are highly accurate. By uploading their contact lists, your friends have shared your private data without asking your permission – worse, they have given Facebook permission by proxy to take your private data and profit from it. Permission by proxy is dishonest, and should be banned.

I put point 2 first since it's the one we agree upon. It's been too long that it has been considered okay for citizen to trade in data about their social neighborhood, friends and peers. Even the GDPR does not impede social data treason.

Opt Ins not Opt Outs

Currently, for most websites and mobile apps, the default is maximum data collection. Users wanting privacy then figure out how to limit the amount or type of data collected about them. This is an example of opt-out. The default should instead be opt-in: no data collection unless instructed by users. When the default setting is opt-in, businesses have to win over the users’ trust, and so they will have a much stronger incentive to clarify and explain the benefits of the data collection. Say goodbye to the days of hand-waving claims, coercion and trickery.

As we noticed before, people are willing to opt-in on data that actually belongs to their friends and peers. If we indeed manage to make that illegal, then we are still facing the problem that there is a strong imbalance of knowledge between the organization gathering the data and the user who is supposed to predict which apparently harmless information can later be used against them. Just answering questions on your favorite foods and sports? Didn't expect that data to end up at your health insurance company? In the case of Cambridge Analytica those people thought they were helping a university research project — and then the data ends up being used for the worst possible use: demolishing democracy. So I dare to question the entire notion that people are able to discern which data collection is good for them, let alone how much it is able to tell about their peers.

Oh, another problem: What if companies simply claim they were given permission? How will you prove them wrong? Facebook has been caught doing just that:

Daily Mail, UK, 2018-03-26: How Facebook logs ALL your phone calls and texts - but the social media giant insists the function has always been ‘opt-in only’

Stop mis-direction

I’d like to see strong regulation with heavy penalties for businesses that request permission from users for specific uses of their data but then fail to police their data analysts to curb abuses. For example, many websites collect mobile numbers from users, saying that the two-factor authentication is essential to protect their accounts. Once the phone numbers are stored in the database, there is no telling which data analysts will get a hold of the data. Most data analysts will utilize whatever data they can get their hands on. To prevent mis-direction of data, companies should have a data governance function.

Sounds like a better-than-nothing measure. It describes a symptom of the deeper malady of digital data: you can't trace abuse because all evidence is just data… some log files at best. It doesn't stand tall before a court, so justice doesn't happen and anyone who doesn't abuse data is in a strategic disadvantage to those who do. The market then deals with what's left of ethics. In a globalized world where competition shapes companies much more than laws or ethics, this is a losing game and allowing for companies to have any such data in the first place is problematic.

Does that mean we shouldn't use digital technology at all? No! Read on, the solution is at the end!

Sunshine Policy

It is technically feasible for Facebook or other companies to keep a log of which third parties have received what data about you from Facebook. If these companies believe that the trading of private data is fundamental to their business models, then they should allow users to inspect how they collected the data, and which entities received the data. Better yet, users should be given the ability to opt out of specific transactions. For example, if Facebook has a deal to sell data to Pfizer, users should have the right to say no, you should not give our data to Pfizer.

So either Facebook has to overwhelm users with opt-in/opt-out choices, or it makes use of the fact that abuse is nearly impossible to prove and simply continues doing things behind users' backs. In the current market situation where companies are in worldwide competition, the ones that disregard laws will win. It's mathematical. The solution is to make laws they cannot disregard. Read on.

Wall off the data

If companies are willing to wall off user data, and not send them to third parties, then users are more likely to share the data.

Again naive… how can they afford not to sell data on a market where everybody does? How can you know if it is in their interest to hide this fact and there is no physical evidence?

The right to be forgotten

Europe is ahead of the U.S. on this issue. Companies should be required to delete user data older than say five years. Aggregate statistics older than five years should be allowed. More recent data supersede the older data, so there is negligible value in keeping the old data anyway.

This also builds on the wishful thinking that companies truly delete data, ever. Why throw away money if they can take the old data home and sell it on the darknet? How do you expect to catch anyone on the black data market for as long as Bitcoin is legal and the object offered for sale needs no postal delivery?

Stop the blackmail

One reason for the pervasive data sleaze is the favorite business model of web and mobile companies – free service to all, paid for by advertisers. Users are then barred from using the service unless they sign off on extensive snooping. Sometimes, their signatures are not even required; the websites just claim that usage is taken to imply consent. This policy is about taking the cake and eating it too. The website operators don’t really want to ban any user so as to inflate their user counts (“eyeballs”). This practice creates the perception of dishonesty, and is self-defeating, if the companies actually believe that the data collection benefits their users. If the business model is such that users get free service in exchange for their private data, then they should enforce strict access policies, only serving those who acknowledge the data collection.

Same problem of naivité as the previous proposals. In a worldwide competition, how likely is it to expect collaboration on this front? If some companies comply, will they quickly be superceded by competitors who don't?


I had hoped this article would offer an alternative to dismantling big data monopolists, but in my view it doesn't. The collection of big data creates material that is highly valuable, highly inviting for abuse and impossible to protect.

The kind of big data that makes sense to share with corporations is the sort of data that has nothing to do with individuals or groups of human beings. Frequently it makes sense to make it open data then.

For social data instead the solution should be to make distributed systems that do not put personal data in the hands of strangers. At all.

Here's how.


Last Change: 2019-03-22

My pages.

Go have a chat.
Use Tor.
Use IRC.
Use IRC over Tor.

But, if you already sold your
soul to the surveillance
market, you can . . .  

Tweet this.
Share on Facebook.
Stumble upon this.
Find it delicious.
Digg this.

Follow tweets in
Deutsch, Italiano
and English.

Follow on Facebook in
Deutsch, Italiano
and English.

CC-BY-SA, carlo von lynX