Skip to main content

Where does your data come from?

Should businesses reveal the data they use to ‘train’ their AI systems?

Would greater transparency build trust in artificial intelligence? And if so, how would we do it?

AI and advanced analytics enable us to aggregate and analyse an unprecedented volume and variety of data. At their best, these tools can drive more accurate, objective and efficient knowledge discovery and decision-making, and deliver outcomes as diverse as more accurate cancer diagnoses and more efficient supply chains. Since many businesses are now data-centric, there is huge corporate appetite for, and optimism about, their potential.

Despite this, it would be naive to assume that more data is always better, or that analytics inevitably improve our ability to select the right course of action. Assumptions that current limitations will necessarily disappear as we collect and analyse more data are also unreasonable. Problematic or flawed insight-generating techniques can worsen over time, turning ripples into waves.

Today’s ‘data ocean’ – filled by the ubiquitous digital platforms and services that define modern life – reflects society as it exists, good and bad. This means that the biases and usage gaps observed in society and social institutions are inevitably reflected in the datasets describing them.1 For instance, facial recognition models have proven less adept at identifying non-Caucasian physiognomies, partly because the data on which they are trained is overwhelmingly populated by white subjects. Such tools are not designed with explicit racial bias, but they can nonetheless reflect and aggravate these structural inequalities. When applied at scale, for example to train machine-learning models, these biases and usage gaps can be easily and inadvertently reinforced.

Given these dynamics, understanding the history, strengths, weaknesses and blind spots of datasets and analytics is more important than ever. Thankfully, work in the areas of data and process provenance (record-keeping and chronological tracking), along with fairness- and discrimination-aware data mining, aims to deliver more accurate and socially responsible results.

Richer, safer datasets

Trust in AI is essential for it to achieve widespread market adoption, and transparency is often cited as the key to making AI trustworthy. But what exactly does transparency mean? And how can it be achieved in a mutually satisfactory way? Standardised documentation of large datasets and machine-learning models could be a means to enhance trust and accountability between the public, business and researchers. Different forms of documentation have been proposed, and are already beginning to be used in data science and data-intensive industries. Such documentation helps to ensure that essential information about how datasets or models are collected, cleaned, labelled and trained accompanies them as they are passed between data controllers or processors. In some cases, they are also released to the public.

Such practices have several benefits for society, researchers and business. First, they make it easier to identify bias and assess fairness. Second, they encourage trust in the outputs of analytics by establishing that the datasets are well understood and provide reliable grounds for the type of assessment being made. Third, they improve accuracy by highlighting where models have been used inappropriately, or to answer the wrong sort of question. 

Understanding the provenance of information is a growing priority for data-intensive companies. As trained algorithmic models can increasingly be repurposed or even bought ‘off the shelf’ and deployed in new contexts, the risk of emergent, unobserved biases and errors increases. Models often need to be retrained with local data or modified to account for local requirements and historical biases. For example, models used for predictive policing in the US have performed poorly when redeployed in other countries.

As shown by the emergence of initiatives such as the Partnership on AI (comprising Amazon, Apple, Facebook, Google, DeepMind, Microsoft and IBM), data-centric businesses are increasingly expected to be accountable and transparent as global calls for ‘data ethics’ and ‘codes of ethics for AI’ grow.

By taking seriously these demands for standardised documentation, businesses will become more accountable while also benefiting themselves from higher-quality datasets and increased trustworthiness in the eyes of consumers. In short, accountability and transparency can be a market advantage for proactive businesses, and documenting datasets and models is a low-cost, high-reward step towards achieving it.

The flourishing of AI requires dialogue between business, government and society about what is not only possible but also responsible and reasonable. To be productive, the history and limitations of AI datasets and models need to be more widely understood. In my view, data-intensive industries and public bodies that develop and use AI should urgently support and adopt standardised provenance documentation to ensure a well-informed and mutually beneficial societal dialogue about the future of AI.

‘Data nutrition labels’, data explainer sheets and emerging standards

In recent years there have been calls from the machine-learning community, most prominently from the Fairness, Accountability, and Transparency in Machine Learning (FAT-ML) research network, to establish standardised assessment and documentation for the provenance of training datasets and machine learning models.2 These are motivated by the methodological and ethical risks that come with the growing usage, sharing and aggregation of the diverse datasets and models necessary for AI. Currently, no universally accepted standard form of documentation is required, although many major technology companies and research bodies are developing their own.

A project sponsored by the AI Now Institute involving Microsoft Research has proposed publishing information sheets, or ‘datasheets’, to accompany datasets. The sheets are manually completed by individuals involved in creating them and describe things like how the data was collected, cleaned and distributed. Legal and ethical dimensions of the data are also included. The datasets are labelled with relevant metadata so that they can be interrogated, and to identify biases or risks before or during processing. The datasheets are intended to allow potential users to determine whether the dataset is appropriate for a given task, and to assess its strengths and limitations.

A team at MIT’s Media Lab and Harvard’s Berkman Klein Center advocate a second approach – that of the ‘dataset nutrition label’. To avoid the need for labour-intensive manual labelling, their approach proposes a centralised, modular infrastructure harnessing probabilistic computing tools that can both generate information about metadata, provenance and variables in a dataset and pre-emptively provide basic statistical analysis of it. Such a standardised approach to pre-processing can help minimise the time between data acquisition, model development and deployment by providing essential information and technical features necessary for the ‘exploratory’ phase immediately after the data is collected.

Such approaches seek to create a standardised set of information to accompany datasets as they are shared and re-used, focusing particularly on potential biases, gaps, proxies and correlations that could be inadvertently picked up and reinforced by machine-learning systems making use of the data. As a secondary effect, they may also drive better data collection practices as well as raising the profile of contextual and methodological challenges and biases.

Businesses are rightly excited about the power of data as a source of competitive advantage, product improvement and operational efficiency. But advanced analytics can go wrong in significant ways.

Close engagement with data research communities can help ensure companies place as much emphasis on how data is curated and labelled, and maintain a keen awareness of emergent problems along the data chain. In short, businesses that take the time to understand the limitations and history of their datasets and models will be best placed to use them accurately – and responsibly.  


1. Kate Crawford, 'The Hidden Biases in Big Data', Harvard Business Review, 2013

2. Timnit Gebru et al., 'Datasheets for Datasets', 2018 (PDF)

By Dr Brent Mittelstadt

University of Oxford

Dr Brent Mittelstadt is a senior research fellow in data ethics at the Oxford Internet Institute, a Turing fellow at the Alan Turing Institute, and a former member of the UK National Statistician's Data Ethics Advisory Committee.