Embracing open data is now more important than ever (open data note 2 of 2)

United Kingdom

Imagine a world where any individual for any purpose can freely use, freely modify and freely share non-personal data. This is the concept of ‘open data’ and the world sought by the Open Knowledge Foundation (the “OKF”). In this article we build on the primer of the law laid out in our first article by considering the concept and appropriateness of ‘open data’ and the alternative ways to safeguard data. We conclude the article by discussing possible regulatory intervention, including Europe’s proposed Data Act 2021, and suggesting that meaningful policy intervention to facilitate open data may be best achieved through contractual arrangements with access controls and repealing the database rights granted under Europe’s database directive (which we discussed in our first ‘open data notes’ article).

What is open data?

The OKF created the ‘Open Definition’, which sets out in detail the broad parameters of ‘openness’ in order to grant an extremely wide-remit to data use. According to the OKF, ‘openness’ is work:

  • in the public domain or licensed irrevocably for free use, redistribution (including sale), modification (including creating derivatives),  separation (and use, distribution and modification of that separate part of the work), compilation (i.e. distribution alongside other distinct works), non-discrimination, propagation and application to any purpose. Licences with conditions on attribution, copyright notices and requirements that distributions of the work remain under the same or similar license (i.e. ‘share alike’ provisions) are permitted;

  • provided as a whole and at no more than a reasonable one-time reproduction cost;

  • provided in a machine-readable format where the work’s individual elements can be easily accessed and modified; and

  • provided in an open format with no use-restrictions and capable of being fully processed with at least one free/libre/open-source software tool.

Open data benefits

The intrinsic value of open data, proponents argue, is that it promotes innovation and progress, ultimately advancing society through job creations, enhanced efficiencies and economic stimulation. The results from a 2017 European Commission study suggested that the value of the EU data economy (which includes the data market (i.e. where products/services derived from raw data are exchanged) and its resulting economic impacts) including the UK would more than double from almost €300 billion in 2016 to €739 billion under a high growth forecast by 2020 or, excluding the UK, would increase from €238 billion in 2016 to more than €572 billion in 2020. The latest report published 6 July 2020, revises these figures down to value the 2020 EU data economy at €355 billion excluding the UK or €443 billion including the UK, suggesting that the UK data economy alone is valued at €88 billion for 2020.  Growth in the EU data economy is predicted to continue in a high growth scenario to €827 billion excluding the UK by 2025 (or €1.036 trillion including the UK).

This high growth scenario is predicated on accelerated EU public and private investments in artificial intelligence, advanced robotics, automation and new skills as well as investments in the digital economy and consumer willingness to spend.  These vast forecasts emphasise the correlation between open data and economic growth with the high growth scenario described as a society with ‘a high level of data innovation, low data power concentration, an open and transparent data governance model with high data sharing, and a wide distribution of the benefits of data innovation’.

The vast sums are not limited to the EU with, according to the European Commission’s report, the US data market (i.e. just the exchange of products/services derived from raw data, not including the resulting impacts on the economy) valued at around 2.5 times larger than the EU data market in 2019 (€185 billion in the US compared with €72.3 billion in the EU). A calculation by the Economist last year (involving extrapolating Canada’s estimated data value) suggests that the value of all the data in the US could be between $1.4 trillion and $2 trillion.

In addition to the clear economic benefits, there are health, science, education, operational and organisational values in open data.  In a joint report by the IDC and Lisbon Council, data held by private companies was considered to ‘offer radical new insight on aspects of life that were never measured before, and could drive an unprecedented understanding of human behaviour as well as natural phenomena’.  It can help map the spread of diseases and pandemics, achieve scientific advancements more efficiently, predict and prevent illnesses including mental health issues (such as eating disorders and suicide) from internet and social media usage, track energy consumption, forecast consumer spending and transport demands, as well as many more possibilities.

‘Digital twins’ of sports players, aircraft engines and manufacturing plants can be created to mirror in real time the real-life versions.  These digital replicas rely on data (such as weather, usage, nutrition, activity and history) to predict faults, diagnose issues, identify physical injuries, detect nutritional deficiencies and target maintenance. If data is openly available, more factors could be inputted to improve the accuracy of the virtual simulation or the recommended adjustments to make to avoid issues, leading to efficiencies.

Reusing data held by private companies is particularly advantageous compared with self-collection, saving organisations considerable sums and administrative burdens. For public bodies funded by taxpayers, this is an important political consideration. It is equally important for private businesses looking to maximise profits for their shareholders. The quantity and quality of available data is increased when reusing data whilst the bias risks associated with self-collected data decreases.

The cross-fertilisation of data across industries allows new value to be unlocked in old data. For instance, a joint report onHow the Power of Data Will Drive the EU Economy’ examples data from mobile telecommunication operators being used by these businesses for internal purposes, by EU statistical offices for official statistics on mobility and demography as well as by the health industry to control and predict disease outbreaks.

The fallacy of data as the new oil

In 2006, Clive Humby (Tesco’s Clubcard architect) declared that ‘data is the new oil’, ‘it’s valuable, but if unrefined it cannot really be used’.  In May 2017 the Economist wrote that ‘a new commodity spawns a lucrative, fast-growing industry, prompting antitrust regulators to step in … A century ago, the resource in question was oil’.  Now the resource they refer to is data, which they describe as ‘the oil of the digital era’.  That same year, the Economist also wrote that ‘the world’s most valuable resource is no longer oil, but data’.  Similarly, Wired wrote in an article entitled ‘Data Is the New Oil of the Digital Economy’ that ‘data in the 21st Century is like Oil in the 18th Century: an immensely, untapped valuable asset’.  However, the Huffington Post also postulated that year that data as the new oil is an obsolete analogy; instead it likened data to water as water gives life and is abundant, purified, distributed, democratic, fresh and human.

The prevalence and necessity of data as the fuel to power so much of our lives from search engine results to wearable technology, the weather, mobile maps and TV viewing suggestions, makes the comparison with oil seem understandable and inherently natural.  However, equating oil with data is fraught with problems. For instance:

  • oil can only be consumed once by one individual for one purpose whilst the same data can be used and reused multiple times by different individuals.  In economics, data is considered non-rivalrous;

  • oil is also homogenous whilst data is heterogeneous.  The value of data varies dramatically depending on its attributes, the amount its used and the creativity employed when using it – as the OKF’s founder proclaimed: ‘the coolest thing with your data will be done by someone else’.  Often data’s true value is difficult to ascertain until it is combined with other data or an algorithm goes live;

  • obtaining, verifying and presenting data within databases may require a substantial investment (see our first article), but data gathering, processing and sharing is not necessarily costly; whereas the process of extracting, refining and transporting oil is undoubtedly expensive and time-consuming.  Data is also becoming increasingly accessible with computing advancements;

  • oil cannot squeeze down fibre optic cabling within a system of ducts at speeds of 44.2 terabits per second like data can;

  • whilst data may have a ‘use by date’, it can be endlessly reused and repurposed; oil, on the other hand, is a finite single-use resource.  Data, for example, can be used to reveal individual traits in an athlete for their coach to focus on and improve their performance but can also generate deeper insights into potential exercise benefits for wider society, the limits of the human body and nutritional requirements; and

  • raw data can be any information or characteristics capable of being processed by computers (i.e. text, characters, symbols, quantities, photos, facts, figures, time) but oil in its raw form is simply always crude oil.

These innate differences between data and oil help to explain why in 2013 the Financial Times published a valuation of personal data at under a penny a piece. Crude oil is always useful when refined but value is not guaranteed for raw data that is refined at a particular time, by a particular person and for a particular purpose. This is because data is not typically fungible. In fact, equating data with oil is damaging to societal progress. Just like oil tycoons with their oil reserves, data-rich companies (such as the largest technology companies) enjoy a great source of power from their data. Oil is scarce and can be stockpiled and retain its value, whilst data is plentiful and there is no societal benefit or value in it being stashed away. As explained in our first article, the temporal, ephemeral nature of much data means that it has little value at all if its hidden away beyond its use by date. To benefit society and reduce the power imbalance levied by data-rich companies, making the abundance of data available to all should be actively encouraged.

A better comparison for data, as aptly made by Bernard Marr in Forbes, would be with renewable energy sources, such as solar, tidal and wind. These are plentiful sources that would benefit all if they are made accessible to all. Moreover, last year, the Economist acknowledges that whilst data is in some ways a natural resource (like oil), it must also be used as widely as possible to maximise wealth creation and this tension must be reflected like intellectual property. It likens data with sunlight or other renewal resources as raw data is not a tradeable product. 

Renewal energy sources are highly accessible, with most forms mostly available everywhere to everybody.  It would be strange to erect wind shelters or block sunlight with blinds yet simultaneously charge for wind or solar energy.  To achieve the same degree of accessibility with data involves embracing open data.

Open data examples

Many countries have embraced open data initiatives, including the UK, the US, Russia, India, Italy, Brazil, Ghana, Australia, Moldova, Chile, the Philippines and others. The EU Open Data Portal provides free access to EU datasets that can be freely reused (commercial or otherwise). There are also city-wide open data initiatives (such as in London, San Francisco and Buenos Aires) and sector-specific open data initiatives (such as the World Bank Climate Change Data for the environment, OpenStreetMap for geography, OpenPlans for transport and the Global Water Database for water).

Police.uk is an example of an application built using open data.  It provides street-level UK crime and police force data with mapping functionality and crime categorisation. Other examples are DataViva, which uses over 100 million interactive visualisations to make government data on the Brazilian economy freely accessible for all, and Save the Rain, which uses maps to allow users to estimate how much rainwater they could save each year to reduce the predicted worldwide reduction in annual rainfall.

Open data barriers

Despite the numerous benefits of open data, the road to achieving open data is not straightforward. Scaling-up using cloud computing infrastructure is necessary to handle large datasets effectively and only a few suppliers offer products with such capabilities. These products could carry costs and risks of vendor lock-in with interoperability and portability limitations. Cyber security and analytical computing skills are also pertinent limitations, particularly as the former is often subject to prominent news coverage and negative publicity.

This lack of suitable IT infrastructure is compounded by the financial costs associated with handling personal data using appropriate technical and organisational measures in line with strict privacy legislations, such as the General Data Protection Regulation (the “GDPR”) in the EU or the Consumer Privacy Act in California. This limitation is generally relevant whenever personal data is processed, including the collection, organisation, and sharing phases, but navigating conflicting privacy laws across jurisdictions does create uncertainty for companies.  If personal data must be anonymised or pseudonymised, costs may further increase.  Companies may also be deterred by the high penalties where personal data is processed in breach of the GDPR, which can be as high as €20 million or 4% of annual global turnover.

Financial implications are also materially relevant where companies who have invested significant sums in data collection are reluctant to simply give their data away for free.  Moreover, companies may fear liability or costs associated with inadvertently disclosing commercially sensitive or confidential information, affecting the competitiveness of their business. As organisations are not afforded foresight of who accesses their data when it is truly open, they may be dissuaded by the notion of their competitors prying on their information.  Moreover, there is an enduring perceived public relations benefit in only sharing data for certain campaigns or for targeted reasons (such as for public health campaigns). 

If companies opt to make data open, they must prepare the data by cleaning it, disaggregating it, merging it with other datasets, making it readily accessible in a common, interoperable standard.   The lack of a preferred or universal method to share data makes this a more cumbersome process.  A 2012 report by the UK’s National Audit Office estimated that staff costs required to make certain standard public sector transparency disclosures of pre-existing data amount to between £53,000 and £500,000.  This is a significant up-front investment required to make data public.

If privately-held data is made open, it often suffers with data coverage bias stemming from an unrepresentative or narrow sample that is used during the data collection.  Time, skills and costs are required to ensure that the data validly represents the particular market and a degree of trust is required in organisations sharing and using the data.  Similarly time and resource-intensive work is also required when manipulating data to extract its value, particularly where raw data is provided in an unusual or incompatible format (i.e. contrary to the above definition of openness). Technicians with appropriate skillsets and knowledge must be deployed for the data’s value to be appreciated.

Open data could also be inhibited by data sovereignty, the concept that data is subject to the laws and governance of the country in which it is collected.  In Germany, for instance, public bodies must protect personal data with regards to access, admissibility and traceability through third parties. The German government also requires its citizens’ data to be stored locally in Germany.  Data localisation goes further than data sovereignty by requiring that the initial collection, processing and storage of a citizen’s data must initially occur within the country’s boundaries. Various countries have data localisation laws including Australia (health records), China (personal business and financial data), India (payments system data) and South Korea (geospatial and map data). Russia requires that collection of its citizens’ personal data must occur in databases located in Russia and, recently, Russia introduced financial penalties for non-compliance with its data localisation laws (previously the sanction was to block service access for non-compliance, which Russia did to LinkedIn in 2016 when it refused to bring its servers into Russia).  Stringent and nationalistic outlooks on data sovereignty and localisation, as opposed to a policy which advocates the free flow of data across borders, is counter to the open data movement.  It poses difficulties for the cloud computing providers who wish to process the data in certain locations for technical and cost motives.

Given the vast up-front and on-going costs, sensitivities regarding commercial information and increasingly nationalistic outlooks regarding data sovereignty and localisation, the decision for private businesses to make their data open is a strategic and significant one.  However, the typical reservations about open data are often unfounded, with many companies lacking awareness of the actual potential benefits (including financial) and long-term opportunities of open data (as discussed above). In any event, an increased prevalence and adoption of open data through industry co-operation would mean it faces fewer technical barriers (such as differing data standards and formats).

How to achieve open data

To incentivise open data there could be:

  • regulatory intervention by governments and legislators (as discussed further below), which could include (provided adequate safeguards are in place) immunity from data privacy infringements;

  • voluntary initiatives led by companies could champion the open data movement, such as Microsoft’s commitment to make its social impact initiatives ‘open by default’ beginning with sharing broadband access data to help accelerate broadband connectivity developments;

  • collaborative initiatives, such as ‘hackathons’ or the Data Pop Alliance. The latter was developed by Harvard University with the aim of using data and AI to diagnose issues and further social good in developing countries, which has received engagement from mobile telecommunications company Vodafone;

  • financial or tax incentives to private companies who make data open, particularly considering the potential economic and societal benefits (as discussed above) of open data;

  • uniform standards set for data quality, format and accessibility with organisations encouraged to capture more information in a machine-readable electronic format; and

  • increased governmental investments in skills and technologies to facilitate data processing.

Achieving open data through regulatory intervention

To overcome any barriers effectively and incentivise open data, there must be a fair balance between allowing data collectors to benefit financially from their information and permitting society to use data as they wish (this balance between granting time-limited monopolies for rightsholders and facilitating societal benefits is discussed in the context of intellectual property rights in our first article).

In terms of governmental regulatory intervention to encourage this, a step in the right direction towards open data would be for Europe to repeal the database rights discussed in our first article.  These 1992-conceived protections do not provide a fair trade to society and were not intended to apply to the broader data economy and, if they do, as noted in the European Commission’s 2018 evaluation of Directive 96/9/EC (the “Database Directive”), any meaningful policy intervention would need to be substantial.  The Database Directive overprotects data for too long a period.

Directive (EU) 2019/1024, which must be implemented domestically across Europe by July 2021, consolidates EU directives from 2003 and 2013 whilst updating the framework for reusing public sector information by broadening the scope of public entities who must comply; generally preventing public entities from charging higher than marginal costs for data reuse; introducing rules on high-value datasets (such as statistics or geospatial data) to allow them to be freely available across Europe; limiting data lock-in due to contractual arrangements between public entities and private bodies; requiring open access policies to be developed for publicly funded research data; and harmonising reuse obligations for data accessible via central repositories.  Whilst this regulatory intervention may open more public sector data, it does little to encourage open data amongst private companies.

Europe’s proposal for policy intervention was published in February 2020 as the possible Data Act 2021.  This data strategy, according to its inception impact assessment (28 May 2021), strategises that the EU should ensure an “open, but assertive approach towards international data flows” and aims to create a single data market to facilitate data access, which should allow more to benefit from “big data” and machine learning and fairly allocate value within the data economy. 

This EU data strategy should be considered alongside the EU’s Data Governance Act (“DGA”) and the Digital Markets Act (“DMA”) proposals. The DGA proposal intends to encourage data-sharing across the EU through, for example, facilitating the re-use of certain public sector data and the altruistic sharing of data by organisations (e.g. for “scientific research purposes or improving public services”). The DMA proposal meanwhile intends to promote competition and prevent the abuse of market dominance by large companies considered the “gatekeepers” in digital markets (such as those controlling social networking platforms or search engines) through, amongst other obligations, requiring these gatekeepers to ensure that business users can access the data generated when using the gatekeeper’s platform (e.g. data portability). The DMA proposal provides the Commission with powers to conduct market investigations and issue fines for non-compliance of up to 10% of the company’s total worldwide annual turnover.

The Data Act 2021 is at the public consultation stage until 3 September 2021 and, according to the initial proposal, it could:

  • foster business to government data sharing for the public interest;

  • support voluntary data sharing between businesses by addressing usage rights for co-generated data, identifying and addressing existing barriers to data sharing and clarifying rules and legal liabilities for data use;

  • make data sharing compulsory under certain specific, appropriate, fair, transparent, reasonable and non-discriminatory circumstances, such as where an unavoidable market failure in a specific sector is identified or foreseen requiring data sharing to alleviate it. Even in these circumstances, the legitimate interests of the data rightsholder would be considered; and/or

  • evaluate the intellectual property framework to enhance data access (including by revising the Database Directive – with the inception impact assessment indicating that the Database Directive “could be amended so that it supports the objectives of this [Data Act 2021] initiative”).

Whilst the limited information currently available on the Data Act 2021 does suggest that it is a meaningful policy intervention, we consider that a better intervention to encourage open data whilst safeguarding the interests of data rightsholders would be through a combination of contractual arrangements, access controls, implied licensing and repealing database rights under the Database Directive.

An alternative future for open data

Encouraging open data, in our view, could be better achieved via contracts.  Contractual arrangements between the original data collector or creator and the first recipient or a central data repository could govern the initial disclosure and safeguard the data collector’s interests. Technology access controls could regulate or restrict access to certain data, particularly where commercially sensitive or personal data is involved.  Such access controls could define who and how the data is used by requiring the first recipient to identify themselves and agree to certain limited restrictions on use, such as not to further disseminate sensitive or personal information.  A nominal one-off fee could also be catered for if desired (this could, for instance, cover any reasonable costs and remuneration).  In line with OKF’s definition of openness, conditions on attributions and copyright notices could be provided for in these contractual terms. To further the open data movement and facilitate reusage, significant fees and/or use restrictions should be discouraged.

A 2018 study by the European Commission found that 83.3% of certain respondents agreed or strongly agreed that relying on contractual arrangements, rather than the database rights under the Database Directive, provided them with more certainty. The European Commission also found it to be the most prevalent form of protection, with 72.1% of respondents relying on it, often because of the flexibility of contractual freedom. This combination of certainty and consistency is crucial to efficient business. 

An access tiering system could be adopted for the contractual arrangements, such as that outlined by the UK Data Service (which is indirectly funded by the UK Government through the Economic and Social Research Council).  Information without personal or sensitive information could be open whilst particularly sensitive data would only be available to specifically trained and accredited individuals and different contractual arrangements would govern any data falling in between these categories.

Where there is no contract, implied licensing or contractual arrangements can ascertain the intention of the relevant parties, allowing data to be fairly used by the recipient.  In these circumstances, it would be fair for the recipient to ensure the validity of the data and carry the risk if it breaches confidentiality.  Many standard-form licences already exist, such as the Creative Commons licences, written for rightsholders granting licences to their databases.

With the devastating impact of Covid-19 clear, it is more important than ever to achieve open data.  Open data would allow for Covid-19 and other pandemics to be foreseen, predicted, tracked and averted.  The impact and behaviours of pandemics can be better understood and mitigated more quickly and more effectively.  On medRxiv and bioRxiv alone (two free depositories for scientific articles not yet peer-reviewed with the BMJ and Yale University supporting the former) there were over 7000 articles relating to Covid-19 produced and uploaded in recent months.  Scientists immediately have access to extensive and considered data across a range of Covid-19-related issues, from prevention to treatment to dealing with flattening the curve, allowing them to scrutinise science and accelerate their understandings.  Yet this is not enough.  A group of scientists have written in the Science journal to ‘strongly urge all scientists modelling … the COVID-19 pandemic and its consequences for health and society to rapidly and openly publish their code … so that it is accessible to all scientists around the world’.  These scientists see the open exchange of data as ‘a hallmark of science’ and now as ‘more important than ever’ for open data sharing.

The long-term societal benefits and enormous economic potential of embracing open data clearly outweigh the concerns and the monopolistic desires of data collectors.  An open data society, crucial to a competitive economy and evolving society, should be actively encouraged and facilitated.  Contractual arrangements with technology access controls provide sufficient safeguards for data collectors, making any additional or concurrent intellectual property or database rights unnecessary. With legitimate data access rights, individuals can do anything with information, benefiting from the value they can derive from it using their creativity, whilst the data collector benefits from the initial disclosure and from witnessing the cool things done with their data.