Digital technologies have vastly enhanced our capacities to interact with the world and each other. However these technologies also have many downsides. We focus here on the issue of maintaining trust in the digital world, as our online ecosystems evolve to include many more actors. We also discuss the importance of this trust during a crisis like COVID.

Cookies, cookies, cookies everywhere! Source: On the trail of personal data, SITRA

Recently, I collaborated with Finish investment fund SITRA on a research project titled On the trail of personal data. Both SITRA and I believe that digital technologies can vastly enhance our lives: they can increase convenience of existing services, of course, but also offer truly novel and useful options. However, with this transformative power comes a lot of responsibility, that only increases the importance of transparency, accountability and trust.

To test this hypothesis, we wanted to work on the famously opaque online advertising ecosystem: what are the different actors observing our behavior online? What relationships do they have? How much light can we shed on all those activities? Are those efforts welcomed?

SITRA didn’t want this investigation to be “dry”, with just faked or simulated data. Instead, they were more ambitious and wanted to work with real-life data, real interactions of actual human beings. This makes things substantially harder but is methodologically much more valuable to confidently derive insights. SITRA helped recruit six volunteers: diverse individuals who understood the importance of our efforts, and simply wanted to help. We are grateful for their participation!

The study was designed to follow a dual approach:

With this dual setup, we were looking at data both in transit and at rest. Note that the latter should also include the profiling information derived from the data collected.

Our results, which you can read here, are very humbling.

We found a very large number of third parties observing traffic everywhere on the web, and little transparency on the purposes of this monitoring. Even when we helped our test subjects ask the companies directly — as is their right under European law — they did not receive meaningful responses.

Accessing the New Yorker website leaks data to 56 third parties, with at least 22 contributing to profiling of the user. This is a fairly typical situation.

We know (some of) these companies collect a lot of data: we can either see some traffic going to them, or we can directly read in their privacy policies that they collect interaction or geolocation data (or some proxy like WiFi access points), and vague statements as to their goals for doing so. We can also deduce some of this by looking at how apps are constructed, and which Software Development Kits are included. These SDKs are libraries that are constructed by profiling companies, helping developers build their apps but simultaneously embedding tooling that can spy on the eventual app users.

We think this lack of accountability and transparency is a real problem: it erodes trust.

And trust is essential, particularly in strange times like ours.

COVID and public interest data

SARS-CoV-2, that tiniest but mightiest of troublemakers, thrives on uncertainties: a person can be infected or even infectious without knowing their condition (asymptomatics) and even if we know a contact event led to an infection, it is sometimes hard to evaluate who might have infected whom (the order of symptoms appearance is only a mild indication).

It is natural however that, faced with all this uncertainty, we would try to collect more data to get us out of our predicament. For instance, we could log contact events to at least know when someone was at risk. It would not directly prevent the infection, but would help us more precisely assess who should be put in quarantine.

People seem to get this! In a very interesting study recently published, researchers have shown that American users were more willing to share data in the context of the pandemic.

Rates at which app users opt-out from geolocation via a specific SDK. That SDK is embedded in hundreds of apps. The dashed vertical line is the day of national lockdown in the US, and we can see the opt-out rate went down with the lockdown. The data is segmented by city and (colors) by quantile of proximity exposure, with the latter computed based on geolocation data. Source: “Trading Privacy for the Greater Social Good: How Did America React During COVID-19?”

This data clearly indicates that users seemed to display so-called “prosocial behavior” in sharing their personal data. However, there are reasons to be suspicious of the narrative. More and more NGOs are critical of what they describe as a COVID-washing strategy: exploiting the pandemic to justify practices that would otherwise not be fully accepted by users. In particular this scientific study is enabled by privileged access to data, granted by a(n unnamed) company producing a SDK of the kind detected in the SITRA study. This data is otherwise being traded, with users kept in the dark about the details on the uses of the data in the commercial space.

At the same time, this data is useful and in the public interest, in the context of the pandemic. In a separate study titled Superspreading k-cores at the center of COVID-19 pandemic persistence, scientists have studied transmission dynamics in a Brazilian city when exiting the confinement period.

By matching SDK-tracked geoloation data with published pseudonymized COVID health data, scientists were able to find that clusters of contacts were at the heart of the persistence of COVID in the Brazilian town of Fortaleza, beyond the confinement period.

Again, this data was collected via a SDK (here, it is named as Grandata). Grandata seems to collect data from all over the world. The main difference for Europe is that the data is not reshared as easily with others.

Many governments would dream to have this amount of data. For better or worse, they would never be trusted with it though.

So in the context of COVID-19 governments had to focus a lot of on defining precisely what data they (thought) they needed, and why.

A fragile trust model

The end result of that exercise in restraint is the Google/Apple Exposure Notification framework (GAEN), a protocol proposed/imposed by Apple and Google to governments, in order to build their contact tracing apps. This is the basis for apps such as SwissCovid (Switzerland), Corona-Warn-App (Germany), Koronavilkku (Finland) or NHS COVID-19 (United Kingdom): Bluetooth chirps are exchanged between phones, and their attenuation is used to infer distance, which enables some form of risk calculation. However these applications were designed with one threat in mind: overeager governments suddenly accumulating a lot of data about their citizens.

As we saw in the SITRA study though, this trust model is potentially fragile. The advertising ecosystem is full of a large variety of actors whose actions are sometimes invisible and unexpected.

While in most countries the app itself is well built and will not be spying on users, other actors in the ecosystem might. The Google/Apple Exposure Notification framework mostly ignores this threat.

This is the topic of recent research I conducted with Joel Reardon (UCalgary/AppCensus): Proximity Tracing in an Ecosystem of Surveillance Capitalism. We cannot exclude that malevolent actors could conduct attacks on the protocol, leveraging apps and SDKs as vectors to attack governmental apps based on GAEN. In fact, we show that it would be surprisingly easy and cheap, and find specific actors who are in such position of control that they could start conducting such an attack tomorrow (important note: there is no indication that they intend to do so). Some of those findings are backed by research conducted by the cybersecurity sub-unit of ArmaSuisse, the Swiss Defence Procurement office.

Our finding is really foreshadowed by all the research of the past year: the SITRA study shows us that multiple actors in the mobile ecosystem have access to our phones (via operating system permissions), Joel and I show that the Bluetooth chirp can be collected by them, often in combination with GPS data, and the Brazilian study shows there is tremendous value of combining health and GPS data. In our paper with Joel we circle back on the potential motivations for such an attack: geopolitical disruption, of course, but also economic gain (a hedge fund trying to do population level epidemiology, to predict economic disruption).

What would be the consequences of such an attack? I certainly think it would be disastrous in the context of this pandemic. At times where public trust in governments is eroding (efficacy of COVID measures, motivations around data collection, intent around vaccines), this threat exposes us all at high risks: not just that the national app fails or is compromised against us, but even worse that it would invalidate efforts by our governments to lead and be bold in digital space.

Deploying software as a government, with goals of safe, large and inclusive usage, is a tremendously difficult endeavor that requires an extremely robust foundation of trust. This trust requires expert confidence in how commercial actors actually operate in the digital space. This confidence is simply not there: as the SITRA research shows, it is extremely difficult to know what is possible and what is being done with our data.

It really shouldn’t be that way.

Mathematician. Co-founder of PersonalData.IO. Free society by bridging ideas. #bigdata and its #ethics, citizen science