Thoughts about the handling of PEPP-PT proximity history IDs

Disclaimer: I am not affiliated to ePP-PT e.V. i.Gr nor am I an expert in the fields of virology or epidemiology. I have spent the larger part of my professional life with thinking about and building end-to-end distributed solutions and recently sold my company to Vodafone. From what I understand so far, I am supporting the objective of ePP-PT e.V. i.Gr and have engaged in a dialog with some of its members. This article appears on my personal blog and is part of a planned, multi-part series. Find me on Twitter.

With the Coronavirus disease forcing millions of people to practice social distancing or going into complete lockdown, experts of many fields started to contemplate solutions to help slowing down the spreading of the virus.

PEPP-PT (short for “Pan-European Privacy-Preserving Proximity Tracing”) aims at providing a complete framework and reference implementation to “[…] interrupt new chains of SARS-CoV-2 transmission rapidly and effectively by informing potentially exposed people“. Naturally, this raises tons of potential privacy concerns, which is why PEPP-PT aims at putting privacy front and center. I am not going to repeat the basic concepts in this article since a lot has already been published elsewhere. It is, however, important to understand that PEPP-PT is about proximity tracing, not contact or location tracking.

So far, PEPP-PT has not released any technical specifications or reference code. My understanding of the suggested implementation is therefore based only on what’s publicly available and what others have publicly stated before.

With reports emerging about an imminent launch of a “Corona App” as early as Easter, I seriously hope, that a thorough specification of all components of such a solution will be made available publicly within the next days. We are discussing a complex solution, with a decentralized, distributed backend and applications for at least two mobile operating systems. In short: You don’t want to rush out a release without giving the community at large appropriate time to review each and every element of it independently.

In this article, I focus on one specific key aspect of the concept: The handling of IDs.

The core concept of PEPP-PT centres around two aspects:

Any user’s device (e.g. smartphone) generates a random, anonymous and unique ID and continuously broadcasts the ID via Bluetooth Low Energy. A new ID is generated in defined intervals, for example every 30 minutes, replacing the previous one. A complete list of all IDs the user has ever broadcasted is stored securely and locally on the user’s device, for the sake of clarity, let’s call it Anonymised Device IDs History (ADIH).
Any user’s device continuously listens for IDs being broadcasted in close proximity and those IDs are stored securely and locally on the user’s device. This establishes a Proximity History (PH) on a participants device. “In close proximity” determines which IDs are getting added to the Proximity History and the boundaries of what can be achieved vie Bluetooth Low Energy and various smartphones are currently being researched as part of the project.

It is immediately obvious, that the handling of these IDs will be crucial for protecting the privacy of all participants against attacks on various levels of the system. Hence, I focus solely on the scenario, where one of the participants is confirmed infected and everyone who was in epidemiologically relevant proximity would ideally rapidly get informed.

Before I discuss my thoughts around some technological challenges this system creates, let’s try to find out more about what is publicly known. Once again, as far as I know, no detailed specification has been made available yet, so please take all of this with a grain of salt.

In a document circulated by PEPP-PT to various parties, the mechanism is described as follows (Version A):

Once a PEPP-PT user A has been deemed SARS-CoV-2 positive, the health authorities will contact them by phone or other means, depending on a country’s implementation of the system. The health authority will convey an activation code to the PEPP-PT user A, with which the encrypted proximity history is transmitted – in encrypted form – to a national trust center in order to trace the contact PEP-CT user A to other PEPP-PT users, e.g. PEPP-PT user B.
[…]
When a proximity history is uploaded to the trusted server, the server can match that with proximity histories uploaded in the past. When a match is found, the probability is calculated that via that proximity event the transmission occurred. Given that, the measurement parameters associated with the event can be used to estimate a risk of exposure using machine learning. This way, the system can become more precise while being deployed.

The envisioned mechanism is described differently on PEPP-PT’s website as of April 5th, 2020 (Version B):

If the user of phone A has been confirmed to be SARS-CoV-2 positive, the health authorities will contact user A and provide a TAN code to the user that ensures potential malware cannot inject incorrect infection information into the PEPP-PT system. The user uses this TAN code to voluntarily provide information to the national trust service that permits the notification of PEPP-PT apps recorded in the proximity history and hence potentially infected. Since this history contains anonymous identifiers, neither person can be aware of the other’s identity.
[…]
If an anonymous ID of phone B is identified as being associated with another country than phone A, information associated with the anonymous ID of phone B is transmitted to the national trust service of the other country. This transmission is fully encrypted and digitally signed. Further processing is done by the national trust service of the country that issued the app.

I assume, that the online version (Version B) is a newer revision. Notably, it is less specific than Version A.

Version A was explicitly stating the upload of the (complete) proximity history and even a matching algorithm executed on the server, e.g. not on the device. This immediately appears to be a bad idea.

In Version B, the wording has changed from uploading the proximity history to the less specific

providing information associated with the anonymous ID of phone B

and the server side matching has been replaced by

permitting the notification of PEPP-PT apps recorded in the proximity history.

Nothing is said about how exactly these two functionalities are supposed to be implemented.

For the remainder of this article let’s assume, that we want to achieve an implementation, where we completely rule out the possibility of any centralised entity being able to match proximity histories with infected users, e.g on a server. We always want those matches to be identified locally, on a user’s device.

In general, there are two main approaches:

The infected user provides a complete list of every ID she has ever broadcasted to the system.
The infected user provides a complete list of her Proximity History to the system.

Note: Experts on my team have correctly pointed out, that instead of creating and storing a long list of completely random IDs, one would very likely implemented a solution, where subsequent IDs are derived from their previous IDs using a device specific key. It would be this key, which gets replaced e.g. every three weeks so that given the root ID of the current epoch plus the associated key, a system could compute the complete list of IDs for the current epoch. (See also DP3T).

I tend to lean towards the first approach, since it seems to be more in line with Data Economy guidelines (“Datensparsamkeit”) and sends no data related to IDs of others than myself into the system. With approach 1, at any time the system will always only contain IDs of devices from confirmed positive users and not any other, e.g. non-infected, participants’ devices. I would also argue, that once I’ve been tested and confirmed positive by the authorities, I have already – to a certain degree – given up anonymity to at least those who were involved with the testing.

But please note: In a follow-up article, I’ll drill down for approach 2, which has some advantages, too. For example, if implemented right, it would be almost impossible for a potential attacker, to check whether any given ID X belongs to a device of a user who got infected. An attacker would at most be able to check whether her own IDs are on a list of any infected person, but it would be impossible to link that list back to a specific user’s device. On second thought, this approach might have more positive aspects over the other, but we will explore and see. The final verdict is still open!

But for now, let us look at the first strategy: Uploading my own ADIH to a central system. Note: In reality, it will very likely be a distributed system, but it will still be much more centralised as opposed to the millions of devices which are clients of the system. Also, we have to assume that at least in some countries, this system will be exposed to governments and organizations with relatively low data security and privacy standards.

The next question which arises is: Should the ADIH be uploaded as is or is there an argument for hashing the individual IDs in that list prior to uploading it to the system?

I am not talking about transport layer security here. I take those measurements for granted. For all of the communication between devices and the system, I implicitly assume state-of-the-art application of well understood security primitives.

I would argue in favour of locally hashing each ID in the ADIH with the TAN provided by the health authorities or even use a more advanced concept like pubkey(pepp-pt, server-id+id).

Why? An important aspect of protecting all participants’ privacy is that at no point anybody including a malicious attacker and except the health authorities (they know the positively infected person anyway), can link IDs to persons or personal data. If the device would upload a non-hashed version of the ADIH, somebody could potentially gain access to the IDs. This risk would be mitigated to a certain degree, if the list would be asymmetrically encrypted with the public key of the health authorities, but it would then potnetially still exist on the distributed servers spread across potentially hundreds of countries and organisations.

For the sake of illustration, let’s make this attack vector a bit more concrete: Somebody (a multinational chain or a corrupt government organisation) could set up Bluetooth devices in highly frequented areas (shops, bars, etc.) and start recording IDs along with e.g. credit card transactions which happen in close proximity (spatial or date/time proximity). They would essentially create a list of IDs mapping to individuals. While this scenario might suggest a rather large portion of criminal intend, other apps could very likely do the same, as long as they run in foreground or find other ways of circumventing mobile operating system level imposed restrictions. What if I create a fitness tracker companion app, which asks users to log in and starts recording IDs?

There very likely is more than one solution to this problem, but for the time being, I’d go for not having the plain IDs of infected user’s devices available on distributed servers.

So far so good. The system is now able to store and maintain a list of hashes of all IDs the infected users has ever broadcasted. Each generated hash can only be reconstructed if somebody has the ID and the corresponding TAN. Over time, the system will have a growing datastore of hashed ADIHs and their corresponding TANs – let’s call this the Infected Hashes Dataset (IHD). (I know, these names sound super ridiculous, please bear with me.)

Unfortunately, this also creates a new problem: On the server(s), this solution now effectively clusters IDs which have been randomly generated in intervals on the device for the sole sake of not allowing the mapping of multiple IDs to a single user. This problem does not arise, if the Proximity History is transferred to the system as opposed to the ADIHs (approach 2).

Next is another super crucial step of the solution. To quote from he PEPP-PT website again:

the notification of PEPP-PT apps recorded in the proximity history

While PEPP-PT is not yet going into more detail as to how exactly users get notified, I assume we can at least agree on the following:

All matching must happen locally, e.g. IDs do not get matched on any of the systems’ servers.
The system does not actually actively send out any targeted notifications. (Don’t think in terms of Push Notifications or Google Cloud Messages).
At no stage, any user is ever required to submit any personal detail to any entity of the system in order to actively query, whether she is infected.

Given these principles, the local app somehow has to regularly check, whether the IDs of its own, local ADIH are part of the global IHD. If there’s a match, this means the device’s user has likely been in epidemiologically relevant proximity of an infected person.

Fulfilling this second requirement, creates a ton of interesting additional challenges.

The goal of this second phase is simple: Finding out whether any of my own IDs is part of the large dataset of IDs coming from infected users.

In specific intervals, for example twice a day, the participating devices would request an update of their locally stored IHD, providing the decentralized system with an individual token (the salt). The decentralized system will use this salt to respond with an IHD but hash the IDs using the salt of the requesting device. This means, while each participant will ultimately end up storing the complete IHD, it will look different on each device. The decentralized system will also reveal the TANs.

A device would then take each of their own IDs, hash it with the first TAN, hash it with the salt using to request the update and check whether the resulting hash matches with any of the received hashes. This guarantees the following: All of the matching happens on the device. Nobody can reconstruct the initial IDs from their copy of the IHD.

If implemented straight forward, it might create a problem in terms of size of the dataset which needs to be distributed to each participating device. I’ve done a rough calculation where

Each hash is 32 Bytes in size.
The ID is rotated in 15 minute intervals and broadcast 24 hours / day.
We ring buffer 3 months of proximity history.
The number of infected who reported back to the system is 500.000.
The IHD would be roughly 140 GB in size.

1 million infected and a 5 minutes interval would lead to ~ 850 GB.

These numbers, of course, are based on a very straight forward implementation, without optimizations (e.g. compressing data, using Keyed HMAC algorithms so that subsequent IDs can be computed instead of being permanently stored etc.). However, I do believe no matter how much transparent optimization one can achieve, we might need additional ingredients. Even if, for example, a Keyed HMAC algorithm is used, where only the key would change every ~ three weeks, devices would have to “unpack” the proximity graph locally, which might impose a problem for more low-powered/low-storage smartphones.

Some ideas which came to mind during a lengthy discussion between me and Tim Buchwaldt, one of our engineering team leads at grandcentrix:

Effective sharding, where hashes will be concatenated with “bucket identifiers”
Use of bitwise trees to reduce the amount of data which needs to be distributed
Merkle trees to facilitate validation of dataset membership without the need to provide the complete dataset

I am planning on diving deeper into these ideas in a follow-up article, likely co-authored by Tim. Probably, PEPP-PT will publish more details about these mission critical aspects of the solution anytime soon to foster the discussion.

I’m also super curious about your feedback to all of this. Since there are a lot of outlets which facilitate political and ethical feedback, I’d love if we keep the discussion here close to the technology.

If you want to get in touch, the easiest way right now is via Twitter.

Thoughts about the handling of PEPP-PT proximity history IDs

In eigener Sache: #LucaApp

Sachsen ermöglicht als erstes Bundesland Corona-Warn-App zur Kontaktnachverfolgung

Das nächste #LucaApp Problem und Brillen ROTTLER