Many users rely on cloud-based machine learning and data collection for everything from tagging photos of friends online to remembering shopping preferences. Although this can be useful and convenient, it can also be a user privacy disaster. With new machine learning features in its latest phone and desktop operating system releases, Apple is exploring ways to provide these kinds of services and collect related user data with more regard for privacy. Two of these features—on-device facial recognition and differential privacy—deserve a closer look from a privacy perspective. While we applaud these steps, it's hard to know how effective they are without more information from Apple about their implementation and methods.
Facial recognition and machine learning
Let’s start with the new object and facial recognition feature for the Photos app. The machine learning processing necessary for an app like Photos to recognize faces in pictures is usually run in the cloud, exposing identifiable user data to security threats. Instead, Apple has bucked this industry trend and opted to develop a system that runs in the background on your phone, tablet, or laptop only, without you having to upload your photos to the cloud. Keeping user data on the device like this—rather than sending it off to Apple's servers or other third parties—is often better for user privacy and security.
The choice to run machine learning models like facial recognition on a device rather than in the cloud involves some trade-offs. When deployed this way, Apple loses speed, power, and instant access to mountains of user data for its facial recognition machine learning model. On the other hand, users gain something much more important: privacy and control over their information. Running these services on the device rather than in the cloud gives users a higher degree of privacy, especially in terms of law enforcement access to their data.
While cloud is often the default for large-scale data processing, Apple has shown that it doesn't have to be. With these trade-offs in mind, Apple has rightly recognized that privacy is too great a price to pay when working with data as sensitive and identifiable as users' private photos. Running a machine learning model on the device is not a privacy guarantee—but at the very least, it’s a valuable effort to offer technically sophisticated facial recognition functionality to users without requiring all of them to hand over their photos.
Differential privacy
The second noteworthy feature of Apple’s latest release is a model called differential privacy. In general, differential privacy is a process for making large datasets both as accurate and as anonymous as possible. It’s important to note that Apple is not the first large-scale data operation to take on differential privacy: Microsoft researchers pioneered the field, Google employs anonymized data collection algorithms, and the Census Bureau released a differentially private dataset. Collectively, these initiatives show the way forward for other parts of the tech industry: when user data needs to be collected, there are often cleverer, safer, more privacy-respecting ways to do it.
In this case, Apple is trying to ensure that queries on its database of user data don’t leak too much information about any individuals. The best way to do that is to not have a database full of private information—which is where differential privacy comes in. Differential privacy helps companies like Apple learn as much as possible about their users in general without revealing identifiable information about any individual user in particular. Differentially private datasets and analysis can, for example, answer questions about what kinds of people like certain products, what topic is most popular in a news cycle, or how an application tends to break.
Apple has released few details about its specific approach to differential privacy. It has publicly mentioned statistics and computer science methods like hashing (transforming data into a unique string of random characters), subsampling (using only a portion of all the data), and noise injection (systematically adding random data to obscure individuals’ information). But until Apple provides more information about its process (which it may do in a white paper, as it in the past), we are left guessing as to exactly how and at what point in data collection and analysis such methods are applied.
Just as on-device machine learning has trade-offs, so too does differential privacy. Differential privacy relies on the concept of a privacy budget: essentially, the idea you can only make so much use of your data without compromising its privacy-preserving properties. This is a tricky balancing act between accuracy and anonymity. The parameters and inputs of a given privacy budget can describe how information is being collected, how it is being processed, and what the privacy guarantees are.
With the new release, Apple is employing differential privacy methods when collecting usage data on typing, emoji, and searching in an attempt to provide better predictive suggestions. To date, differential privacy has had much more academic attention than practical application, so it's interesting and important to see major technology companies applying it—even if that application has both good and bad potential consequences.
On the good side, Apple has apparently put some work into collecting user data with regard for privacy. What's more, even the use of differential privacy methods on user data is opt-in, a step we're very glad to see Apple take.
However, Apple is collecting more data than it ever has before. Differential privacy is still a new, fairly experimental pursuit, and Apple is putting it to the test against millions of users' private data. And without any transparency into the methods employed, the public and the research community have no way to verify the implementation—which, just like any other initial release, is very likely to have flaws. Although differential privacy is meant to mathematically safeguard against such flaws in theory, the details of such a large roll-out can blow away those guarantees. Apple's developer materials indicate that it's well aware of these requirements—but with Apple both building and utilizing its datasets without any oversight, we have to rely on it to self-police.
In the cases of both facial recognition and differential privacy, Apple deserves credit for implementing technology with user privacy in mind. But to truly advance the cause of privacy-enhancing technologies, Apple should release more details about its methods to allow other technologists, researchers, and companies to learn from it and move toward even more effective on-device machine learning and differential privacy.