Exploring Trust, Privacy, and Security in Machine Learning, Natural Language Processing and AI

Wednesday, May 24, 2023
CPI's Gautam Kamath at the podium speaking about differential privacy

We sat down for a conversation with聽, an assistant professor at the Cheriton School of Computer Science in the Faculty of Mathematics and member of the Cybersecurity and Privacy Institute, about trust, privacy, and security surrounding machine learning (ML) and natural language processing (NLP) models.聽

Gautam leads a research group called聽聽and was recently聽named a Canada CIFAR AI Chair and a Vector Institute Faculty Member in recognition of his contributions to differential privacy, machine learning and statistics.聽His research focuses on developing trustworthy and reliable machine learning and statistics, with a particular emphasis on addressing fundamental problems in the realms of robustness and data privacy.聽

The following answers been edited for clarity and brevity.聽



There鈥檚 a quote from Ernest Hemingway 鈥淭he best way to find out if you can trust somebody, is to trust them.鈥 Is that an approach we should take to NLP and AI driven tools we interact with?


I thought a lot about what this phrase means. My personal interpretation of it is to trust them a little, and then see if you can trust them more. I interpret that quote as a type of test. You must test someone to see if you can trust them or not, and this is often done in a lot of machine-learning contexts for security and privacy. I also interpret it to mean, can we trust them to give us something correct or not? While these models are powerful and they can do a lot of amazing things, trusting them to give you the right answers 100% of the time and make decisions in a life-or-death situation, I don鈥檛 think they are quite there yet. These models can give wrong answers, confidently wrong answers.聽

How concerned should users of these AI and NLP tools be about their data privacy and security?


You start by asking where ChatGPT or any of these other machine learning and NLP tools gather their data from. Essentially, how do they learn? One of the things that powered a lot of advances in machine-learning and NLPs over the last 5-10 years is large, publicly available data sets. An example is Common Crawl, which is a data set online scraped from the public Internet from a variety of different sources. You can imagine that a lot of this is going to be innocuous, perhaps random, Internet comments, jokes, and memes.聽

Now suppose I posted some sensitive information on my Facebook page and somehow, I misunderstood the privacy settings and it鈥檚 now accidentally visible to the world. It鈥檚 possible that information was used as training data. Down the line you don't know what these tools will do with that information and there have been cases demonstrating that these language models can spit out parts of their training data verbatim.聽

Are there any security and privacy concerns specifically about ChatGPT that you鈥檝e come across?


It鈥檚 not exactly ChatGPT, but I have this paper on the closely related GPT-4 in front of me, and in the paper, they first comment on using publicly available data sets for training.聽The other thing I want to highlight is the paper mentions data licensed from third-party providers. This is all they tell you about their datasets, which is kind of mysterious.聽What do these third-party providers have about me?聽I'm sure it鈥檚 an appropriately licensed data set, but you can imagine at some point you clicked聽鈥极碍鈥聽and accepted the terms of a license agreement on an app. Now your data might be in the hands of a third-party, unless specifically stated otherwise in the agreement. They can do whatever they want with it. They can sell it or license it to other people and companies.聽Now this third-party data, your data, is in this massive machine learning model.聽

Additionally, you鈥檙e also sending them data through the prompts that you give ChatGPT. They state they will use this data to improve ChatGPT, and it essentially becomes new training data. Sensitive things that you have told it, have asked it, ChatGPT can memorize and use it. People should think about the privacy considerations in all these cases. Unfortunately, I think people have already leaked a lot of their private information just by clicking 鈥补肠肠别辫迟鈥櫬on things without understanding or thinking about where their data is going to end up.聽

Are there any real incentives for creators of machine learning and NLP models to be more careful with user data and privacy? Or is it viewed as almost a hindrance to progress?


One reason why you might not want to be careless with user data is to maintain their trust, so they provide you with more of their data in the future.聽A lot of my work is on a specific notion of privacy called differential privacy and a big complaint against this notion is that while it does guarantee more individual privacy in some very precise sense, it can hurt utility.聽On the other hand, maybe there is an order of magnitude more data that you wouldn't be able to access unless you put privacy and security first. So, you can enhance your model鈥檚 utility by enhancing your commitment to user privacy.聽If you are respectful of the users鈥 data they might give you more data later, which can allow your model to eventually do more useful things.

We keep offering up our personal information, is there any real digital security and privacy anymore? Do we just accept a lack of security and privacy as the norm?


That's a good but tough question.聽On one hand, there needs to be better education about the聽鈥榖ad things鈥聽that can happen when you allow access to your data.聽I'm not sure if people think about these things because the outcomes of these decisions are distant.聽It鈥檚 difficult for people to have foresight and understand the potential risk.聽There needs to be better education that explicitly says聽鈥榟ey, you did this and now I can figure out this about you鈥. For example, there was a study that showed it鈥檚 possible to guess your sexuality from what pages you like on Facebook, even though they鈥檙e not obviously related.聽

I think the security and privacy community needs to focus more on accessible and understandable information for the public. Researchers in these areas understand these risks but come across as too technical when communicating those same risks. There is often no 鈥渟moking gun鈥 for the average user, so I think making the risks clear is something that could be done to better educate everyone.聽


Learn more about research and course lessons on his YouTube channel. His lectures on differential privacy are particularly noteworthy. However, Dr. Kamath covers an array of topics on this platform including machine learning. Watch his insightful videos .