Our bodies are made of trillions of cells that form tissues and organs. The genes inside the nucleus of each cell code for proteins that determine a cell鈥檚 structure and function, as well as instruct a cell when to grow, divide and die. Normally, our cells follow these instructions, but if a cell鈥檚 DNA mutates it can cause the cell to divide and grow out of control. Cancer is fundamentally a disease of uncontrolled cell growth and regulation, and all cancers ultimately are caused by mutations to the genes that regulate cell division, growth and differentiation.
Our immune system defends the body against harmful pathogens such as bacteria and viruses, but it also protects us against cancer by eliminating tumours as they form. Specialized cells of the immune system called T cells can detect cancer cells and destroy them. The immune system鈥檚 ability to find and kill cancer cells is the basis for a revolutionary kind of treatment known as cancer immunotherapy 鈥 a treatment clinicians use to strengthen the T cell鈥檚 response to cancer.聽
But to understand how T cells recognize cancer cells requires a short explanation of how DNA is translated into proteins.聽
The DNA in our genes contains information to build proteins 鈥 the molecules in our cells that carry out all functions necessary for life 鈥 but DNA does not create proteins directly. The flow of genetic information from DNA in the cell鈥檚 nucleus to the proteins that are synthesized within the cell involves two major steps called transcription and translation.

The process by which DNA is copied to mRNA is called transcription, and the process聽by which mRNA is used to synthesize聽proteins is called translation.
Transcription is the first step in a gene鈥檚 expression, the process by which information from a gene is used to synthesize a protein. During transcription, the DNA of a gene serves as a template to create messenger RNA, also known as mRNA, which is a single-stranded molecule composed of nucleotides that correspond to the genetic sequence of a gene.聽
The mRNA copy of a gene鈥檚 DNA sequence carries the information needed to build a protein, a large molecule made of many amino acids. During translation, the mRNA is read by cellular enzymes as a genetic code that relates the nucleotide sequence of DNA in a gene to the sequence of amino acids that form a protein molecule.聽
If a mutation occurs in a cell鈥檚 DNA 鈥 聽even if just one nucleotide is substituted for another in a gene 鈥 it affects DNA鈥檚 transcription to mRNA and the mRNA鈥檚 translation to the sequence of amino acids. 聽
This is where the immune system鈥檚 ability to recognize non-self comes in. If the protein synthesized by a cell is changed, our immune system sees it as non-self. And using the same mechanism that T cells use to recognize pathogens in the body as foreign or non-self, T cells target and kill cancer cells that present neoantigens, small pieces of non-self protein on their surfaces.
鈥淚n cancer, when a missense mutation occurs in a cell鈥檚 DNA, a single nucleotide substitution results in a different amino acid during translation. As a result, the peptide 鈥 a fragment of the protein 鈥 that carries the mutated amino acid can be recognized by our immune system as foreign, even though it is synthesized by the cancer cell from our own body,鈥 explains Hieu Tran, an Adjunct Assistant Professor at the Cheriton School of Computer Science and Senior Research Scientist at Bioinformatics Solutions.聽
鈥淭his mutated peptide is what is called a neoantigen 鈥 a new antigen that鈥檚 present only on the surface of cancer cells, but not normal cells. Our immune system can recognize neoantigens on cancer cells and kill those cancer cells, while not affecting the normal cells. We can use these same neoantigens to develop cancer vaccines to boost our immune system to eliminate a tumour.鈥
鈥淲hen a cell becomes cancerous the human leukocyte antigen or HLA system knows about it,鈥 adds Ming Li, a University Professor at the Cheriton School of Computer Science, who also holds the Canada Research Chair in Bioinformatics. 鈥淭he HLA system cuts the mutated protein into peptides and presents those peptides on the surface of the cell. If the HLA presents a normal peptide, the T cells know that it is a self-peptide and they don鈥檛 attack it. They attack only cells with mutated peptides 鈥 the neoantigens 鈥 that are not recognized as self.鈥
When a tumour is found in the body, a surgeon will remove a sample for analysis. Then using a technique known as mass spectrometry, which identifies molecules based on their mass, the amino acid sequence of the neoantigens on the surface of the tumour cells can be determined, Dr. Tran said.
The trick, however, is finding the tumour-specific neoantigens 鈥 essentially a needle in a large haystack. Not surprisingly, it is a bewilderingly difficult task to do using conventional methods, but it is crucially important because neoantigens are the non-self peptides that the immune system uses to find and destroy cancer cells.聽
Amino acids are the building blocks of peptides and ultimately proteins. Although many amino acids have been identified in nature, just 20 amino acids make up the proteins found in the human body. By convention, amino acids are labelled using a one-letter code. For example, the amino acid alanine is labelled A, arginine is labelled R, asparagine is labelled N, and so on. A peptide鈥檚 amino acid sequence can be considered as a word of composed of these letters.
鈥淚f you are familiar with Natural Language Processing, you鈥檝e likely seen your mobile phone guess the next word you might have typed as you compose a message. You write 鈥榟ow鈥 and it suggests 鈥榓re鈥 and if you type 鈥榓re鈥 it suggests 鈥榶ou鈥,鈥 Dr. Tran said.聽
鈥淲e applied a similar machine-learning model to determine the amino acid sequence of neoantigens based on this one-letter amino acid code. We predict the peptide鈥檚 sequence by predicting its amino acids one at a time. If I know your immunopeptidome 鈥 the thousands of short 8 to 12 amino acid peptide antigens displayed on the cell surface 鈥 and I know that a neoantigen is different from your existing peptides by just one mutation, I can train a machine learning model using your normal peptides to predict the mutated peptides. We used a recurrent neural network 鈥 a machine learning model we call DeepNovo 鈥 to predict the amino acid sequence of neoantigens one letter at a time.鈥

Personalized de novo peptide sequencing workflow to discover neoantigens to develop cancer vaccines (view larger figure).
To do this the researchers downloaded the immunopeptidome datasets of five patients with melanoma, a type of skin cancer, which they then used to train, validate and test their machine learning model.聽
鈥淥ur machine-learning model expanded the predicted immunopeptidomes of those patients by 5 to 15 percent using only the data from mass spectrometry,鈥 Dr. Tran said. 鈥淲e also discovered neoantigens, including those with validated T-cell responses that had not been reported in previous studies.鈥
Even more impressively, the machine learning model is able to personalize the results 鈥 that is, it identifies specific neoantigens for each individual patient.聽
鈥淭he most exciting thing is that our approach is truly personalized 鈥 personalized to each individual patient as opposed to a group of similar patients. We used the data of each individual patient to identify his or her own neoantigens and develop a cancer vaccine specifically for that patient,鈥 Dr. Tran said.聽
鈥淐ancer immunotherapy is quickly becoming a fourth modality of cancer treatment, alongside surgery, chemotherapy and radiotherapy,鈥 adds University Professor Li. 鈥淓very patient is different and every cancer is different, so cancer treatment shouldn鈥檛 be the same for all. Treatment should be tailored to the patient, and that鈥檚 what our personalized machine learning model allows us to do.鈥

聽(L) is a University Professor at the Cheriton School of Computer Science and聽the . He is known for his fundamental contributions to Kolmogorov complexity, bioinformatics, machine learning theory, and analysis of algorithms.聽聽(R) is an聽Adjunct Assistant Professor at the Cheriton School of Computer Science and Senior Research Scientist at聽Bioinformatics Solutions,聽a 蓝莓视频-based company that uses machine learning to sequence and identify proteins.
To learn more about the research on which this feature is based, please see the following scientific journal publications.
Ngoc Hieu Tran, Rui Qiao, Lei Xin, Xin Chen, Baozhen Shan, Ming Li. . Nature Machine Intelligence 2, 764鈥771 (2020).聽
Ngoc Hieu Tran, Rui Qiao, Lei Xin, Xin Chen, Chuyi Liu, Xianglilan Zhang, Baozhen Shan, Ali Ghodsi, Ming Li. . Nature Methods 16, 63鈥66 (2019).聽
Ngoc Hieu Tran, Xianglilan Zhang, Lei Xin, Baozhen Shan, Ming Li. . Proceedings of the National Academy of Sciences 114 (31), 8247鈥8252 (2017).