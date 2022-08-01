AI learned to recognize and classify different dog breeds based on images. A new machine learning method from CZ Biohub now makes it possible to classify and compare different human proteins based on fluorescence microscopy images. Credit: CZ Biohub



People are good at looking at images and finding patterns or making comparisons. For example, look at a collection of dog photos and you can sort them by color, ear size, face shape, and so on. But could you compare them quantitatively? And perhaps more intriguingly, could a machine extract meaningful information from images that humans cannot?

Now, a team of Chan Zuckerberg Biohub scientists at Standford University has developed a machine learning method to quantitatively analyze and compare images — in this case microscopy images of proteins — without prior knowledge. As reported in Nature Methods, their algorithm, called “cytoself,” provides rich, detailed information about the location and function of proteins in a cell. This capability could reduce research time for cell biologists and ultimately be used to speed up the drug discovery and drug screening process.

“This is very exciting — we’re applying AI to a new kind of problem and still recovering everything people know, plus more,” said Loic Royer, co-corresponding author of the study. “In the future, we could do this for different types of images. It opens up a lot of possibilities.”

Cytoself not only demonstrates the power of machine learning algorithms, it has also provided insights into cells, the basic building blocks of life, and into proteins, the molecular building blocks of cells. Each cell contains about 10,000 different types of proteins – some working alone, many working together and doing different tasks in different parts of the cell to keep them healthy. “A cell is much more spatially organized than we previously thought. That’s an important biological result about how the human cell is wired,” said Manuel Leonetti, also co-corresponding author of the study.

And like all tools developed at CZ Biohub, cytoself open source and accessible to everyone. “We hope it will inspire many people to use similar algorithms to solve their own image analysis problems,” Leonetti says.

Never mind a PhD, machines can learn on their own

Cytoself is an example of what is known as self-supervised learning, which means that people don’t teach the algorithm anything about the protein images, as is the case with supervised learning. “With supervised learning, you have to teach the machine one by one with examples; it’s a lot of work and very tedious,” said Hirofumi Kobayashi, lead author of the study. And if the machine is limited to the categories that people teach it, it can introduce bias into the system.

“Manu [Leonetti] I believed the information was already in the images,” Kobayashi said. “We wanted to see what the machine could come up with on its own.”

The team, which also included CZ Biohub Software Engineer Keith Cheveralls, was indeed surprised by the amount of information the algorithm was able to extract from the images.

“The level of detail in protein localization was much higher than we had imagined,” said Leonetti, whose group is developing tools and technologies to understand cell architecture. “The machine transforms each protein image into a mathematical vector. So then you can start arranging images that look the same. We realized that by doing that, we could predict proteins with high specificity that interact in the cell, just by comparing their images, which was quite surprising.”













In this rotating 3D UMAP image, each dot represents a single protein image, colored according to protein localization categories. Collectively, it forms a highly detailed map of the full diversity of protein localizations. Credit: CZ Biohub

First of its kind

While previous work has been done on protein images using self-supervised or unsupervised models, self-supervised learning has never been used more successfully on such a large dataset of over 1 million images encompassing more than 1,300 proteins measured. from living human cells, said Kobayashi, an expert in machine learning and high-speed imaging.

The images were a product of CZ Biohub’s OpenCell, a project led by Leonetti to create a complete map of the human cell, ultimately including characterizing the approximately 20,000 types of proteins that power our cells. Published earlier this year in Science were the first 1,310 proteins they characterized, including images of each protein (produced using some sort of fluorescent tag) and images of their interactions with each other.

Cytoself was key to OpenCell’s performance (all images available at opencell.czbiohub.org), with highly detailed and quantitative information on protein localization.

“The question of what all the possible ways a protein can localize in a cell — all the places it can be and all kinds of combinations of sites — is fundamental,” Royer said. “Biologists have spent decades trying to pinpoint all possible places and all possible structures within a cell. But that’s always been done by people looking at the data. The question is how many human limitations and prejudices have made this process imperfect?”

Royer added, “As we’ve shown, machines can do it better than humans. They can find finer categories and see distinctions in the images that are extraordinarily nice.”

The team’s next goal for cytoself is to explore how small changes in protein localization can be used to recognize different cellular states, for example, a normal cell versus a cancer cell. This could hold the key to better understanding many diseases and facilitating drug discovery.

“Drug screening is basically trial and error,” Kobayashi said. “But with cytoself, this is a big leap, because you don’t have to do experiments one by one with thousands of proteins. It’s a cheap method that can significantly increase the research speed.”

