Scientists leverage AI to design protein dataset

Scientists in Shanghai have made a breakthrough in protein design by leveraging artificial intelligence, establishing the world's largest protein sequence dataset and developing models that enable targeted modification and selection of proteins with specific functions.
The advancement has the potential to drastically reduce the time and cost involved in industrial protein modification, according to researchers from Shanghai Jiao Tong University.
Proteins play key roles in industries ranging from pharmaceuticals to green manufacturing. However, natural proteins often require modification to withstand environmental factors such as temperature shifts and acidity levels. For example, if a protein is used in laundry detergent, it must function in both hot and cold water to effectively break down stains.
Traditionally, modifying proteins required thousands of trial-and-error experiments, a costly and time-consuming process. The Shanghai team's approach transforms this by replacing trial and error with AI-powered design, cutting the research and development timeline from two to five years to as little as six months.
Their technology allows precise modifications to enhance specific properties such as extreme heat resistance, alkaline stability and resilience against digestion. The approach has broad implications for biotechnology, pharmaceuticals and industrial production.
The breakthrough has already been industrialized alongside automated equipment, making protein design more efficient.
At the core of the research is the Venus-Protein Outsize Database, or Venus-Pod, which contains more than 9 billion protein sequences spanning a wide range of organisms, including extremophiles, the microorganisms that thrive in harsh conditions.
The dataset includes 3.62 billion terrestrial microbial protein sequences, 2.94 billion marine microbial sequences, 2.43 billion antibody sequences and 60 million viral protein sequences. Notably, 500 million of these are labeled with functional tags that indicate their optimal working conditions such as temperature, pressure, acidity and alkalinity.
Using Venus-Pod, researchers trained the Venus series models, which rank at the top of the industry leaderboard in predicting and designing protein functions, according to Hong Liang, the team's lead scientist.
The Venus models have two core functions: AI-directed protein evolution and AI-powered screening.
"The first optimizes underperforming proteins to meet specific application requirements, while the second precisely identifies proteins with exceptional properties, such as extreme heat or gastrointestinal resistance," Hong said.
The team has also developed what they say is the world's first integrated machine capable of high-volume protein expression, purification and functional testing. The system can complete more than 100 tasks in 24 hours, nearly 10 times faster than manual methods, cutting labor and resource costs while accelerating protein engineering research.
Over the past two years, the Venus models have successfully designed multiple proteins that are now moving toward industrialization.
For example, in early diagnosis of Alzheimer's disease, researchers optimized an enzyme known as alkaline phosphatase to perform at three times the activity level of the best global product, allowing for detection of biomarkers at extremely low concentrations. The modified ALP has entered the 200-liter scale-up production stage, marking a significant step toward commercial application.
Researchers say the achievement could have major implications for diagnostic testing projects that require ultra-sensitive detection.
zhouwenting@chinadaily.com.cn
- Scientists leverage AI to design protein dataset
- 43 officials, 5 entities penalized for bridge collapse
- 'Sky Net' drive snares more than 14,000
- Cryopreservation keeps hopes of motherhood alive
- Visit Ming'antu observing station in N China's Inner Mongolia
- Improved connectivity drives growth in Xizang