蔡明剛Tsai, Ming-Kang鄭吉峰Cheng, Chi-Feng2023-12-082023-08-152023-12-082023https://etds.lib.ntnu.edu.tw/thesis/detail/748c20681cd677c751b32b398b77ab49/http://rportal.lib.ntnu.edu.tw/handle/20.500.12235/120993在藥物研發中,Compound Protein Interaction是一個關鍵的領域,它關注藥物與蛋白質之間的相互作用,這些作用對於藥物的活性和效果至關重要。傳統上,CPI的研究主要依賴實驗室進行的耗時耗力的試驗,但隨著機器學習的快速發展,它在CPI研究中展現了許多優勢,它可以高效地處理大規模和複雜的生物信息數據,並自動學習特徵和模式,從而加速藥物研發的進程並降低成本。本研究旨在改進現有的CPI機器學習模型,以提升其預測能力。原始模型主要採用了Transformer模型的自注意機制來預測CPI反應性,這種機制能夠捕捉分子和蛋白質之間的局部和全局關係。我們認為進一步引入分子的化學指紋可以增加對分子特徵的理解,從而提高模型的性能。為此我們使用了PaDEL工具生成了GPCR資料集中所有分子的化學指紋。通過聚類分析,我們對資料集中不同化學指紋的分布情況進行了研究。這有助於我們理解分子的結構和性質之間的相似性和差異性。接著我們將這些化學指紋先後以三種方式引入模型訓練中,試圖從中探明其有效性並找出最適合的引入方法。首先,我們將化學指紋轉換為嵌入向量,以提供更全面的信息。其次,我們嘗試將化學指紋作為附加特徵引入模型,使模型能夠更完整的使用到化學指紋。最後,我們對化學指紋的數值進行TF-IDF的操作來擴展其變異性,以便模型能夠更好地理解分子之間的不同。在實驗結果中,我們比較了這三種模型在CPI預測性能上的差異,並分析了它們與先前聚類分析結果之間的關係。我們觀察到引入化學指紋後,模型的預測準確性和穩定性在特定化學指紋得到了改善,並且其與聚類分析結果之間存在一定的關聯性。Compound Protein Interaction (CPI) is a critical field in drug development that focuses on the interactions between drugs and proteins. This plays a crucial role in determining the activity and efficacy of drugs. Traditionally, CPI research heavily relied on laborious and time-consuming experimental assays. However, the rapid advancement of machine learning has demonstrated numerous advantages in CPI research, enabling efficient processing of large-scale and complex biological data while automatically learning features and patterns. As a result, it accelerates the drug development process and reduces costs significantly.This study aims to improve existing CPI machine learning models to enhance their predictive capabilities. The original model primarily employed the self-attention mechanism of the Transformer model which captures both local and global relationships between molecules and proteins. We believe that further incorporating molecular chemical fingerprints can enhance the understanding of molecular features and improve model performance. To achieve this, we utilized the PaDEL tool to generate chemical fingerprints for all molecules in the GPCR dataset.Through cluster analysis, we investigated the distribution patterns of different chemical fingerprints within the dataset. This analysis aided our understanding of the similarities and differences in molecular structures and properties. Subsequently, we introduced these chemical fingerprints into the model training process using three different approaches, aiming to determine their effectiveness and identify the most suitable integration method. Firstly, we transformed the chemical fingerprints into embedding vectors to provide more comprehensive information. Secondly, we attempted to incorporate the chemical fingerprints as additional features to enable the model to fully utilize the information contained in the fingerprints. Lastly, we applied TF-IDF operations to the numerical values of the chemical fingerprints to expand their variations, allowing the model to better understand the differences between molecules.In the experimental results, we compared the performance of these three models in CPI prediction and analyzed their relationship with the previous cluster analysis results. We observed that the introduction of chemical fingerprints improved the predictive accuracy and stability of the model, particularly for specific chemical fingerprint types, and exhibited certain correlations with the cluster analysis results.深度學習CPI化學指紋TransformerDeep learningCPIChemical fingerprintTransformer結合化學指紋輔助原子嵌入和自注意力模型進行蛋白質-配體交互作用預測Combining Molecular Fingerprints and Atomic Embedding with a Self-Attention Model for Protein-Ligand Interaction Predictionetd