基於機器學習預測有機分子之最高佔據分子軌域與最低未佔據分子軌域及其能隙
No Thumbnail Available
Date
2023
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
近年來科技發展迅速,以大數據的電腦模擬研究也跟著興起,利用機器學習的方式透過演算法來精準預測結果,並輔佐實驗進展,從中尋找出新的可能性已然是種趨勢,而傳統的量化計算耗時長,成本相對高,且只能做少量的分子。HOMO、LUMO和Energy gap性質用於化學領域中,因其放光波長、電子傳遞、化學反應性等特性,廣泛應用於有機化學,本研究基於上述問題,使用了機器學習中的分群法、線性及非線性回歸的方式建立模型,逐步針對大量種類的有機化合物進行分析與探討。本研究利用機器學習中的Lasso回歸、K-means分群法、隨機森林演算法,用於預測114896種有機化學分子的HOMO、LUMO和能隙(Energy gap)性質,透過本研究之模型,得出:HOMO、LUMO、Energy gap的理論與預測值之MAE小於 0.3 eV,並且非線性回歸模型之校正R2值大於 0.93,顯示模型預測結果高度符合吾人預期之化學性質。透過本研究之分析結果,顯示本研究所建立之模型,除了有著良好的預測效果,其篩選出來的描述特徵與一般化學界的認知相吻合,未來可期運用本研究之相關概念與分析方法,對相關領域之數值分析有所貢獻。
With the rapid development of science and technology in recent years, computer simulation research based on big data is also on the rise. It is a trend to use machine learning to accurately predict the results through algorithms and assist the progress of experiments. Traditional quantitative calculations take a long time, usually expensive, and can only do a small amount of molecules. Comparatively, using computers and machine learning has already become a new trend to find new possibilities.The properties of HOMO, LUMO and Energy gap are widely used in the field of organic chemistry because of their light emission wavelength, electron transfer, chemical reactivity and other characteristics. Based on the above properties, this study uses the clustering algorithm in machine learning, linear and nonlinear regression methods to establish machine learning models. The models are used to analyze various kind of organic molecular step by step.This research uses Lasso regression, K-means clustering method, and random forest algorithm in machine learning to predict the HOMO, LUMO, and energy gap properties of 114,896 organic chemical molecules. Through the model of this study, it is concluded that the MAE of the theoretical and predicted values of HOMO, LUMO, and Energy gap is less than 0.3 eV, and the corrected R2 value of the nonlinear regression model is greater than 0.93, showing that the predicted results of the model are highly in line with the expected chemical properties.Through the analysis results of this study, it is concluded that the model established in the study not only has prediction effect, but also the selected descriptors are consistent with the cognition of general chemistry. In the future, the related concepts and analysis methods of this study can be used to contribute to the numerical analysis of related fields.
With the rapid development of science and technology in recent years, computer simulation research based on big data is also on the rise. It is a trend to use machine learning to accurately predict the results through algorithms and assist the progress of experiments. Traditional quantitative calculations take a long time, usually expensive, and can only do a small amount of molecules. Comparatively, using computers and machine learning has already become a new trend to find new possibilities.The properties of HOMO, LUMO and Energy gap are widely used in the field of organic chemistry because of their light emission wavelength, electron transfer, chemical reactivity and other characteristics. Based on the above properties, this study uses the clustering algorithm in machine learning, linear and nonlinear regression methods to establish machine learning models. The models are used to analyze various kind of organic molecular step by step.This research uses Lasso regression, K-means clustering method, and random forest algorithm in machine learning to predict the HOMO, LUMO, and energy gap properties of 114,896 organic chemical molecules. Through the model of this study, it is concluded that the MAE of the theoretical and predicted values of HOMO, LUMO, and Energy gap is less than 0.3 eV, and the corrected R2 value of the nonlinear regression model is greater than 0.93, showing that the predicted results of the model are highly in line with the expected chemical properties.Through the analysis results of this study, it is concluded that the model established in the study not only has prediction effect, but also the selected descriptors are consistent with the cognition of general chemistry. In the future, the related concepts and analysis methods of this study can be used to contribute to the numerical analysis of related fields.
Description
Keywords
機器學習, QM9資料集, 聚類分群法, 隨機森林, machine learning, Quantum-Machine 9, K-means, random forest