以機器學習方法分析結構與螢光波長之關係

dc.contributor蔡明剛zh_TW
dc.contributorTsai, Ming-Kangen_US
dc.contributor.author周弈銘zh_TW
dc.contributor.authorChou, Yi-Mingen_US
dc.date.accessioned2019-09-04T09:09:33Z
dc.date.available2023-12-31
dc.date.available2019-09-04T09:09:33Z
dc.date.issued2018
dc.description.abstract在定量構效關係的研究中,以機器學習方式進行資料挖掘的比例越來越高,而使用少量描述符對某種化學特性進行建模一直是化學訊息學中非常重要的一環,在擁有少量樣本以及大量從E-Dragon資料庫中取得的分子結構與特性相關的描述符數據後,特過機器學習的方式找出能夠對萘和香豆素之不同取代基化合物之螢光波長進行擬合的描述符和演算法,變成為本次實驗的目的,而透過四種不同的機器學習演算法 ( 決策樹回歸、隨機森林回歸、GBDT回歸、極端樹回歸 ) 之間投票和比較,從1664種描述符中取得R3m、Ss、R7u+三種描述符對螢光波長進行擬合;再透過測試集準確率的比較與檢驗,選出對於處理非線性問題具有良好功能的隨機森林回歸做為最後建模工具 ( 隨機森林回歸所使用的層數為19層、65個弱學習器 ) 。而此三種描述符則是在本實驗中做為具有預測螢光波長之描述符。 在建模之後,分析訓練集和測試集的平均絕對誤差以及誤差百分率,得到訓練集之平均絕對誤差為16奈米、誤差百分率為百分之四;而測試集的平均絕對誤差為26奈米、誤差百分率為百分之六。而在分析誤差結果時也發現,R3m和Ss之相關性程度取決於取代基的複雜程度,而不同的複雜程度會對不同光區的分子有著不同的影響。如果具有高度相關性,也就是取代基舉有多重鍵以及複雜性,則落在短波長區間(尤其是紫光)的預測能力較佳;若高度相關性的情況發生在長波長分子上,則模型的預測能力會變弱。zh_TW
dc.description.abstractIn the study of quantitative structure-activity relationship, the proportion of data mining by machine learning method is getting higher and higher, and the use of a small number of descriptors to model a certain chemical property has always been a very important part of chemical informatics. After getting the data and a large number of descriptor from the E-Dragon database, using machine learning method to find out the descriptors and algorithms for fitting the fluorescence of different substituent compounds of naphthalene and coumarin became the purpose of this experiment. The R3m, Ss, and R7u+ descriptors are selected from 1664 descriptions in order to fit the fluorescence wavelength, through the comparison and voting between four different machine learning algorithms (decision tree regression, random forest regression, GBDT regression, extreme tree regression). Then, through the comparison and test of the test set accuracy, the random forest regression is a good function for dealing with nonlinear problems and selected as the final modeling tool. The number of layers used in random forest regression is 19 layers and 65 weak learners). These three descriptors are used in this experimentas descriptors with predicted fluorescence wavelengths. After modeling, the average absolute error and the percentage error of the training set and the test set are analyzed. The average absolute error of the training set is 16 nm and the error percentage is 4%. The average absolute error of the test set is 26 nm. The percentage error is 6%. When analyzing the error results, it is also found that the degree of correlation between R3m and Ss depends on the complexity of the substituents, and the different complexity will have different effects on the molecules of different regions. If there is a high degree of correlation, that is, the substitution has multiple bonds and complexity, the prediction ability in the short wavelength range (especially purple light) is better; if the high correlation occurs on the long wavelength molecule, the model’s predictive power will be weaker.en_US
dc.description.sponsorship化學系zh_TW
dc.identifierG060442056S
dc.identifier.urihttp://etds.lib.ntnu.edu.tw/cgi-bin/gs32/gsweb.cgi?o=dstdcdr&s=id=%22G060442056S%22.&%22.id.&
dc.identifier.urihttp://rportal.lib.ntnu.edu.tw:80/handle/20.500.12235/100132
dc.language中文
dc.subjectQSARzh_TW
dc.subject機器學習zh_TW
dc.subject螢光zh_TW
dc.subjectQSARen_US
dc.subjectMachine learningen_US
dc.subjectfluorescenceen_US
dc.title以機器學習方法分析結構與螢光波長之關係zh_TW
dc.titleAnalyzing the relationship between the structure and fluorescence: Machine learning methoden_US

Files

Collections