A Molecular Image-Based Novel Quantitative Structure-Activity Relationship Approach, Deepsnap-Deep Learning and Machine Learning

The quantitative structure-activity relationship (QSAR) approach has been used in numerous chemical compounds as in silico computational assessment for a long time. Further, owing to the high-performance modeling of QSAR, machine learning methods have been developed and upgraded. Particularly, the threedimensional structure of chemical compounds has been gaining increasing attention owing to the representation of a large amount of information. However, only many of feature extraction is impossible to build models with the high-ability of the prediction. Thus, suitable extraction and effective selection of features are essential for models with excellent performance. Recently, the deep learning method has been employed to construct prediction models with very high performance using big data, especially, in the field of classification. Therefore, in this study, we developed a molecular image-based novel QSAR approach, called DeepSnap-Deep learning approach for designing high-performance models. In addition, this DeepSnap-Deep learning approach outperformed the conventional machine learnings when they are compared. Herein, we discuss the advantage and disadvantages of the machine learnings as well as the availability of the DeepSnap-Deep learning approach.

or nonlinear relationships between structural features of chemicals and their biological activities measured experimentally using chemical descriptors, representing values of characteristics of chemicals with the same basic skeleton (Muratov et al., 2020;Achary, 2020;Toropov and Toropova, 2020). Moreover, these descriptors are categorized from zero dimension (0D) to four dimension (4D) based on the compound space considered in the calculation (González-Díaz et al., 2007;Yap, 2011;Kurgan and Disfani, 2011;Comelli et al., 2014, Yuan et al., 2018Schneideret al., 2019;Ginex et al., 2019;Tangadpalliwar et al., 2019). First, a 0D-molecular descriptor is a configuration descriptor that represents the molecular weight of the compound and the numbers of rotatable bonds as well as double or triple bonds, and a count descriptor that indicates the number of atoms and rings, as well as the total heavy atoms (Kramer and Gedeck, 2011;Damião et al., 2014). Secondly, a 1D-molecular descriptor is physical property value, such as the numbers of specific functional groups, partial fragment structures, hydrogen bond donor, and acceptor atoms, fingerprint that can be represented by the presence or absence of the partial structure, and various LogPs including AlogP, ClogP, MlogP, SlogP, XlogP, and so on (Moriguchi et al., 1992;Xue et al., 2000;Abreu et al., 2011;dos Reis et al., 2014;McDonagh et al., 2015). Third, a 2Dmolecular descriptor is a topological or connectivity index which is a quantitative variable that characterizes topological features as an invariant for a molecular graph, for example, the topological polar surface area indicating the polar part of the surface of the molecules, Wiener index showing the sum of the shortest distances between the atoms in the molecules, and the Balaban J index exhibiting the average total bond distance in the molecules (Prasanna et al., 2005;Prasanna and Doerksen, 2009;Poša, 2011). Fourth, a 3D-molecular descriptor is a geometrical descriptors that shows 3D information of molecules, such as molecular size, molecular structure, symmetry, and atomic distribution, for example, highest occupied molecular orbital (HOMO)/lowest unoccupied molecular orbital (LUMO ) energy levels calculated from quantum chemical calculations, and a weighted holistic invariant molecular (WHIM) descriptor that is an eigenvalue calculated from a molecular matrix corresponding to a molecular graph in which 3D-coordinates are weighted based on the characteristics of each atom (Chen, 2008;Uesawa and Mohri, 2012;Marunnan et al., 2017;Elrhayam and Elharfi, 2019). Fifth, a 4D-molecular descriptor is calculated through the interaction with other compounds, for example, the interaction energy resulting from molecular dynamics simulation (Kumar andKulkarni, 2017, Ataide Martins et al., 2018;Ma et al., 2019). Since the affinity of chemical substances to protein enzymes and receptors strongly depends on intermolecular interactions, such as hydrogen bonding, electrostatic interactions, and hydrophobic interactions, it has been observed that the complementary positional relationship between the base groups of the chemicals and their binding site of the enzyme or receptor required for the interactions of chemical compounds is important (Yang et al., 2017;Kimani et al., 2018;Pan et al., 2019;Garcia et al., 2020). Therefore, besides molecular descriptors that show physiochemical properties related to intermolecular interactions, QSAR analysis in a 3D space that utilizes descriptors representing the 3D structure of chemical substances has been recently attracting research attention Hadni and Elhallaoui, 2020;Kumar et al., 2020;Liu et al., 2020;Zhang et al., 2020). Additionally, prediction models in classical QSAR are constructed using descriptors as explanatory variables. The objective variables have various types and strengths, namely reactivity and affinity, of different chemical substances, including hormones and inhibitors that maintain homeostasis of biological functions, ligands act on energy metabolism, and agonist/antagonist for vitamins involved in signal transduction (Hsu et al., 2014;Santos et al., 2017;Nagamani et al., 2018;Sakkiah et al., 2018;Agrawal et al., 2019;Kato 2020). Moreover, these chemical substances are involved in a process called key event (KE), which explains an unexpected in this pathway, that induces the expression of physiological activity by interacting with target molecules, such as receptors (Bal-Price and Meek, 2017;Scotti et al., 2017;Brüggemann et at 2018;Ehlert, 2018;Naqvi et al., 2018). Further, the type of physiological activity depends on the type of target molecule, and the strength of the physiological activity is determined by the strength of interaction with the target molecule (Devidas and Ranawat, 2019). In particular, QSAR analysis can be employed to identify or assess the potential of chemical compounds in toxicity that is initiated by a molecular initiating event through the KE, finally causing an adverse outcome, in which numerous exogenous substances interfere with hormonal systems to produce a range of developmental, reproductive, neurologic, immune, or metabolic diseases (Vinken, 2016;Ciallella and Zhu, 2019;Schneider et al., 2019).
In classical QSAR derived from a set of small molecules with similar targetspecific biological activity, using various descriptors related to the characteristic of physiological activity, mathematical correlations and patterns have been analyzed to date by applying multivariate analysis, multiple linear regression, partial least squares (PLS) regression, and so on (Stanton, 2012;Varmuza et al., 2013;Martinez-Lopez et al., 2017;Baskin II, 2018;Cruz et al., 2018;Nnadi et al., 2018). In these analyses, increasing the number of descriptors that are explanatory variables increases the coefficient of determination, R 2 , and improves the prediction accuracy. However, when there is a strong linear dependence among the explanatory variables, the regression coefficient increases, and the estimation of the partial regression coefficient becomes unstable (Venkatraman et al., 2004;Basak et al., 2006). Thus, the influence of each explanatory variable on the objective variable may be unclear, and reliability may decrease (multicollinearity). In addition, an orthogonal transformation that aggregates and uses explanatory variables with strong correlation is required by applying feature selection, that is, selecting appropriate subsets of explanatory variables and using them in regression equations, such as principal component regression, PLS regression, and regression with penalties, to bring partial regression coefficient values closer to zero, such as ridge, lasso, and elastic net regressions (Hemmateenejad and Yazdani, 2009;Eklund et al., 2014;Tsiliki et al., 2015;Al-Sha'er et al., 2016). In these cases, owing to the occurrence of bias in the estimation of the regression coefficient value and the aggregation of variables, the original effect of the explanatory variable may become unclear, and the predictability of the prediction result may deteriorate (Fearn et al., 2008). Furthermore, as a solution to the variable selection problems that combine variables among those that accurately explain the target variable of interest from numerous candidate explanatory variables, a variable specification method that specifies the optimal explanatory variable is utilized based on scientific theory or knowledge (George, 2000). However, it is rare to specify the optimum explanatory variable in advance. Further, in the case of the best subset selection or round-robin method, in which analysis is performed by utilizing all the combinations of the explanatory variables, there are 2 k − 1 combinations of k explanatory variables, resulting in a huge calculation cost (Noon et al., 2011;Saavedra et al., 2020). Additionally, based on the usefulness of each univariate regression coefficient, there are some sequential selection methods, which include forward−backward stepwise selection method, forward stepwise selection method, backward stepwise selection method, and backward−forward stepwise selection method, that sequentially increase or decrease the explanatory variables individually (Goodarzi et al., 2012;Fatima et al., 2018;Fatima et al., 2019;Hrynkiewicz et al., 2019;Fatima and Agarwal, 2020;McCann et al., 2020). However, the effect of unselected explanatory variables is not corrected, so to make variable selections, it is necessary to include technical and academic knowledge to the variable selection criteria.

Machine learning
Moreover, to confound complex factors that contain numerous explanatory variables and construct nonlinear predictive models, it is often difficult to perform accurate modeling using conventional statistical methods. However, machine learning that can learn large amounts of data and automatically build models to perform classification and regression is suitable (Idakwo et al., 2018;Achary, 2020;Lin, et al., 2020;Muratov et al., 2020;Xiao et al., 2020;Yang et al., 2020). In machine learning algorithms, regularization to reduce extra features is internally performed so that the regression coefficient is stable. Furthermore, it is possible to reduce the data reading as well as the preprocessing and calculation costs, and by ameliorating the algorithm speed and clarifying the feature amount that has the most information, it is expected that the interpretation of the prediction result and the explanation can be improved. To date, machine learning algorithms, such as random forest (RF), support vector machine (SVM), eXtreme Gradient Boosting (Xgboost), Light Gradient Boosting Machine (LightGBM), Category Boosting (CATBoost), and neural network (NN), have been used to perform QSAR analysis (Zhang et al., 2018;Maltarollo et al., 2019;Sidorov et al. 2019;Uesawa, 2019a, 2020). The RF is an ensemble learning method based on bagging in which many decision trees that divide a group by conditional branching are connected and learning is performed in parallel for each model (Ghosh et al., 2020). In the case of classification in RF, the majority is output; the average value is finally output in the case of regression. Explaining the output results is relatively easy, and the effect of overfitting, which is a character of decision trees, can be reduced. Moreover, noise in the input data can easily cause over-learning, which leads to a reduction in generalization performance and an increase in calculation cost for complex datasets. Besides, an SVM performs two-class classification by constructing a classification boundary to increase the margin, which is the distance between the support vectors, which are the learning data located near the classification boundary, and the boundary (Cho et al., 2019). Thus, the classification accuracy does not depend on the dimension increase of the feature amount, so the parameters can be easily optimized. In addition, since the calculation cost increases based on the learning data size, it is difficult to apply it to large-scale data, and it is also difficult to optimize the discriminant function owing to multiple classes. Further, Xgboost is a classifier configured plurality of classifiers gradient by collecting boosting, which ensemble learning combining by the plurality of weak classifiers, and RF (Zopluoglu, 2019). When constructing a new weak learner in the Xgboost, the weight is reduced for the data correctly classified by the constructed weak learners, whereas the weight is increased for the data not correctly classified, whereby update the construction of the weak learner serially in sequence. As described above, since the model is designed by aggregating the weightings based on the accuracy, highly accurate modeling can be expected. Additionally, many parameters need to be tuned to improve accuracy. In boosting, the learning cost increases because the learning is performed in series. LightGBM is also an ensemble learning method that applies gradient boosting based on a decision tree algorithm, but this method adopts the "Leaf-wise" method, which when training a decision tree, it grows based on the leaf of the decision tree . This method has a shorter training time than the "Level-wise" method, in which the level of the decision tree grows; it is expected to improve the prediction accuracy since the structure of the decision tree generated by the "Leaf-wise" method is more complicated than that of the " Level-wise" method. Besides, overfitting is likely to occur, whereby hyperparameter adjustment to avoid this issue becomes complicated. Furthermore, in conventional search for branching points in a decision tree, it was necessary to read all the data points; in LightGBM, the features of the training data are divided into classes and made into histograms, reducing the computational cost even for large-scale data. Additionally, Catboost is also an ensemble learning method that uses a gradient boosting decision tree, but preprocessing is unnecessary because it can absorb categorical variables, namely qualitative variables, internally and process them (Romagnoni et al., 2019). Thus, it is expected that overfitting, such as target leakage, can be reduced by randomly selecting a dataset, repeating the calculation of statistics from the dataset, and calculating a substitute value. Moreover, the calculation cost is high, and it is difficult to output the prediction accuracy higher than the learning data. Furthermore, a simple perceptron, which is a type of NN, comprises two layers: input and output layers. A unit called neuron of the input data that corresponds to the type of explanatory variables is multiplied by a weight indicating a coupling strength, which is input to the output layer (Laudani et al., 2015;Roudi and Taylor, 2015). The sum of the values in each output layer and the bias for adding the variation to the data are added and output by a step function, which is an activation function. When the input value is less than zero, the output value is always zero, while when the input value is greater than or equal to zero, the output value is always 1, and finally the result is output. In learning the NN, the correct answer rate can be expected to be improved through parameter optimization by repeatedly adjusting the weight and bias. Moreover, this simple perceptron can be applied to a linearly separable problem in which two sets in an N-dimensional space can be separated by (N −1)-dimensional space, but there was a problem in which non-linearly separable problems could not be handled (Van Calster et al., 2006;Martinetz et al., 2009). In the multi-layer perceptron that adds a layer, also called intermediate layer or hidden layer, between the input and output layers based on the concept that complicated function approximation is possible by combining numerous simple neurons, it was possible to transform the nonlinear separation into a possible form. This was achieved by applying the activation function that performed the nonlinear transformation after performing the linear transformation in the intermediate layer.
Furthermore, although the inability to update the coupling coefficient from the input layer to the intermediate layer was the cause of the nonlinear separation, the error backpropagation method corrects the weights of the intermediate and output layers based on the error between the output value from the output layer and the correct answer; this method further corrects the weight between the input and intermediate layers based on this corrected value that modified the entire NN each time, thereby enabling highly accurate nonlinear separation (Adigun and Kosko, 2019;Ko and Lee, 2020). In a multi-layered NN, the process of reducing the error, which is the difference between the predicted and actual values, becomes complicated. Therefore, by applying the error backpropagation method and the stochastic gradient descent method, this problem can be avoided (Elfwing et al., 2015;Stromatias et al., 2017). In the case of a sigmoid function or its like, since there is a region where the amount of change called gradient in error is close to zero, the weight is hardly modified. The optimal solution was not obtained because all the gradients downstream from that region approach zero, and there was a problem (gradient disappearance problem) in which learning did not progress (Zhang, 2017). Besides, since the gradient of each layer is propagated by the product, it significantly disappears or increases (gradient explosion problem) as the number of layers increases, and the gradient of the NN becomes unstable (instability gradient problem) (Zhao et al., 2012;Sun et al., 2020). Consequently, training error is significantly reduced in the error regarding the training data, but generalization error cannot be reduced in the error regarding the test data, resulting in overfitting. In addition, there are problems in which the true optimal solution cannot be derived because the system converges to a local optimal solution, which is a partial optimal solution (local solution problem), and the calculation cost increases (Fu et al., 2018). Therefore, by utilizing a stacked autoencoder that reduces the number of neurons in the intermediate layer to be smaller than that in the input/output layer and performs repeated dimension compression or feature extraction by extracting very important information and deleting the rest, a deep neural network (DNN) composed of a multi-layered structured by preliminarily estimating an optimal value (preliminary learning) for the initial value of the weight of the neural network through learning using the same data as the output has enabled deep learning (Bengio et al., 2013).

Deep learning
In conventional machine learning, the feature values of interest in the dataset were manually determined (feature extraction) based on knowledge and technology, and the model was designed by fitting the space (feature space) represented by the feature value into the data distribution. In deep learning, using a convolutional neural network (CNN), it has become possible to automate a series of processes of highly versatile feature extraction and model construction (Zhao et al., 2019;Heo et al., 2020). In addition, conventional machine learning has made it possible to process large-scale labeled training data, but when building a complex model, there was a problem in which the relationship between data size and generalization performance eventually becomes a plateau. Moreover, in deep learning, it was demonstrated that the relationship between the data size and the generalization performance is proportionally related, thereby enabling high accuracy modeling from big data Lee and Chen, 2020). In addition, the learning result of this deep learning depends on the weight of the node, but it is difficult to theoretically interpret the accuracy or regularity of this result (black box problem) (Ishida et al., 2019;Leming et al., 2020). Further, to obtain a highly accurate result, there are still many unclear issues concerning the construction and learning method of the CNN. It is also relatively difficult to obtain large-scale labeled learning data with little noise. Therefore, in the case of image data, modeling with high accuracy was reported by employing feature extraction performed using CNN for a variant of the dataset with an increased amount of data, which were created by rotation, stretching, and so.

DeepSnap-Deep learning method
The performance of the deep learning method may depend on the quantity and quality of the input datasets used in training and prediction. However, the deep learning approach has not established the preparation of suitable input data of three-dimensional (3D)-chemical structures. Therefore, as a new QSAR analysis method based on the present deep learning, DeepSnap-Deep learning method was developed by Uesawa at the Meiji Pharmaceutical University in 2018 (Uesawa, 2018). In this DeepSnap-Deep learning approach, the 3D-optimized molecular structures, which can be rotated at any arbitrary angle on the x-, y-, and z-axes, were photographed as a ball-and-stick model with different colors to represent the corresponding atoms to automatically input as much structural information as possible into the DL models. (Figure 1). Using this snapshot image as input data for deep learning, feature extraction through DNN and model construction are automatically performed (Figure 2). For example, when the 3D molecular structure is rotated in 45° increments on the x-, y-, and z-axes and photographed, a total of 512 images are captured for each molecule and saved in the portable network graphics (PNG) format. This allows for combining digital information regarding the 2D plane location of the atoms with pixel-level data representing the three primary colors (RGB). Therefore, when compared with the deep learning through DNN based on a 2D-graph structure, extraction of a larger amount of information regarding chemical structure can be expected. Moreover, this method can be used to predict the potential activity of many different chemicals to various receptors without the extraction of descriptors. In the process of drawing the 3D-chemical structure of this DeepSnap-Deep learning method, it is possible to adjust parameters, such as the atoms, color of atoms, bond radius, and pixel size (Matsuzaka and Uesawa, 2019b). Besides, by decreasing the drawing angle, the number of generated images per molecule can be increased, but no proportional and discontinuous relationship is observed between this number of generated images and the performance of the model. Additionally, in the DeepSnap-Deep learning method, a proportional relationship is observed between the number of generated images and the calculation cost. This demonstrated that the optimization of the number of generated images may minimize the calculation cost. In addition, the effects of protonation, stable structure selection in the 3D-process of the compound, and the background color of the generated image on model performance were also illustrated Uesawa, 2019a,b, 2020). Further, when compared with the conventional machine learning methods, such as RF, Xgboost, LightGBM, Catboost, and NN, format are captured as image pictures on arbitrary angles on the x-, y-, and z-axes, and saved as portable network graphics (PNG) files. These obtained images were split into the following three datasets for the input to deep learning: training, validation, and test.

PNG file SMILES
that use molecular descriptors extracted through calculation software, Mordred (Moriwaki et al. 2018), it was observed that this DeepSnap-Deep learning method outperformed Uesawa, 2019a,b, 2020;. In this DeepSnap-Deep learning method, the classification performance was evaluated using information retrieved from a confusion matrix by cut-off points calculated with Youden index. Based on the sensitivity, which is a true positive rate identified as positive for all the positive samples including true and false positives, and the specificity, which is a true negative rate identified as negative for all the negative samples including true and false negatives, a confusion matrix regarding the predicted and experimentally defined labels was used to make the Receiver Operating Characteristic (ROC) curve and calculate the Area Under Curve (AUC), balanced accuracy, F value , and Matthews correlation coefficient. Utilizing proper statistical values can construct the prediction models with high performance. Thus, this DeepSnap-Deep learning method has the following advantages. First, the feature(s) in the molecular images can be automatically extracted by CNNs and can visualize the feature(s) by predicting the convolutional areas with NN. Second, high prediction performance can be expected as more detailed information of the chemical structure can be captured from different viewing directions along the x-, y-, and z-axes. Third, the determination and visualization of the conformer that is docked in the ligandbinding domain of protein may reveal the critical conformation of chemicals and protein domains related to the biological activity.
In the future, we hope that improvement and understanding of the interpretation and explanation of the prediction results from the DeepSnap-Deep learning method will yield modeling with high performance and the elucidation of the molecular mechanism exerted by chemical substances at the cellular, organ, and organism levels.