hosted by
publicationslist.org
    

Florian Nigsch


florian.nigsch@novartis.com

Journal articles

2010
Sai Chetan K Sukuru, Florian Nigsch, Jean Quancard, Martin Renatus, Rajiv Chopra, Natasja Brooijmans, Dmitri Mikhailov, Zhan Deng, Allen Cornett, Jeremy L Jenkins, Ulrich Hommel, John W Davies, Meir Glick (2010)  A lead discovery strategy driven by a comprehensive analysis of proteases in the peptide substrate space.   Protein Sci 19: 11. 2096-2109 Nov  
Abstract: We present here a comprehensive analysis of proteases in the peptide substrate space and demonstrate its applicability for lead discovery. Aligned octapeptide substrates of 498 proteases taken from the MEROPS peptidase database were used for the in silico analysis. A multiple-category naïve Bayes model, trained on the two-dimensional chemical features of the substrates, was able to classify the substrates of 365 (73%) proteases and elucidate statistically significant chemical features for each of their specific substrate positions. The positional awareness of the method allows us to identify the most similar substrate positions between proteases. Our analysis reveals that proteases from different families, based on the traditional classification (aspartic, cysteine, serine, and metallo), could have substrates that differ at the cleavage site (P1-P1') but are similar away from it. Caspase-3 (cysteine protease) and granzyme B (serine protease) are previously known examples of cross-family neighbors identified by this method. To assess whether peptide substrate similarity between unrelated proteases could reliably translate into the discovery of low molecular weight synthetic inhibitors, a lead discovery strategy was tested on two other cross-family neighbors--namely cathepsin L2 and matrix metallo proteinase 9, and calpain 1 and pepsin A. For both these pairs, a naïve Bayes classifier model trained on inhibitors of one protease could successfully enrich those of its neighbor from a different family and vice versa, indicating that this approach could be prospectively applied to lead discovery for a novel protease target with no known synthetic inhibitors.
Notes:
2009
Florian Nigsch, N J Maximilan Macaluso, John B O Mitchell, Donatas Zmuidinavicius (2009)  Computational toxicology: an overview of the sources of data and of modelling methods.   Expert Opin Drug Metab Toxicol 5: 1. 1-14 Jan  
Abstract: BACKGROUND: Toxicology has the goal of ensuring the safety of humans, animals and the environment. Computational toxicology is an area of active development and great potential. There are tangible reasons for the emerging interest in this discipline from academia, industry, regulatory bodies and governments. RESULTS: Pharmaceuticals, personal health care products, nutritional ingredients and products of the chemical industries are all potential hazards and need to be assessed. Toxicological tests for these products are costly, frequently use laboratory animals and are time-consuming. This delays end-user access to improved products or, conversely, the timely withdrawal of dangerous substances from the market. The aim of computational toxicology is to accelerate the assessment of potentially dangerous substances through in silico models. CONCLUSIONS: In this review, we provide an overview of the development of models for computational toxicology. Addressing the significant divide between the experimental and computational worlds-believed to be a prime hindrance to computational toxicology-we briefly consider the fundamental issue of toxicological data and the assays they stem from. Different kinds of models that can be built using such data are presented: computational filters, models for specific toxicological endpoints and tools for the generation of testable hypotheses.
Notes:
2008
Florian Nigsch, John B O Mitchell (2008)  How to winnow actives from inactives: introducing molecular orthogonal sparse bigrams (MOSBs) and multiclass Winnow.   J Chem Inf Model 48: 2. 306-318 Feb  
Abstract: In the present paper we combine the Winnow algorithm and an advanced scheme for feature generation into a tool for multiclass classification. The Winnow algorithm, specifically designed in the late 1980s to work well with high-dimensional data, by design ignores most of the irrelevant features for the scoring of each single training/test case. To augment the pool of available molecular features we use the Winnow algorithm in conjunction with a process that creates additional features from a set of given ones. We adapt a technique formerly employed in text classification termed "orthogonal sparse bigrams" and extend the use of that method to the domain of cheminformatics. Using circular molecular fingerprints as initial features, we create "molecular orthogonal sparse bigrams" (MOSBs) and report their successful application to the task of classification of bioactive molecules. Additionally, we introduce a memory-efficient way of bagging individual classifiers, avoiding the need to hold the complete training data set in memory. To compare the performance of our method with published results, we use the Hert data set of 8293 active molecules in 11 classes. We compare our method to Random Forest and find that our method not only is comparable or better in classification accuracy (up to 50% higher in MCC [Matthews correlation coefficient], 98% higher in fraction of correct predictions) but also is quicker to train (by a factor between 2 and 18, depending on the feature generation), more memory efficient, and able to cope more easily with large data sets when we seeded the actives into a pool of 94290 inactive molecules. It is shown that this method can be used with different fingerprints.
Notes:
Laura D Hughes, David S Palmer, Florian Nigsch, John B O Mitchell (2008)  Why are some properties more difficult to predict than others? A study of QSPR models of solubility, melting point, and Log P.   J Chem Inf Model 48: 1. 220-232 Jan  
Abstract: This paper attempts to elucidate differences in QSPR models of aqueous solubility (Log S), melting point (Tm), and octanol-water partition coefficient (Log P), three properties of pharmaceutical interest. For all three properties, Support Vector Machine models using 2D and 3D descriptors calculated in the Molecular Operating Environment were the best models. Octanol-water partition coefficient was the easiest property to predict, as indicated by the RMSE of the external test set and the coefficient of determination (RMSE = 0.73, r2 = 0.87). Melting point prediction, on the other hand, was the most difficult (RMSE = 52.8 degrees C, r2 = 0.46), and Log S statistics were intermediate between melting point and Log P prediction (RMSE = 0.900, r2 = 0.79). The data imply that for all three properties the lack of measured values at the extremes is a significant source of error. This source, however, does not entirely explain the poor melting point prediction, and we suggest that deficiencies in descriptors used in melting point prediction contribute significantly to the prediction errors.
Notes:
Edward O Cannon, Florian Nigsch, John B O Mitchell (2008)  A novel hybrid ultrafast shape descriptor method for use in virtual screening.   Chem Cent J 2: 02  
Abstract: BACKGROUND: We have introduced a new Hybrid descriptor composed of the MACCS key descriptor encoding topological information and Ballester and Richards' Ultrafast Shape Recognition (USR) descriptor. The latter one is calculated from the moments of the distribution of the interatomic distances, and in this work we also included higher moments than in the original implementation. RESULTS: The performance of this Hybrid descriptor is assessed using Random Forest and a dataset of 116,476 molecules. Our dataset includes 5,245 molecules in ten classes from the 2005 World Anti-Doping Agency (WADA) dataset and 111,231 molecules from the National Cancer Institute (NCI) database. In a 10-fold Monte Carlo cross-validation this dataset was partitioned into three distinct parts for training, optimisation of an internal threshold that we introduced, and validation of the resulting model. The standard errors obtained were used to assess statistical significance of observed improvements in performance of our new descriptor. CONCLUSION: The Hybrid descriptor was compared to the MACCS key descriptor, USR with the first three (USR), four (UF4) and five (UF5) moments, and a combination of MACCS with USR (three moments). The MACCS key descriptor was not combined with UF5, due to similar performance of UF5 and UF4. Superior performance in terms of all figures of merit was found for the MACCS/UF4 Hybrid descriptor with respect to all other descriptors examined. These figures of merit include recall in the top 1% and top 5% of the ranked validation sets, precision, F-measure, area under the Receiver Operating Characteristic curve and Matthews Correlation Coefficient.
Notes:
Florian Nigsch, John B O Mitchell (2008)  Toxicological relationships between proteins obtained from protein target predictions of large toxicity databases.   Toxicol Appl Pharmacol 231: 2. 225-234 Sep  
Abstract: The combination of models for protein target prediction with large databases containing toxicological information for individual molecules allows the derivation of "toxiclogical" profiles, i.e., to what extent are molecules of known toxicity predicted to interact with a set of protein targets. To predict protein targets of drug-like and toxic molecules, we built a computational multiclass model using the Winnow algorithm based on a dataset of protein targets derived from the MDL Drug Data Report. A 15-fold Monte Carlo cross-validation using 50% of each class for training, and the remaining 50% for testing, provided an assessment of the accuracy of that model. We retained the 3 top-ranking predictions and found that in 82% of all cases the correct target was predicted within these three predictions. The first prediction was the correct one in almost 70% of cases. A model built on the whole protein target dataset was then used to predict the protein targets for 150000 molecules from the MDL Toxicity Database. We analysed the frequency of the predictions across the panel of protein targets for experimentally determined toxicity classes of all molecules. This allowed us to identify clusters of proteins related by their toxicological profiles, as well as toxicities that are related. Literature-based evidence is provided for some specific clusters to show the relevance of the relationships identified.
Notes:
Noel M O'Boyle, David S Palmer, Florian Nigsch, John Bo Mitchell (2008)  Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction.   Chem Cent J 2: 10  
Abstract: BACKGROUND: We present a novel feature selection algorithm, Winnowing Artificial Ant Colony (WAAC), that performs simultaneous feature selection and model parameter optimisation for the development of predictive quantitative structure-property relationship (QSPR) models. The WAAC algorithm is an extension of the modified ant colony algorithm of Shen et al. (J Chem Inf Model 2005, 45: 1024-1029). We test the ability of the algorithm to develop a predictive partial least squares model for the Karthikeyan dataset (J Chem Inf Model 2005, 45: 581-590) of melting point values. We also test its ability to perform feature selection on a support vector machine model for the same dataset. RESULTS: Starting from an initial set of 203 descriptors, the WAAC algorithm selected a PLS model with 68 descriptors which has an RMSE on an external test set of 46.6 degrees C and R2 of 0.51. The number of components chosen for the model was 49, which was close to optimal for this feature selection. The selected SVM model has 28 descriptors (cost of 5, epsilon of 0.21) and an RMSE of 45.1 degrees C and R2 of 0.54. This model outperforms a kNN model (RMSE of 48.3 degrees C, R2 of 0.47) for the same data and has similar performance to a Random Forest model (RMSE of 44.5 degrees C, R2 of 0.55). However it is much less prone to bias at the extremes of the range of melting points as shown by the slope of the line through the residuals: -0.43 for WAAC/SVM, -0.53 for Random Forest. CONCLUSION: With a careful choice of objective function, the WAAC algorithm can be used to optimise machine learning and regression models that suffer from overfitting. Where model parameters also need to be tuned, as is the case with support vector machine and partial least squares models, it can optimise these simultaneously. The moving probabilities used by the algorithm are easily interpreted in terms of the best and current models of the ants, and the winnowing procedure promotes the removal of irrelevant descriptors.
Notes:
Florian Nigsch, Andreas Bender, Jeremy L Jenkins, John B O Mitchell (2008)  Ligand-target prediction using Winnow and naive Bayesian algorithms and the implications of overall performance statistics.   J Chem Inf Model 48: 12. 2313-2325 Dec  
Abstract: We compared two algorithms for ligand-target prediction, namely, the Laplacian-modified Bayesian classifier and the Winnow algorithm. A dataset derived from the WOMBAT database, spanning 20 pharmaceutically relevant activity classes with 13 000 compounds, was used for performance assessment in 24 different experiments, each of which was assessed using a 15-fold Monte Carlo cross-validation. Compounds were described by different circular fingerprints, ECFP_4 and MOLPRINT 2D. A detailed analysis of the resulting approximately 2.4 million predictions led to very similar measures for overall accuracy for both classifiers, whereas we observed significant differences for individual activity classes. Moreover, we analyzed our data with respect to the numbers of compounds which are exclusively retrieved by either of the algorithmsbut never by the otheror by neither of them. This provided detailed information that can never be obtained by considering the overall performance statistics alone.
Notes:
2007
Florian Nigsch, Werner Klaffke, Silvia Miret (2007)  In vitro models for processes involved in intestinal absorption.   Expert Opin Drug Metab Toxicol 3: 4. 545-556 Aug  
Abstract: The abundance of different techniques and protocols available reflects the need for reliable in vitro methods to assess intestinal absorption of potentially bioactive compounds. Physicochemical assays try to pinpoint the molecular properties contributing to the absorption process. The end points of biologically based methods, such as cell cultures and excised tissues, account for all processes undergone by a molecule that traverses a 'living' biological membrane, a cell or tissue. On top of fundamental physical processes (e.g., solubility, diffusion) such biological methods incorporate physiological responses such as active transport and metabolism. In this review, an account of in vitro methods for the assessment of molecular properties (lipophilicity, solubility, permeability) influencing intestinal absorption is given. Their advantages and limitations and the possibilities offered by this area of research are also evaluated. The combination of results from both classes of assays (physicochemical and biological) and integration with computational models will guide future developments in this field. Finally, possible future developments including stem cell research and multiple-end point assays are discussed.
Notes:
2006
Florian Nigsch, Andreas Bender, Bernd van Buuren, Jos Tissen, Eduard Nigsch, John B O Mitchell (2006)  Melting point prediction employing k-nearest neighbor algorithms and genetic parameter optimization.   J Chem Inf Model 46: 6. 2412-2422 Nov/Dec  
Abstract: We have applied the k-nearest neighbor (kNN) modeling technique to the prediction of melting points. A data set of 4119 diverse organic molecules (data set 1) and an additional set of 277 drugs (data set 2) were used to compare performance in different regions of chemical space, and we investigated the influence of the number of nearest neighbors using different types of molecular descriptors. To compute the prediction on the basis of the melting temperatures of the nearest neighbors, we used four different methods (arithmetic and geometric average, inverse distance weighting, and exponential weighting), of which the exponential weighting scheme yielded the best results. We assessed our model via a 25-fold Monte Carlo cross-validation (with approximately 30% of the total data as a test set) and optimized it using a genetic algorithm. Predictions for drugs based on drugs (separate training and test sets each taken from data set 2) were found to be considerably better [root-mean-squared error (RMSE)=46.3 degrees C, r2=0.30] than those based on nondrugs (prediction of data set 2 based on the training set from data set 1, RMSE=50.3 degrees C, r2=0.20). The optimized model yields an average RMSE as low as 46.2 degrees C (r2=0.49) for data set 1, and an average RMSE of 42.2 degrees C (r2=0.42) for data set 2. It is shown that the kNN method inherently introduces a systematic error in melting point prediction. Much of the remaining error can be attributed to the lack of information about interactions in the liquid state, which are not well-captured by molecular descriptors.
Notes:
Powered by PublicationsList.org.