Abstract: Introduction: The goal of early predictive safety assessment (PSA) is to keep compounds with detectable liabilities from progressing further in the pipeline. Such compounds jeopardize the core of pharmaceutical research and development and limit the timely delivery of innovative therapeutics to the patient. Computational methods are increasingly used to help understand observed data, generate new testable hypotheses of relevance to safety pharmacology, and supplement and replace costly and time-consuming experimental procedures. Areas covered: The authors survey methods operating on different scales of both physical extension and complexity. After discussing methods used to predict liabilities associated with structures of individual compounds, the article reviews the use of adverse event data and safety profiling panels. Finally, the authors examine the complexities of toxicology data from animal experiments and how these data can be mined. Expert opinion: A significant obstacle for data-driven safety assessment is the absence of integrated data sets due to a lack of sharing of data and of using standard ontologies for data relevant to safety assessment. Informed decisions to derive focused sets of compounds can help to avoid compound liabilities in screening campaigns, and improved hit assessment of such campaigns can benefit the early termination of undesirable compounds.
Abstract: Computational methods play an ever increasing role in lead finding. A vast repertoire of molecular design and virtual screening methods emerged in the past two decades and are today routinely used. There is increasing awareness that there is no single best computational protocol and correspondingly there is a shift recommending the combination of complementary methods. A promising trend for the application of computational methods in lead finding is to take advantage of the vast amounts of HTS (High Throughput Screening) data to allow lead assessment by detailed systems-based data analysis, especially for phenotypic screens where the identification of compound-target pairs is the primary goal. Herein, we review trends and provide examples of successful applications of computational methods in lead finding.
Abstract: Given the tremendous growth of bioactivity databases, the use of computational tools to predict protein targets of small molecules has been gaining importance in recent years. Applications span a wide range, from the 'designed polypharmacology' of compounds to mode-of-action analysis. In this review, we firstly survey databases that can be used for ligand-based target prediction and which have grown tremendously in size in the past. We furthermore outline methods for target prediction that exist, both based on the knowledge of bioactivities from the ligand side and methods that can be applied in situations when a protein structure is known. Applications of successful in silico target identification attempts are discussed in detail, which were based partly or in whole on computational target predictions in the first instance. This includes the authors' own experience using target prediction tools, in this case considering phenotypic antibacterial screens and the analysis of high-throughput screening data. Finally, we will conclude with the prospective application of databases to not only predict, retrospectively, the protein targets of a small molecule, but also how to design ligands with desired polypharmacology in a prospective manner.
Abstract: From a medicinal chemistry point of view, one of the primary goals of high throughput screening (HTS) hit list assessment is the identification of chemotypes with an informative structure-activity relationship (SAR). Such chemotypes may enable optimization of the primary potency, as well as selectivity and phamacokinetic properties. A common way to prioritize them is molecular clustering of the hits. Typical clustering techniques, however, rely on a general notion of chemical similarity or standard rules of scaffold decomposition and are thus insensitive to molecular features that are enriched in biologically active compounds. This hinders SAR analysis, because compounds sharing the same pharmacophore might not end up in the same cluster and thus are not directly compared to each other by the medicinal chemist. Similarly, common chemotypes that are not related to activity may contaminate clusters, distracting from important chemical motifs. We combined molecular similarity and Bayesian models and introduce (I) a robust, activity-aware clustering approach and (II) a feature mapping method for the elucidation of distinct SAR determinants in polypharmacologic compounds. We evaluated the method on 462 dose-response assays from the Pubchem Bioassay repository. Activity-aware clustering grouped compounds sharing molecular cores that were specific for the target or pathway at hand, rather than grouping inactive scaffolds commonly found in compound series. Many of these core structures we also found in literature that discussed SARs of the respective targets. A numerical comparison of cores allowed for identification of the structural prerequisites for polypharmacology, i.e., distinct bioactive regions within a single compound, and pointed toward selectivity-conferring medchem strategies. The method presented here is generally applicable to any type of activity data and may help bridge the gap between hit list assessment and designing a medchem strategy.
Abstract: We present here a comprehensive analysis of proteases in the peptide substrate space and demonstrate its applicability for lead discovery. Aligned octapeptide substrates of 498 proteases taken from the MEROPS peptidase database were used for the in silico analysis. A multiple-category naïve Bayes model, trained on the two-dimensional chemical features of the substrates, was able to classify the substrates of 365 (73%) proteases and elucidate statistically significant chemical features for each of their specific substrate positions. The positional awareness of the method allows us to identify the most similar substrate positions between proteases. Our analysis reveals that proteases from different families, based on the traditional classification (aspartic, cysteine, serine, and metallo), could have substrates that differ at the cleavage site (P1-P1') but are similar away from it. Caspase-3 (cysteine protease) and granzyme B (serine protease) are previously known examples of cross-family neighbors identified by this method. To assess whether peptide substrate similarity between unrelated proteases could reliably translate into the discovery of low molecular weight synthetic inhibitors, a lead discovery strategy was tested on two other cross-family neighbors--namely cathepsin L2 and matrix metallo proteinase 9, and calpain 1 and pepsin A. For both these pairs, a naïve Bayes classifier model trained on inhibitors of one protease could successfully enrich those of its neighbor from a different family and vice versa, indicating that this approach could be prospectively applied to lead discovery for a novel protease target with no known synthetic inhibitors.
Abstract: We present a workflow that leverages data from chemogenomics based target predictions with Systems Biology databases to better understand off-target related toxicities. By analyzing a set of compounds that share a common toxic phenotype and by comparing the pathways they affect with pathways modulated by nontoxic compounds we are able to establish links between pathways and particular adverse effects. We further link these predictive results with literature data in order to explain why a certain pathway is predicted. Specifically, relevant pathways are elucidated for the side effects rhabdomyolysis and hypotension. Prospectively, our approach is valuable not only to better understand toxicities of novel compounds early on but also for drug repurposing exercises to find novel uses for known drugs.
Abstract: Different molecular descriptors capture different aspects of molecular structures, but this effect has not yet been quantified systematically on a large scale. In this work, we calculate the similarity of 37 descriptors by repeatedly selecting query compounds and ranking the rest of the database. Euclidean distances between the rank-ordering of different descriptors are calculated to determine descriptor (as opposed to compound) similarity, followed by PCA for visualization. Four broad descriptor classes are identified, which are circular fingerprints; circular fingerprints considering counts; path-based and keyed fingerprints; and pharmacophoric descriptors. Descriptor behavior is much more defined by those four classes than the particular parametrization. Using counts instead of the presence/absence of fingerprints significantly changes descriptor behavior, which is crucial for performance of topological autocorrelation vectors, but not circular fingerprints. Four-point pharmacophores (piDAPH4) surprisingly lead to much higher retrieval rates than three-point pharmacophores (28.21% vs 19.15%) but still similar rank-ordering of compounds (retrieval of similar actives). Looking into individual rankings, circular fingerprints seem more appropriate than path-based fingerprints if complex ring systems or branching patterns are present; count-based fingerprints could be more suitable in databases with a large number of repeated subunits (amide bonds, sugar rings, terpenes). Information-based selection of diverse fingerprints for consensus scoring (ECFP4/TGD fingerprints) led only to marginal improvement over single fingerprint results. While it seems to be nontrivial to exploit orthogonal descriptor behavior to improve retrieval rates in consensus virtual screening, those descriptors still each retrieve different actives which corroborates the strategy of employing diverse descriptors individually in prospective virtual screening settings.
Abstract: The elucidation of drug targets is important both to optimize desired compound action and to understand drug side-effects. In this study, we created statistical models which link chemical substructures of ligands to protein domains in a probabilistic manner and employ the model to triage the results of affinity chromatography experiments. By annotating targets with their InterPro domains, general rules of ligand-protein domain associations were derived and successfully employed to predict protein targets outside the scope of the training set. This methodology was then tested on a proteomics affinity chromatography data set containing 699 compounds. The domain prediction model correctly detected 31.6% of the experimental targets at a specificity of 46.8%. This is striking since 86% of the predicted targets are not part of them (but share InterPro domains with them), and thus could not have been predicted by conventional target prediction approaches. Target predictions improve drastically when significance (FDR) scores for target pulldowns are employed, emphasizing their importance for eliminating artifacts. Filament proteins (such as actin and tubulin) are detected to be 'frequent hitters' in proteomics experiments and their presence in pulldowns is not supported by the target predictions. On the other hand, membrane-bound receptors such as serotonin and dopamine receptors are noticeably absent in the affinity chromatography sets, although their presence would be expected from the predicted targets of compounds. While this can partly be explained by the experimental setup, we suggest the computational methods employed here as a complementary step of identifying protein targets of small molecules. Affinity chromatography results for gefitinib are discussed in detail and while two out of the three kinases with the highest affinity to gefitinib in biochemical assays are detected by affinity chromatography, also the possible involvement of NSF as a target for modulating cancer progressions via beta-arrestin can be proposed by this method.
Abstract: We present a workflow that leverages data from chemogenomics based target predictions with Systems Biology databases to better understand off-target related toxicities. By analyzing a set of compounds that share a common toxic phenotype and by comparing the pathways they affect with pathways modulated by nontoxic compounds we are able to establish links between pathways and particular adverse effects. We further link these predictive results with literature data in order to explain why a certain pathway is predicted. Specifically, relevant pathways are elucidated for the side effects rhabdomyolysis and hypotension. Prospectively, our approach is valuable not only to better understand toxicities of novel compounds early on but also for drug repurposing exercises to find novel uses for known drugs.
Abstract: We present a novel method to better investigate adverse drug reactions in chemical space. By integrating data sources about adverse drug reactions of drugs with an established cheminformatics modeling method, we generate a data set that is then visualized with a systems biology tool. Thereby new insights into undesired drug effects are gained. In this work, we present a global analysis linking chemical features to adverse drug reactions.
Abstract: Typically, screening collections of pharmaceutical companies contain more than a million compounds today. However, for certain high-throughput screening (HTS) campaigns, constraints posed by the assay throughput and/or the reagent costs make it impractical to screen the entire deck. Therefore, it is desirable to effectively screen subsets of the collection based on a hypothesis or a diversity selection. How to select compound subsets is a subject of ongoing debate. The authors present an approach based on extended connectivity fingerprints to carry out diversity selection on a per plate basis (instead of a per compound basis). HTS data from 35 Novartis screens spanning 5 target classes were investigated to assess the performance of this approach. The analysis shows that selecting a fingerprint-diverse subset of 250K compounds, representing 20% of the screening deck, would have achieved significantly higher hit rates for 86% of the screens. This measure also outperforms the Murcko scaffold-based plate selection described previously, where only 49% of the screens showed similar improvements. Strikingly, the 2-fold improvement in average hit rates observed for 3 of 5 target classes in the data set indicates a target bias of the plate (and thus compound) selection method. Even though the diverse subset selection lacks any target hypothesis, its application shows significantly better results for some targets-namely, G-protein-coupled receptors, proteases, and protein-protein interactions-but not for kinase and pathway screens. The synthetic origin of the compounds in the diverse subset appears to influence the screening hit rates. Natural products were the most diverse compound class, with significantly higher hit rates compared to the compounds from the traditional synthetic and combinatorial libraries. These results offer empirical guidelines for plate-based diversity selection to enhance hit rates, based on target class and the library type being screened.
Abstract: High-throughput screening (HTS) is a well-established hit-finding approach used in the pharmaceutical industry. In this article, recent experience at Novartis with respect to factors influencing the success of HTS campaigns is discussed. An inherent measure of HTS quality could be defined by the assay Z and Z' factors, the number of hits and their biological potencies; however, such measures of quality do not always correlate with the advancement of hits to the later stages of drug discovery. Also, for many target classes, such as kinases, it is easy to identify hits, but, as a result of selectivity, intellectual property and other issues, the projects do not result in lead declarations. In this article, HTS success is defined as the fraction of HTS campaigns that advance into the later stages of drug discovery, and the major influencing factors are examined. Interestingly, screening compounds in individual wells or in mixtures did not have a major impact on the HTS success and, equally interesting, there was no difference in the progression rates of biochemical and cell-based assays. Particular target types, assay technologies, structure-activity relationships and powder availability had a much greater impact on success as defined above. In addition, significant mutual dependencies can be observed - while one assay format works well with one target type, this situation might be completely reversed for a combination of the same readout technology with a different target type. The results and opinions presented here should be regarded as groundwork, and a plethora of factors that influence the fate of a project, such as biophysical measurements, chemical attractiveness of the hits, strategic reasons and safety pharmacology, are not covered here. Nonetheless, it is hoped that this information will be used industry-wide to improve success rates in terms of hits progressing into exploratory chemistry and beyond. The support that can be obtained from new in silico approaches to phase transitions are also described, along with the gaps they are designed to fill.
Abstract: In this work we explore the possibilities of using fragment-based screening data to prioritize compounds from a full HTS library, a method we call virtual fragment linking (VFL). The ability of VFL to identify compounds of nanomolar potency based on micromolar fragment binding data was tested on 75 target classes from the WOMBAT database and succeeded in 57 cases. Further, the method was demonstrated for seven drug targets from in-house screening programs that performed both FBS of 8800 fragments and screens of the full library. VFL captured between 28% and 67% of the hits (IC 50 < 10microM) in the top 5% of the ranked library for four of the targets (enrichment between 5-fold and 13-fold). Our findings lead us to conclude that proper coverage of chemical space by the fragment library is crucial for the VFL methodology to be successful in prioritizing HTS libraries from fragment-based screening data.
Abstract: Human CD8 is a T cell coreceptor, which binds to pHLA I and plays a pivotal role in the activation of cytotoxic T lymphocytes. Soluble recombinant CD8 alphaalpha has been shown to antagonize T cell activation, both in vitro and in vivo. However, because of a very low affinity for pHLA I, high concentrations of soluble CD8 alphaalpha are required for efficient inhibition. Based upon our knowledge of the wild-type CD8/pHLA I structure, we have designed and produced a mutated form of soluble CD8 alphaalpha that binds to pHLA I with approximately fourfold higher affinity. We have characterized the binding of the high affinity CD8 mutant using surface plasmon resonance and determined its structure at 2.1 A resolution using X-ray crystallography. The analysis of this structure suggests that the higher affinity is achieved by providing a larger side chain that allows for an optimal contact to be made between the HLA alpha3 loop and the mutated CDR-like loops of CD8.
Abstract: Development of a pharmacophore hypothesis related to small-molecule activity is pivotal to chemical optimization of a series, since it defines features beneficial or detrimental to activity. Although crystal structures may provide detailed 3D interaction information for one molecule with its receptor, docking a different ligand to that model often leads to unreliable results due to protein flexibility. Graham Richards' lab was one of the first groups to utilize "fuzzy" pattern recognition algorithms taken from the field of image processing to solve problems in protein modeling. Thus, descriptor "fuzziness" was partly able to emulate conformational flexibility of the target while simultaneously enhancing the speed of the search. In this work, we extend these developments to a ligand-based method for describing and aligning molecules in flexible chemical space termed FEature POint PharmacophoreS (FEPOPS), which allows exploration of dynamic biological space. We develop a novel, combinatorial algorithm for molecular comparisons and evaluate it using the WOMBAT dataset. The new approach shows superior retrospective virtual screening performance than earlier shape-based or charge-based algorithms. Additionally, we use target prediction to evaluate how FEPOPS alignments match the molecules biological activity by identifying the atoms and features that make the key contributions to overall chemical similarity. Overall, we find that FEPOPS are sufficiently fuzzy and flexible to find not only new ligand scaffolds, but also challenging molecules that occupy different conformational states of dynamic biological space as from induced fits.
Abstract: Preclinical Safety Pharmacology (PSP) attempts to anticipate adverse drug reactions (ADRs) during early phases of drug discovery by testing compounds in simple, in vitro binding assays (that is, preclinical profiling). The selection of PSP targets is based largely on circumstantial evidence of their contribution to known clinical ADRs, inferred from findings in clinical trials, animal experiments, and molecular studies going back more than forty years. In this work we explore PSP chemical space and its relevance for the prediction of adverse drug reactions. Firstly, in silico (computational) Bayesian models for 70 PSP-related targets were built, which are able to detect 93% of the ligands binding at IC(50) < or = 10 microM at an overall correct classification rate of about 94%. Secondly, employing the World Drug Index (WDI), a model for adverse drug reactions was built directly based on normalized side-effect annotations in the WDI, which does not require any underlying functional knowledge. This is, to our knowledge, the first attempt to predict adverse drug reactions across hundreds of categories from chemical structure alone. On average 90% of the adverse drug reactions observed with known, clinically used compounds were detected, an overall correct classification rate of 92%. Drugs withdrawn from the market (Rapacuronium, Suprofen) were tested in the model and their predicted ADRs align well with known ADRs. The analysis was repeated for acetylsalicylic acid and Benperidol which are still on the market. Importantly, features of the models are interpretable and back-projectable to chemical structure, raising the possibility of rationally engineering out adverse effects. By combining PSP and ADR models new hypotheses linking targets and adverse effects can be proposed and examples for the opioid mu and the muscarinic M2 receptors, as well as for cyclooxygenase-1 are presented. It is hoped that the generation of predictive models for adverse drug reactions is able to help support early SAR to accelerate drug discovery and decrease late stage attrition in drug discovery projects. In addition, models such as the ones presented here can be used for compound profiling in all development stages.
Abstract: High throughput screening (HTS) data is often noisy, containing both false positives and negatives. Thus, careful triaging and prioritization of the primary hit list can save time and money by identifying potential false positives before incurring the expense of followup. Of particular concern are cell-based reporter gene assays (RGAs) where the number of hits may be prohibitively high to be scrutinized manually for weeding out erroneous data. Based on statistical models built from chemical structures of 650 000 compounds tested in RGAs, we created "frequent hitter" models that make it possible to prioritize potential false positives. Furthermore, we followed up the frequent hitter evaluation with chemical structure based in silico target predictions to hypothesize a mechanism for the observed "off target" response. It was observed that the predicted cellular targets for the frequent hitters were known to be associated with undesirable effects such as cytotoxicity. More specifically, the most frequently predicted targets relate to apoptosis and cell differentiation, including kinases, topoisomerases, and protein phosphatases. The mechanism-based frequent hitter hypothesis was tested using 160 additional druglike compounds predicted by the model to be nonspecific actives in RGAs. This validation was successful (showing a 50% hit rate compared to a normal hit rate as low as 2%), and it demonstrates the power of computational models toward understanding complex relations between chemical structure and biological function.
Abstract: This work describes a novel semi-sequential technique for in silico enhancement of high-throughput screening (HTS) experiments now employed at Novartis. It is used in situations in which the size of the screen is limited by the readout (e.g., high-content screens) or the amount of reagents or tools (proteins or cells) available. By performing computational chemical diversity selection on a per plate basis (instead of a per compound basis), 25% of the 1,000,000-compound screening was optimized for general initial HTS. Statistical models are then generated from target-specific primary results (percentage inhibition data) to drive the cherry picking and testing from the entire collection. Using retrospective analysis of 11 HTS campaigns, the authors show that this method would have captured on average two thirds of the active compounds (IC(50) < 10 microM) and three fourths of the active Murcko scaffolds while decreasing screening expenditure by nearly 75%. This result is true for a wide variety of targets, including G-protein-coupled receptors, chemokine receptors, kinases, metalloproteinases, pathway screens, and protein-protein interactions. Unlike time-consuming "classic" sequential approaches that require multiple iterations of cherry picking, testing, and building statistical models, here individual compounds are cherry picked just once, based directly on primary screening data. Strikingly, the authors demonstrate that models built from primary data are as robust as models built from IC(50) data. This is true for all HTS campaigns analyzed, which represent a wide variety of target classes and assay types.
Abstract: CD8(+) cytotoxic T lymphocytes (CTL) are key determinants of immunity to intracellular pathogens and neoplastic cells. Recognition of specific antigens in the form of peptide-MHC class I complexes (pMHCI) presented on the target cell surface is mediated by T cell receptor (TCR) engagement. The CD8 coreceptor binds to invariant domains of pMHCI and facilitates antigen recognition. Here, we investigate the biological effects of a Q115E substitution in the alpha2 domain of human leukocyte antigen (HLA)-A*0201 that enhances CD8 binding by approximately 50% without altering TCR/pMHCI interactions. Soluble and cell surface-expressed forms of Q115E HLA-A*0201 exhibit enhanced recognition by CTL without loss of specificity. These CD8-enhanced antigens induce greater CD3 zeta chain phosphorylation in cognate CTL leading to substantial increases in cytokine production, proliferation and priming of naive T cells. This effect provides a fundamental new mechanism with which to enhance cellular immunity to specific T cell antigens.
Abstract: Target identification is a critical step following the discovery of small molecules that elicit a biological phenotype. The present work seeks to provide an in silico correlate of experimental target fishing technologies in order to rapidly fish out potential targets for compounds on the basis of chemical structure alone. A multiple-category Laplacian-modified naïve Bayesian model was trained on extended-connectivity fingerprints of compounds from 964 target classes in the WOMBAT (World Of Molecular BioAcTivity) chemogenomics database. The model was employed to predict the top three most likely protein targets for all MDDR (MDL Drug Database Report) database compounds. On average, the correct target was found 77% of the time for compounds from 10 MDDR activity classes with known targets. For MDDR compounds annotated with only therapeutic or generic activities such as "antineoplastic", "kinase inhibitor", or "anti-inflammatory", the model was able to systematically deconvolute the generic activities to specific targets associated with the therapeutic effect. Examples of successful deconvolution are given, demonstrating the usefulness of the tool for improving knowledge in chemogenomics databases and for predicting new targets for orphan compounds.
Abstract: Lead discovery in the pharmaceutical environment is largely an industrial-scale process in which it is typical to screen 1-5 million compounds in a matter of weeks using High Throughput Screening (HTS). This process is a very costly endeavor. Typically a HTS campaign of 1 million compounds will cost anywhere from $500000 to $1000000. There is consequently a great deal of pressure to maximize the return on investment by finding fast and more effective ways to screen. A panacea that has emerged over the past few years to help address this issue is in silico screening. In silico screening is now incorporated in all areas of lead discovery; from target identification and library design, to hit analysis and compound profiling. However, as lead discovery has evolved over the past few years, so has the role of in silico screening.
Abstract: Bridging chemical and biological space is the key to drug discovery and development. Typically, cheminformatics methods operate under the assumption that similar chemicals have similar biological activity. Ideally then, one could predict a drug's biological function(s) given only its chemical structure by similarity searching in libraries of compounds with known activities. In practice, effectively choosing a similarity metric is case dependent. This work compares both 2D and 3D chemical descriptors as tools for predicting the biological targets of ligand probes, on the basis of their similarity to reference molecules in a 46,000 compound, biologically annotated chemical database. Overall, we found that the 2D methods employed here outperform the 3D (88% vs 67% success) in correct target prediction. However, the 3D descriptors proved superior in cases of probes with low structural similarity to other compounds in the database (singletons). Additionally, the 3D method (FEPOPS) shows promise for providing pharmacophoric alignment of the small molecules' chemical features consistent with those seen in experimental ligand/ receptor complexes. These results suggest that querying annotated chemical databases with a systematic combination of both 2D and 3D descriptors will prove more effective than employing single methods.
Abstract: High-throughput screening (HTS) plays a pivotal role in lead discovery for the pharmaceutical industry. In tandem, cheminformatics approaches are employed to increase the probability of the identification of novel biologically active compounds by mining the HTS data. HTS data is notoriously noisy, and therefore, the selection of the optimal data mining method is important for the success of such an analysis. Here, we describe a retrospective analysis of four HTS data sets using three mining approaches: Laplacian-modified naive Bayes, recursive partitioning, and support vector machine (SVM) classifiers with increasing stochastic noise in the form of false positives and false negatives. All three of the data mining methods at hand tolerated increasing levels of false positives even when the ratio of misclassified compounds to true active compounds was 5:1 in the training set. False negatives in the ratio of 1:1 were tolerated as well. SVM outperformed the other two methods in capturing active compounds and scaffolds in the top 1%. A Murcko scaffold analysis could explain the differences in enrichments among the four data sets. This study demonstrates that data mining methods can add a true value to the screen even when the data is contaminated with a high level of stochastic noise.
Abstract: Conventional similarity searching of molecules compares single (or multiple) active query structures to each other in a relative framework, by means of a structural descriptor and a similarity measure. While this often works well, depending on the target, we show here that retrieval rates can be improved considerably by incorporating an external framework describing ligand bioactivity space for comparisons ("Bayes affinity fingerprints"). Structures are described by Bayes scores for a ligand panel comprising about 1000 activity classes extracted from the WOMBAT database. The comparison of structures is performed via the Pearson correlation coefficient of activity classes, that is, the order in which two structures are similar to the panel activity classes. Compound retrieval on a recently published data set could be improved by as much as 24% relative (9% absolute). Knowledge about the shape of the "bioactive chemical universe" is thus beneficial to identifying similar bioactivities. Principal component analysis was employed to further analyze activity space with the objective to define orthogonal ligand bioactive chemical space, leading to nine major (roughly orthogonal) activity axes. Employing only those nine activity classes, retrieval rates are still comparable to original Bayes affinity fingerprints; thus, the concept of orthogonal bioactive ligand chemical space was validated as being an information-rich but low-dimensional representation of bioactivity space. Correlations between activity classes are a major determinant to gauge whether the desired multitarget activity of drugs is (on the basis of current knowledge) a feasible concept because it measures the extent to which activities can be optimized independently, or only by strongly influencing one another.
Abstract: The class I CD8 positive T-cell response is involved in a number of conditions in which artificial down-regulation and control would be therapeutically beneficial. Such conditions include a number of autoimmune diseases and graft rejection in transplant patients. Although the CD8 T-cell response is dominated by the TCR-pMHC interaction, activation of T cells is in most cases also dependent on a number of associated signalling molecules. Previous work has demonstrated the ability of one such molecule (CD8) to act as an antagonist to T-cell activation if added in soluble form. Therefore, a high-affinity mutant CD8 (haCD8) has been developed with the aim of developing a therapeutic immunosuppressor. In order to fully understand the nature of the haCD8 interaction, this protein was crystallized using the sitting-drop vapour-diffusion method. Single haCD8 crystals were cryocooled and used for data collection. These crystals belonged to space group P6(4)22 (assumed by similarity to the wild type), with unit-cell parameters a = 101.08, c = 56.54 A. VM calculations indicated one molecule per asymmetric unit. A 2 A data set was collected and the structure is currently being determined using molecular replacement.
Abstract: The off-rate (k(off)) of the T cell receptor (TCR)/peptide-major histocompatibility complex class I (pMHCI) interaction, and hence its half-life, is the principal kinetic feature that determines the biological outcome of TCR ligation. However, it is unclear whether the CD8 coreceptor, which binds pMHCI at a distinct site, influences this parameter. Although biophysical studies with soluble proteins show that TCR and CD8 do not bind cooperatively to pMHCI, accumulating evidence suggests that TCR associates with CD8 on the T cell surface. Here, we titrated and quantified the contribution of CD8 to TCR/pMHCI dissociation in membrane-constrained interactions using a panel of engineered pMHCI mutants that retain faithful TCR interactions but exhibit a spectrum of affinities for CD8 of >1,000-fold. Data modeling generates a "stabilization factor" that preferentially increases the predicted TCR triggering rate for low affinity pMHCI ligands, thereby suggesting an important role for CD8 in the phenomenon of T cell cross-reactivity.
Abstract: The noise level of a high-throughput screening (HTS) experiment depends on various factors such as the quality and robustness of the assay itself and the quality of the robotic platform. Screening of compound mixtures is noisier than screening single compounds per well. A classification model based on naïve Bayes (NB) may be used to enrich such data. The authors studied the ability of the NB classifier to prioritize noisy primary HTS data of compound mixtures (5 compounds/well) in 4 campaigns in which the percentage of noise presumed to be inactive compounds ranged between 81% and 91%. The top 10% of the compounds suggested by the classifier captured between 26% and 45% of the active compounds. These results are reasonable and useful, considering the poor quality of the training set and the short computing time that is needed to build and deploy the classifier.
Abstract: We have previously shown that a machine learning technique can improve the enrichment of high-throughput docking (HTD) results. In the previous cases studied, however, the application of a naive Bayes classifier failed to improve enrichment for instances where HTD alone was unable to generate an acceptable enrichment. We present here a protocol to rescue poor docking results a priori using a combination of rank-by-median consensus scoring and naive Bayesian categorization.
Abstract: A primary goal of 3D similarity searching is to find compounds with similar bioactivity to a reference ligand but with different chemotypes, i.e., "scaffold hopping". However, an adequate description of chemical structures in 3D conformational space is difficult due to the high-dimensionality of the problem. We present an automated method that simplifies flexible 3D chemical descriptions in which clustering techniques traditionally used in data mining are exploited to create "fuzzy" molecular representations called FEPOPS (feature point pharmacophores). The representations can be used for flexible 3D similarity searching given one or more active compounds without a priori knowledge of bioactive conformations or pharmacophores. We demonstrate that similarity searching with FEPOPS significantly enriches for actives taken from in-house high-throughput screening datasets and from MDDR activity classes COX-2, 5-HT3A, and HIV-RT, while also scaffold or ring-system hopping to new chemical frameworks. Further, inhibitors of target proteins (dopamine 2 and retinoic acid receptor) are recalled by FEPOPS by scaffold hopping from their associated endogenous ligands (dopamine and retinoic acid). Importantly, the method excels in comparison to commonly used 2D similarity methods (DAYLIGHT, MACCS, Pipeline Pilot fingerprints) and a commercial 3D method (Pharmacophore Distance Triplets) at finding novel scaffold classes given a single query molecule.
Abstract: We have previously reported that the application of a Laplacian-modified naive Bayesian (NB) classifier may be used to improve the ranking of known inhibitors from a random database of compounds after High-Throughput Docking (HTD). The method relies upon the frequency of substructural features among the active and inactive compounds from 2D fingerprint information of the compounds. Here we present an investigation of the role of extended connectivity fingerprints in training the NB classifier against HTD studies on the HIV-1 protease using three docking programs: Glide, FlexX, and GOLD. The results show that the performance of the NB classifier is due to the presence of a large number of features common to the set of known active compounds rather than a single structural or substructural scaffold. We demonstrate that the Laplacian-modified naive Bayesian classifier trained with data from high-throughput docking is superior at identifying active compounds from a target database in comparison to conventional two-dimensional substructure search methods alone.
Abstract: The technology underpinning high-throughput docking (HTD) has developed over the past few years to where it has become a vital tool in modern drug discovery. Although the performance of various docking algorithms is adequate, the ability to accurately and consistently rank compounds using a scoring function remains problematic. We show that by employing a simple machine learning method (naïve Bayes) it is possible to significantly overcome this deficiency. Compounds from the Available Chemical Directory (ACD), along with known active compounds, were docked into two protein targets using three software packages. In cases where HTD alone was able to show some enrichment, the application of naïve Bayes was able to improve upon the enrichment. The application of this methodology to enrich HTD results can be carried out without a priori knowledge of the activity of compounds and results in superior enrichment of known actives compared to the use of scoring methods alone.
Abstract: The crystal structure of the bovine zinc metalloproteinase carboxypeptidase A (CPA) has been refined to 1.25 A resolution based on room-temperature X-ray synchrotron data. The significantly improved structure of CPA at this resolution (anisotropic temperature factors, R factor = 10.4%, R(free) = 14.5%) allowed the modelling of conformational disorders of side chains, improved the description of the protein solvent network (375 water molecules) and provided a more accurate picture of the interactions between the active-site zinc and its ligands. The calculation of standard uncertainties in individual atom positions of the refined model of CPA allowed the deduction of the protonation state of some key residues in the active site and confirmed that Glu72 and Glu270 are negatively charged in the resting state of the enzyme at pH 7.5. These results were further validated by theoretical calculations that showed significant reduction of the pK(a) of these side chains relative to solution values. The distance between the zinc-bound solvent molecule and the metal ion is strongly suggestive of a neutral water molecule and not a hydroxide ion in the resting state of the enzyme. These findings could support both the general acid/general base mechanism, as well as the anhydride mechanism suggested for CPA.
Abstract: T lymphocytes recognize peptides presented in the context of major histocompatibility complex (MHC) molecules on the surface of antigen presenting cells. Recognition specificity is determined by the alphabeta T cell receptor (TCR). The T lymphocyte surface glycoproteins CD8 and CD4 enhance T cell antigen recognition by binding to MHC class I and class II molecules, respectively. Biophysical measurements have determined that equilibrium binding of the TCR with natural agonist peptide-MHC (pMHC) complexes occurs with KD values of 1-50 microm. The pMHCI/CD8 and pMHCII/CD4 interactions are significantly weaker than this (KD >100 microm), and the relative roles of TCR/pMHC and pMHC/coreceptor affinity in T cell activation remain controversial. Here, we engineer mutations in the MHCI heavy chain and beta2-microglobulin that further reduce or abolish the pMHCI/CD8 interaction to probe the significance of pMHC/coreceptor affinity in T cell activation. We demonstrate that the pMHCI/CD8 coreceptor interaction retains the vast majority of its biological activity at affinities that are reduced by over 15-fold (KD > 2 mm). In contrast to previous reports, we observe that the weak interaction between HLA A68 and CD8, which falls within this spectrum of reduced affinities, retains substantial functional activity. These findings are discussed in the context of current concepts of coreceptor dependence and the mechanism by which TCR coreceptors facilitate T cell activation.
Abstract: Antibody and T-cell receptors (TCRs) are the primary recognition molecules of the adaptive immune system. Antibodies have been extensively characterized and are being developed for a large number of therapeutic applications. This has been possible because of the ability to manufacture stable, soluble, monoclonal antibodies which retain the antigen specificity of B cells. Unlike antibodies, TCRs are not expressed in a soluble form, but are anchored to the T-cell surface by an insoluble trans-membrane domain. Characterization and development of TCRs has been hampered by the lack of suitable methods for producing them as soluble and stable proteins. Here we report the engineering of soluble human TCRs suitable for crystallization studies and potentially for in vivo therapeutic use.
Abstract: A feature of Peter Kollman's research was his exploitation of the latest computational techniques to devise novel applications of the free energy perturbation method. He would certainly have seized upon the opportunities offered by massively distributed computing. Here we describe the use of over a million personal computers to perform virtual screening of 3.5 billion druglike molecules against protein targets by pharmacophore pattern matching, together with other applications of pattern recognition such as docking ligands without any a priori knowledge about the binding site location.
Abstract: Structural genomics will yield an immense number of protein three-dimensional structures in the near future. Automated theoretical methodologies are needed to exploit this information and are likely to play a pivotal role in drug discovery. Here, we present a fully automated, efficient docking methodology that does not require any a priori knowledge about the location of the binding site or function of the protein. The method relies on a multiscale concept where we deal with a hierarchy of models generated for the potential ligand. The models are created using the k-means clustering algorithm. The method was tested on seven protein-ligand complexes. In the largest complex, human immunodeficiency virus reverse transcriptase/nevirapin, the root mean square deviation value when comparing our results to the crystal structure was 0.29 A. We demonstrate on an additional 25 protein-ligand complexes that the methodology may be applicable to high throughput docking. This work reveals three striking results. First, a ligand can be docked using a very small number of feature points. Second, when using a multiscale concept, the number of conformers that require to be generated can be significantly reduced. Third, fully flexible ligands can be treated as a small set of rigid k-means clusters.
Abstract: The problem of global optimization is pivotal in a variety of scientific fields. Here, we present a robust stochastic search method that is able to find the global minimum for a given cost function, as well as, in most cases, any number of best solutions for very large combinatorial "explosive" systems. The algorithm iteratively eliminates variable values that contribute consistently to the highest end of a cost function's spectrum of values for the full system. Values that have not been eliminated are retained for a full, exhaustive search, allowing the creation of an ordered population of best solutions, which includes the global minimum. We demonstrate the ability of the algorithm to explore the conformational space of side chains in eight proteins, with 54 to 263 residues, to reproduce a population of their low energy conformations. The 1,000 lowest energy solutions are identical in the stochastic (with two different seed numbers) and full, exhaustive searches for six of eight proteins. The others retain the lowest 141 and 213 (of 1,000) conformations, depending on the seed number, and the maximal difference between stochastic and exhaustive is only about 0.15 Kcal/mol. The energy gap between the lowest and highest of the 1,000 low-energy conformers in eight proteins is between 0.55 and 3.64 Kcal/mol. This algorithm offers real opportunities for solving problems of high complexity in structural biology and in other fields of science and technology.
Abstract: The CD8 coreceptor of cytotoxic T lymphocytes binds to a conserved region of major histocompatibility complex class I molecules during recognition of peptide-major histocompatibility complex (MHC) class I antigens on the surface of target cells. This event is central to the activation of cytotoxic T lymphocyte (CTL) effector functions. The contribution of the MHC complex class I light chain, beta(2)-microglobulin, to CD8alphaalpha binding is relatively small and is mediated mainly through the lysine residue at position 58. Despite this, using molecular modeling, we predict that its mutation should have a dramatic effect on CD8alphaalpha binding. The predictions are confirmed using surface plasmon resonance binding studies and human CTL activation assays. Surprisingly, the charge-reversing mutation, Lys(58) --> Glu, enhances beta(2)m-MHC class I heavy chain interactions. This mutation also significantly reduces CD8alphaalpha binding and is a potent antagonist of CTL activation. These results suggest a novel approach to CTL-specific therapeutic immunosuppression.
Abstract: Identification of a ligand binding site on a protein is pivotal to drug discovery. To date, no reliable and computationally feasible general approach to this problem has been published. Here we present an automated efficient method for determining binding sites on proteins for potential ligands without any a priori knowledge. Our method is based upon the multiscale concept where we deal with a hierarchy of models generated using a k-means clustering algorithm for the potential ligand. This is done in a simple approach whereby a potential ligand is represented by a growing number of feature points. At each increasing level of detail, a pruning of potential binding site is performed. A nonbonding energy function is used to score the interactions between molecules at each step. The technique was successfully employed to seven protein-ligand complexes. In the current paper we show that the algorithm considerably reduces the computational effort required to solve this problem. This approach offers real opportunities for exploiting the large number of structures that will evolve from structural genomics.
Abstract: A novel automated method for the optimal placement of polar hydrogens in a protein structure is presented. The algorithm adds initially, to a protein data bank file of the protein, nonrotatable hydrogens such as peptide backbone hydrogens according to geometric considerations. Then, water protons and polar side chain protons of lysine, serine, threonine, tyrosine, aspartic acid, glutamic acid, and the C and N termini of a protein are added according to energy considerations. A unique stochastic approach has been developed to overcome a combinatorial explosion in the search for the lowest energy structure. First, the system is divided into ensembles. Each ensemble is treated separately: N conformations are sampled at random, their energies computed, whereas common components of high-energy combinations are gathered on one hand, and low-energy combinations on the other. Components that yield only high-energy conformations and do not contribute to any low energies are excluded. This is reiterated while the total amount of combinations is decreased along the iterative process. When the total number of combinations is lower than a user defined threshold, all remaining combinations are evaluated by exhaustive search. Energy evaluations use nonbonding energy expressions alone. The program was tested on five high-resolution crystal structures: bovine pancreatic trypsin inhibitor (Brookhaven Protein Data Bank file 5PTI), RNase-A (5RSA), trypsin (1NTP), and carbon monoxymyoglobin (2MB5), for which neutron diffraction structures are available, as well as phosphate binding protein (1IXH) for which very high resolution X-ray crystallography was used. The low RMS values prove the efficiency of this algorithm as a tool for positioning protons in proteins. It may be used for other biological structures.