DMPK Insights #13: The Influence of QSAR and Physicochemical Descriptors on Drug Design A Brief History and New Paradigms
About this Podcast on QSAR Influence
In this podcast, Scott and Matt will discuss QSAR and how in silico ADME science has evolved to influence Drug Design and improve the ADME properties of hit and lead compounds.
We will address the following questions:
- What were the drivers for the emergence of QSAR and physicochemical descriptors?
- What is todayโs state-of-the-art for ADME scientists to know about?
- How is QSAR likely to evolve in the near future?

Scott Summerfield: Hello and welcome to this Pharmaron Podcast part of our DMPK Insights podcast series. My name is Scott Summerfield and I lead the metabolism group in our UK-based integrated DMPK discovery and development platform.
Well today I’m delighted to be joined by Dr. Matt Segall to discuss the influence of QSAR and physicochemical descriptors on drug design where we’ll be covering a brief history, key steps to take us where we are today and also some thoughts on the future. Welcome, Matt.
Matt Segall: Hi, Scott. It’s great to speak to you today.
Scott Summerfield: By way of introduction, Matt is CEO of Optibrium. He has a Master’s of Science in Computation from the University of Oxford and a PhD in Theoretical Physics from the University of Cambridge. As Associate Director at Camitro in the UK, ArQule, and Inpharmatica, he led a team developing predictive ADME models and state of the art intuitive decision support and visualization tools for drug discovery.
In 2006, Matt became responsible for the management of Inpharmatica’s ADME business, including experimental ADME services and the StarDrop software platform. Following the acquisition of Inpharmatica, Matt became senior director, responsible for BioFocus’s DPI’s ADMET division, and in 2009, led a management buyout of the StarDrop business and found Optibrium which develops software and AI solutions for small molecules, design, optimization, and data analysis. Matt has over 30 peer reviewed publications and book chapters based in the areas of computational chemistry, cheminformatics, and drug discovery.
So Matt, firstly, I see you have a background originating in computation and physics. And so I wondered how you were drawn to using those skills in DMPK.
Matt Segall: Yes, Scott. It’s certainly an unusual background to go into DMPK. I guess as you said, I did an undergraduate degree in physics, a master’s degree in computation. So I was interested in computational physics, but I’ve always been sort of fascinated by the boundaries between different scientific fields. I think it’s often the most fruitful to take methods from one discipline and apply it in another.
And, I have a family background in the pharma industry. And so when I was interviewing for my PhD in Computational Physics with my then supervisor, Mike Payne, who was interested in sort of applying sort of quantum mechanical simulations at a sort of atomistic level, so atoms and electrons, I just happened to say, well, have you ever thought of applying this in biology? And his eyes lit up and he was like, oh, I’d love to do that, but I don’t know anyone who would be interested in taking that on. I said, yeah, sure, I’d love to give it a go. I have some contacts, you know, through my family and the industry, we can, you know, see what might be possible.
And then when searching for appropriate applications, I mean, this is back in the nineties. So, you know, this was really hadn’t been tried before. I was chatting with a professor at Sheffield, Professor Jeff Tucker, who many of your listeners may know, ultimately founded Simcyp. He was and is an expert in DMPK and particularly had an interest in cytochrome P450s and chatting with him realized it was almost the perfect application of these quantum mechanical methods because we’re trying to model a reaction. There’s a very clear signal we can look for to see if the calculations are working.
And so that’s what I did my PhD on. Started doing a bit of a postdoc. That got the interest of what was then Glaxo Wellcome, again dating myself. They were really interested in seeing what would be possible in that space. and so, yeah, ultimately, you know, got involved in DMPK, drug discovery, and that’s led me to where I am today.
Scott Summerfield: Well, thank you. I mean, you do highlight something I think that’s really important, particularly in DMPK. Like, it’s not, you don’t go to university to, to study drug metabolism and pharmacokinetics. It’s kind of a family of really big things, you know, physiology, pharmacology, chemistry, physics, maths, and some of those latter ones in computing have come in later.
And I guess that was, I mean, you’d graduated at a good time, I guess, when all of those were beginning to become important, you know, as we’re trying to get out of the era of not very, strong delivery of new medicines. The pipeline was quite poor then, but I guess, you know computational chemistry is part of like improving that.
Matt Segall: Yeah, absolutely. It’s been a growing area of application. I mean, all the way from the sort of the 1970s, 1980s onwards, but the really sort of hit its stride in the late 90s, 2000s, and so on. And I think continuing today, which I’m looking forward to discussing more.
Scott Summerfield: So, you know, interestingly, we live in a world where the number of scientific papers and discoveries is orders of magnitude higher than, you know, even a few decades ago. And one of these podcasts, one of the values of these podcasts really is to link key milestones on the past to the current state of the art, and then how these will lead to new possibilities in the future.
So we’ll start with the reflective piece, and kind of get your thoughts on the key driving forces that led to the early exploration of QSAR and physicochemical descriptors for kind of DMPK optimization specifically.
Matt Segall: Yeah, I mean, again, this really goes back to the 1990s. I think that was sort of the start of the era of sort of really target focused drug discovery. And there was a lot of focus in sort of early drug discovery on, you know, optimizing potency to a very high degree. And ADME and DMPK was sort of thought of as something downstream. Let other people worry about that once we get into preclinical development and, you know, we can fix any issues in formulation later on.
And what that ultimately led to was, you know, a very large number of clinical failures, due to DMPK issues. I think the key paper that everyone cites is the paper by Kennedy in 1997, where he was talking about sort of 40% of compounds failing in the clinic due to sort of DMPK issues. Now, we need to be a little bit careful about that, because if you read the paper carefully, he was very clear that that data set was somewhat biased towards anti infectives, which had greater DMPK issues. But ultimately the picture was correct. There wasn’t enough attention being played to ADME and to DMPK early in the process.
And a number of pharma companies noticed that, and that was really the advent of what was then called early ADME. So bringing ADME screens early in the drug discovery process and using those to sort of de-risk that the compounds being progressed. I mean, several pharma companies, you know, took that initiative. The one I’m most familiar with, due to my relationship back then with GSK, was the Combinatorial Lead Optimization Project or CLOP, which was run by my one of my bosses, ultimately, Mike Tarbett.
So this was generating. what we now think of very conventional things like permeability assays, microsomal stability earlier on, but the advent of these higher throughput screens and early drug discovery led to the availability of a lot more data. And so attention naturally turned to, well, given all this extra data, can we apply these QSAR methods that had historically been used for predicting, compound potency and activity to predicting, you know, ADME properties, DMPK and so on? And so that’s really, it was the genesis of the whole field of computational ADME and DMPK.
The other outcome of this was the recognition that some of these characteristics could be, at least encoded or de-risked through considering simple physicochemical descriptors properties. And of course the most famous, example of that is Lipinski’s Rule of Five. So just looking at four simple properties, you know, the log P, the molecular weight, the number of hydrogen bond donors and acceptors. This has been, was a fantastic tool for raising awareness of the importance of getting into the right chemical space to achieve, you know, oral bioavailability or whatever route of administration you’re looking for.
Again, it’s one of these things which you have to be careful not to over interpret. Lipinski’s Rule of Five are some nice guidelines, but they’re far from hard and fast rules. And we see now the advent of beyond Rule of Five compounds. And so, you know, you don’t want to, some people have taken the approach of sort of using that to define drug like, but, you know, that’s probably taking it a bit far. Nonetheless, there have been sort of many similar guidelines that have come from this that at least have helped, drug discovery projects to focus in areas of chemical space where they’re much more likely to find compounds with good ADME and PK properties.
Scott Summerfield: Thanks, Matt. I guess, you mentioned the sort of advent of higher throughput kind of permeability screens and stuff like that. I was curious as you were talking about whether kind of the data quality or the curation. was, how early was that recognized? Because, obviously any model needs a good data set. So, what was it like back then?
Matt Segall: A little bit of a Wild West, in fact, arguably it still is. I think you’ve touched on a really important point, which is quality consistent data is key. So if you’re in a large pharma company now, there are hundreds of thousands of data points and consistent assays in the same lab with the same protocol that, you know, are really high quality data sets to use to build these QSAR models. And, you know, this is now sort of run of the mill stuff in a large pharma company.
But if you’re trying to build your own data set, maybe in a smaller company, being in some public domain, literature data, and so on. you need to be very, very careful indeed to look at the assay protocols used to generate that data and make sure they are genuinely comparable. If you just collect, for example, if you collect every data point labeled hERG IC50 from the literature, throw it into a big pot and try to build a model, what you’re going to be doing is modeling the difference between the different experimental protocols and essentially the noise in that data and not really capturing that structure activity relationships that will enable you to make good predictions for new compounds.
Scott Summerfield: Yeah, that’s an interesting point. I was wondering as well as you were talking, because obviously, I mean, I don’t know what computational science was like back then. Obviously, computing was say, simpler. I can’t think of the right word, but, you know, the sort of level of computation that you could do and on an average system was much less than today.
So, you know, without increasing computational power and also the knowledge that we’ve been able to build off the back of that computational power. What would you say today we can do well? And where do you think we might see the short term improvements?
Matt Segall: Another great point. I mean, if you go back to the 90s, comparatively speaking, the computers were steam powered. you know, I started my PhD on a Cray X-MP. So anyone who’s a computer geek out there will know it’s one of these huge cabinet size things with, you know, cooling systems and all sorts of, it looked really cool, but, frankly, it’s about the same as an iPhone nowadays. In fact, it’s probably even a bit slower than an iPhone would be. so, so yeah, there clearly have been huge advances in the intervening decades.
So what do we do really well now? So machine learning models have come on leaps and bounds. You know, the methods being developed in companies, you know, like Google and so on, have really improved the sophistication with which we can model this data that we’re trying to understand and to make predictions with. Now one needs to be careful translating that into the space of drug discovery, because actually the data we deal with is different in a few ways. first of all, It’s, it’s a lot smaller. We think we’ve got big data sets now, but actually on the scale of Google, they’re tiny. and their data is noisy. so it’s not very straightforward to just simply repurpose these methods. but where they’ve been carefully explored and carefully translated into this domain, they’ve had a really significant impact.
And so for simple phys, chem, and ADME properties, these are now a run of the mill approaches being used. routinely in every pharma and pretty much all biotech companies and having a real impact. there was a great paper published by actually an industry consortium. the lead author was, Lombardo. this is back in 2017, where they actually looked at the impact of predictive ADME models across their companies, and see how they were influencing the compounds being synthesized and tested. And they saw a dramatic shift towards compounds being synthesized and tested with good ADME properties like solubility, metabolic stability, and so on. So they were being assessed in silico and the best compounds were moving forward. And that was, you know, just really improving the quality of the compounds. being progressed in the drug discovery projects.
But, you know, there’s more sophisticated methods coming on stream now. so a big area with my background in physics is actually using quantum methods in a practical day to day way. so, you know, back when I was doing my PhD, it would take. it weeks to run a calculation. but now we can actually start to deploy these quantum mechanical methods where they add a lot of value, particularly where there are, you know, reactions occurring. again, background in drug metabolism, you know, that’s a, obviously a, a catalytic reaction at the active site. and so using these quantum mechanical methods to model those and simulate those reactions and, and really get. very valuable inputs to these models, like activation energies and hence rates of different reactions and so on can add enormously to the precision of the predictions we can make in areas like metabolism, pKa and so on. So that has been a and continues to be a big area of development.
And then an area that I’m particularly excited about is in what’s called data imputation. And what this enables us to do is use early stage ADME data, for example, permeability, metabolic stability, solubility, protein binding, and so on, actually as inputs to a machine learning model that will predict downstream, much more complex endpoints like pharmacokinetics.
And we published a paper with AstraZeneca demonstrating this approach. The challenge with these, you know, in vivo properties, particularly a drug disposition and PK is that of course they’re, they’re governed by multiple factors, and trying to, dissect the influence of compound structure on all these different factors and combine that into a prediction of PK, would require vast amounts of data. But of course, as we know, that’s where we have the least data because they’re extremely expensive, time consuming experiments. We want to, to minimize our use of animals. So these new so called imputation approaches for leveraging data where we have it. To make more accurate predictions of those downstream, much more informative and clinically relevant outcomes is a really exciting new area.
Scott Summerfield: You mentioned earlier, Lipinski’s Rule of Five. So you’ve got hydrogen bond donors and acceptors, log P, which obviously has an influence across various parts of the the kind of life cycle of a drug from the site of administration to to the site of ultimate action, and I was wondering with that computation and maybe the time that we’ve had to think about physicochemical descriptors, have there any, do you think there’ve been any kind of advances or additional descriptions of molecules?
Maybe, you know, even simple, simple kind of like descriptors or maybe even more complex ones, I guess that you might be able to pull out through quantum mechanical approaches, you know, they might be a bit abstract to us, but I just wondered whether you know what you thought were, key developments beyond sort of hydrogen bond donors and acceptors.
Matt Segall: Yeah, I think that is a key area of development. essentially the way that most QSAR models work is by using very simple descriptors. So basically patterns of atoms that occur within those molecules. Each individual descriptor has very little information content about how that compound interacts with it’s surroundings with proteins with other molecules and so on. And so what you end up with is actually a very complex model that tries to integrate lots and lots of inputs, which each of which have very little information into an overall prediction.
And so I’m a strong believer that the closer we can get to describing those molecules, in a based on the sort of fundamental physics and chemistry of what’s going on, the more information content those descriptors will have, and hence the simpler, more transferable, and more predictive those models will be. And so yes, characterizing them in terms of the actual sort of three dimensional structure of those compounds, the charge distributions, the energetics of, you know, for example, obviously, pKa is about either removing or adding a proton. And if we can actually look at the energetics of that and use that as inputs to these machine learning models will get much better models in the long term. And actually, you’d be able to use smaller data sets of higher quality data with which to build those models.
Scott Summerfield: That’s a good point. That’s a piece, actually, because obviously say it’s unlikely to go back to the days of CLOP. And actually, I remember when I joined GSK at the time, that word was used quite a lot, that phrase. You mentioned energetics twice. One just a moment ago and also activation kind of energies as well. So, and that brings in, you know, kind of thermodynamics.
And, you know, obviously that’s really important for target binding. And, you know, any, any movement of drug across a membrane or any process requires some thermodynamic consideration. I guess you don’t get that in a, like you say, in a simple descriptor, which from that sense is probably quite a flat parameter. Yeah, I mean, in terms of quantum mechanical calculations and all the models that, you know, that can come forward that way, you know, I guess the kind of assessment of free energies and et cetera is that an important part of the way that those models work? It’s my naivety. I’m just curious.
Matt Segall: Yeah. The challenge that we have, of course, is that real life biological systems are dynamic system, things are moving. The molecules are changing conformation. They’re interacting with each other in a dynamic sense. And the way we characterize our molecules typically is very static. We look at the structure of the molecule in the simplest case as a 2D structure, you know, the graph. structure of the compound. Never mind looking at in 3D. Never mind looking at how that 3D conformation changes.
So there are levels of, abstraction and approximation that we’re always using. At a quantum mechanical level, even there, we’re really looking at a snapshot of the molecule, a static picture, if you like, of that molecule. In an ideal world, we would run dynamic simulations, and we could look at free energies exactly as you say, and for that we’d be looking at sort of molecular dynamics. We’re not there yet, even given today’s computational power, to be able to do that on a routine basis.
There are what are called empirical molecular dynamics, so using simple approximations to the forces between atoms, and they can run for timeframes that are meaningful in a biological sense, but it’s still a very expensive calculation to do if we can try to go to quantum mechanics, which are orders of magnitude more computationally intensive, then we’re not able to run molecular dynamics for, long enough to really see, you know, get a good picture of the free energy.
Now that also may be changing, thinking a little bit further into the future, a really exciting area that’s being explored at the moment is actually a hybrid between quantum mechanics and machine learning. These are what are called machine learning potentials. So by taking large databases of very accurate quantum mechanical, quantum mechanical simulations, we can actually train these interatomic potentials to, using machine learning, to actually use those in these molecular dynamics simulations that are much more accurate than the very simple potentials we typically use nowadays. So essentially getting quantum mechanical level accuracy with the speed of an empirical simulation. So, that’s a direction that we’re heading in, but we’re not there yet.
Scott Summerfield: And coming back to the point that you mentioned earlier, Matt, around all the five descriptors, hydrogen bond donors, acceptors, et cetera, I think you also mentioned calculation of pKas, which require some chemical sort of intuition, I guess, or knowledge. What would you say is driving kind of predictability ,quality of those physical measurements that are being done in silico.
Matt Segall: Yeah, so I guess there are three factors that improve the quality of a model. One is data. So as, the size and of high quality data sets are increasing, then that’s obviously one factor that leads to the improvement in the quality of the model. The second is, as we discussed, is the way we characterize or describe those molecules as inputs. So, you know, being able to, you know, describe those molecules in a more sophisticated way. than simple descriptors. So we mentioned quantum mechanics, for example, but other ways to represent those.
And the third is the sophistication of the machine learning algorithms as they get better and better at recognizing patterns and the quality of the predictions increase over time. And certainly that’s a trend that we’ve seen. But there’s a sort of a tension between the quality of prediction, where we’re looking at sort of within the chemical space that the model has been trained on, and transfer ability of that model as well. So, you know, how well can that model extrapolate to new chemistries?
And again, representing those molecules in ways that relate to the sort of the fundamentals of physics and chemistry gives us, greater transferability. But we also have to ensure that the models recognize when they are extrapolating beyond what they can confidently predict. So a big area is making sure that the each model prediction has an associated uncertainty so that the model recognizes when it’s gone beyond where it can make a confident prediction and be really transparent about that to the ultimate user.
Scott Summerfield: Ok, yeah cool. You touched on a topic as well, which is generating a huge amount of, of interest, you know, generally, like beyond Rule of Five molecules, there, there have always been molecules that were molecular weight thousand, you know, et cetera, that they’ve always been there, almost like, like flagpoles. and by inference, they don’t meet even Lipinski’s rules, you know, that’s obviously I’m not teaching you nothing you don’t know, but, in some ways, particularly in DMPK, for example, we’re kind of back to square one.
A lot of the in vitro assays and those kind of measurements are really hard to do when you’ve got very lipophilic molecules. Sometimes it’s almost impossible to measure the protein binding without adding something, you know. And then, you know, do you really have a true measure or not? I wondered about the kind of physicochemical descriptorsspace and QSAR, et cetera, and where that kind of is with those molecules. Is that a step back as well? Or are there any things, you know, in that space that we can take forward from the descriptors, et cetera, that we have that are maybe applicable there?
Matt Segall: Yeah, that’s a huge challenge for on the computational side as well. So, we talked about dynamics, so this actually segues very nicely, which is that the challenge with these molecules, is they’re much larger and they’re much more flexible. So they have a much higher degree of conformational freedom than a typical small molecule drug, which tends to be relatively rigid. and so, you know, because it’s a relatively rigid structure, we can characterize it using these sort of molecular descriptors.
But, when you’re talking about one of these large flexible molecules, it doesn’t just sit in one conformation. It can change conformation depending on the environment it’s in or what it’s binding to and so on. And in fact, those confirmations can be exquisitely sensitive to small changes in the structure. So, simple sort of SAR almost breaks down because, you know, you can lock in entirely new conformations from a single substitution.
So this is a really big unmet challenge in the sort of the QSAR space, or even, you know, more conventional 3D molecular modeling struggles with these beyond Rule of Five compounds. It’s an area that actually I’m really interested in and some of our team are working on, are really sophisticated approaches for the 3D modeling of these structures and actually showing some remarkable results in collaboration. with Merck and BMS and others, to use, you know, small amounts of biophysical data, and much more sophisticated explorations of conformational spaces to be able to really model these in 3D for the first time. And you really do need to model them in 3D for the reasons we’ve just discussed.
So, so it is a big area of challenge. The other way that we found to, approach this is using the imputation approach I spoke about before, which is if you have even a small amount of experimental data on these compounds, you can leverage that to make much better predictions of properties that are just impossible to model in that chemical space with QSAR models, and, we did a webinar with, with Merck, just towards the end of last year, demonstrating the first application of these approaches in, you know, peptides and particularly macrocyclic compounds. And we’re working on a paper. So watch this space.
Scott Summerfield: A good plug. Yeah, I mean, I just wonder, you know, maybe there’s a rule of something out there, you know, it won’t be as simple as the Rule of Five, I’m sure, but it you know, obviously there’s quite a strong reliance even now on in vivo work, which is in some ways, you know, not what you want, you want in vitro, assays first, but in some areas that’s been a challenge for these beyond Rule of Five molecules. So it’d be good to see what comes forward.
So you mentioned one of the thoughts on the future, which is the combination of machine learning and quantum mechanics. Where do you feel there are you know, other significant advancements that might be on the horizon or will come and I guess as well, do you think there are any things that the life sciences industry needs to do collectively, data curation, et cetera, and, protocols, you know, something you mentioned earlier in the standardization, but, you know, kind of collaboration, data transparency, those kind of things.
Matt Segall: Yeah, yeah, absolutely. so, we’ve covered some of the sort of longer term developments, that I see as being really important already, but, yeah, certainly more. You know, it’s all about the data at the end of the day. Any model is only as good as the data on which you build it. And of course, the large global pharma companies have this sort of rich seam of historical consistent, high quality data. I’d love to suggest a consortium for data sharing, to be able to build models. My experience tells me that’s unlikely. There have been many, many attempts to build such consortia, sort of pre competitive data sharing, and they inevitably fall down on considerations of IP and confidentiality and so on. So, you know, I don’t think it is, unfortunately, that simple.
But you know, better data curation, in the public domain, would be really good. And there’s obviously a great databases like ChEMBL and PubChem and others where they’re bringing together that data. But the sort of careful annotation and curation of that will be absolutely essential.
Also available on:
Our Moderator:
Scott Summerfield – Executive Director of Metabolism at Pharmaron
Scott Summerfield is the head of Metabolism, leading clinical and nonclinical radiolabeled ADME (Pharma and Environmental), in vivo support, imaging, as well as Discovery/Development and bioanalysis metabolite ID.โฏ Scott joined Pharmaron in 2022, having worked in the Pharmaceutical Industry for over 20 years, supporting both small and large molecule DMPK projects (Discovery and Development).โฏ He holds a PhD and a postdoctoral degree in protein mass spectrometry. He has published extensively in the areas of bioanalysis and the permeation of drugs across the blood-brain barrier.
Our Speakers:
Matt Segall – CEO of Optibrium
Matt Segall is the CEO of Optibrium. He holds a Master of Science in Computation from the University of Oxford and a Ph.D. in Theoretical Physics from the University of Cambridge. As Associate Director at Camitro (UK), ArQule Inc., and subsequently at Inpharmatica, he led a team in developing predictive ADME models and state-of-the-art, intuitive decision-support and visualization tools for drug discovery. In January 2006, he assumed responsibility for managing Inpharmatica’s ADME business, encompassing experimental ADME services and the StarDrop software platform. Following the acquisition of Inpharmatica, Matt became Senior Director responsible for BioFocus DPI’s ADMET division and, in 2009, led a management buyout of the StarDrop business to found Optibrium, which develops software and AI solutions for small molecule design, optimisation, and data analysis. Matt has published over 30 peer-reviewed papers and book chapters on computational chemistry, cheminformatics, and drug discovery.