Dear academic, biotech & drug discovery twitter colleagues, I need your help! I'm collecting a list of benchmarks & evaluation datasets for protein-small molecule affinity and virtual screening capacity (e.g. published hit discovery campaign results), which ones do you recommend? Tweet added by Gabriele Corso @GabriCorso

Gabriele Corso

5 months

Dear academic, biotech & drug discovery twitter colleagues, I need your help! I'm collecting a list of benchmarks & evaluation datasets for protein-small molecule affinity and virtual screening capacity (e.g. published hit discovery campaign results), which ones do you recommend?

11

76

Gabriele Corso

@GabriCorso

5 months

Examples include CSAR-HiQ, Merck FEP benchmark, CACHE #1 challenge... the more the merrier!

3

1

2

Gabriele Corso

@GabriCorso

5 months

Tagging a few people from whom I would love to have their opinion @david_koes @olexandr @CGorgulla @akshat_ai @jchodera @alshedivat 🙏

0

2

Maruan Al-Shedivat

@alshedivat

5 months

@GabriCorso openff protein-ligand benchmark has a pretty diverse set of targets (although # of ligands per target is small): . another one that comes to mind is merck fep, but you already mentioned it.

GitHub - openforcefield/protein-ligand-benchmark: Protein-Ligand Benchmark Dataset for Free Energy...

Protein-Ligand Benchmark Dataset for Free Energy Calculations - openforcefield/protein-ligand-benchmark

github.com

1

3

Gabriele Corso

@GabriCorso

5 months

@alshedivat Thanks Maruan! I actually was not aware of this one! Let me know if similar ones come to mind!

0

Jude Wells

@_judewells

5 months

@GabriCorso If you need solved structures: PDB Bind: DUDE: If you don't care about having the structure:

ChEMBL Database

A manually curated database of bioactive molecules with drug-like properties

www.ebi.ac.uk

1

0

2

Gabriele Corso

@GabriCorso

5 months

@_judewells Yeah though from my experience these (without particular filterings) are all of somewhat bad quality and not very representative of what is actually useful in research/industry

0

2

Gabriele Corso

@GabriCorso

5 months

And connected to it, what is the right way of fairly evaluating methods on these (of course after blind prospective studies)? E.g. ensuring test proteins/pockets/ligands are never seen during training...

2

0

1

Giuseppe Marco (zeld) Randazzo

@GM_Randazzo

5 months

@GabriCorso Long story short. It depends on what you want to prove and achieve. Why not posebuster?

1

0

Gabriele Corso

@GabriCorso

5 months

@GM_Randazzo Posebusters is only structural as far as I know, mostly interested in affinity here!

1

0

Clemens Isert

@clemensisert

5 months

@GabriCorso Roche’s PDE10A dataset might be useful

A high quality, industrial data set for binding affinity prediction: performance comparison in...

Journal of Computer-Aided Molecular Design - We release a new, high quality data set of 1162 PDE10A inhibitors with experimentally determined binding affinities together with 77 PDE10A X-ray...

link.springer.com

1

3

Gabriele Corso

@GabriCorso

5 months

@clemensisert Thank you, yes this is very interesting (although I guess performance for a single target might be somewhat biased)!

0

1

José Jiménez-Luna

@josejimlun

5 months

@GabriCorso BindingDB protein-ligand validation sets. Old but contains lots of docked congeneric series data and some crystals.

1

0

Gabriele Corso

@GabriCorso

5 months

@josejimlun Thanks, I'll investigate!

0

Ahmet Sarıgün

@a_sarig_

5 months

@GabriCorso I've trıed a dataset with Molecular Mechanic features derived from a subset of the PDBBind database. However, I would advise examining some of these features more closely. Paper: Dataset:

GitHub - LinaDongXMU/GXLE: Prediction of Binding Free Energy of Protein–Ligand Complexes with a...

Prediction of Binding Free Energy of Protein–Ligand Complexes with a Hybrid Molecular Mechanics/Generalized Born Surface Area and Machine Learning Method - LinaDongXMU/GXLE

github.com

1

0

Gabriele Corso

@GabriCorso

5 months

@a_sarig_ Thanks!

0

1

Andrea 🤌🏾 Ranieri

@4ndr3aR

5 months

@GabriCorso Hey there, just matching keywords here (I know almost nothing of the field), but could this be of some interest for you? It's from a few colleagues of mine, let me know if it may be relevant

GEO-Nav: a geometric dataset of voltage-gated sodium channels

Voltage-gated sodium (Nav) channels constitute a prime target for drug design and discovery, given their implication in various diseases such as epilepsy, migraine and ataxia to name a few. In...

arxiv.org

0

Sreejana Basu

@heyitsbasu

5 months

@GabriCorso I may be able to collaborate and help you with a database- lets dm!

0

1

Dominique Beaini @ ICLR 2024

@dom_beaini

5 months

@GabriCorso I recommend getting in touch with @cas_wognum , he is at the center of a bio/pharma consortium for better benchmarks

0

5

Replies