In the past,
we have received a number of queries about the status of the PDBbind core set.
We also noticed that there are some confusions in literature regarding the
naming convention of the CASF benchmark developed by our group. Here, we would
like to make a formal statement about the PDBbind core set and the CASF
benchmark in a hope to answer those queries and also clarify the confusion.
Our group has
a long-standing interesting in scoring function development. The PDBbind
database is a notable outcome along the path (Liu et al., Acc. Chem. Res. 2017, 50, 302-309). The PDBbind database is now
updated on an annual basis, and each release of PDBbind is named after the
release year, such as PDBbind v.2016, PDBbind v.2017, and so on. The
PDBbind database collects experimentally measured binding affinity data for
four types of molecular complexes, i.e. protein-ligand complexes, nucleic
acid-ligand complexes, protein-protein complexes, protein-nucleic acid
complexes. Among them, we have named the collection of protein-ligand
complexes as the "general set". We put a focus on this data set because
it is most relevant to drug design and discovery studies. Apparently, not every
entry in the general set is suitable for calibrating or validating
docking/scoring methods due to misc problems in 3D structure, binding data, and
other aspects. Therefore, we have selected the relatively
"healthy" entries from the general set to compile the so-called
"refined set". The refined set serves as a generally acceptable
data set for docking/scoring studies. Other researchers may apply the refined
set directly to their studies, or use the refined set as the starting point to
compile data sets with their own focus. Both the general set and the refined
set are updated with the PDBbind database on an annual basis. They should be
correctly cited as, for example, "the PDBbind general set v.2016",
"the PDBbind refined set v.2017", and so on.
As another part of our efforts, we have
established the CASF benchmark (Comparative Assessment of Scoring Functions),
which aims at providing an objective platform for assessing scoring functions. The
first published work was CASF-2007 (Cheng et al., J. Chem. Inf. Model. 2009, 49, 1079-1093). Another major update,
i.e. CASF-2013, was published a few years later (Li et al., J. Chem. Inf. Model. 2014, 54, 1700-1716; J. Chem. Inf. Model. 2014, 54, 1717-1736).
The CASF benchmark employs a high-quality set of protein-ligand complexes as the
primary test set. This data set is selected from the PDBbind refined set
through a systematic, non-redundant sampling procedure, which is named as the PDBbind
"core set" by us. Accordingly, each public release of the CASF
benchmark is named after the version of the PDBbind database from which the
test set is selected. For example, the test set in CASF-2007 was compiled based
on PDBbind v.2007, the test set in CASF-2013 was compiled based on PDBbind
v.2013, and so on. It is not a good idea to name each CASF benchmark by its
publish year. It is because we cannot predict when our paper will be published
in prior when we prepare the manuscript.
important to point out that unlike the PDBbind database, the PDBbind core
set is not updated on an annual basis. As implied above, the PDBbind core
set is a component of the CASF benchmark rather than the PDBbind database. The
CASF benchmark is not updated on an annual basis due to the following reasons:
• A HUGE amount of efforts is
needed to finish each CASF update. The CASF benchmark is more than a simple
data set. For instead, it consists of a whole set of evaluation methods, the
test set, as well as a large panel of standard scoring functions to be tested
as demonstration. A lot of material needs to be prepared, and a lot of
computation needs to be conducted for each CASF update.
• Even if it were doable, in our opinion, there is no need to update
CASF so frequently. Our current plan is to update the CASF benchmark every
three years. In fact, we have already finished CASF-2016, and are preparing
a manuscript regarding it. We hope that this paper can be published in the year
As mentioned above, the last published version of the PDBbind core set is v.2013. This data set was not updated with PDBbind v.2014 and v.2015, so there is no PDBbind core set v.2014 and v.2015. For historical reasons, the PDBbind core set used to be included in the downloadable data package in some previous releases of PDBbind. To avoid further confusion, we have removed the core set from the data packages of recent releases of PDBbind (e.g. PDBbind v.2014, v.2015, v.2016, and v.2017). If needed, the user can obtain the information of the PDBbind core set in the data package of the corresponding CASF benchmark (e.g. CASF-2007 and CASF-2013), which is also downloadable from the PDBbind-CN web site.
the take-home message is:
• The CASF benchmark should not
be referred to as the "PDBbind benchmark". There are such wrong
naming conventions in literature, and now you know what the correct one is.
• Data package of the CASF
benchmark can be downloaded from the PDBbind-CN web site under the
"CASF" tab (http://www.pdbbind-cn.org/casf.asp). At this point, we do
not think it is necessary to set up two separate web sites to host PDBbind and
• Currently, the latest public
release of the CASF benchmark is CASF-2013. There will be CASF-2016 soon.
Introduction. The aim of the PDBbind database is to provide a comprehensive collection of the experimentally measured binding affinity data for all types of biomolecular complexes deposited in the Protein Data Bank (PDB). It thus provides an essential linkage between energetic and structural information of these complexes, which is helpful for various computational and statistical studies on molecular recognition occurred in biological systems.
The PDBbind database was originally developed by Prof. Shaomeng Wang's group (http://sw16.im.med.umich.edu) at the University of Michigan in USA, which was first released to the public in May, 2004. This database is now maintained and further developed by Prof. Renxiao Wang's group (http://www.sioc-ccbg.ac.cn) at the Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences under a mutual agreement with the University of Michigan. The PDBbind database is now updated on an annual base to keep up with the growth of the Protein Data Bank.
The current release, i.e. version 2018, is based on the contents of PDB officially released by Jan 1st, 2018. This release provides binding data of a total of 19,588 biomolecular complexes, including protein-ligand (16,151), nucleic acid-ligand (125), protein-nucleic acid (896), and protein-protein complexes (2,416), which is currently the largest collection of this kind. Compared to the last release (v.2017), binding data included in this release have increased by 9.43%. All binding data are curated by ourselves from over 34,700 original references. Moreover, a "refined set" and a "core set(now CASF)" are compiled as high-quality data sets of protein-ligand complexes for developing and validating docking/scoring methods. Click here for a brief introduction to the PDBbind database (PDF brochure). A Special Statement about the PDBbind core set and the CASF benchmark Mar 3rd, 2018
The basic information of each complex in PDBbind is completely open for access (see the [BROWSE] page). Users are required to register under a license agreement in order to utilize the searching functions provided on this web site or to download the contents of PDBbind in bulk. Registration is free of charge to all academic and industrial users. Please go to the [REGISTER] page and follow the instructions to complete registration.
This project is financially supported by the National Natural Science Foundation of China (grants #81430083, #81172984, #21072213, #21102168, #21402230). We are very grateful to Prof. Zenghui (John) Zhang's group at East China Normal University for their aid to the collection of raw data needed by version 2015,2016,2017.