Fakultät für Mathematik und Naturwissenschaften

Challenge in bi-molecular machine learning

Background

In molecular machine learning, we are interested to predict properties of molecules. Since molecules need to be expressed as a mathematical object in a computer, it is common to start from a description of molecules by their atom types and atom coordinates, out of which feature vectors or “representations” such as “Coulomb matrices” are derived. The regression task is then to learn the mapping from each molecule’s representation to a typically scalar-valued output such as a “ground state energy” or an “excitation energy”. Numerous machine learning models for solving such a regression task exist. Beyond neural networks, kernel-based regression models, such as Kernel Ridge Regression or Gaussian Process Regression play an important role in this field. Over the recent years, models for learning the mapping from (single) input molecule geometries to real-valued quantities have received a lot of attention in the research community. This task is meanwhile quite well understood. A basic introduction into the regression task is given in [1,2].

In this challenge, we address a more complicated case, in which we consider the prediction of properties that are associated to two molecules, i.e. they describe some sort of an interaction between two molecules, e.g. the “excitonic coupling”. In other words, for a pair of molecules, a scalar-valued property is to be predicted. This task is more demanding than the single molecule case, as typically the to-be-learned function depends on the shape of the first molecule, on the shape of the second molecule and overall on the pairwise interaction between the molecules. Therefore, often, a much higher amount of training samples needs to be calculated to allow for an accurate prediction of a bi-molecular property, rendering the construction of machine learning models more expensive. Our research therefore aims at developing techniques to build regression models at identical quality but with much reduced training sample sizes, leading to an overall reduced model cost.

The challenge

Our particular machine learning challenge covers the basic case of learning a bi-molecular property. Reference [3] provides the underlying dataset. For pairs of 200 molecular geometries, i.e. a total of 200×200 such pairs, artificially generated excitonic coupling data is provided. The geometries of the molecules that form the first input are stored in the file “Coord_A.xyz”, while the geometries of the molecules that form the second input are stored in the file “Coord_B.xyz”. Both files are “xyz” files, a typical file format to represent molecules. They contain 200 concatenated xyz files to describe the 200 molecular geometries. The associated outputs for all input pairs are provided in “CouplingEnergies.csv”. It maps the indices of the molecular geometries to the coupling energies. As an additional file, which may or may not be of relevance for this challenge (depending on what is done), one also finds the file “Coord_supermol.xyz”. It contains the 200×200 geometries that one would receive if one were to treat each pair of input molecules as one big molecule, a “supermolecule”, instead of two individual molecules.

The task is then to build and benchmark a bi-molecular machine learning model for pairs of input molecules being mapped to scalar outputs, using Kernel Ridge Regression or Gaussian Process Regression. As representation for the molecular geometries, it is recommended to use Coulomb matrices, which one can either implement oneself or take from any existing library that would calculate that representation. To check the outcome of the model, any participant in the challenge is asked generate a “learning curve”, which is a double-logarithmic plot that has on the horizontal axis the number of training sample molecules and on the vertical axis the mean absolute error gathered by some (cross) validation approach. Finally, participants create a one(!) page PDF (text plus figures) document in which they present the learning curve accompanied by a brief write-up of the model, chosen parameters, validation method, etc.

To repeat, the challenge requires to

  1. read the above referred paper,
  2. familiarize oneself with the attached data set,
  3. think about how to model the bi-molecular regression task, including thinking about how to in particular address the bi-molecular input,
  4. potentially use some libraries to e.g. build molecular representations such as the Coulomb matrix,
  5. implement the Kernel Ridge Regression or Gaussian Process Regression approach,
  6. create a learning curve, and
  7. document everything in a one(!) page PDF.

Submission and further details

Participants need to provide their submission (one-page PDF, source code) in two different formats:

  • one PDF containing first the one page write-up and then a concatenated listing of all source codes (the latter ideally in a Jupyter notebook exported to PDF)
  • the one page write-up and all source codes and/or binary files as one one compressed archive which they upload on an online storage of their choice and for which they provide the link as part of the cover letter.

The one single PDF (containing the write-up and the source codes) is submitted to the online portal at https://stellenausschreibungen.uni-wuppertal.de together with the cover letter (containing the link to the compressed archive) and the other application documents.

It is by intention that the description does not cover all aspects of the task. Submitted material will allow to better understand, how participants would approach and document a given task. It should also be noted that the final performance of the model is not the central point of interest. Instead the evaluation rather identifies a working solution, which can be clearly understood in its functionality, as well as explanations that give a clear guidance on the developed solution strategy. Still, as a point of reference, it should be noted that kernel-based models can achieve an error in the range of 0.01 to 0.1 as mean absolute error (MAE).

While working on the solution of the task, participants of the challenge should think of it as a task that they would carry out as PhD student in the group. Therefore, as it would happen with a normal task as a PhD student in our group, any e-mail based interaction, clarification or questioning on unclear points of the task are welcome.

A last word on large language models: Participants are invited to use those as much as they would do it in their daily lives, while assuring that they fully understand what the solution provided by an LLM would do. If LLMs are used, the full prompt history to complete the challenge has to be submitted as part of the compressed file.

References:

  1. https://pure.mpg.de/pubman/faces/ViewItemOverviewPage.jsp?itemId=item_2175859 (introductory article, preprint)
  2. https://onlinelibrary.wiley.com/doi/full/10.1002/qua.24954 (introductory article, published version)
  3. BiMolData.tgz (dataset file)