A few months ago my boss declared me as a Machine Learning researcher. On this day, my area of research officially shifted from "DFT calculations for materials science" to "Machine Learning for materials science". Of course I had already been showing much interest for ML during the past year (and that involved a lot of Coursera certifications!) but I actually hadn't had the opportunity to really get into it. This official change in my area of research was therefore the green light for me to dive deep into this exponentially growing and exciting field.
My first thought was to dust off my one year-old Kaggle account, on which I had barely set foot on. I was planning to find a 'not-too-hard' competition in order to apply what I had learned and see how it turned out in the real world (whether Kaggle is the real world is an entirely different discussion. My friend Bojan says it's even better!). It's probably karma that decided that Kaggle would launch a competition this day entitled Predicting Molecular Properties, in which the goal was to predict some magnetic interactions between atoms in molecules and compare the predictions with DFT calculations.
That was the perfect competition for me! I had been doing DFT calculations for the past 10 years so I clearly had some domain knowledge, although I am more of a physicist than a chemist. Still, it was an opportunity for me to focus on ML, rather than having to struggle to understand both ML and scientific background. So I decided to jump in, and began reading the competition discussions and kernels.
It was overwhelming. Really. There was a wide gap between what I had learned in textbooks and the actual use of ML in Kaggle comps. My first source of information/inspiration came from Andrew's kernels. The guy is the absolute Kaggle Kernels Grandmaster. But we're talking the top-of-the-pyramid kind of guy. I learned a lot from his kernels, so much that I decided to take a veeeeeeery long shot: ask him whether he wanted to team up with me, so that we could own this competition together. And he accepted. I still don't understand why, but he did. And that triggered a very fortunate chain of events that all led us to a gold medal.
We spent a few days climbing up the leaderboard with Andrew. My domain knowledge combined with his ML skills was a good formula. I learned about LGBM (Light Gradient Boosting Model), how it basically outperformed a lot of models that I had tried before, became better and better at Python programming, and even wrote my first (and only at the time of this writing) two kernels: HOW TO: Easy Visualization of Molecules and Introducing Atom-Centered Symmetry Functions: Application to the prediction of Mulliken charges, with 234 and 196 upvotes, respectively, and approximately 5000 views each. Once again, it must have been karma that had me attend a ML workshop two weeks before about how to apply ML to materials science problems, with a focus on molecular properties! It indeed proved very helpful for the competition.
A couple of days later, I received a message from Kaggler Psilogram, whose current competition rank is 5th of 116,303. Apparently, Phil had sent me a team invitation a few days earlier, which I never answered, and he was kindly reminding me that according to Kaggle rules, every team invitation had to be answered. Truth is, I had no idea what he was talking about, since I had never received any invitation. Because obviously, I would have answered! You don't turn down this kind of invitation, do you? So I apologized and took a second veeeeeeeeery long shot: ask Phil whether he was interested in joining our team! And he accepted! The force was strong in our team: domain knowledge, exploratory data analysis (EDA), and very serious ML expertise. This was starting to become very interesting as it had already far exceeded my expectations.
From this point everything went very fast. Phil used the "chemical" input features I had created (called ASCF, see this kernel) with a lot of his ML magic. I learned about blending, meta-features generation, and what impressed me is that Phil had actually managed to end up in the top 1% (at that time of the competition) just by using what was available in the public kernels and discussions. The takeaway from this is that there are a lot of very important information pieces scattered throughout the discussions and kernels, and that in order to reach the top, one has to thoroughly go through all of them, which takes a fair amount of time. Little by little, we continued our ascension to the top, and Bojan subsequently joined our team. Well that was another big shot coming in: Bojan is currently 16th out of 116,303 Kagglers, and I learned that he was the master of ensembling (though at this time I did not know what it meant...).
Having Bojan on board marked another milestone. Not only because of his ML skills, but also because I got to join his "Hard-Core Modeling" slack channel! I discovered a frightening new world. A world of top level Data Scientists. It was like all top Kagglers had decided to join a party and were discussing out in the open their latest models and discoveries. It was frightening because I could not understand a word of what they were saying. I was drowning under a moutain of information and it made me realize how little I knew about ML. But I did not mind. After all, it is always more rewarding to realize that you are getting better and better every day, which I believe I did during the course of the competition.
Then LGBM hit a wall. We were around -2.ish on the leaderboard, and clearly there were other competitors in front of us who had figured out a way to get deep into the -2 club. We had the best LGBM users in the team, so it had to be something else, and a quick look at the scientific literature, together with Heng's discussion on message passing neural networks (MPNN) gave us the answer: deep learning (DL) was the key. None of us in the team was a DL expert. I had a fair amount of knowledge on neural networks, especially the maths behind it, but MPNN was not in the same ballpark, not even the same league. We were thinking about asking Heng to join, but Bojan recommended someone else whom he had already worked with: Christof. According to Bojan, the guy was a DL expert, so we trusted him and Christof joined us.
Christof was actually more than a DL expert. He also had a PhD in mathematics and impressive programming skills, which obviously helped a lot. We all started to work on finding the best architecture for the problem at hand, and Christof was implementing them in a blink of an eye. The dive into Graph Neural Networks (which is what it's called) opened yet another world to me, beyond the Convolutional and Recurring Neural Networks (CNN, RNN) that I had already heard of. The basic principle of GNNs in chemistry is that a molecule is first featurized into a graph, with atoms and bonds acting as vertices (or nodes) and edges, respectively. Then each atom and bond is given some chemical and geometrical features: aromaticity, number of neighbors, single/double/triple bonds, angles between atom pairs, dihedral angles between atom triplets, and more. The next step is to have these nodes and edges interact one with another. There are a lot of different architectures for that and ours was based on the SchNet architecture. The final block is called a regression head, and is aimed at computing the target that we wanted to predict : the scalar coupling constant. At the end not much was left from the original SchNet. We had built our own custom architecture, our winning solution, a densely connected GNN.
Although the competition has ended, I still have a lot of work to do. Christof was working so fast that I did not have time to really grasp all the concepts that he has implemented. I guess none of us but Christof have. So it will probably take me another month to study the architecture, run some more tests, benchmark it on the QM9 dataset, and eventually write an article with the team. That shouldn't be a problem, writing articles is what we research scientists do. It would be a shame to let all this work go to waste!
I am grateful for this amazing opportunity. To have been able to work with such amazing and talented team members was a great honor, especially for me as a Kaggle beginner. It is strange to see how things unfolded so fast after Andrew accepted to team up with me. Talk about a butterfly effect!
And now life goes on, I can finally focus my mind on other things. But a little devil on my shoulder is whispering to me: "Hey! Why don't we take a look at the active competitions on Kaggle?".