top of page
Search
samrodriques

What does it take to build an AI Scientist?

What will it take to build an AI Scientist?

I run FutureHouse, a non-profit AI-for-Science lab where we are automating research in biology and other complex sciences. Several people have asked me to respond to Sakana's recent AI Scientist paper. However, judging from comments on HackerNews, Reddit and elsewhere, I think people already get it: Sakana’s AI Scientist is just ChatGPT (or Claude) writing short scripts, making plots, and grading its own work. It's a nice demo, but there's no major technical breakthrough. It's also not the first time someone has claimed to make an AI Scientist, and there will be many more such claims before we actually get there.


So, putting Sakana aside: what are the problems we have to solve to build something like a real AI scientist? Here’s some food for thought, based on what we have learned so far:



It will take fundamental improvements in our ability to navigate open-ended spaces, beyond the capabilities of current LLMs

Scientific reasoning consists of essentially three steps: coming up with hypotheses, conducting experiments, and using the results to update one’s hypotheses. Science is the ultimate open-ended problem, in that we always have an infinite space of possible hypotheses to choose from, and an infinite space of possible observations. For hypothesis generation: How do we navigate this space effectively? How do we generate diverse, relevant, and explanatory hypotheses? It is one thing to have ChatGPT generate incremental ideas. It is another thing to come up with truly novel, paradigm-shifting concepts. 


It will take tight integration with experiments

Once we have a hypothesis, we then need to decide which experiment to conduct. This is an iterative process. How can we identify experiments that will maximize our information gain? How do we build affordance models that tell us which experiments are possible and which are impossible? Affordance models are critical, because discovery is about doing things that have never been done before.


And then, once we have obtained results from an experiment: how do we tell whether experiments are reliable, or what sources of bias or confounds may exist, and how do we use that evidence to update our model of the world? This is particularly challenging when some of your evidence might be in the form of images, some might be in sequencing, and so on.


Improving performance on a defined benchmark is one thing. For research in the natural sciences, in particular, we do not have good simulations. The only way you can make progress is by combining the in-silico part of science with experiments. 


It will take a ton of engineering

It is easy to build something that will make a small number of queries to the Semantic Scholar API. If you want access to full text PDFs from the entire literature, at scale, it is a serious engineering undertaking. If you want further to be able to access all the publicly available databases on the web, access publicly available computational tools, interface with equipment in the wet lab or spin up servers or whatever real scientists do, it’s a huge amount of engineering effort. 


It will take robust and scalable evaluations

Finally, as a field, we need robust and scalable evaluations for the accuracy of the AI Scientist systems we want to build. If we ask an LLM to do some analysis or read a paper or implement some method, we need to know how reliable its answer is. We have built infrastructure for scaling up human evals internally, and have used it to create LAB-Bench, an open set of evals for a variety of scientific tasks. We hope others will join us.


At the same time, we cannot expect LLMs to always perform well in a zero-shot context. We need environments that mimic (or, indeed, simply implement) core aspects of scientific research, and that emit high quality reward signals that we can use to train our AI Scientist agents at scale. Defining rewards for open-ended tasks is one of the most challenging pieces of machine learning, and is one of the areas where the most innovation will be needed. We may also need fundamental advances in reinforcement learning that will allow us to learn in highly complex and open-ended environments.


Closing thought

Clearly, we have a lot of work to do, and it’s not going to be solved today, it’s not going to be solved by GPT-5, it’s going to take some time and a lot of elbow grease over an extended period. But it will happen, it will change the world when it does, and you will know it is happening because you will get that “oh shit” feeling in your stomach when the AI scientist starts to tell you new experiments to conduct and new insights about the world that you never realized before. If you have thoughts about this, or about any of the above, get in touch -- hello@futurehouse.org.

2,586 views0 comments

Recent Posts

See All

Comments


bottom of page