Evaluating Large Language Models in Scientific Discovery

Summary

This paper introduces a new benchmark framework (SDE - Scientific Discovery Evaluation) that evaluates LLMs on actual scientific discovery tasks rather than decontextualized knowledge tests. Domain experts define real research projects across biology, chemistry, materials science, and physics, decomposed into modular scenarios.

The framework assesses models at two levels:

Question-level: accuracy on scenario-specific items
Project-level: ability to propose testable hypotheses, design experiments/simulations, and interpret results

Key findings:

Consistent performance gap compared to general science benchmarks
Diminishing returns from scaling up model size and reasoning
Systematic weaknesses shared across top-tier models from different providers
Large variation across scenarios means the “best” model changes depending on the project
All current LLMs are far from general scientific “superintelligence”

My thoughts

The most interesting finding: LLMs show promise even when constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. Science isn’t just answering questions correctly — it’s about asking the right questions and making unexpected connections.

This echoes my view that LLMs are tools for augmenting human cognition, not replacing it. The benchmark’s focus on iterative reasoning, hypothesis generation, and observation interpretation feels much closer to how science actually works than typical Q&A benchmarks.

Summary

My thoughts

Favorite Books