Google Introduces Simula: A Reasoning-First Framework for Generating Controllable, Scalable Synthetic Datasets Across Specialized AI Domains

Training powerful AI models depends on one resource that is quietly running out: specialized data. While the internet provided a seemingly infinite supply of text and images to train today’s generalist models, the next wave of AI breakthroughs — in cybersecurity, legal reasoning, healthcare, and other niche domains — requires data that simply doesn’t exist in sufficient volume, or can’t be accessed due to privacy concerns.

A team of researchers from Google and EPFL introduce Simula, a reasoning-driven framework for synthetic data generation and evaluation that prioritizes transparency, fine-grained control, and scalability. Unlike conventional approaches, Simula doesn’t rely on seed data from the target distribution, hand-crafted prompts, or evolutionary algorithms — it constructs each dataset from first principles, treating data generation as a problem of mechanism design.

Why Synthetic Data Generation is Harder Than It Looks

If you’ve worked with fine-tuning pipelines or domain-specific model training, you’ve likely run into the ‘not enough data’ wall. Manually collecting and annotating specialized datasets is expensive, time-consuming, and error-prone. But the obvious workaround — just prompt a large language model (LLM) to generate training data — runs into its own set of problems.

Most existing synthetic data methods optimize for only a subset of what the researchers define as the three axes of ‘good’ data: quality, diversity, and complexity. Quality refers to whether a data point meets specific semantic and syntactic requirements. Diversity covers both global coverage (do you have examples from across the entire concept space?) and local variation (do you have multiple distinct takes on each concept?). Complexity captures how confusing, uncommon, or elaborate a given example is. Simultaneously controlling all three, at scale, with explainability, is the unsolved challenge that Simula directly targets.

How Simula Works: Taxonomies, Meta-Prompts, and Dual Critics

Simula breaks down the generation process into four distinct, controllable steps, each targeting a specific data property.

The first step addresses global diversity using hierarchical taxonomies. Given a dataset description — say, ‘a dataset of cybersecurity threat intelligence questions’ — a multi-modal model (referred to as M3) is prompted to identify the prime factors of variation for that domain (e.g., attack type, threat actor, vulnerability class). Each factor is then expanded breadth-first into a hierarchical taxonomy tree. To reduce the risk of missing important subcategories, the system uses a Best-of-N proposal strategy combined with a critic refinement step, where the model proposes N candidate child nodes and then critiques them for completeness, soundness, and specificity. The resulting taxonomies function as structured sampling scaffolds — ensuring that when you draw 512,000 training examples, they genuinely cover the long tail of the domain rather than clustering around common modes.

https://research.google/blog/designing-synthetic-datasets-for-the-real-world-mechanism-design-and-reasoning-from-first-principles/

The second step handles local diversity. Sampled combinations of taxonomy nodes — called ‘mixes’ — are passed to an M3 to generate ‘meta prompts.’ For example, a mix of {house cat, poem, travel enthusiast} becomes ‘Compose an exciting haiku about a house cat who goes on an adventure.’ To prevent mode collapse when many meta prompts are generated from the same node-set, Simula generates multiple meta prompts simultaneously and sub-samples the required fraction, ensuring distinct instantiations rather than identical repetitions.

The third step is complexification. A user-configurable fraction, c, of meta prompts is passed through a complexification step, which prompts the M3 to increase the complexity of the generated meta prompts and outputs while maintaining all other requirements. This separates complexity control from coverage control — you can raise the difficulty ceiling without sacrificing breadth.

The fourth step enhances quality through a ‘dual-critic’ approach. Rather than asking the model once whether a generated answer is correct, Simula independently queries the model for whether the answer is correct and whether it is incorrect. This dual-verification design mitigates sycophancy bias — the tendency of LLMs to agree with plausible-sounding outputs — and is particularly important for tasks with a defined notion of correctness, such as multiple-choice questions or math problems.

https://research.google/blog/designing-synthetic-datasets-for-the-real-world-mechanism-design-and-reasoning-from-first-principles/

What the Experiments Show

The research team tested Simula using Gemini 2.5 Flash (non-thinking) as the teacher model and Gemma 3 4B as the student model, running 10 iterations of LoRA fine-tuning with different seeds per configuration and reporting mean accuracy with 95% confidence intervals. They generated datasets of up to 512K data points across five domains: CTI-MCQ, a multiple-choice question dataset for assessing understanding of CTI standards, threats, and mitigation; CTI-RCM, an open-ended generation task requiring the model to produce a Common Weakness Enumeration (CWE) category from a Common Vulnerabilities and Exposures (CVE) description; LEXam, covering Swiss, EU, and international law examinations in English and German; GSM8k (grade-school math); and Global MMLU (Math, Computer Science, and Physics in English, Korean, and Nepali).

Across all datasets and data sizes, the full Simula system — combining global diversification, local diversification, complexification, and critiquing — consistently outperformed simpler baseline configurations. Notably, combining both Global and Local diversification was critical; either in isolation produced suboptimal results depending on dataset and scale.

The complexity results were particularly instructive. On GSM8k, the High Complexity split yielded a 10% accuracy gain over the Low Complexity split at 64K data items. But on LEXam, where the teacher model achieved only 57% accuracy, higher complexity data actually hurt performance — demonstrating that complex data is only beneficial when the teacher model is strong enough to generate reliable labels for it. The critic rejection rate for LEXam reached 61%, compared to just 2% for CTI-MCQ, 9% for CTI-RCM, and 9% for GSM8k, directly reflecting the teacher model’s weakness on that domain.

A separate and practically important finding is what the research team call the Student-Teacher Gap effect on scaling laws. For CTI-RCM, student model performance saturated at around 128K data points, after bridging approximately 83% of the gap between the student’s starting accuracy (40%) and the teacher model’s performance (70%). GSM8k, by contrast, showed no such saturation because the student model’s peak performance (75%) remained sufficiently far from the teacher’s (88%).

Intrinsic Evaluation Gets a Rethink

Beyond generation, the research team introduces two new evaluation approaches. Taxonomic Coverage measures what fraction of taxonomy nodes at each level are represented in a dataset — a structured alternative to coarse embedding-based cosine distance metrics that fail to provide actionable insights. Calibrated Complexity Scoring assigns Elo ratings to individual data points by running batch-wise pairwise comparisons, a method the research team call ‘calibrated attribute scoring,’ which proved to align well with human-annotated complexity labels on the MATH dataset.

One finding stands out: on a taxonomic coverage basis, real-world reference datasets almost always cover less of the target domain than Simula-generated variants, even when embedding-based diversity metrics tell the opposite story. This underscores the limitation of relying on cosine distance alone as a proxy for dataset quality.

Key Takeaways

Simula’s reasoning-first, seedless framework controls quality, diversity, and complexity as independent axes — enabling fine-grained synthetic dataset design without relying on manual prompts, evolutionary algorithms, or seed data from the target distribution.

Combining Global and Local diversification is critical: either component in isolation produces suboptimal results, but together they consistently improve downstream model performance across all tested datasets and data sizes.

Data complexity helps model performance in most domains, but can hurt when the teacher model is weak — on LEXam, where Gemini 2.5 Flash (non-thinking) achieved only 57% accuracy, the Low Complexity split outperformed the High Complexity split.

Real-world reference datasets almost always cover less of the target domain than Simula-generated variants on a taxonomic coverage basis, even when standard embedding-based cosine distance metrics suggest otherwise.

Data scaling laws are driven by data properties, not size alone — the full Simula system reached higher downstream performance with fewer samples compared to baseline approaches, making it more cost-effective across the full data lifecycle despite requiring up to 5x more inference calls per data point.

Check out the Paper and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link