SynthForge combines diffusion models, Gaussian copulas, and LLM-powered intelligence
to generate production-grade synthetic tabular data from a 2,500-row sample.
Detect PII. Preserve correlations. Ship as a pip-installable library.
$ pip install synthforge
98.5%Quality Score
6Synthesizers
5-LayerEvaluation
100+LLM Providers
The Approach
Five-stage pipeline, one line of code
Most synthetic data tools treat generation as a black box. SynthForge decomposes it into
five observable, configurable stages — each one improvable independently.
01
Profile
Auto-detect schema types. LLM infers column semantics, relationships, and business rules from names + sample values.
02
Detect
3-layer PII detection: regex heuristics → Microsoft Presidio NER → LLM catches non-obvious patterns. MNPI flagging for financial data.
03
Fit
Auto-select synthesizer by data type and hardware. Reversible transforms handle nulls, outliers, mixed types. Constraints baked in via CAG.
04
Generate
Batch synthesis at configurable scale (1K–10M rows). PII columns auto-replaced with Faker. Min/max enforced from original.
05
Evaluate
5-layer quality gate: diagnostics, KS/TV/correlation/C2ST fidelity, TSTR ML utility, MIA privacy, LLM semantic validation.
Generation Engines
Six models, from seconds to state-of-the-art
Auto-strategy engine selects the optimal model based on your data characteristics
and hardware. CUDA enforced for neural models — no accidental 3-hour CPU training runs.
Gaussian Copula
Default · CPU
Fits marginal distributions per column + Gaussian correlation structure. Trains in seconds. No GPU needed. Best for numerical data.
CPU · Seconds
CTGAN
NeurIPS 2019 · GPU
Conditional GAN with mode-specific normalization and training-by-sampling. Best for imbalanced categoricals and high-cardinality columns.
CUDA · Minutes
TVAE
NeurIPS 2019 · GPU
Tabular Variational Autoencoder. More stable than CTGAN, fewer hyperparameters. Strong default for mixed-type data.
CUDA · Minutes
TabDDPM
ICML 2023 · GPU
Denoising diffusion with dual noise processes — Gaussian for continuous, multinomial for categorical. Decisive quality leap over GANs.
CUDA · 10min
TabSyn
ICLR 2024 Oral · GPU
Latent diffusion: VAE encodes mixed types to unified latent space, score-based diffusion learns it. 86% better marginals, 93% faster sampling than TabDDPM.
CUDA · SOTA
GReaT
ICLR 2023 · GPU
Fine-tunes GPT-2 on text-serialized rows. Leverages pretrained semantic knowledge. Supports conditional generation without retraining.
LLM · Hours
The Differentiator
LLM intelligence at every stage
No existing library systematically uses LLMs for schema understanding, privacy detection,
and semantic validation. SynthForge does, through any provider — Claude, GPT, Ollama, vLLM — via LiteLLM.
Schema Enrichment
Infers that "fname" means first_name, "amt" means currency, and that city-state-zip form a hierarchical group. Maps columns to Faker providers for realistic replacement.
PII Detection
Layer 1: regex on column names. Layer 2: Presidio NER on values. Layer 3: LLM catches non-obvious PII — a column named "cust_ref" containing SSNs, quasi-identifiers that together re-identify individuals.
MNPI Detection
Flags material non-public information in financial data: unreleased earnings, M&A deal values, strategic plans. Classifies risk level (low/medium/high/critical) per column.
Semantic Validation
LLM-as-judge pattern: reviews batches of synthetic rows for impossible combinations — a 5-year-old with a PhD, Japanese names with Mexican zip codes, shipping dates before order dates.
Quality Benchmarks
Evaluation is first-class, not an afterthought
Every generate() call can return a quality report. Built-in pass/fail thresholds
configurable per use case. MIA-based privacy replaces the discredited DCR metric.
Dataset
Score
KS Compl.
Correlation
C2ST
TV Compl.
Sensor (numerical)
98.5%
0.992
0.987
0.973
—
HR (mixed + PII)
78.3%
0.967
0.984
0.856
0.739
Financial (complex)
75.7%
0.496
0.983
—
0.811
E-commerce (categorical)
73.2%
0.755
0.865
—
0.682
Gaussian Copula on CPU. Quality increases substantially with TabSyn/TabDDPM on GPU.
Usage
Three lines to production
Pythonfrom synthforge import SynthForge
# Load a 2,500-row sample from Redshift / any warehouse
df = pd.read_csv("production_sample.csv")
# One line: profile → fit → generate
forge = SynthForge(llm_provider="anthropic", llm_model="claude-sonnet-4-20250514")
synthetic = forge.fit_generate(df, num_rows=100_000)
# Quality report with pass/fail gates
report = forge.evaluate(df, synthetic)
print(report.summary())
# → Overall: 98.34% PASS
Brief
The elevator pitch
Copy-ready description
SynthForge is a pip-installable Python library for generating high-fidelity synthetic tabular data from small production samples. It combines six generation backends — from fast Gaussian Copula (seconds, CPU) to state-of-the-art TabSyn latent diffusion (ICLR 2024) and TabDDPM denoising diffusion (ICML 2023) — with an LLM-augmented pipeline that automatically detects PII and MNPI, infers column semantics, and validates generated data for logical consistency. The library auto-selects the optimal synthesizer based on data characteristics (numerical, categorical, time-series, mixed) and hardware (CUDA enforced for neural models), supports configurable scale from thousands to millions of rows via batch generation, and ships with a 5-layer evaluation pipeline covering statistical fidelity, ML utility, and privacy metrics. LLM integration is provider-agnostic via LiteLLM, supporting Claude, OpenAI, Ollama, and 100+ providers.