u/DaxliaLabs

A genome-scale model of E. coli (iJO1366) has ~2,583 reactions. Each can be knocked out, overexpressed, or downregulated — that's ~7,500 single interventions, and double combinations run into the millions. Running two-step FBA on all of them isn't feasible.

RéseauFlux trains a gradient boosting model to rank interventions from network features alone, then validates with FBA only the top candidates.

How it works

18 features are extracted per reaction from the model structure: FVA flux range, betweenness centrality, shortest path distances to both target and biomass reactions, subsystem one-hot encoding, flux entropy, stoichiometric relationship to the target metabolite, and derived ratios.

A two-step FBA (maximize growth → fix at 10% of max → maximize target flux) runs on ~300 sampled interventions to generate training labels. Labels are rank-normalized to [0, 1] to avoid the plateau problem where hundreds of KOs return the same ceiling value.

Training uses 5-fold GroupKFold CV grouped by reaction ID, so all three intervention types (KO, OE, downregulation) of the same reaction always land in the same fold. This tests genuine generalization: can the model predict outcomes for reactions it has never seen in any form?

After training, GBR scores all ~2,283 unenumerated reactions. Only the top candidates get full FBA validation.

Results (aerobic succinate production in E. coli)

Metric Value
GBR CV Spearman ρ 0.557
Holdout ρ (unseen reaction groups) 0.602
Top-K precision (CV) 0.472
Best single intervention ATPS4rpp KO → 16.38 mmol/gDW/h (~100× WT)
Best double (ML-guided) ATPM KO + O2tex KO → 16.80 mmol/gDW/h
Literature rank recall 2/3 known KOs in top 10
Multi-target ρ (L-malate) 0.445
Multi-target ρ (acetate) 0.491

ATPS4rpp KO at #1 is consistent with prior computational predictions — the ATP synthase reaction is a known bottleneck for succinate overproduction under aerobic conditions.

Ablation study:

Model CV Spearman ρ
GBR alone 0.557
GBR + pairwise ranker 0.358
Pairwise ranker alone 0.159

Adding the pairwise ranker actively degraded performance — it was removed from the final pipeline.

What else is in the pipeline

  • Double KO search: top 25 ML-predicted pairs validated with FBA
  • Triple KO candidates via evolutionary search
  • Monte Carlo robustness: ±20% Gaussian noise on uptake rates
  • Flux Control Coefficient analysis for bottleneck identification
  • Gene-level knockout predictions
  • Pareto front (succinate flux vs. biomass growth)
  • Multi-target generalization to L-malate and acetate without retraining

All outputs (12+ CSVs + a 12-panel summary figure) are packaged into a timestamped ZIP.

Running it

pip install cobra highspy scikit-learn networkx numpy pandas matplotlib scipy
python metabolic\_ml\_pipeline.py

iJO1366 downloads automatically via COBRApy. Compute: Kaggle free-tier (2× NVIDIA Tesla T4).

GitHub: https://github.com/Daxlia/ReseauFlux DOI (Zenodo): https://doi.org/10.5281/zenodo.19984812 Medium: https://medium.com/@daxlia.work/réseauflux-an-ml-pipeline-for-genome-scale-metabolic-engineering-847e35d7a966

Post generated with Claude AI by Anthropic.

u/DaxliaLabs — 8 days ago