r/mltraders

Been running my algo bot for a while and this is the total P&L(58K), any suggestions?
▲ 70 r/mltraders+3 crossposts

Been running my algo bot for a while and this is the total P&L(58K), any suggestions?

Even though there are more losses at the end of the day I'm in green. I'm thinking of reducing losses by adding some new strategy. Any suggestions from your experience?

u/helloimhello6688 — 4 days ago

Im a discretionary swing trader, but want to make an algorithm for backtest

I have 200 trades backtested on a discretionary basis, but i want to make an algorithm for i can backtest more trades with more data and paires without emotions and human errors. Idk if I should do it or stay discretionary, I have positive edge on my strategy, and 3 years of experience on trading. but I think Im going to start learning code to be able to have more metrics and in the future maybe automate the strategy.

Someone who can give me an advise on the subject, python language. I have very low programing skills, I would start practically from zero. But I think its worth it because at the end of the day a good strategy is based on data and metrics, so I can have more with an algorithm removing discretion.

reddit.com
u/TenaxFi — 14 hours ago

Why my backtests kept lying to me (and what I did about it)

I've spent the last year building a live algorithmic trading system from scratch on Alpaca — momentum rotation on ETFs, RSI mean-reversion swing trades, proper risk management (1% per trade, ATR-based stops, daily circuit breaker, drawdown kill switch).

The thing that humbled me most wasn't the coding. It was running what looked like a genuinely strong backtest, going live, and watching it fall apart within weeks.

After digging into why, I realised almost everything I'd read about backtesting was quietly skipping the hard parts:

  • In-sample optimisation is basically cheating. If you tune your RSI period and stop-loss on the same data you're testing on, you're not finding a strategy — you're finding the parameters that fit that specific historical period. It will not repeat.
  • Most retail backtesting tools don't model slippage honestly. Assuming you fill at the close price on a thinly traded ETF is fantasy.
  • Survivorship bias is invisible until you look for it. If your universe is "current S&P 500 constituents" you're testing on a list of companies that already survived.

What actually helped was walk-forward testing — train on one window, test on the next, roll forward, repeat. It produces worse-looking results but the live performance gap shrinks dramatically.

Curious how others here handle this. Are you using QuantConnect, TradingView Pine, something custom? And do your backtests actually predict your live performance or is there always a big gap?

reddit.com
u/TopTimPlayz — 1 day ago

https://preview.redd.it/4y0nfw9x6nzg1.png?width=1184&format=png&auto=webp&s=325841b9bb88109e864895060f1f5fd567fb4ef5

I've been building an evolutionary trading system for the past 119 days. The idea is simple: instead of hand-crafting strategies, let a genetic algorithm discover them. 3.2 billion iterations later, I have some real data to share.

**How it works (briefly):**

Each bot is a set of genes (entry/exit rules, position sizing, risk parameters). Every generation, the top 50 performers reproduce and mutate. The rest get replaced. Rinse and repeat across millions of ticks of live BTC/USDT data.

I'm running 9 parallel evolution sets — 4 spot configurations and 5 futures market-making configurations — each with different fee tiers and entry/exit styles. They all evolve independently from $100 starting capital.

**What the numbers actually look like right now:**

*Spot bots (4 sets):*

- Top performers consistently at $102.33–$102.46 equity (from $100)

- Winner rates climbed from ~50% to 72%+ in the strongest sets

- Near-zero drawdown on all spot sets (0.06%–0.67% max)

- Conservative, consistent — what you'd want from a spot strategy

*Futures market-making bots (5 sets, 10x leverage):*

- Top individual performer: **$10,817 from $100** (+10,717%, medium_high)

- Best set average: **$211.65/bot** (low_fee, Gen7)

- **Every single futures set flipped from negative to positive between Gen6 and Gen7** — collective PnL went from -$6.3M to +$9.0M in one generation

- ~99% max drawdown still exists — this is the open problem I'm working on

**The most interesting thing we discovered (to me):**

Every single spot set converged to limit orders — regardless of which entry/exit strategy the scenario was configured with. The bots evolved toward limit orders even when we started them with market orders. That wasn't intended by the setup, but the algorithm found something consistent across all 4 independent runs. I'm still figuring out whether this is a simulation artifact or a genuine market insight.

**What happened between Gen6 and Gen7 (the $15M swing):**

This is the data point I find most encouraging. On May 5, Gen6 futures bots were getting crushed — every set was showing -$1.2M to -$1.3M PnL. Twenty-four hours later, Gen7 had completely flipped the script:

| Set | Gen6 PnL | Gen7 PnL | Swing |

|:----|:--------:|:--------:|:-----:|

| low_fee | -$1.29M | +$2.37M | +$3.66M |

| medium_low | -$1.26M | +$2.26M | +$3.52M |

| medium_high | -$1.25M | +$1.54M | +$2.79M |

| high_fee | -$1.25M | +$1.02M | +$2.26M |

| medium | -$1.28M | +$1.76M | +$3.04M |

The gene pool found something in Gen7 that Gen6 couldn't. Same data. Same parameters. Different selection outcome. It tells me the system is genuinely exploring the solution space, not just getting lucky once.

**What we validated with a 50-hour historical replay:**

We took the top 50 DNA from each set and ran them through 302,143 ticks of collected market data (roughly 50.5 hours). The same strategies that made $1 in a 1-day evaluation window made $7,753 across the full replay. The longer window gave dramatically different — and better — results.

This tells me the 1-day evaluation window we're using for evolution is noisy. The bots are better than their daily scores suggest.

**What's still broken:**

- Futures bots consistently hit 99% drawdown before recovering. The fitness function doesn't penalize risk enough.

- Entry/exit style genes override the scenario configuration — the bots keep "escaping" toward limit orders regardless of what they're assigned.

- Limit→Limit spot set is still 4 generations behind the others (it started late, still converging).

- Gen-to-gen performance is volatile on futures — a great Gen can follow a terrible Gen with no obvious trigger.

**What I'd love feedback on:**

- Has anyone experimented with multi-window fitness functions (short-term + long-term combined)?

- How do you handle the simulation artifact vs. actual insight problem with GA-discovered strategies?

- The drawdown problem on leveraged bots — penalize harder in fitness, or let evolution solve it on its own?

**Full live stats:** evotrade.ca (updates every 5 minutes with real daemon state)

Happy to answer questions about the architecture, the GA setup, or specific gene configurations. I'm still learning what works and I'm genuinely curious what others have seen with similar approaches.

reddit.com
u/Accomplished-Rip9652 — 7 days ago
▲ 9 r/mltraders+7 crossposts

Data ingestion and avoiding lookahead bias is a massive headache, so I built an open-source CLI agent to automate my backtesting setup.

It takes a plain-English strategy idea, generates validated Python using your own LLM key, and runs a historical backtest.

I just added Binance support today.

My biggest challenge right now is the automated safety checks—it currently scans the AST for lookahead flaws before executing.

The tool is free and open source locally at finnyai.tech, with an optional $10/mo tier for managed hosting.

If anyone here builds automated validation for strategy code, how do you handle edge cases and LLM data hallucinations?

u/Awkward_Weather5721 — 12 days ago
▲ 8 r/mltraders+1 crossposts

Is anyone getting big into agentic feature/model experimentation? Automating these pipelines is unlocking whole new worlds.

Been building an autonomous energy-demand forecasting research harness and curious if anyone here has gone deep on agentic/automated feature experimentation.

Current setup:
- NSW electricity demand forecasting
- weather + historical demand features
- rolling walk-forward validation
- Modal running large parallel experiment sweeps
- leaderboard + automatic scoring against fixed baselines

Right now the system is good at:
- model/config sweeps
- backtesting
- evaluation
- calibration

But I’m now moving toward automated feature generation/proposal.

The rough idea:
- LLM proposes feature sets/interactions/lags/transforms
- deterministic harness builds + evaluates them
- only improvements get promoted into the leaderboard

Examples:
- temp × humidity interactions
- lag structures
- rolling weather anomalies
- calendar effects
- weather regime features
- demand ramp features

I’m trying to avoid:
- leakage
- overfitting the leaderboard
- combinatorial garbage feature spam
- “LLM generated alpha soup”

Curious if anyone here has:
- done autonomous feature research seriously
- used agents for forecasting/model discovery
- built good constraints/DSLs around feature generation
- thoughts on how much value is actually there vs brute force + human intuition

Feels like forecasting is unusually well-suited to autonomous experimentation because the scoring loop is so clean.

reddit.com
u/jajohn99 — 5 days ago

For the last couple of months I have been tinkering with an ML model that predicts certain (relatively rare) events of BTC price movements. Recently, I got some results that are sometimes good and sometimes terrible. I have a few ideas on what experiments could improve performance, but I don't really understand the underlying cause of the problem. Hopefully someone had a similar experience once and can give me some tips.

More details:

I am using mostly 1-second granularity data of prices, trades, and some other metrics of BTC.

As a validation scheme, I am using rolling windows for now with a block of 500,000 rows as training and 86,400 rows as validation, mirroring an actual live use. Train size was chosen based on some small experiments with autocorrelation (nothing sophisticated).

Currently, I am evaluating my feature selection and model-building process as a whole, not a particular model or fixed feature set. For this I plan to use around 10 to 20 folds. In the following, I am showing 4 folds that illustrate what is going right and wrong. Dates (validation data ends at 23:59:59 on these dates) = 2026-04-28, 2026-02-28, 2025-11-28, 2025-07-28. The month offsets are a bit arbitrary but lean to more recent data: [0, 2, 5, 9].

Based on early experiments using other data (not the validation folds), I have found embedded feature selection using only train data to work well sometimes when combined with a large amount of candidate features. From my perspective, it seems that the selection process can find features with predictive power sometimes. Other times the model cannot beat 40% precision.

For now I am using XGB as a classifier with mostly basic parameters: I only quickly tuned the max_depth on some other data apart from the validation folds and set it to 10. The XGB predictions are also ensembled across 30 seeds to stabilize the PNL, as I found it was unstable using just one random seed.

The chosen feature sets, using only the recent training data, and models are evaluated on the validation fold using a set fee logic. The simulated trades don't use any position sizing yet, just a fixed amount per trade ($150). This is why there can be large negative results. When it works, the positions often get opened in quick succession (concurrency of up to 20 positions).

Here's a snapshot of using the prediction threshold 0.8 performance of the out of sample, unseen validation folds:

threshold n n_tp n_fp precision edge_per_trade total_net_pnl
f64 i64 i64 i64 f64 f64 f64
0.8 98 70 28 0.714286 22.779897 2232.42992
0.8 597 192 405 0.321608 -39.229474 -23419.995954
0.8 558 217 341 0.388889 -15.50954 -8654.323338
0.8 0 0 0 0.0 0.0 0.0

Note: Using a baseline model without feature engineering the first fold's PNL is negative. Performance has also been positive on an experiment using similar data but on the 20th of April.

Per fold plots:

https://preview.redd.it/vzwy8tt7fhyg1.png?width=1089&format=png&auto=webp&s=163236ab7017b5b0fa24fc8e4c76ee1a20b48f4f

https://preview.redd.it/iapja7l8fhyg1.png?width=1089&format=png&auto=webp&s=832f233a10ecf0dfad87c5cc0d6305ee1a18c9d6

https://preview.redd.it/kn8lhl89fhyg1.png?width=1088&format=png&auto=webp&s=d461b2e9b30cdbcc062c99b2a84a11e6543e2615

https://preview.redd.it/xk4ehau9fhyg1.png?width=1089&format=png&auto=webp&s=7afe1b88b9ec5a42c509c815ef70ac0578f2fd61

https://preview.redd.it/yyi7te2qehyg1.png?width=989&format=png&auto=webp&s=d39cefc9948cb5225236427c624030d9f3edb173

Some of my ideas what I could do without knowing the core underlying problem:

- Regime or per trade filter

- Use more data for training

- Use feature stability when selecting features

What should I consider doing next?

Thanks in advance.

reddit.com
u/Apprehensive_Fox8212 — 13 days ago