u/Apprehensive_Fox8212 — reddlx

For the last couple of months I have been tinkering with an ML model that predicts certain (relatively rare) events of BTC price movements. Recently, I got some results that are sometimes good and sometimes terrible. I have a few ideas on what experiments could improve performance, but I don't really understand the underlying cause of the problem. Hopefully someone had a similar experience once and can give me some tips.

More details:

I am using mostly 1-second granularity data of prices, trades, and some other metrics of BTC.

As a validation scheme, I am using rolling windows for now with a block of 500,000 rows as training and 86,400 rows as validation, mirroring an actual live use. Train size was chosen based on some small experiments with autocorrelation (nothing sophisticated).

Currently, I am evaluating my feature selection and model-building process as a whole, not a particular model or fixed feature set. For this I plan to use around 10 to 20 folds. In the following, I am showing 4 folds that illustrate what is going right and wrong. Dates (validation data ends at 23:59:59 on these dates) = 2026-04-28, 2026-02-28, 2025-11-28, 2025-07-28. The month offsets are a bit arbitrary but lean to more recent data: [0, 2, 5, 9].

Based on early experiments using other data (not the validation folds), I have found embedded feature selection using only train data to work well sometimes when combined with a large amount of candidate features. From my perspective, it seems that the selection process can find features with predictive power sometimes. Other times the model cannot beat 40% precision.

For now I am using XGB as a classifier with mostly basic parameters: I only quickly tuned the max_depth on some other data apart from the validation folds and set it to 10. The XGB predictions are also ensembled across 30 seeds to stabilize the PNL, as I found it was unstable using just one random seed.

The chosen feature sets, using only the recent training data, and models are evaluated on the validation fold using a set fee logic. The simulated trades don't use any position sizing yet, just a fixed amount per trade ($150). This is why there can be large negative results. When it works, the positions often get opened in quick succession (concurrency of up to 20 positions).

Here's a snapshot of using the prediction threshold 0.8 performance of the out of sample, unseen validation folds:

threshold	n	n_tp	n_fp	precision	edge_per_trade	total_net_pnl
f64	i64	i64	i64	f64	f64	f64
0.8	98	70	28	0.714286	22.779897	2232.42992
0.8	597	192	405	0.321608	-39.229474	-23419.995954
0.8	558	217	341	0.388889	-15.50954	-8654.323338
0.8	0	0	0	0.0	0.0	0.0

Note: Using a baseline model without feature engineering the first fold's PNL is negative. Performance has also been positive on an experiment using similar data but on the 20th of April.

Per fold plots:

https://preview.redd.it/vzwy8tt7fhyg1.png?width=1089&format=png&auto=webp&s=163236ab7017b5b0fa24fc8e4c76ee1a20b48f4f

https://preview.redd.it/iapja7l8fhyg1.png?width=1089&format=png&auto=webp&s=832f233a10ecf0dfad87c5cc0d6305ee1a18c9d6

https://preview.redd.it/kn8lhl89fhyg1.png?width=1088&format=png&auto=webp&s=d461b2e9b30cdbcc062c99b2a84a11e6543e2615

https://preview.redd.it/xk4ehau9fhyg1.png?width=1089&format=png&auto=webp&s=7afe1b88b9ec5a42c509c815ef70ac0578f2fd61

https://preview.redd.it/yyi7te2qehyg1.png?width=989&format=png&auto=webp&s=d39cefc9948cb5225236427c624030d9f3edb173

Some of my ideas what I could do without knowing the core underlying problem:

- Regime or per trade filter

- Use more data for training

- Use feature stability when selecting features

What should I consider doing next?

Thanks in advance.

threshold

n_tp

n_fp

precision

edge_per_trade

total_net_pnl

f64

i64

f64

0.8

0.714286

22.779897

2232.42992

0.8

597

192

405

0.321608

-39.229474

-23419.995954

0.8

558

217

341

0.388889

-15.50954

-8654.323338

0.8

0.0