
Introduction to Machine Learning for Financial Markets
Supervised, unsupervised, and reinforcement learning for trading
The Backtest That Made $0 in the Real World
In 2018, a quant team showed their fund a model that turned $100,000 into $4.2 million in a backtest. The equity curve was a near-perfect 45-degree line. The Sharpe ratio was 4.0 — better than almost any hedge fund in history. They deployed it with real money.
It lost money in the first week, then bled out for three months before they shut it down. The model that "made" $4.2 million on paper made nothing in reality. Nothing had been faked — no fraud, no bug. The model had simply memorized the past instead of learning anything about the future.
This story repeats so often it has a name: backtest overfitting, and it is the single biggest reason machine-learning funds fail. Marcos López de Prado — who has run billions in ML-driven strategies — put it bluntly: with enough tries, anyone can produce a beautiful backtest on pure noise. The market does not care how elegant your neural network is.
Machine learning is genuinely transforming how markets are traded. But the gap between a model that looks brilliant and one that survives contact with live markets is enormous — and learning to tell them apart is the whole game. This lesson teaches you both: how to build ML systems that find real edges, and how to avoid the traps that destroy the people who skip the second half.
What Is Machine Learning — And Why Markets Are the Hardest Case
Machine learning is the science of getting computers to learn patterns from data without being explicitly programmed for each scenario. Instead of writing a rule like "buy when RSI drops below 30," you feed the algorithm thousands of historical examples and let it discover which patterns precede profitable trades. The machine generalizes from experience — exactly as a seasoned trader develops intuition, but across more data than any human could hold in their head.
Financial markets generate staggering volumes of it: price ticks, order-book snapshots, funding rates, on-chain flows, sentiment scores, macro releases. Traditional rule-based strategies capture only the relationships their creator already imagined. ML models can detect nonlinear, high-dimensional interactions across hundreds of features at once — relationships a human analyst would never think to test.
But markets are the most hostile environment in all of applied ML, and it's worth being honest about why. Most ML breakthroughs — image recognition, language models — work because the underlying rules don't move. A cat looks like a cat whether the photo was taken in 2010 or 2026. Markets are the opposite:
- The data is not stationary. The statistical "rules" of the market shift constantly as regimes change. A model trained on a calm bull market can be actively dangerous in a crash.
- The signal-to-noise ratio is brutal. In image recognition, almost every pixel carries information. In returns, the vast majority of price movement is noise. You are mining for a faint signal in a roaring sea of randomness.
- You are competing against adaptive adversaries. A cat doesn't change its appearance to fool your classifier. The moment a real market edge becomes popular, other traders arbitrage it away. Your model's success quietly erodes its own advantage.
The good news: the barrier is no longer access to data or compute. Platforms like GaiaEx provide API access to real-time and historical market data on Hyperliquid L1, so any developer can collect the datasets ML requires. The remaining barrier is knowledge — which is exactly what this lesson provides.
Three Paradigms: Supervised, Unsupervised, and Reinforcement Learning
Machine learning is not one technique — it's a family of approaches, each suited to different problems. In finance, all three major paradigms have real applications.
Supervised learning is the workhorse. You give the model labeled examples: historical feature vectors (inputs) paired with known outcomes (targets), and it learns a mapping from one to the other. Two sub-types dominate financial use:
- Classification — predict a category. Will the asset rise or fall over the next hour? Is this transaction fraudulent? The output is a discrete label, usually with a probability attached.
- Regression — predict a continuous value. What will the 1-hour return be? What is the fair funding rate? The output is a number, and the model minimizes prediction error.
Unsupervised learning finds structure in data without labels. Clustering algorithms like K-Means can group trading days into market regimes — trending, mean-reverting, high-volatility — without you defining those regimes in advance. Dimensionality reduction like PCA compresses hundreds of correlated features into a handful of independent factors, which matters when your feature set is wider than your sample is deep.
Reinforcement learning (RL) trains an agent to make a sequence of decisions by maximizing cumulative reward. The agent interacts with an environment (the market), takes actions (buy, sell, hold), and receives feedback (profit or loss). RL is appealing for portfolio allocation and order execution, where the best action depends on your current position, transaction costs, and market impact. DeepMind's game-playing agents inspired a wave of finance RL research — but practical results remain mixed, precisely because markets are non-stationary and the "game" keeps changing its rules mid-play.
Feature Engineering: Turning Raw Data into Predictive Signals
In machine learning, features are the input variables the model uses to make predictions. Raw OHLCV data is a starting point, but feeding raw prices into a model is like handing a chef raw wheat instead of flour — you have to process it first. Feature engineering is where domain expertise meets data science, and it is far more often the difference between a working model and a useless one than the choice of algorithm is.
Common feature categories for financial ML include:
- Technical indicators — RSI, MACD, Bollinger Band width, ATR, ADX. These encode momentum, volatility, and trend strength as standardized, scale-invariant inputs.
- Lag features — past returns over multiple horizons (1-bar, 5-bar, 20-bar, 60-bar), capturing momentum and mean reversion at different timescales.
- Volatility measures — rolling standard deviation, Parkinson volatility (from high/low), the Garman-Klass estimator. Volatility clustering is one of the most reliable stylized facts in all of finance.
- Volume features — volume ratio (current vs. average), on-balance volume, volume-price correlation. Unusual volume frequently precedes price moves.
- Cross-asset features — BTC's return as an input when predicting ETH. Shifting correlations between assets often flag regime changes before price does.
Here's a practical example of engineering features in Python:
import pandas as pd
import numpy as np
def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
df["return_1h"] = df["close"].pct_change(1)
df["return_4h"] = df["close"].pct_change(4)
df["return_24h"] = df["close"].pct_change(24)
df["volatility_24h"] = df["return_1h"].rolling(24).std()
df["rsi_14"] = compute_rsi(df["close"], 14)
df["volume_ratio"] = df["volume"] / df["volume"].rolling(24).mean()
df["atr_14"] = compute_atr(df, 14)
return df.dropna()
Two rules are non-negotiable. First, never use future data — every feature must be computable from information available at the moment of prediction. The most common way to "discover" a profitable strategy is to accidentally let tomorrow's number leak into today's features. Second, normalize your inputs; most ML models choke when features span wildly different scales (a price of 60,000 next to an RSI of 30).
The Most Important Section: Validating on Time Series
If you remember one thing from this entire lesson, make it this: standard ML validation is catastrophically wrong for financial markets.
In normal machine learning, you shuffle your data randomly and split it into train and test sets. Do that with time-series price data and you have just let the model train on the future to predict the past. Your accuracy will look spectacular. Your live trading will be a slaughter. This single mistake — known as look-ahead bias — is responsible for more blown-up quant strategies than any market crash. The correct approach always respects the arrow of time:
Chronological train/validation/test split. Divide data by date: train on 2020–2022, validate on 2023, test on 2024. The test set is touched exactly once — it is your simulation of live trading. The moment you tune hyperparameters against it, it stops being a test set and you've lost any honest estimate of out-of-sample performance.
Walk-forward validation (expanding or sliding window) is more robust. Train on months 1–12, predict month 13. Then train on months 1–13 (or 2–13), predict month 14. Repeat. This generates many out-of-sample predictions, each on data the model has never seen, while adapting to evolving conditions — it mirrors how the strategy would actually be deployed and retrained.
Purged cross-validation (introduced by Marcos López de Prado) adds a gap between training and test folds to stop information leaking through overlapping labels. If your target is the 24-hour forward return, observations near a fold boundary share information; the purge gap removes that contamination. An additional "embargo" period after the test fold guards against serial correlation bleeding back into training.
Finally, judge the model on metrics that matter for money, not for textbooks. Accuracy alone is misleading. A model that's 51% accurate but right on the big moves can be wildly profitable; a 70%-accurate model that only nails the tiny moves can lose money after fees. Track precision on your directional calls, the Sharpe ratio of the resulting strategy, and maximum drawdown — and always net of realistic transaction costs and slippage.
Random Forests, Gradient Boosting, and Choosing the Right Model
For tabular financial data — the kind you get from OHLCV candles, technical indicators, and engineered features — tree-based ensemble models consistently outperform deep neural networks. This isn't an opinion; it's a well-established empirical result (see Grinsztajn et al., 2022, "Why do tree-based models still outperform deep learning on tabular data?"). For images and language, deep learning rules. For a table of features, trees win.
Random Forests build hundreds of decision trees, each trained on a random subset of rows and features, then average their predictions. This averaging reduces variance and overfitting. They're robust, need minimal tuning, and hand you a built-in feature-importance ranking — invaluable for understanding what is actually driving your model.
Gradient Boosting (XGBoost, LightGBM, CatBoost) builds trees sequentially, each one correcting the errors of the ensemble so far. This usually beats random forests on accuracy but demands more careful tuning. LightGBM is the default choice for most financial ML practitioners: fast training, native handling of missing values, and it scales to millions of rows comfortably.
import lightgbm as lgb
from sklearn.model_selection import TimeSeriesSplit
model = lgb.LGBMClassifier(
n_estimators=500,
max_depth=6,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
min_child_samples=50,
)
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
model.fit(X.iloc[train_idx], y.iloc[train_idx],
eval_set=[(X.iloc[val_idx], y.iloc[val_idx])])
Deep learning still has its place — but a narrow one. LSTMs and Transformers can genuinely add value on raw sequential data: tick-by-tick order flow, or fusing price with the text of news and social-media sentiment, where the ordering and context carry information that flat tables discard. Just don't reach for them by default. For structured data with engineered features, a well-tuned LightGBM trained on data pulled from GaiaEx's API will almost always beat a neural network — and it trains in seconds rather than hours.
Where ML Actually Earns Its Keep in Crypto
Price prediction gets all the attention, but it's the hardest and least reliable use of ML in markets. Some of the most valuable applications have nothing to do with forecasting the next candle:
- Market-regime detection. Unsupervised clustering sorts market conditions into regimes — quiet trending, choppy mean-reverting, high-volatility panic. Knowing which regime you're in lets you switch strategies or cut size, which is often worth more than any single directional prediction.
- Sentiment analysis. Large language models read news, X/Twitter, and Discord at a scale no human can, scoring shifts in market mood. Research consistently shows that fusing a sentiment signal with price data improves crypto-prediction accuracy over price alone — sentiment moves crypto unusually hard.
- Fraud and manipulation detection. This is where ML quietly shines. Models flag wash trading, spoofing, and classic pump-and-dump schemes by spotting anomalous volume and order-book patterns that precede a coordinated dump. Anomaly-detection models (LSTMs, Anomaly Transformers) reliably beat both classical ML and simple statistical thresholds here.
- Execution and slippage modeling. ML predicts the market impact of a large order, helping execution algorithms slice it to minimize cost. On a 24/7 venue like crypto, where a clumsy order can move the book against you, this directly protects your bottom line.
- Risk and liquidation management. Models estimate the probability that a position breaches a risk threshold given current volatility, helping size positions and set stops before a cascade rather than after.
Why Most ML Strategies Fail — And What That Teaches You
Honest education means naming how this goes wrong, because it usually does. These pitfalls have destroyed more quant strategies than every bear market combined:
- Overfitting. The model memorizes the training data instead of learning anything generalizable. The cure is discipline, not cleverness: simpler models, fewer features, regularization, and ruthless out-of-sample testing. If your backtest looks too good to be true, it is.
- Look-ahead bias. Using information that wasn't available at decision time — computing features on the full dataset before splitting, using prices that were revised after the fact, or (more common than anyone admits) leaving the target variable in the feature set.
- Survivorship bias. Training only on assets that still exist today. Delisted tokens, rugged projects, and dead coins are silently missing from your dataset, which biases every result upward. In crypto this is severe — thousands of tokens have gone to zero.
- Non-stationarity. Market distributions shift. A model trained on a 2021 bull market fails in a 2022 bear market because the rules it learned no longer hold. You must retrain regularly and monitor for distribution drift, not "set and forget."
- The multiple-testing trap. Try 1,000 strategy variations and a few will look brilliant by pure luck. This is how that "$4.2 million" backtest happens. Tools like the Deflated Sharpe Ratio exist specifically to discount performance you found only by searching hard enough.
- The research-to-production gap. Backtests assume frictionless, instant fills. Live markets impose slippage, fees, latency, and market impact that routinely shave 30–70% off theoretical returns — and can flip a "profitable" strategy negative.
There's a deeper point hiding in this list. Markets are an adversarial, adaptive system — every real edge attracts competitors who arbitrage it away. A model that wins for a year can stop working the month it gets crowded, through no fault of its code. This is why successful quant operations don't search for one perfect model; they build a pipeline to keep finding, validating, and retiring edges as the market evolves.
Building on GaiaEx: Your First End-to-End Project
Every quant skill you've read about depends on one thing first: clean, reliable data. This is where GaiaEx fits into your ML workflow. Because trades execute on Hyperliquid L1, every fill, funding rate, and order-book change is recorded on-chain and exposed through the GaiaEx API — giving you transparent, granular, real-time and historical data without the gaps and silent revisions that plague many centralized data feeds.
Why on-chain data matters for ML: a model is only as honest as its inputs. When your training data comes from a transparent L1 rather than a black-box internal database, you can trust that the prices and volumes you're learning from are the prices and volumes that actually traded. That removes a whole category of subtle data-integrity bugs before they ever reach your features.
Your first project should be deliberately simple. The goal is not to beat the market on attempt one — it's to build a complete, leak-free pipeline you can iterate on. Here's a concrete roadmap:
- Collect 1-hour candle data for BTC/USDC from GaiaEx's API — at least 6 months.
- Engineer 10–15 features: lagged returns, RSI, ATR, volume ratio, Bollinger Band width.
- Label each bar: 1 if the next 4-hour return is positive, 0 otherwise.
- Train a LightGBM classifier using walk-forward validation — never a random split.
- Evaluate precision, recall, and the Sharpe ratio of a simple strategy that goes long when the model predicts 1, after subtracting realistic fees.
Then — and this is the part beginners skip — try to break it. Add a purge gap and see if your edge survives. Check whether any feature is secretly leaking the future. Test it on a different time window. If the edge evaporates under scrutiny, you've learned something genuinely valuable: that it was never real, and you found out for free instead of with your capital.


