
Time Series Forecasting with LSTM and Transformer Models
Predicting price movements with sequence models
Understanding Time Series Data
A time series is a sequence of data points ordered by time — stock prices at each minute, daily Bitcoin closes, hourly temperature readings. What distinguishes time series from other data is that order matters. Shuffling rows in a classification dataset is harmless; shuffling a time series destroys its meaning.
Three properties define the character of any time series. Trend is the long-term direction — BTC's price moved from $3,000 in early 2019 to $60,000 by late 2021, a clear upward trend. Seasonality refers to repeating patterns at fixed intervals — crypto markets often see increased volume during US market hours and reduced activity on weekends. Stationarity means that statistical properties like mean and variance remain constant over time. Most financial time series are non-stationary — prices drift upward or crash — which creates challenges for models that assume stable distributions.
Before feeding financial data into any model, you need to transform it. Taking log returns (the natural log of price ratios between consecutive periods) converts non-stationary prices into approximately stationary returns. Differencing — subtracting the previous value — achieves a similar effect. These transformations are not optional; they are prerequisites for most forecasting methods to work correctly.
ARIMA: The Classical Baseline
ARIMA (AutoRegressive Integrated Moving Average) has been the workhorse of time series forecasting for decades. It combines three components: AR(p) uses the previous p values to predict the next one, I(d) differences the series d times to achieve stationarity, and MA(q) models the error terms from past predictions.
For financial data, ARIMA serves as a critical baseline. Any machine learning model that can't beat a well-tuned ARIMA on your specific dataset is adding complexity without value. In practice, ARIMA often performs surprisingly well for short-horizon forecasts of low-frequency data (daily or weekly candles) where the signal-to-noise ratio is manageable.
# ARIMA baseline for BTC daily returns
from statsmodels.tsa.arima.model import ARIMA
import pandas as pd
returns = prices.pct_change().dropna()
model = ARIMA(returns, order=(2, 0, 1)) # AR(2), no differencing (returns already stationary), MA(1)
fitted = model.fit()
forecast = fitted.forecast(steps=5)
print(f"AIC: {fitted.aic:.2f}") # Lower AIC = better model
ARIMA's limitations become apparent with complex nonlinear patterns, regime changes, and high-dimensional feature spaces. This is where deep learning enters the picture.
LSTM Networks for Sequence Modeling
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network designed to learn long-range dependencies in sequential data. Unlike vanilla RNNs, which suffer from vanishing gradients and forget information after just a few time steps, LSTMs use a gating mechanism — forget gate, input gate, and output gate — to selectively retain or discard information over hundreds of steps.
For financial time series, LSTMs offer two advantages over ARIMA: they can model nonlinear relationships (price dynamics are rarely linear), and they can incorporate multiple input features simultaneously — not just past prices, but volume, volatility, funding rates, order book imbalance, and any other signal you believe carries predictive information.
Preparing data for an LSTM requires careful windowing. You create input sequences of fixed length (e.g., 60 time steps) paired with the target value (the next price or return). Normalization is critical — scale features to [0, 1] or standardize to zero mean and unit variance. But here's the trap: you must fit the scaler on training data only and transform validation and test data using the same parameters. Fitting on the full dataset leaks future information.
Never normalize using statistics computed from the entire dataset. Fit your scaler on the training set only, then apply it to validation and test sets. This single mistake causes more look-ahead bias than any other data preparation error.
# LSTM data preparation with proper normalization
import numpy as np
from sklearn.preprocessing import MinMaxScaler
def create_sequences(data, window=60):
X, y = [], []
for i in range(window, len(data)):
X.append(data[i - window:i])
y.append(data[i, 0]) # Predict next close
return np.array(X), np.array(y)
# Fit scaler on training data ONLY
scaler = MinMaxScaler()
train_scaled = scaler.fit_transform(train_data)
test_scaled = scaler.transform(test_data) # Transform, don't fit
X_train, y_train = create_sequences(train_scaled)
X_test, y_test = create_sequences(test_scaled)Attention Mechanisms and Temporal Fusion Transformers
The same self-attention mechanism that revolutionized natural language processing has proven equally powerful for time series. Instead of processing sequences step-by-step like an LSTM, attention allows the model to look at all time steps simultaneously, learning which historical moments are most relevant for predicting the future.
The Temporal Fusion Transformer (TFT), introduced by Google Research in 2021, is purpose-built for multi-horizon time series forecasting. It combines several innovations:
- Variable selection networks — Automatically learns which input features matter most, providing built-in interpretability. You can see whether the model relies more on price momentum, volume, or funding rates.
- Gated residual networks — Suppress irrelevant inputs and enable the model to handle complex nonlinear feature interactions.
- Multi-head attention over time — Identifies which historical time steps are most informative for each prediction horizon.
- Quantile outputs — Produces prediction intervals (10th, 50th, 90th percentiles) rather than point estimates, giving you a measure of uncertainty — essential for risk management in trading.
In practice, TFTs have outperformed LSTMs and traditional models on benchmarks including electricity demand forecasting, retail sales prediction, and financial volatility modeling. For crypto markets accessible through platforms like GaiaEx, the TFT's ability to process heterogeneous inputs — static metadata (asset type, listing date), known future values (day of week, time of day), and observed time-varying features (price, volume, on-chain metrics) — makes it particularly well-suited.
Practical Example: Predicting BTC Price Direction
Let's be honest about what's achievable. Predicting the exact price of Bitcoin tomorrow is essentially impossible. The efficient market hypothesis (EMH) argues that prices already reflect all available information, making consistent prediction a fool's errand. In crypto, the weak form of EMH is debatable — markets are less efficient than equities — but the noise-to-signal ratio remains brutal.
A more realistic goal is directional accuracy: will BTC close higher or lower than it opened? This binary classification problem is more tractable, and even modest improvements over 50% accuracy (say, 53–55% consistently) can be highly profitable with proper position sizing and risk management.
A practical pipeline looks like this:
- Features: 60-period returns, RSI, MACD histogram, Bollinger Band width, volume ratio (current vs. 20-period average), funding rate from perpetual futures, BTC dominance, and open interest change.
- Model: Two-layer LSTM with 128 hidden units, dropout of 0.3, followed by a dense layer with sigmoid activation for binary classification.
- Split: Train on 2020–2023, validate on Jan–Jun 2024, test on Jul–Dec 2024. Never shuffle — always split chronologically.
- Evaluation: Directional accuracy, precision/recall on up vs. down predictions, and — most importantly — simulated P&L assuming a fixed position size per signal.
If your model achieves 54% directional accuracy on the out-of-sample test set with a profit factor above 1.2, you have something worth exploring further. If it shows 65% accuracy, you've almost certainly overfit. Real edges in financial markets are small, and anyone claiming otherwise is selling something.
Evaluation Metrics and Ensemble Approaches
Choosing the right metric determines whether you're optimizing for the right thing. MAE (Mean Absolute Error) tells you the average magnitude of your prediction errors in the same units as your data — intuitive but doesn't penalize large errors disproportionately. RMSE (Root Mean Squared Error) squares errors before averaging, heavily penalizing outliers — appropriate when a single catastrophic misprediction matters more than many small ones. Directional accuracy measures the percentage of times your model correctly predicts whether the price goes up or down — often the most relevant metric for trading signals.
Ensemble methods combine predictions from multiple models to reduce variance and improve robustness. Common approaches include:
- Simple averaging — Average the predictions of an LSTM, a Transformer, and an ARIMA. If each model captures different aspects of the signal, the ensemble outperforms any individual model.
- Stacking — Train a meta-model (e.g., a gradient-boosted tree) to learn the optimal combination of base model predictions.
- Regime-aware switching — Use a volatility regime detector to select which model to trust. An LSTM might excel in trending markets while a mean-reversion model outperforms in ranging conditions.
Whatever approach you take, remember that the gap between research and production is vast. A model that predicts BTC direction at 55% accuracy in a Jupyter notebook needs to survive latency, slippage, and transaction costs when executed live through a platform like GaiaEx. Build your evaluation pipeline to simulate these realities from day one — not as an afterthought.


