DeveloperAI & MLacademy.article.readingTime

Neural Networks: From Perceptrons to Deep Learning

How layers of simple functions learn complex patterns

The Perceptron: Where It All Begins

Every neural network — from a simple classifier to GPT — descends from a single idea: the perceptron, invented by Frank Rosenblatt in 1958. It's a mathematical model loosely inspired by biological neurons. A biological neuron receives electrical signals through dendrites, processes them in the cell body, and fires an output signal through the axon if the combined input exceeds a threshold. The perceptron works the same way, just with numbers.

A perceptron takes a vector of inputs x, multiplies each by a learned weight w, sums the products, adds a bias term b, and passes the result through an activation function. If the result exceeds zero, the neuron "fires" (outputs 1); otherwise it doesn't (outputs 0). Mathematically: output = f(w·x + b).

A single perceptron can only learn linearly separable patterns — it draws a straight line (or hyperplane) to divide data into two classes. This is powerful enough for some problems but fails on anything requiring curved decision boundaries. The solution? Stack multiple perceptrons into layers, add nonlinear activation functions, and you get a network that can approximate any continuous function. This is the universal approximation theorem — a single hidden layer with enough neurons can represent arbitrarily complex mappings, though in practice, deeper networks learn better representations with fewer total parameters.

For financial applications, this means a neural network can theoretically learn any pattern in market data — the practical challenge is having enough quality data and the right architecture to learn it without overfitting.

One neuron: dot product, bias, then activation — the atom of every deep net.

Activation Functions: The Nonlinear Secret Sauce

Without activation functions, a neural network is just a fancy linear regression — no matter how many layers you stack, the output is always a linear combination of the inputs. Activation functions introduce nonlinearity, allowing networks to model complex, curved relationships in data.

The key activation functions every practitioner must know:

Sigmoid — σ(x) = 1/(1+e⁻ˣ). Squashes output to (0, 1). Historically popular for binary classification outputs, but problematic in hidden layers because gradients vanish for large |x| values, slowing training dramatically.
Tanh — tanh(x) = (eˣ−e⁻ˣ)/(eˣ+e⁻ˣ). Outputs range (−1, 1). Zero-centered, which helps optimization. Still suffers from vanishing gradients at extremes but less severely than sigmoid.
ReLU — f(x) = max(0, x). The default choice for hidden layers in modern networks. Simple, computationally cheap, and solves the vanishing gradient problem for positive values. The downside: "dying ReLU" — neurons that output zero for all inputs become permanently inactive.
Leaky ReLU — f(x) = max(0.01x, x). Fixes the dying neuron problem by allowing a small gradient for negative inputs.
GELU — Gaussian Error Linear Unit. Used in transformers (BERT, GPT). Smooth approximation of ReLU that performs well in deep architectures.

Practical rule: Use ReLU (or its variants) in hidden layers for almost all architectures. Use sigmoid for binary classification output layers, softmax for multi-class outputs, and linear (no activation) for regression outputs. This covers 95% of cases you'll encounter in financial ML.

Backpropagation and Gradient Descent: How Networks Learn

A neural network learns by adjusting its weights to minimize a loss function — a measure of how wrong its predictions are. The algorithm that makes this possible is backpropagation, combined with gradient descent.

The process works in two phases. In the forward pass, input data flows through the network layer by layer, producing a prediction. The loss function compares this prediction to the true target — for regression, Mean Squared Error (MSE) measures the average squared difference; for classification, cross-entropy loss measures how far predicted probabilities are from the true labels.

In the backward pass, the algorithm computes the gradient of the loss with respect to every weight in the network using the chain rule of calculus. These gradients tell each weight how much it contributed to the error and in which direction to adjust. Gradient descent then updates each weight: w = w − α × ∂L/∂w, where α is the learning rate.

The learning rate is arguably the most critical hyperparameter. Too large, and the model overshoots optimal values, oscillating or diverging. Too small, and training takes forever or gets stuck in poor local minima. Modern optimizers like Adam adapt the learning rate per-parameter, combining the benefits of momentum (smoothing out noisy gradients) and RMSProp (scaling gradients by their recent magnitude). Adam is the default starting optimizer for almost all deep learning projects.

Batch size controls how many examples the model sees before updating weights. Full-batch gradient descent is mathematically clean but computationally impractical for large datasets. Stochastic gradient descent (SGD) updates after every single example — noisy but fast. Mini-batch SGD (typically 32–256 examples) strikes the balance: stable enough gradients, efficient GPU utilization, and some regularizing noise that helps escape local minima.

Autodiff walks the chain rule; optimizers decide how aggressively to step.

Regularization: Preventing Your Network from Memorizing Noise

Neural networks are extraordinarily powerful function approximators — so powerful that they can memorize training data perfectly, including all its noise and quirks. In finance, where signal-to-noise ratios are low, regularization is the difference between a model that generalizes and one that only works on historical data.

Dropout is the most widely used regularization technique for neural networks. During each training step, each neuron has a probability p (typically 0.2–0.5) of being temporarily "dropped" — set to zero. This forces the network to distribute learned representations across many neurons rather than relying on a few, creating an ensemble-like effect. At inference time, all neurons are active but their outputs are scaled by (1−p).

L1 and L2 regularization add a penalty term to the loss function based on the magnitude of weights. L2 (weight decay) penalizes large weights quadratically, encouraging the model to distribute influence evenly. L1 encourages sparsity — driving irrelevant weights to exactly zero, effectively performing feature selection. In financial models where most features are noise, L1 regularization can automatically identify the handful of features that actually matter.

Batch normalization normalizes the output of each layer to have zero mean and unit variance, then applies a learned scale and shift. This stabilizes training, allows higher learning rates, and acts as a mild regularizer. It's standard in CNNs and feedforward networks, though less common in RNNs.

Early stopping monitors validation loss during training and halts when it starts increasing — the point where the model begins memorizing training data rather than learning general patterns. Combined with model checkpointing (saving the best model so far), early stopping is a simple but powerful guard against overfitting.

CNNs for Pattern Recognition, RNNs and LSTMs for Sequences

Beyond basic feedforward networks, two specialized architectures have transformed how we process structured data — and both have compelling financial applications.

Convolutional Neural Networks (CNNs) were designed for images but excel at detecting local patterns in any grid-like data. A CNN slides small filters across the input, detecting features like edges, textures, and shapes at multiple scales. In finance, researchers have applied CNNs to:

Candlestick chart images — treating technical analysis as a visual pattern recognition task
Order book heatmaps — detecting supply/demand imbalances from depth-of-market snapshots
1D convolutions on price sequences — learning temporal patterns the way image CNNs learn spatial patterns

Recurrent Neural Networks (RNNs) process sequential data by maintaining a hidden state that acts as memory, passing information from one time step to the next. In theory, perfect for time series. In practice, vanilla RNNs suffer from the vanishing gradient problem — as sequences grow long, gradients shrink exponentially during backpropagation, and the network forgets early inputs.

Long Short-Term Memory (LSTM) networks solve this with a gating mechanism. Three gates — forget, input, and output — control what information to discard, store, and emit at each time step. This allows LSTMs to learn dependencies across hundreds of time steps, making them effective for modeling financial time series where events from hours or days ago still influence current prices.

import torch.nn as nn

class PricePredictor(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim,
                            num_layers, batch_first=True,
                            dropout=0.2)
        self.fc = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        out, _ = self.lstm(x)
        return self.fc(out[:, -1, :])

On platforms like GaiaEx, where real-time WebSocket feeds deliver continuous streams of price and order data, LSTMs can process this sequential information naturally — though modern transformer architectures are increasingly replacing LSTMs for many sequence tasks.

Deep Learning Hardware and Getting Started

Training neural networks is computationally intensive — and hardware choices significantly impact both training time and your electricity bill.

GPUs are the standard for deep learning. Their massively parallel architecture — thousands of cores executing the same operation on different data — maps perfectly to the matrix multiplications that dominate neural network training. NVIDIA dominates: consumer cards (RTX 4090) work for experimentation, while professional cards (A100, H100) and cloud instances (AWS, GCP, Lambda Labs) handle production-scale training. A model that takes 8 hours on a CPU might train in 15 minutes on a modern GPU.

TPUs (Tensor Processing Units), developed by Google, are custom ASICs optimized specifically for tensor operations. Available through Google Cloud, they excel at very large models and are the hardware behind most of Google's own ML research. For most financial ML practitioners, GPUs remain the practical choice due to broader software support.

For getting started, you don't need expensive hardware. Google Colab provides free GPU access sufficient for learning and prototyping. As your models grow, cloud GPU instances offer on-demand scaling without upfront investment. Only invest in dedicated hardware once you're training models daily and the cloud costs justify it.

Your learning path from here:

Start with feedforward networks on engineered features — they're fast to train and easy to debug
Experiment with 1D CNNs on raw price sequences — compare against your feature-engineered baseline
Build an LSTM that processes sequences of candlestick data from GaiaEx's API
Study attention mechanisms and transformers — they're increasingly the architecture of choice for sequence modeling
Always compare deep learning results against gradient boosting baselines — if the neural network doesn't beat XGBoost on your tabular financial data, keep the simpler model

Type	Taker Fee	Maker Fee
Perpetuals	0.0675%	0.0300%
Spot	0.0675%	0.0300%