
Neural Networks: From Perceptrons to Deep Learning
How layers of simple functions learn complex patterns
The Move Nobody Understood
On March 10, 2016, in the second game of a five-match series in Seoul, a machine called AlphaGo placed a black stone on the fifth line of the board. The move was so strange that the human commentators assumed it was a mistake. One professional said out loud that a beginner would be scolded for playing it. Lee Sedol — an 18-time world champion at the game of Go — stood up and left the room to collect himself.
It was not a mistake. Fifty moves later, that stone — now famous as "Move 37" — was the hinge the entire game turned on. AlphaGo won. Lee Sedol, considered one of the greatest players alive, lost the series 4–1. And here is the part that matters: no human taught AlphaGo that move. It was not in any textbook. The machine had discovered it on its own, by playing millions of games against itself and adjusting billions of tiny internal numbers until a pattern emerged that no person had ever seen.
Go has more possible board positions than there are atoms in the observable universe. You cannot brute-force it. You cannot write down rules for it. The only way to play it at a superhuman level is to learn it — and the thing doing the learning was a neural network: a stack of simple mathematical units that, given enough data and enough adjustment, can find structure that no human can fully articulate.
That same machinery — the same weighted sums, the same training loop, the same gradient nudges — is what powers the language model you talk to, the system that flags fraudulent card transactions, and the models that quantitative funds now point at financial markets. This lesson is about what is actually happening inside that box.
The Perceptron: Where It All Begins
Every neural network — from a simple spam filter to GPT to AlphaGo — descends from a single idea: the perceptron, invented by Frank Rosenblatt in 1958. It's a mathematical model loosely inspired by a biological neuron. A real neuron receives electrical signals through its dendrites, accumulates them in the cell body, and fires a signal down its axon only if the combined input crosses a threshold. The perceptron does the same thing, just with numbers instead of voltage.
A perceptron takes a vector of inputs x, multiplies each input by a learned weight w, adds up the products, adds a bias term b that shifts the threshold, and passes the result through an activation function. If the result clears the bar, the neuron "fires"; otherwise it stays quiet. The whole thing is one compact expression: output = f(w·x + b).
Think of the weights as volume knobs. Each input arrives with some loudness, and the weight decides how much that input matters to this particular neuron. A neuron learning to spot a fraudulent trade might crank up the weight on "transaction size relative to account history" and turn down the weight on "time of day." Learning, at its core, is nothing more glamorous than turning those knobs to the right settings.
A single perceptron has a hard limit: it can only separate data with a straight line (or, in higher dimensions, a flat hyperplane). It can answer "is this point above or below the line?" but not "is this point inside the curved region?" The historic example is the XOR problem — a pattern a single perceptron provably cannot learn, a limitation that helped freeze AI funding for years in the 1970s. The escape is almost absurdly simple in hindsight: stack many perceptrons into layers and insert a nonlinear activation between them. Do that, and the resulting network can bend, fold, and carve space into arbitrarily complex shapes.
Layers, Depth, and Why 'Deep' Learning
One neuron is a knob. A useful network is thousands of them, organized into layers that pass information forward like a bucket brigade.
- Input layer — where your raw features enter: pixels of an image, words of a sentence, or for a trading model, things like price, volume, volatility, and order-book imbalance.
- Hidden layers — the layers in between, where the real work happens. Each layer takes the outputs of the previous one and recombines them into new, more abstract features.
- Output layer — the final answer: a probability, a price forecast, a class label.
The word "deep" in deep learning simply means many hidden layers. And depth buys you something specific: a hierarchy of features. In an image network, the first layer detects edges, the next assembles edges into shapes, the next assembles shapes into objects. Nobody programmed "edge detector" or "eye detector" — those concepts emerge from training. The network invents its own intermediate vocabulary.
There's a beautiful theoretical result here, the universal approximation theorem: a network with even a single hidden layer, given enough neurons, can approximate any continuous function to arbitrary accuracy. In other words, the right network can in principle represent any pattern that exists in your data. The catch — and it's a big one — is that "in principle" is doing heavy lifting. The theorem promises such a network exists; it says nothing about whether you can find it, whether you have enough data to pin it down, or whether the pattern you're chasing is even real. In low-signal domains like finance, that gap between "representable" and "learnable" is exactly where most models go to die.
Activation Functions: The Nonlinear Secret Sauce
Here is a fact that surprises people: without activation functions, depth is worthless. If every layer were a plain weighted sum, then stacking ten layers would be mathematically identical to a single layer — a chain of linear operations collapses into one linear operation. You'd have spent a fortune in compute to build a glorified straight line. Activation functions are the nonlinear kink inserted between layers that lets the network bend, and bending is what makes it powerful.
The activations every practitioner should know:
- Sigmoid — σ(x) = 1/(1+e⁻ˣ). Squashes any input into (0, 1), which reads naturally as a probability. Great for a binary output, but a poor choice inside hidden layers: for large positive or negative inputs the curve flattens, its gradient goes to nearly zero, and learning grinds to a halt — the vanishing gradient problem.
- Tanh — outputs in (−1, 1) and is zero-centered, which helps optimization. Still saturates at the extremes, but less harshly than sigmoid.
- ReLU — f(x) = max(0, x). The modern default for hidden layers. It is laughably simple, cheap to compute, and keeps a healthy gradient for all positive inputs. Its flaw is the "dying ReLU": a neuron that gets pushed into always outputting zero can become permanently inert.
- Leaky ReLU — f(x) = max(0.01x, x). Lets a small trickle of gradient through for negatives, which keeps neurons from dying.
- GELU — a smooth, probabilistic cousin of ReLU that is the standard inside transformers like BERT and GPT.
Backpropagation and Gradient Descent: How Networks Learn
So far we have a network full of weights. But where do the right weight values come from? Nobody types them in — there are millions. They are discovered through a feedback loop that is, at heart, organized trial and error with calculus doing the bookkeeping.
Step one is the forward pass: feed in an example, let it flow through the layers, and read off a prediction. Step two is to measure how wrong that prediction was using a loss function. For predicting a number (a price), Mean Squared Error averages the squared gap between prediction and truth. For classification, cross-entropy measures how far the predicted probabilities sit from the correct answer. A high loss means "very wrong"; the entire goal of training is to push that number down.
Step three is the clever part: backpropagation. Using the chain rule of calculus, the algorithm works backward from the loss through every layer and computes, for each individual weight, a gradient — the answer to "if I nudge this one weight up a hair, does the error go up or down, and how much?" This is the trick that thawed the AI winter. Backprop, popularized in the 1980s, made it possible to assign blame across millions of weights efficiently instead of guessing.
Step four is gradient descent: nudge every weight a small step in the direction that reduces the loss, using the rule w = w − α × ∂L/∂w. Picture standing on a foggy hillside trying to reach the valley floor; you can't see the bottom, but you can feel which way is downhill under your feet, so you take a step that way and repeat. The learning rate α is your stride length, and it is arguably the most important dial in the whole system. Too large and you bound past the valley and bounce around forever; too small and you inch along for an eternity or get stuck in a ditch partway down.
Two refinements you'll meet immediately. Mini-batches: instead of updating after every single example (noisy) or only after the entire dataset (slow), you update after a small batch — typically 32 to 256 examples — which is stable, GPU-friendly, and the sweet spot in practice. And Adam: a smarter optimizer that gives every weight its own adaptive stride and adds momentum to smooth out noisy gradients. Adam is the sensible default to start from on almost any project — though in finance, with its low signal-to-noise ratio, you'll still need to tune the learning rate, weight decay, and stopping point by hand.
Overfitting: When the Network Memorizes Instead of Learns
Here is the failure mode that wrecks more financial models than any other. A neural network is such a flexible function-fitter that, given the chance, it will memorize your training data outright — including all the random noise, the one-off flukes, and the coincidences that will never repeat. It scores beautifully on the data it has seen and falls flat the moment it meets anything new. This is overfitting, and in markets — where the genuine signal is faint and the noise is deafening — it is the default outcome, not the exception. The tools that fight it are called regularization.
Dropout is the most popular. During each training step, every neuron has a probability p (often 0.2–0.5) of being temporarily switched off. Forced to work even when random teammates keep vanishing, the network learns to spread its knowledge across many neurons rather than betting everything on a few fragile ones — an ensemble effect baked into a single model. At prediction time all neurons switch back on, scaled appropriately.
L1 and L2 penalties add a tax on large weights to the loss. L2 (weight decay) discourages any single weight from getting too big, spreading influence evenly. L1 is sharper: it actively drives useless weights all the way to exactly zero, performing automatic feature selection. In a financial dataset where most of your hundred features are noise, L1 can quietly find the handful that actually carry signal.
Batch normalization rescales each layer's outputs to a stable mean and variance, which steadies training and lets you use higher learning rates. Early stopping is the simplest and most reliable guard of all: watch the error on a held-out validation set, and the instant it starts creeping up while training error keeps falling, stop — that crossover is the exact moment the network switches from learning patterns to memorizing noise.
CNNs for Patterns, RNNs and LSTMs for Sequences
Plain feedforward networks treat every input as an unstructured bag of numbers. But a lot of real data has structure — an image has spatial layout, a price series has temporal order — and two specialized architectures exploit that structure directly.
Convolutional Neural Networks (CNNs) were built for images. Rather than connecting every pixel to every neuron, a CNN slides small filters across the input, hunting for local patterns — an edge here, a texture there — and reusing the same filter everywhere. Early layers catch edges; deeper layers assemble them into shapes and objects. In finance, researchers have pointed CNNs at:
- Candlestick chart images — reframing chart reading as a visual pattern-recognition task.
- Order-book heatmaps — spotting supply/demand imbalances in depth-of-market snapshots.
- 1D convolutions over raw price sequences — learning short temporal motifs the way an image CNN learns spatial ones.
Recurrent Neural Networks (RNNs) are built for sequences. They process one time step at a time while carrying a hidden state — a running memory — from each step to the next. In theory this makes them ideal for time series. In practice, vanilla RNNs are crippled by the vanishing gradient problem over long sequences: signal from many steps ago shrinks toward zero on the way back, so the network simply forgets the distant past.
Long Short-Term Memory (LSTM) networks fix this with a small set of learnable gates — forget, input, and output — that decide at each step what to erase from memory, what to write in, and what to read out. That gating lets an LSTM hold onto relevant information across hundreds of steps, which is why it became a workhorse for modeling financial time series where an event from yesterday still moves today's price.
import torch.nn as nn
class PricePredictor(nn.Module):
def __init__(self, input_dim, hidden_dim, num_layers):
super().__init__()
self.lstm = nn.LSTM(input_dim, hidden_dim,
num_layers, batch_first=True,
dropout=0.2)
self.fc = nn.Linear(hidden_dim, 1)
def forward(self, x):
out, _ = self.lstm(x)
return self.fc(out[:, -1, :])
On a platform like GaiaEx, where WebSocket feeds stream continuous price and order-book data, an LSTM can ingest that sequence naturally. That said, the field has largely moved on: transformers — the architecture behind modern language models — now match or beat LSTMs on many sequence tasks by using "attention" to look at all time steps at once instead of marching through them one by one.
What Neural Networks Don't Fix (Especially in Markets)
Neural networks are spectacular pattern-finders. But honest education means naming where they break — and in trading, they break in expensive ways.
- They find patterns even when none exist. Give a deep network noise and it will confidently fit it. Markets are largely noise, so the burden of proof is on you to show a pattern is real and persistent — not just present in last year's data.
- They are black boxes. A model can be accurate and still be unable to tell you why. When it suddenly loses money, there's often no clean explanation and no obvious fix. For risk-bearing capital, "I don't know why it did that" is a serious liability.
- They are hungry and fragile. Deep learning shines with millions of clean examples. Financial history is short, non-stationary, and noisy — the regime that trained your model may simply stop existing, a problem called distribution shift.
- They don't beat simpler models for free. On the tabular, feature-based data common in finance, gradient-boosted trees like XGBoost frequently outperform neural networks while being faster to train and easier to interpret. Complexity is a cost, not a virtue.
- They can be attacked. Tiny, deliberate perturbations to an input can flip a network's output — an adversarial example — which matters anywhere an adversary can shape what your model sees.
Hardware, Tools, and Your Path Forward
Training neural networks is heavy on arithmetic — overwhelmingly matrix multiplications — and the hardware you run it on changes your iteration speed from "overnight" to "over coffee."
GPUs are the standard. Their thousands of cores all run the same operation on different data at once, which is exactly the shape of neural network math. A model that crawls for 8 hours on a CPU can finish in 15 minutes on a modern GPU. NVIDIA dominates: consumer cards (an RTX-class GPU) are fine for learning, while data-center cards (A100, H100) and cloud instances handle production-scale training. TPUs, Google's custom chips for tensor math, shine on very large models through Google Cloud, but GPUs remain the practical default for most practitioners thanks to broader software support.
You do not need to buy anything to start. Google Colab hands out free GPU time that's plenty for learning and prototyping; rent cloud GPUs on demand as your models grow; buy dedicated hardware only once you're training daily and the cloud bill justifies it.
A sane learning path from here:
- Start with a plain feedforward network on a few engineered features — fast to train, easy to debug, and an honest baseline.
- Try a 1D CNN on raw price sequences and compare it head-to-head against that baseline.
- Build an LSTM that consumes sequences of candlestick data from GaiaEx's API.
- Study attention and transformers — increasingly the architecture of choice for sequences.
- Above all, always pit your deep model against a gradient-boosting baseline like XGBoost. If the neural network can't beat the simpler model out-of-sample on your tabular financial data, ship the simpler model and move on.
The same loop that found Move 37 — predict, measure error, nudge the weights, repeat — is the loop running inside every system in this list. Understand that loop and you understand deep learning; the rest is architecture and engineering on top of it.