
How Large Language Models (LLMs) Work
Transformers, attention, and the architecture behind ChatGPT
What large language models are
A large language model (LLM) predicts the next token in a sequence. Train it on enough text and the same mechanism produces summaries, code, and dialogue when you steer it with prompts.
Scale matters because capacity changes what fits in the weights: small models pattern-match; large models can follow instructions and hold longer context—within limits. Parameters are not “understanding” in a human sense; they are coefficients shaped by optimization against a prediction loss.
Older language models often used recurrence (RNN/LSTM), which serializes computation and struggles with long-range dependencies. The dominant architecture today is the Transformer, which trades recurrence for attention and parallelizes across positions.
Transformer and self-attention
The 2017 Transformer paper replaced recurrence with self-attention: each position builds a weighted mix of other positions. Implementations use query, key, and value projections; softmax-normalized scores set the weights.
Multi-head attention runs several attention maps in parallel so different heads can specialize (syntax, long-range links, etc.). Positional information is added because attention alone is permutation-blind. Subword tokenization (BPE, SentencePiece, etc.) maps text to tokens the model sees.
Attention enables parallel training across sequence positions; that throughput is a practical reason Transformers scale better than RNNs on modern accelerators.
In production, you still pay for context length in memory and latency. Long-context models reduce how often you must chunk text, not the need to check outputs.
Pre-training, fine-tuning, alignment
Pre-training minimizes next-token loss on large corpora. That yields a base model that completes text but does not automatically behave like a product assistant.
Supervised fine-tuning (SFT) trains on curated prompt/response pairs to teach format and policy. RLHF (or preference optimization variants) uses human or model judgments to steer outputs toward usefulness and away from disallowed content. The exact recipe varies by vendor.
“Emergent” behaviors with scale are debated: some capabilities appear abruptly in benchmarks as size increases. Treat benchmarks as probes, not guarantees in your domain.
Context, memory, and retrieval
The context window caps tokens processed in one forward pass. Long windows help with documents and code, but cost grows with length. Nothing persists between API calls unless you send history or store state yourself.
RAG retrieves documents (often from a vector index) and places excerpts in the prompt so answers can cite fresher or private material than the base model saw at training. Quality depends on retrieval, chunking, and verification—not on the embedding model alone.
Model families differ by modality, licensing, and tool use. Pick based on latency, cost, evals in your task, and data-handling rules—not leaderboard rank alone.
LLMs in trading workflows
Unstructured text is common in markets: filings, transcripts, chat, headlines. Models can label sentiment, extract entities, or draft summaries. Treat outputs as drafts: numbers, dates, and identifiers need reconciliation to primary sources.
For automation, constrain outputs (JSON schema, tool calls) and log prompts and responses for audit. If you connect to a trading API, keep keys out of prompts and enforce risk checks in code the model cannot bypass.
# Sketch: sentiment helper — validate JSON and bounds in application code
import json
def sentiment_from_headlines(client, model: str, headlines: list[str]) -> dict:
raw = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "Return compact JSON: sentiment_score [-1,1], themes[]."},
{"role": "user", "content": "\n".join(headlines)},
],
)
data = json.loads(raw.choices[0].message.content)
assert -1 <= float(data["sentiment_score"]) <= 1
return dataLimits and failure modes
Hallucination here means confident but wrong text: fake citations, bad arithmetic, subtle bugs in code. Mitigations include retrieval, calculators, compilers, unit tests, and human review—especially where capital is at risk.
- Knowledge cutoff — training data ends at a date; live markets need feeds.
- Reasoning — long chains fail more often; decompose tasks and verify steps.
- Cost and latency — large models are not free; smaller models or distillation may fit production SLAs.
- Prompt sensitivity — rephrase and ensemble checks when consistency matters.
Tool-using agents change the surface area: failures become mis-selected tools or bad parameters, not only bad prose. Design permissions and timeouts accordingly.


