GaiaEx AcademyGaiaEx Academy
SQL for Financial Data Warehouses
DeveloperProgrammingacademy.article.readingTime

SQL for Financial Data Warehouses

Query, aggregate, and analyze structured financial data

Share Posts

Why SQL Is the Language of Financial Data

Every trade, every tick, every order book update — financial markets produce staggering volumes of structured data. And for over four decades, SQL (Structured Query Language) has been the dominant tool for querying, transforming, and analyzing that data. It's not glamorous. It's not trendy. But when a portfolio manager asks "What was my average fill price on ETH trades last Tuesday between 2 PM and 3 PM UTC?", the answer comes from SQL.

Financial data is inherently relational. A trade references an order, which references an account, which belongs to a user. Positions relate to instruments; instruments relate to markets. The relational model — tables with rows and columns, connected by foreign keys — maps naturally onto these relationships. This is why relational databases like PostgreSQL, MySQL, and SQL Server remain the backbone of financial infrastructure, from Goldman Sachs to GaiaEx.

SQL's power lies in its declarative nature: you describe what you want, not how to get it. The database engine figures out the optimal execution plan. This matters enormously when your trade history table has 500 million rows and you need results in under a second. Write the right query, build the right index, and SQL delivers — regardless of whether you're running it against a local PostgreSQL instance or a cloud-scale data warehouse.

Relational model (simplified) users user_id PK accounts user_id FK → users tier, limits … trades user_id FK → accounts symbol, price, time … JOINs follow foreign keys — the shape of finance data is already relational.
Trades hang off accounts and users: the same links your SQL JOINs traverse.

SELECT, JOIN, and WHERE: Querying Trade Data

Let's start with a concrete schema. Imagine a trades table on a platform like GaiaEx:

CREATE TABLE trades (
    trade_id    BIGINT PRIMARY KEY,
    user_id     INT NOT NULL,
    symbol      VARCHAR(20) NOT NULL,
    side        VARCHAR(4) NOT NULL,  -- 'buy' or 'sell'
    price       NUMERIC(18,8) NOT NULL,
    quantity    NUMERIC(18,8) NOT NULL,
    fee         NUMERIC(18,8) DEFAULT 0,
    executed_at TIMESTAMPTZ NOT NULL
);

The most fundamental operation is the SELECT statement with filtering. To find all ETH-USD buy trades above $3,000 in the last 24 hours:

SELECT trade_id, price, quantity, executed_at
FROM trades
WHERE symbol = 'ETH-USD'
  AND side = 'buy'
  AND price > 3000
  AND executed_at > NOW() - INTERVAL '24 hours'
ORDER BY executed_at DESC;

JOINs connect related tables. Suppose you have an accounts table and want to see the trading volume per account tier:

SELECT a.tier, COUNT(*) AS trade_count,
       SUM(t.price * t.quantity) AS total_volume
FROM trades t
JOIN accounts a ON t.user_id = a.user_id
WHERE t.executed_at > NOW() - INTERVAL '30 days'
GROUP BY a.tier
ORDER BY total_volume DESC;

These queries are the daily bread of financial analysts, risk teams, and compliance officers. Master them and you can answer any question your data contains.

Window Functions: The Analyst's Secret Weapon

Window functions are what separate SQL beginners from SQL practitioners in finance. They let you perform calculations across a set of rows related to the current row — without collapsing the result into a single aggregated value. Think of them as "running computations" across your data.

The most commonly used window functions in financial analysis:

  • ROW_NUMBER() — Assigns a sequential integer to each row within a partition. Useful for deduplication or selecting the Nth trade per symbol.
  • LAG() and LEAD() — Access the previous or next row's value. Essential for computing trade-to-trade returns or detecting gaps in time-series data.
  • SUM() OVER () — Running totals. Compute cumulative volume, cumulative P&L, or running position size.
  • AVG() OVER (ROWS BETWEEN) — Moving averages directly in SQL. No Python needed.

Here's a practical example — computing trade-to-trade price change and a running position for BTC:

SELECT executed_at, price, quantity, side,
       price - LAG(price) OVER (ORDER BY executed_at) AS price_change,
       SUM(CASE WHEN side = 'buy' THEN quantity ELSE -quantity END)
           OVER (ORDER BY executed_at) AS running_position
FROM trades
WHERE symbol = 'BTC-USD'
ORDER BY executed_at;

Window functions don't reduce your row count — unlike GROUP BY, every input row produces an output row. This makes them perfect for adding analytical columns to detailed trade logs without losing granularity. Think of them as "enriching" your data rather than "summarizing" it.

For OHLCV aggregation — the foundation of candlestick charts — you can combine GROUP BY with time-bucketing and standard aggregates:

SELECT date_trunc('hour', executed_at) AS bucket,
       (ARRAY_AGG(price ORDER BY executed_at))[1] AS open,
       MAX(price) AS high,
       MIN(price) AS low,
       (ARRAY_AGG(price ORDER BY executed_at DESC))[1] AS close,
       SUM(quantity) AS volume
FROM trades
WHERE symbol = 'ETH-USD'
GROUP BY bucket
ORDER BY bucket;

CTEs: Composing Complex Analytical Queries

Common Table Expressions (CTEs) let you break a complex query into named, readable stages — like functions in programming. They are introduced with the WITH keyword and can reference each other sequentially.

Suppose you want to identify your top 10 most profitable trading days, but profitability requires computing net P&L per day including fees. With CTEs, you can build this step by step:

WITH daily_trades AS (
    SELECT DATE(executed_at) AS trade_date,
           SUM(CASE WHEN side = 'sell' THEN price * quantity
                    ELSE -price * quantity END) AS gross_pnl,
           SUM(fee) AS total_fees
    FROM trades
    WHERE user_id = 42
    GROUP BY DATE(executed_at)
),
daily_pnl AS (
    SELECT trade_date,
           gross_pnl - total_fees AS net_pnl,
           SUM(gross_pnl - total_fees) OVER (ORDER BY trade_date) AS cumulative_pnl
    FROM daily_trades
)
SELECT trade_date, net_pnl, cumulative_pnl
FROM daily_pnl
ORDER BY net_pnl DESC
LIMIT 10;

Each CTE reads like a paragraph. daily_trades aggregates raw trades into daily summaries. daily_pnl computes net P&L and a running total. The final SELECT picks the top 10 days. Compare this to a single monolithic subquery — the CTE version is maintainable, testable, and self-documenting.

In production financial systems, CTEs are used for everything from regulatory reporting (aggregating margin requirements across accounts) to real-time dashboard queries (computing rolling 24-hour volume on GaiaEx). They compose in the same way well-factored code composes — each layer builds on the previous one.

Indexing and Partitioning for Time-Series Performance

A query is only as fast as the index that supports it. Without the right index, even a simple WHERE clause forces a sequential scan — reading every row in the table. At 500 million rows, that's minutes instead of milliseconds.

For financial time-series data, the most critical index pattern is a composite B-tree index on (symbol, executed_at):

CREATE INDEX idx_trades_symbol_time
ON trades (symbol, executed_at DESC);

This single index accelerates the majority of analytical queries: "all ETH trades in the last hour," "BTC trades between two timestamps," or "latest trade per symbol." The column order matters — symbol first enables equality filtering, then executed_at enables efficient range scans within that partition of the index.

For tables that grow beyond hundreds of millions of rows, table partitioning is essential. PostgreSQL supports declarative partitioning by range — perfect for time-series:

CREATE TABLE trades (
    trade_id BIGINT, symbol VARCHAR(20),
    price NUMERIC(18,8), executed_at TIMESTAMPTZ
) PARTITION BY RANGE (executed_at);

CREATE TABLE trades_2026_q1 PARTITION OF trades
    FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');
CREATE TABLE trades_2026_q2 PARTITION OF trades
    FOR VALUES FROM ('2026-04-01') TO ('2026-07-01');

With partitioning, a query for March 2026 data only scans the Q1 partition — the engine skips Q2, Q3, and Q4 entirely. This technique, called partition pruning, delivers order-of-magnitude speedups on large historical datasets.

Time-range partition pruning Q1 2026 Jan–Mar rows Q2 2026 skipped Q3 2026 Q4 2026 Query: WHERE executed_at in March → planner reads only Q1 Composite index (symbol, executed_at) still helps within each partition
Partitions bound time ranges; pruning skips whole chunks that cannot match.

PostgreSQL vs ClickHouse vs BigQuery: Choosing the Right Engine

Not all databases handle financial analytics equally. The choice depends on your query patterns, data volume, and latency requirements.

PostgreSQL is the workhorse. It excels at transactional workloads (recording trades in real time), supports ACID guarantees, handles complex JOINs gracefully, and scales to hundreds of millions of rows with proper indexing and partitioning. It's the right choice for operational databases — the system of record. GaiaEx, for instance, relies on PostgreSQL-compatible infrastructure for its core trading data with the reliability guarantees that financial systems demand.

ClickHouse is a columnar database designed for analytical queries on billions of rows. Where PostgreSQL stores data row by row (great for inserting individual trades), ClickHouse stores data column by column (great for aggregating a single column across millions of rows). A query like "average hourly volume for BTC across 3 years" that takes 30 seconds in PostgreSQL might finish in 200 milliseconds in ClickHouse. The trade-off: ClickHouse doesn't support UPDATE or DELETE efficiently — it's append-only by design.

BigQuery (Google Cloud) is a serverless columnar warehouse. No infrastructure to manage, pay-per-query pricing, and petabyte-scale capacity. It's ideal for ad-hoc analytics, research, and exploration — but query latency is measured in seconds, not milliseconds, making it unsuitable for real-time applications.

  • Real-time trade recording and order management → PostgreSQL
  • Historical analytics and backtesting over billions of rows → ClickHouse
  • Ad-hoc exploration and cross-team data sharing → BigQuery

Many professional trading firms use all three in a tiered architecture: PostgreSQL as the hot operational store, ClickHouse as the warm analytical layer, and BigQuery as the cold archive. The skill of writing good SQL transfers across all of them — the syntax is 90% identical, and the analytical thinking is 100% the same.