DeveloperProgrammingacademy.article.readingTime

Data Analysis with Pandas, NumPy, and Matplotlib

Clean, transform, and visualize financial data like a quant

NumPy: The Engine Under the Hood

Every serious financial computation in Python ultimately runs through NumPy. While Python lists are flexible, they're slow—each element is a full Python object with type checking overhead. NumPy arrays store raw numbers in contiguous memory blocks, enabling operations that run 100x faster than pure Python loops.

Here's the difference in practice:

import numpy as np

# Create an array of 1 million random daily returns
returns = np.random.normal(0.0005, 0.02, 1_000_000)

# Vectorized operations — no loops needed
cumulative = np.cumprod(1 + returns)  # Growth of $1
sharpe = returns.mean() / returns.std() * np.sqrt(252)
max_drawdown = np.min(cumulative / np.maximum.accumulate(cumulative) - 1)

print(f"Sharpe Ratio: {sharpe:.3f}")
print(f"Max Drawdown: {max_drawdown:.2%}")

Key NumPy concepts for finance:

Vectorized operations — Apply math to entire arrays at once instead of looping element by element.
Broadcasting — Automatically align arrays of different shapes for arithmetic (e.g., subtracting a scalar mean from every element).
Linear algebra — np.linalg provides matrix multiplication, eigenvalue decomposition, and solvers used in portfolio optimization.
Random sampling — Monte Carlo simulations, bootstrap sampling, and stochastic modeling all rely on np.random.

Performance tip: Always prefer vectorized NumPy operations over Python for-loops. A portfolio optimization over 500 assets that takes 30 seconds with loops can finish in under 50 milliseconds with vectorized NumPy — a 600x speedup.

Arrays are the fast path; broadcasting is how you keep code readable without loops.

Pandas: DataFrames for Financial Data

If NumPy is the engine, pandas is the dashboard. It wraps NumPy arrays in labeled, indexed structures that make financial data manipulation intuitive and expressive.

The two core structures are:

Series — A single column of data with an index (think: a time series of closing prices).
DataFrame — A table of columns, each a Series, sharing a common index (think: OHLCV candlestick data).

Loading and inspecting data is straightforward:

import pandas as pd

# Read from CSV
df = pd.read_csv("btc_daily.csv", parse_dates=["date"], index_col="date")

# Or from JSON (common in API responses)
df = pd.read_json("https://api.example.com/candles?symbol=ETH")

# Quick inspection
print(df.shape)          # (365, 5)
print(df.dtypes)         # Column types
print(df.describe())     # Statistical summary
print(df.tail())         # Last 5 rows

What makes pandas indispensable for finance is its DatetimeIndex. Once your index is datetime-typed, you unlock powerful time-series operations:

# Slice by date range
q1_data = df["2025-01":"2025-03"]

# Resample daily data to weekly OHLC
weekly = df["close"].resample("W").ohlc()

# Forward-fill missing data (weekends, holidays)
df = df.asfreq("D").ffill()

These operations handle the messy realities of financial data—gaps, timezone differences, irregular timestamps—so you can focus on analysis rather than data plumbing.

Rolling Calculations: Moving Averages and Volatility

Financial analysis is inherently windowed. You rarely care about a single data point in isolation—context comes from how it relates to recent history. Pandas' .rolling() method is your primary tool here.

import pandas as pd
import numpy as np

# Assume df has a "close" column with DatetimeIndex
df["sma_20"] = df["close"].rolling(20).mean()
df["sma_50"] = df["close"].rolling(50).mean()
df["ema_12"] = df["close"].ewm(span=12).mean()

# Daily returns and rolling volatility
df["returns"] = df["close"].pct_change()
df["vol_30d"] = df["returns"].rolling(30).std() * np.sqrt(365)

# Bollinger Bands
df["bb_upper"] = df["sma_20"] + 2 * df["close"].rolling(20).std()
df["bb_lower"] = df["sma_20"] - 2 * df["close"].rolling(20).std()

# Rolling correlation between two assets
df["corr_btc_eth"] = df["btc_returns"].rolling(60).corr(df["eth_returns"])

Key rolling calculations every trader should know:

Simple Moving Average (SMA) — Equal-weighted average over N periods. Lagging but stable.
Exponential Moving Average (EMA) — Weights recent data more heavily via .ewm(). More responsive to price changes.
Rolling Volatility — Standard deviation of returns over a window, annualized. Essential for position sizing and risk budgets.
Rolling Correlation — Measures how two assets move together over time. Critical for portfolio diversification.

On platforms like GaiaEx, you can pull historical candle data through the trading API, load it into a pandas DataFrame, and compute these indicators in seconds—all while your assets stay safe in your MPC wallet.

GroupBy and Aggregation for Portfolio Analysis

When you're managing multiple assets, .groupby() becomes essential. It lets you split data by category, apply calculations, and combine results—the split-apply-combine pattern.

# DataFrame with multiple assets
trades = pd.DataFrame({
    "symbol": ["BTC", "ETH", "BTC", "SOL", "ETH", "BTC"],
    "side": ["buy", "buy", "sell", "buy", "sell", "buy"],
    "pnl": [120.5, -45.2, 89.0, 210.3, 55.8, -30.1],
    "volume": [5000, 3200, 4800, 1500, 2900, 5100],
})

# Performance by asset
summary = trades.groupby("symbol").agg(
    total_pnl=("pnl", "sum"),
    avg_pnl=("pnl", "mean"),
    trade_count=("pnl", "count"),
    total_volume=("volume", "sum"),
    win_rate=("pnl", lambda x: (x > 0).mean()),
)

print(summary.sort_values("total_pnl", ascending=False))

Advanced aggregation patterns for portfolio work:

Multi-level groupby — Group by symbol and side to see long vs. short performance per asset.
Custom aggregation functions — Calculate Sharpe ratios, Sortino ratios, or max drawdown per asset group.
Pivot tables — Reshape data to see monthly returns by asset in a heatmap-friendly format using pd.pivot_table().
Resampling with groupby — Combine time-based resampling with categorical grouping for multi-asset time-series analysis.

Matplotlib: Charting Price Action and Signals

Data without visualization is just numbers. Matplotlib transforms your analysis into charts that reveal patterns invisible in raw data.

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

fig, axes = plt.subplots(3, 1, figsize=(14, 10),
                         sharex=True, gridspec_kw={"height_ratios": [3, 1, 1]})

# Price + Moving Averages
axes[0].plot(df.index, df["close"], label="Close", linewidth=1.2)
axes[0].plot(df.index, df["sma_20"], label="SMA 20", linestyle="--")
axes[0].fill_between(df.index, df["bb_upper"], df["bb_lower"],
                     alpha=0.1, color="blue", label="Bollinger Bands")
axes[0].set_ylabel("Price (USD)")
axes[0].legend(loc="upper left")

# Volume bars
colors = ["green" if c > o else "red"
          for c, o in zip(df["close"], df["open"])]
axes[1].bar(df.index, df["volume"], color=colors, alpha=0.7)
axes[1].set_ylabel("Volume")

# Rolling Volatility
axes[2].plot(df.index, df["vol_30d"], color="purple")
axes[2].set_ylabel("30d Volatility")
axes[2].axhline(y=0.8, color="red", linestyle=":", alpha=0.5)

axes[2].xaxis.set_major_formatter(mdates.DateFormatter("%b %Y"))
plt.tight_layout()
plt.savefig("analysis_dashboard.png", dpi=150)
plt.show()

This produces a professional three-panel dashboard: price action with Bollinger Bands on top, volume in the middle, and volatility at the bottom. This layout mirrors what you'd see on professional trading terminals.

For statistical visualization, seaborn builds on Matplotlib with higher-level functions:

import seaborn as sns

# Return distribution
sns.histplot(df["returns"].dropna(), bins=100, kde=True)

# Correlation heatmap across assets
corr_matrix = portfolio_returns.corr()
sns.heatmap(corr_matrix, annot=True, cmap="RdYlGn", center=0)

Pandas prepares tidy tables; Matplotlib maps them to axes you can annotate and share.

Putting It All Together: A Complete Analysis Pipeline

The real power emerges when you chain these tools into a repeatable pipeline. Here's a workflow that professional quants use daily:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def analyze_asset(symbol: str, df: pd.DataFrame) -> dict:
    """Full analysis pipeline for a single asset."""
    df["returns"] = df["close"].pct_change()
    df["log_returns"] = np.log(df["close"] / df["close"].shift(1))

    return {
        "symbol": symbol,
        "total_return": (df["close"].iloc[-1] / df["close"].iloc[0]) - 1,
        "annual_vol": df["returns"].std() * np.sqrt(365),
        "sharpe": df["returns"].mean() / df["returns"].std() * np.sqrt(365),
        "max_drawdown": (df["close"] / df["close"].cummax() - 1).min(),
        "skewness": df["returns"].skew(),
        "kurtosis": df["returns"].kurtosis(),
    }

# Analyze multiple assets
results = [analyze_asset(sym, data) for sym, data in assets.items()]
summary = pd.DataFrame(results).set_index("symbol")
print(summary.round(4))

This pipeline takes raw price data and produces a comprehensive risk-return profile for each asset. From here, you can feed these metrics into a portfolio optimizer, generate allocation recommendations, or trigger rebalancing signals.

The pandas-NumPy-Matplotlib stack is the foundation that everything else in Python finance builds upon. Master these three libraries and you'll have the skills to analyze any market, build any indicator, and visualize any strategy—whether you're trading equities on NYSE, crypto on GaiaEx, or derivatives on CME.

Type	Taker Fee	Maker Fee
Perpetuals	0.0675%	0.0300%
Spot	0.0675%	0.0300%