How to lose 36 million in 1 year

A Physics and Machine Learning love story

Michael Green

AI Alpha Lab

2026-05-30

The beginning

Financial markets are unforgiving. A bad decision is often quite bad.

This is the story of what I learned building the uncertainty-first, physics-inspired neural network that now runs our fund at AI Alpha Lab. It starts, as these things do, with 36 million going up in smoke.

Backstory

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'background':         '#1e1e2e',
    'primaryColor':       '#313244',
    'primaryTextColor':   '#cdd6f4',
    'primaryBorderColor': '#89b4fa',
    'lineColor':          '#89b4fa',
    'secondaryColor':     '#45475a',
    'tertiaryColor':      '#181825',
    'textColor':          '#cdd6f4',
    'noteTextColor':      '#cdd6f4',
    'sectionBkgColor':    '#313244',
    'altSectionBkgColor': '#45475a',
    'cScale0': '#cba6f7', 'cScaleLabel0': '#1e1e2e',
    'cScale1': '#89b4fa', 'cScaleLabel1': '#1e1e2e',
    'cScale2': '#fab387', 'cScaleLabel2': '#1e1e2e',
    'cScale3': '#a6e3a1', 'cScaleLabel3': '#1e1e2e',
    'cScale4': '#89dceb', 'cScaleLabel4': '#1e1e2e',
    'cScale5': '#f5c2e7', 'cScaleLabel5': '#1e1e2e',
    'cScale6': '#94e2d5', 'cScaleLabel6': '#1e1e2e'
  }
}}%%
timeline
    title My journey
    2000 : Operator
    2003 : MSc Computer Science
         : AI and Robotics
    2008 : PhD Theoretical Physics
         : Neural networks and Complex Systems
    2009 : Corporate Life
         : Consultant SimCorp
         : CTO GroupM
    2013 : Founder
         : CTO, Blackwood Seven
    2018 : Founder
         : CEO, Desupervised
         : CTO, AI Alpha Lab
    2022 : Founder
         : CEO, Alviss AI

The proof is in the pudding

“AI Alpha Lab Globale Aktier, kl n” is the best performing fund in the Danish Fundmarket.
We’re also in the top 2% best performing funds globally according to Citywire
The performance include price + dividends for all funds in the list
It’s a pure Machine Learning driven approach where no human is allowed to overrule a decision
There are safety guards in place though

AI Alpha Lab Fund Performance - April 21

Looking for 𝛂 is never a smooth ride

Making the model admit what it doesn’t know

If a model can’t tell you how sure it is, you can’t size the bet. Every interesting question I care about in quantitative investing is a question about a distribution, not a single number. So how do we actually wring that distribution out of a large neural network?

Two flavours of uncertainty

Aleatoric, the noise in the data itself.

Irreducible. More data won’t save you.
In markets, this is the genuinely random part of returns. It’s always there.
Model it with a heteroscedastic head (predict both \(\mu(x)\) and \(\sigma(x)\)).
Essentially free. You change the loss from MSE to Gaussian NLL and move on.

Epistemic, what the model doesn’t know.

Shrinks with more (representative) data.
In markets: regime shifts, unseen macro, tail events. The darn interesting cases.
This requires a distribution over models, not a point estimate.
This is the part that gets expensive.

What the two uncertainties look like

The good and the bad (as in naive)

A proper Bayesian neural network treats the weights \(w\) as random:

\[p(y \mid x, \mathcal{D}) = \int p(y \mid x, w)\, p(w \mid \mathcal{D})\, dw\]

\(p(w \mid \mathcal{D})\) lives in a space with \(10^6\) to \(10^9\) dimensions.
HMC is the gold standard and completely infeasible here.
Mean-field variational inference is tractable and pathologically overconfident.
The posterior is multimodal. Most approximations happily collapse to one mode and pretend the others don’t exist.

For context

Representing a 100 million parameter posterior with 4000 samples would require a whoppin’ 1.6TB of RAM.

Heteroscedastic head, the free lunch

Two outputs instead of one. Predict \(\mu(x)\) and \(\log \sigma^2(x)\). Swap MSE for Gaussian NLL. That’s the entire change.

\[\mathcal{L}(x, y) = \frac{(y - \mu(x))^2}{2\sigma^2(x)} + \tfrac{1}{2}\log\sigma^2(x)\]

First term: weighted squared error. Noisy samples down-weighted.
Second term: keeps the network honest. Can’t just predict \(\sigma^2 \to \infty\) to zero the loss.
Predict \(\log \sigma^2\), not \(\sigma^2\) directly. Positivity for free, gradients well-behaved.

using Flux

# Two-head output: μ and log σ²
head = Parallel(vcat,
    Dense(d, 1),          # μ(x)
    Dense(d, 1),          # log σ²(x)
)
model = Chain(features, head)

function nll(x, y)
    out  = model(x)
    μ, logσ² = out[1, :], out[2, :]
    σ² = exp.(logσ²)
    return mean(@. (y - μ)^2 / (2σ²) + logσ²/2)
end

train!(nll, params(model), data, opt)

One extra output, one new loss. Aleatoric uncertainty for free.

It’s not always easy to navigate

\(\mu(x_1, x_2)\) and \(\log \sigma^2(x_1, x_2)\) from a heteroskedastic head with data \(y = \sin(x_1)\cos(x_2)\).

The landscape the optimizer sees

\(\mu = \sin(w_1 x)\), \(\sigma^2 = \sigma_0^2 + w_2^2 x^2\)

Deep ensembles

Train \(N\) independent networks from different random seeds.

\[p(y \mid x) \approx \frac{1}{N} \sum_{i=1}^{N} p_i(y \mid x, w_i)\]

Epistemic uncertainty shows up as disagreement between members.
Empirically beats most “principled” Bayesian methods on calibration and out-of-distribution detection (Lakshminarayanan et al. 2017; Ovadia et al. 2019).
Embarrassingly parallel. One of the few things in deep learning that actually scales linearly.
The cost is \(N\times\) training. In practice \(N = 5\) to \(10\) is plenty.
Pairs naturally with a heteroscedastic head for the aleatoric part.

One training run, many models

Ok so what if \(N\times\) training is too expensive? Can we wring a posterior out of a single run?

Laplace approximation

Fit a Gaussian around the MAP estimate (MacKay 1992).
Covariance from the Hessian (or Fisher / GGN as a cheap proxy).
Last-layer Laplace is cheap, post-hoc, and shockingly competitive (Daxberger et al. 2021).
Works on a model you already trained. No retraining required.

SWAG, SGD as a sampler

Fit a Gaussian to the trajectory of SGD iterates (Maddox et al. 2019).
Low-rank plus diagonal covariance. Memory-friendly.
Sample from this Gaussian as an implicit posterior.
Free epistemic uncertainty from a training run you were doing anyway.

Let’s run through a Last-layer Laplace

Three moves

Train the thing, then fit a Gaussian around the MAP. That’s the whole trick.

Point estimate. Train as usual to get the MAP weights \(\hat{w}\).
Curvature at \(\hat{w}\). Approximate the posterior as a Gaussian \[p(w \mid \mathcal{D}) \;\approx\; \mathcal{N}(\hat{w},\, \Sigma), \qquad \Sigma = H^{-1}\] For a Gaussian likelihood, the GGN / Fisher of the last-layer weights is \[H \;=\; \sum_{i=1}^{N} \frac{\Phi_i \Phi_i^\top}{\sigma^2} \;+\; \tau I\] where \(\Phi_i = \phi(x_i)\) are the penultimate-layer features, \(\sigma^2\) is the observation noise, and \(\tau\) is the prior precision.
Predict. At a fresh \(x_\star\) with features \(\Phi_\star = \phi(x_\star)\), \[\mu(x_\star) = \hat{w}^\top \Phi_\star, \qquad v(x_\star) = \Phi_\star^\top \Sigma\, \Phi_\star + \sigma^2.\] First term is epistemic (what the posterior doesn’t pin down), the \(\sigma^2\) is aleatoric.

A rough sketch in julia

using Flux, Zygote, LinearAlgebra

# 1. Train a point estimate the usual way
model = Chain(features, Dense(d, 1))
train!(loss, params(model), data, opt)

# MAP last-layer weights
ϕ(x) = model[1:end-1](x)
θ = vec(model[end].weight)

# 2. Curvature at ω (GGN / Fisher for Gaussian likelihood)
τ, σ² = 1e-3, residual_variance(model, data)
H = zeros(d, d)
for (x, _) in data
    Φ = ϕ(x)
    H .+= (Φ * Φ') / σ²
end
Σ = inv(H + τ*I)                      # last-layer posterior covariance

Point estimate in, Gaussian over the last layer out. Nothing re-trained.

Last-layer Laplace, the predictive

# 3. Predictive distribution at a fresh x
function predict(x)
    Φ = ϕ(x)
    μ  = θ ⋅ Φ

    # epistemic + aleatoric
    v  = Φ' ⋅ Σ ⋅ Φ + σ² 
    return μ, v
end

One forward pass for the mean, one quadratic form for the variance. That’s the whole inference cost.

MC Dropout, simple but bad taste

Keep dropout switched on at inference, take \(T\) forward passes, and call the variance your uncertainty (Gal and Ghahramani 2016).

Trivially easy to implement. No retraining.
But the uncertainty is tied to the dropout rate, which is a hyperparameter, not a posterior.
As the network gets wider, the uncertainty collapses.
Under distribution shift it systematically under-estimates.

Feeding this straight into position sizing is asking for trouble.

Conformal prediction as a post hoc checks

Everything above is a model of uncertainty. Conformal prediction sits on top and gives you a coverage guarantee, whatever model you wrapped.

Wraps any point predictor to emit intervals with exact marginal coverage.
Distribution-free. It only assumes exchangeability of calibration and test data.
Breaks in financial time series.
Adaptive variants (Gibbs and Candès 2021) try to fix this for streaming data.
I use it to audit my Bayesian approximations, not to replace them.

So what actually ends up in the stack?

It really depends on what exactly you need but

A heteroscedastic head for the aleatoric part. Essentially free.
A deep ensemble with \(N \approx 5\) to \(10\) for epistemic uncertainty.
Laplace when that ensemble is too expensive.
A conformal wrapper on top, to check calibration.

The posterior is a tool.

Match the complexity of the uncertainty machinery to what you’ll actually do with it.
Are you sizing positions?
Routing between models?
Refusing to trade?
Figure out which decision the uncertainty is supposed to change, then pick the cheapest method that lets you make it.

Now let’s do some physics!

The original Brownian motion

\(dS_t = \mu S_t dt + σ S_t dW_t\)

A stock in the same bath

Two parameters is just not enough

Thank you

Michael Green
CTO, AI Alpha Lab
https://aialphalab.com

CEO, Desupervised
https://desupervised.io

References

Daxberger, Erik, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. 2021. “Laplace Redux: Effortless Bayesian Deep Learning.” Advances in Neural Information Processing Systems 34.

Gal, Yarin, and Zoubin Ghahramani. 2016. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” Proceedings of the 33rd International Conference on Machine Learning.

Gibbs, Isaac, and Emmanuel Candès. 2021. “Adaptive Conformal Inference Under Distribution Shift.” Advances in Neural Information Processing Systems 34.

Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. 2017. “Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles.” Advances in Neural Information Processing Systems 30.

MacKay, David J. C. 1992. “Bayesian Interpolation.” Neural Computation 4 (3): 415–47. https://doi.org/10.1162/neco.1992.4.3.415.

Maddox, Wesley J., Pavel Izmailov, Timur Garipov, Dmitry P. Vetrov, and Andrew Gordon Wilson. 2019. “A Simple Baseline for Bayesian Uncertainty in Deep Learning.” Advances in Neural Information Processing Systems 32.

Ovadia, Yaniv, Emily Fertig, Jie Ren, et al. 2019. “Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift.” Advances in Neural Information Processing Systems 32.