Alien AFL The Mathematical Foundations of Causal Reinforcement Learning in AFL Analytics

This document outlines the maths for an idea I had pertaining to Moneyball in the AFL. I have used a thought experiment to come up with a what I believe is a unique way of modelling AFL. The idea was to think about how aliens might model the AFL given they have no understanding of the game. Assuming you had any data you want, what would you do? I thought that aliens would probably try and learn the intrinsic value of any give set of states and transitions between them by back calculating the value based on the score resultant from a set of states and transitions.

1. The Environment as a Markov Decision Process (MDP)

Traditional sports analytics methods treat invasion games as sequences of isolated, independent events. In contrast, I propose modelling football as a finite, episodic, multi-dimensional Markov Decision Process (MDP).

1.1 The State Space ( $S$ )

A state $s \in S$ represents a comprehensive snapshot of the spatial and tactical context of the game at any time $t$ . Rather than treating space as continuous pixels, the field is discretized into a coordinate system combined with contextual vectors. A state is parameterized as:

$s_{t} = P_{t} b_{t} v_{t} τ_{t}$

Where:

$P_{t}$ : An $N \times 2$ matrix (where $N = 36$ active players) tracking the global coordinate layout of every athlete on the field:

$P_{t} = x_{1, t} x_{2, t} ⋮ x_{36, t} y_{1, t} y_{2, t} ⋮ y_{36, t} \in R^{36 \times 2}$

$V_{t}$ : Approximated as the discrete-time finite difference of the continuous position matrix function $P (t)$ . Given a standard tracking sampling interval of $Δ t$ , it acts as the empirical first derivative representing the instantaneous velocity all 36 players:

$V_{t} \approx \frac{d P ( t )}{d t} \approx \frac{P _{t} - P _{t - 1}}{Δ t} \in R^{36 \times 2}$

$b_{t}$ : the instantaneous postion of the ball.
$v_{t} = (Δ x, Δ y)$ : represents the instantaneous velocity and directional vector of the ball.
$τ_{t}$ : represents contextual game state metadata (e.g., time remaining in the quarter, point differential).

1.2 The Action Space ( $E_{t}$ )

We treat the action space as a discrete event set. It evolves from the continuous play of the game into a discrete event steam.

$E_{t} = {(i, event_type)}$

i is the player id
event_type is $\in$ (kick, handball, spoil, mark, tackle)

1.3 Transition Probabilities ( $P$ )

The transition dynamics of the environment are dictated by the probability distribution $P (s_{t + 1} ∣ s_{t}, e_{t})$ . This defines the probability that executing action $e_{t}$ from state $s_{t}$ will result in the system transitioning to state $s_{t + 1}$ .

In a chaotic, 360-degree environment like AFL, $P (s_{t + 1} ∣ s_{t}, e_{t})$ captures both the physical execution error of the player and the chaotic intervention of opponents (e.g., spoils, effective tackling, wind conditions).

1.4 The Reward Function ( $R$ ) and Horizon

Because AFL possessions are highly interdependent, rewards are modeled as terminal and episodic. Intermediate actions do not receive immediate extrinsic rewards ( $R_{t} = 0$ ). The reward function resolves strictly when a possession chain terminates ( $T$ ):

$R_{terminal} \in {+ 6.0, + 1.0, 0.0, - 1.0, - 6.0}$

Where:

$+ 6.0$ : Chain ends in a Goal scored by the attacking team.
$+ 1.0$ : Chain ends in a Behind scored by the attacking team.
$0.0$ : Chain ends in a neutral stoppage (referee bounce/throw-in) or clean out-of-bounds.
$- 1.0$ : Chain is turned over and results in an immediate opposition Behind.
$- 6.0$ : Chain is turned over and results in an immediate opposition Goal.

Because every match is a fixed duration and every possession chain is finite, the discount factor is set to $γ = 1.0$ . This ensures that the model optimizes for actual scoreboard outcomes without artificially diminishing the value of multi-stage setup play.

2. Value Estimation via Offline Reinforcement Learning

Because we cannot run active simulations to train a policy, we apply Offline Reinforcement Learning to a historical dataset. This is where the magic comes in for me. Using ORL we can capture the value of any given game configuration without any priors about what is good or bad.

2.1 The Multi-Agent State-Value Function $V (s_{t})$

The State-Value Function $V (s_{t})$ represents the expected terminal reward given the global configuration of the match at timestamp $t$ . It maps the latent value of the entire field layout:

$V (s_{t}) = E [R_{terminal} ∣ S_{t} = s_{t}]$

Because $s_{t}$ contains the player position matrix $P_{t}$ and the velocity matrix $V_{t}$ , $V (s_{t})$ does not just learn to evaluate the quality of the previous actions or where the ball is. It learns to evaluate the whole positional configuration of the game.

2.2 The Sparse Event Action-Value Function $Q (s_{t}, E_{t})$

The Action-Value Function $Q (s_{t}, E_{t})$ defines the expected terminal reward of entering state $s_{t}$ and observing the sparse event set $E_{t}$ . It resolves under two distinct operational pathways based on our event-driven architecture:

$Q (s_{t}, E_{t}) = \sum_{s_{t + 1} \in S} P (s_{t + 1} ∣ s_{t}, E_{t}) V (s_{t + 1})$

Pathway A: Passive System Evolution ( $E_{t} = \emptyset$ )

When no technical event occurs, the $Q$ -value measures the expected progression of the play based purely on physical momentum and tracking trajectories:

$Q (s_{t}, \emptyset) = E [V (s_{t + 1}) ∣ s_{t}]$

Pathway B: Active Event Disruptions ( $E_{t} = {(i, κ, μ)}$ )

When an explicit event $κ$ is executed by player $i$ (e.g., a kick or a spoil), the transition probability shifts non-linearly. The $Q$ -value captures the expected value of the state immediately after the event’s physical resolution:

$Q (s_{t}, Event) = \sum_{s_{t + 1}} P (s_{t + 1} ∣ s_{t}, i, κ, μ) V (s_{t + 1})$

3. The Counterfactual Causal Layer

To isolate an individual player’s execution and decision-making from structural team bias, I propose a counterfactual causal layer based on Structural Causal Models (SCMs). By conditioning our calculations on the multi-agent position and velocity matrices ( $P_{t}, V_{t}$ ), the state space acts as a backdoor adjustment set, neutralizing the confounding effects of system quality.

3.1 Realized Action vs. Passive Counterfactual (Execution Value)

We evaluate the structural value of a physical intervention by comparing the observed active event $E_{t}$ against the hypothetical scenario where the player elected not to intervene ( $E_{t} = \emptyset$ , preserving passive physical tracking trajectory):

$α (s_{t}, E_{t}) = Q (s_{t}, E_{t}) - Q (s_{t}, \emptyset)$

This metrics isolates pure physical execution capability. It answers: Did the mechanical execution of this kick, handball, or tackle improve our field state relative to simply continuing to run or hold the ball?

3.2 Observed Choice vs. Optimal Counterfactual (Decision Quality)

To evaluate a player’s field vision and cognitive execution under chaotic pressure, we map the observed event against the maximum potential value of all counterfactual alternative valid events $E_{t}^{'}$ contained in the local action envelope $E (s_{t})$ :

$Decision Regret (R_{t}) = Q (s_{t}, E_{observed}) - max_{E^{'} \in E (s_{t})} Q (s_{t}, E^{'})$

Where $E (s_{t})$ is calculated geometrically by casting ray-traces through the player position matrix $P_{t}$ to identify open, un-interceptable passing lanes.

A player who consistently maintains a Decision Regret near $0.0$ is executing optimal spatial choices under pressure, regardless of whether their raw disposal count is high or low.

3.3 Observed Player Action vs Average Player Action

I have some intuition that I haven’t formalised here. You should be able to ask a question like “How important was player x to state transitions $G$ ?”. Where $G$ is some state transitions leading to a favourable position or score. If you replaced the player with some average player and asked what would of happend, it should tell you how valuable player x was w.r.t $G$ . I don’t have a formalism for this yet but it anyone wants to help me out that would be awesome.

4. Limitations

There are several key limitations to this approach that come from the underlying structure of invasion sport. Firstly, how do you get $P_{t}$ and $V_{t}$ ? It’s easy to build a model on these concepts but actually having data for the coordinates of each player across a game is non-trivial. I imagine only AFL clubs have access to this kind of data. Meaning that my modelling idea is kind of stuck in theory land.

Secondly, in casaul analysis you really want action distributions for each play given each state. This would create a massive sparsity issue. So in my model what ends up happening is the $P (s_{t + 1} ∣...)$ actually just amalgamates the spatial decsion making of all players into one distribution. Which simplifies the maths but also the mark. Although, you can kind of kick the can down the road by saying, once we have the universal baseline of what action we would expect to see for a given $s_{t}$ you can measure how much a player deviates from that baseline leading to an increased $V (s_{t + 1})$

Tally Analytics

Recent Notes