ARENIL.COM: Theory and Application
- Introduction
- Prediction versus Expectation
- Types of Predictions
- Prediction and Risk
- Quantitative Measures
- Variance: Typical Error Magnitude
- Skewness: Directional Risk
- Kurtosis: Extreme Risk Exposure
- Normal distribution as a Constrained Benchmark for Uncertainty
- Moments
- Simulation
- Notes
- Student's t Distribution as an Extension for Controlling Tail Risk.
- Mean–Std Parametrizations
- Fat Tails
- Simulation
- Notes
- Skew-Normal Distribution as an Extension for Controlling Directional Risk
- Mean–Std Representation
- Median–Std Parameterization
- Directional Risk
- Simulation
- SGT Distribution for Controlling Tail Risk and Directional Risk
- Shape
- Simulation
- Mixture Distribution as a Linear Opinion Pool
- Gaussian Mixture Distribution for Controlling Complex Uncertainty
- Quantile-based Interpolated Distribution for Approximation
- Qualitative Prediction Framework
- Dispersion Statements and Normal Distribution
- Stated Probability Interval
- Expression as Percentage Error
- Qualitative Levels
- Fat Tail Statements and t Distribution
- Step 1: Calibration
- Step 2: Estimating Dispersion
- Directional Imbalance and Skew-Normal distribution
- Step 1: Calibration
- Step 2: Estimating Dispersion
- SGT Distribution
- Step 1: Calibration
- Step 2: Estimating Dispersion
- Extended Qualitative Prediction Framework
- Predictor's Data
- Estimation
- CRPS
- Notation
- Propriety
- Comparison
- Alternative Formulas
- CRPS via CDFs (1)
- CRPS via CDFs (2)
- CRPS via Expectations
- Sharpness: $E|X - y|$
- Dispersion Penalty: E|X-X'|
- CRPS for Distributions
- Equivariance of CRPS under Linear Transformations
- CRPS for Normal Distribution
- CRPS for Student-t Distribution
- CRPS for Gaussian Mixture Distribution
- CRPS for Uniform Distribution
- CRPS for Quantile-based Interpolated Distribution
- CRPS via Numerical Integration
- Directional Predictive Distributions
- Mixture of Directional Distributions
- Brier Score
- Designing Fair and Honest Competitions
- Combining Multiple Predictions from a Single Predictor
- From Betting to an Information-Revealing Framework
- Cross-Contest Normalization: The Skill Score
- The Choice of Baseline: Context-Aware Intelligence
- Aggregating Scores and Designing a Fair Market of Attention
- Averaging across horizons
- Compressing the negative tail
- Aggregating across competitions
- Aggregating three metrics
1 Introduction
In this document, we explain the three main types of predictions in our system, how our scoring system operates based on them, and how these predictions are combined. This is a technical document intended for readers familiar with statistics. We also use generative AI for tasks such as text editing.
This introduction provides a general discussion of our underlying assumptions.
1.1 Prediction versus Expectation
At Arenil.com, a prediction (also referred to as a forecast or expectation) can reflect either the output of a statistical model or a user’s subjective belief about a future value. While these sources differ (models rely on data and algorithms, whereas users draw on intuition or domain knowledge) the way we represent their expectations is unified. Whether the forecast comes from a model or a person, it can be expressed as a single value (a point prediction), a full probability distribution (to reflect uncertainty), or a simpler directional statement (e.g., the probability of an upward or downward movement). This consistent format allows our system to treat both types of inputs equally.
At Arenil.com, we aim to provide flexible tools that enable users to express their subjective expectations easily. Consistent with the ideas discussed above, our goal is to ensure that users can communicate their views in intuitive and natural ways while still grounding those expressions in rigorous statistical interpretation.
For clarity and simplicity, we use the terms prediction, forecast, and expectation interchangeably throughout this document to refer to any expression about a future outcome.
1.2 Types of Predictions
There are three main types of predictions in our system:
- Point Prediction: A single numerical value that represents the forecasted outcome.
- Distribution Prediction: Describes the forecast as a full probability distribution, capturing both the central tendency (such as the median or mean) and the uncertainty (variability) around the outcome.
- Directional Prediction: Specifies the probability that the outcome will move in a certain direction, either upward or downward.
In addition to the primary prediction types, we also derive related information automatically:
Directional Prediction from a Point Prediction: We infer the direction by comparing the predicted value to the most recent known value.
- If the predicted value is greater than the last known value, we assign a probability of 1 to an upward movement.
- If it is lower, we assign a probability of 1 to a downward movement.
- If it is equal, we assign full probability to no change.
Point Prediction from a Distribution Prediction: We use the mean of the distribution as the representative point prediction.
Directional Prediction from a Distribution Prediction: Once the mean is computed, we infer the direction using the same rule as for point predictions, by comparing the mean to the last known value.
2 Prediction and Risk
When we describe a future value \(X\) by a single number \(x\), we provide a point prediction. This expresses a central belief and contains no information about risk/uncertainty. In contrast, specifying a full probability distribution for \(X\) expresses structured beliefs about the shape of possible deviations from the center. In such cases, we are expressing a risk/uncertainty profile which captures multiple, conceptually distinct features: dispersion, Fat tails and skewness. These concepts describe different ways in which outcomes can depart from the central expectation. They are not merely technical properties of distributions; they correspond to different qualitative patterns of uncertainty about the future.
Dispersion refers to the overall spread of possible outcomes around the central value. It captures how far outcomes may move away from the center in typical circumstances. When dispersion is small, most realizations of \(X\) are expected to lie relatively close to the predicted center. The uncertainty exists, but it is limited. When dispersion is large, we expect outcomes to vary widely.
Skewness describes the asymmetry of potential deviations. A symmetric uncertainty profile implies that upward and downward deviations from the center are similarly likely and of comparable magnitude. With skewness, however, the balance changes. A positively skewed profile suggests that large upward surprises are more plausible than large downward ones. A negatively skewed profile indicates the opposite.
Fat tails concern the likelihood of extreme outcomes. Some uncertainty profiles imply that very large deviations from the center are exceedingly rare. Others assign a meaningful probability to such extreme events. When the tails are “fat” the distribution places relatively more weight on outcomes far from the center, meaning that rare but dramatic deviations occur more frequently than we would expect under thin-tailed uncertainty.
Note: Richer distributional assumptions provide greater expressive power, but at a cost. More flexible distributions introduce additional parameters (e.g., skewness, tail thickness, or mixture weights), which must be estimated from data or specified subjectively. This increases the risk of estimation error, overfitting, and model instability, particularly when data is limited. Importantly, incorporating poorly supported structure can be more harmful than omitting it: a misspecified feature (e.g., incorrect skewness or tail behavior) may systematically distort inference and decision-making. Simpler distributions (such as the normal distribution) impose stronger structure, but are often more robust to limited information. Thus, adopting richer assumptions requires not only more information, but sufficiently reliable information to justify the added complexity.
The discussion above implicitly focuses on continuous, single-mode predictive distributions, where uncertainty is naturally interpreted as deviations around a central value. In that setting, variance captures overall dispersion, fat tails describe how probability mass is allocated to extreme outcomes, and skewness reflects directional asymmetry. However, richer forms of uncertainty (such as truncation, discrete states, regime shifts, or multi-modal structures) introduce structural risk that is not fully characterized by variance, tails, or skewness alone. In such cases, the mean may be unrepresentative, tail behavior may reflect separate regimes rather than gradual decay, and dispersion may arise from state changes rather than local variability.
Richer frameworks such as multi-modal distributions require substantially stronger assumptions: one must identify distinct states, specify transition mechanisms, and estimate parameters within each regime. In many settings, the available information is insufficient to support this level of structure reliably. Moreover, imposing an incorrect structure (such as spurious regimes or misidentified states) can be more damaging than working with a simpler, single-regime approximation, as it may lead to systematically misleading conclusions about risk. For this reason, the present framework focuses on shape-based uncertainty within a single regime, prioritizing robustness and interpretability over potentially fragile structural detail.
Note: Distributions without a finite mean (such as the t-distribution with \(\nu \le 1\)) do not provide a well-defined notion of expected value. Without this anchor, standard interpretations of a “prediction” become problematic, and many decision rules based on expectations are no longer applicable. While such distributions can be useful for modeling extreme or highly unstable phenomena, using them as predictive models of future values risks introducing ambiguity rather than clarity. In particular, replacing a well-defined (even if imperfect) central estimate with an ill-defined one may degrade decision quality more than it improves realism. For this reason, we restrict attention to distributions with a finite mean, ensuring that predictions remain interpretable and operational.
Note: In addition to full predictive distributions over continuous outcomes, an alternative modeling approach focuses on discrete outcomes, where the target variable is discretized into a finite set of mutually exclusive states. Predicting direction as a special case of this approach will be discussed in separate section.
2.1 Quantitative Measures
Suppose the outcome of interest is \(X\), and denote its expected value by \(\mu = \mathbb{E}[X]\). Note that we fundamentally assume that this value exists and is finite. Deviations from the center are then given by \(X-\mu\). Many important characteristics of the distribution can be expressed through expectations of functions of these deviations. In general, many properties of the distribution can be written in the form
\begin{equation} \mathbb{E}[f(X-\mu)] \end{equation}
for some function \(f(\cdot)\). Different choices of the function \(f\) highlight different aspects of the distribution. For example, \(f(e)=|e|\) measures the average absolute error and higher powers of \(e\) place increasing emphasis on larger deviations.
In practice, many widely used measures correspond to expectations of polynomial functions of the deviation, that is, powers of \(X-\mu\). The quantities \(\mathbb{E}[(X-\mu)^k]\) are called the central moments of the distribution. In contrast, \(\mathbb{E}[X^k]\) are referred to as raw moments. In what follows, we focus on central moments, as they directly describe deviations from the predictive center.
Moments arise naturally because they describe how a distribution behaves locally around its mean, serving as the fundamental building blocks in approximations of random variables. The use of powers such as \((X-\mu)^2\), \((X-\mu)^3\), and \((X-\mu)^4\) is not arbitrary. Polynomial functions arise naturally in probability theory because many analytical tools rely on expansions around the mean.
When a function of a random variable is approximated using a Taylor expansion around the mean \(\mu\), we obtain
\begin{equation} f(X) \approx f(\mu) + f'(\mu)(X-\mu)+\frac{1}{2}f''(\mu)(X-\mu)^2+\cdots . \end{equation}
Taking expectations of this expansion produces expressions involving
\begin{equation} \mathbb{E}[(X-\mu)^k]. \end{equation}
These expectations are precisely the moments of the distribution. Thus powers of the deviation arise naturally whenever we approximate or analyze functions of random variables. This mathematical structure explains why moments occupy a central role in probability and statistics.
The first few moments correspond closely to intuitive geometric features of a distribution. Because these moments depend on the scale of the variable, they are typically standardized by appropriate powers of the standard deviation. Standardization removes the influence of units and isolates the structural properties of the distribution. In particular, the standardized third and fourth moments measure asymmetry and tail heaviness independently of the overall magnitude of variability.
A probability distribution contains infinitely many details, and no finite set of statistics can capture it completely. Moments therefore serve as a structured hierarchy of probes into the shape of the distribution. Each successive power of the deviation places progressively more weight on larger errors, revealing deeper aspects of the uncertainty profile.
In this hierarchy, variance, skewness, and kurtosis provide compact summaries of the most important qualitative features of predictive uncertainty: the overall scale of typical deviations, the directional balance of surprises, and the exposure to extreme outcomes.
Although moments are widely used, they are not the only possible measures of distributional shape. Alternative statistics (such as mean absolute deviation, quantile-based measures, or tail indices) can also describe dispersion, asymmetry, or tail risk.
On the other hand, moment-based measures possess powerful mathematical properties. Expectations are linear, variances combine neatly for independent variables, and moments are closely connected to generating functions and analytical approximations. These properties make moments particularly convenient for theoretical work and statistical modeling, which explains their central role in statistical theory.
A limitation of moment-based descriptions is that they require the corresponding expectations to exist. In heavy-tailed settings, higher-order moments and in some cases even variance may be infinite. In such cases, moment-based summaries become incomplete or unavailable, and alternative measures of uncertainty may be required.
2.1.1 Variance: Typical Error Magnitude
The most familiar moment-based measure is variance, which summarizes the overall dispersion of outcomes around the central prediction. It is defined as
\begin{equation} \mathrm{Var}(X) = \mathbb{E}\big[(X-\mathbb{E}[X])^2\big]. \end{equation}
Variance measures the average squared magnitude of prediction errors. Because deviations are squared, larger errors receive disproportionately more weight, making variance particularly sensitive to large deviations from the mean. The square root of variance, known as the standard deviation, provides a measure of dispersion in the original units of the variable.
Squaring serves two purposes: it prevents positive and negative errors from canceling each other, and it gives greater weight to larger deviations. A larger variance therefore implies that realized outcomes tend to lie further away from the central prediction, reflecting a higher level of overall uncertainty.
2.1.2 Skewness: Directional Risk
While variance describes the overall magnitude of deviations, it does not indicate whether deviations tend to occur more often on one side of the distribution. This directional imbalance can be captured by skewness, which is based on the third moment of the distribution. It is defined as
\begin{equation} \gamma_1 = \mathbb{E}[(X-\mu)^3] / \sigma^3 \end{equation}
The key idea is that cubic deviations preserve their sign: positive deviations remain positive and negative deviations remain negative. As a result, the expectation of the cube of deviations reveals whether large positive or negative errors dominate. In this way, skewness measures asymmetry in the distribution of forecast errors. A symmetric distribution assigns equal probability to positive and negative deviations of the same size. A skewed distribution reflects directional risk, meaning that large positive errors may be more plausible than large negative ones, or vice versa.
2.1.3 Kurtosis: Extreme Risk Exposure
Another important feature of uncertainty concerns the likelihood of extreme prediction errors. As noted earlier, two distributions can have the same variance yet differ in how probability is allocated across the range of outcomes. One may concentrate most probability near the center, while another places relatively more probability in the extremes.
This feature is commonly described as fat tails. Fat tails indicate that very large deviations occur more frequently than would be expected under thin-tailed distributions.
The moment traditionally used to capture this sensitivity is the fourth central moment. Because the fourth power increases rapidly with the size of deviations, extreme observations contribute disproportionately to this expectation. The standardized version,
\begin{equation} \gamma_2 = \mathbb{E}[(X-\mu)^4] / \sigma^4 \end{equation}
is known as kurtosis.
Kurtosis is often interpreted as a measure of tail heaviness, although it also reflects how probability mass is distributed near the center. Importantly, in sufficiently heavy-tailed distributions, the fourth moment may be infinite, in which case kurtosis is not defined.
2.2 Normal distribution as a Constrained Benchmark for Uncertainty
If we assume
\begin{equation} X \sim \mathcal{N}(\mu,\sigma),\quad \sigma>0 \end{equation}
An important property is:
\begin{equation} \mathbb{E}[X]=\mu, \qquad \mathrm{Var}(X)=\sigma^2. \end{equation}
which means the parameter \(\mu\) represents the central expectation (point forecast), and \(\sigma^2\) represents overall dispersion. Let
\begin{equation} z = \frac{x-\mu}{\sigma}. \end{equation}
The *standard Normal* random variable $Z \sim \mathcal{N}(0,1)$ has the following pdf and cdf functions:\begin{equation} \phi(z)= \frac{1}{\sqrt{2\pi}} \exp\!\left(-\frac{z^2}{2}\right), \qquad \Phi(z)=\frac{1}{\sqrt{2\pi}} \int_{-\infty}^{z} \exp\!\left(-\frac{t^2}{2}\right)dt . \end{equation}
which means:
\begin{equation} f(x)= \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left( -\frac{(x-\mu)^2}{2\sigma^2} \right), \qquad F(x)= \Phi\!\left( \frac{x-\mu}{\sigma} \right). \end{equation}
Under the Normal distribution:- The distribution is perfectly symmetric around \(\mu\).
- Tail thickness is fixed (thin, Gaussian tails).
- All uncertainty is fully summarized by a single parameter \(\sigma\).
2.2.1 Moments
A key reason for the central role of the Normal distribution is that it is fully characterized by its first two moments: the mean and the variance. All higher-order moments are fixed functions of these two parameters, implying that features such as skewness and kurtosis are not free to vary. In this sense, the Normal distribution represents a highly constrained model of uncertainty, where dispersion is the only adjustable dimension.
More generally, it is the maximum entropy distribution among all distributions with a given mean and variance, meaning it introduces no additional structure beyond these constraints. This makes it a natural baseline when only limited information is available.
The prominence of the Normal distribution is also supported by the Central Limit Theorem, which states that sums of many small, independent shocks tend to be approximately Normal, regardless of the underlying distributions. This provides a structural justification for its use in modeling aggregate uncertainty.
In the Normal case, skewness is zero and excess kurtosis is zero, reflecting perfect symmetry and benchmark (Gaussian) tail behavior. Relative to the broader framework of uncertainty, the Normal distribution effectively collapses the multi-dimensional structure of risk (dispersion, skewness, and tail behavior) into a single parameter. Assuming normality means variance is the only qualitative dimension of uncertainty. Since symmetry and tail behavior are fixed, changing \(\sigma\) only stretches or compresses the distribution. No other qualitative feature of risk changes.
The Normal distribution can be interpreted as a minimum-structure model of uncertainty: given only information about the mean and variance, it imposes symmetry and thin tails, ruling out directional asymmetry and extreme events beyond what is implied by Gaussian decay. As such, it serves as a natural baseline against which richer forms of uncertainty—such as skewness or fat tails—can be identified and measured.
Note that interpreting skewness and tail thickness at least requires a reference point. For skewness, the benchmark is symmetry: a skewness value of zero indicates that positive and negative deviations of equal size occur with equal probability.
For tail thickness, the standard benchmark is the Normal distribution, whose tail behavior defines the conventional baseline. This comparison is often formalized using excess kurtosis, which measures tail heaviness relative to the Normal case. A value of zero indicates Normal tail thickness, positive values indicate heavier (fat) tails with greater probability of extreme deviations, and negative values indicate thinner tails.
A key reason the Normal distribution plays such a central role is that its tail behavior is unusually thin compared with many real-world processes. This thinness implies that the probability of very large deviations from the mean declines extremely quickly as we move away from the center. In fact, tail probabilities decay extremely rapidly (at a Gaussian rate proportional to \(e^{-x^2/2}\)), which is faster than standard exponential decay.
Table: Frequency of Extreme Events (\(P(\vert X-\mu\vert > k\sigma)\))
| \(k\) | Ratio |
|---|---|
| 1 | \(\frac{3}{10}\) |
| 2 | \(\frac{5}{100}\) |
| 3 | \(\frac{3}{1000}\) |
| 4 | \(\frac{63}{10^{6}}\) |
| 5 | \(\frac{1}{10^{6}}\) |
| 6 | \(\frac{2}{10^{9}}\) |
Events beyond \(4\sigma\) are already extremely rare, occurring roughly 63 times per million observations. Beyond \(6\sigma\), they become extraordinarily unlikely, with only about 2 occurrences per billion observations under the Normal distribution.
This rapid decay of tail probabilities is what is meant by thin Gaussian tails. The Normal distribution effectively rules out very large deviations as practically impossible. Many real-world phenomena (particularly financial returns, insurance losses, or environmental shock) exhibit more frequent extreme events than the Normal distribution would predict. In such cases the distribution is said to have fat tails, meaning that large deviations occur far more often than would be implied by the Gaussian benchmark. For instance, in financial markets large daily price movements that would correspond to “six-sigma events” under a Normal distribution have historically occurred many times. If the Normal distribution were literally correct, such events should be virtually impossible within any realistic sample size. Their repeated occurrence therefore signals that the true distribution has heavier tails than the Normal benchmark.
2.2.2 Simulation
The plotted distributions illustrate how uncertainty changes when only variance varies while other structural features of the distribution remain fixed. All curves represent Normal distributions with the same mean, implying the same central prediction. Because the Normal distribution is inherently symmetric and has a fixed tail shape, all three cases share identical skewness (zero) and identical kurtosis. As a result, the qualitative structure of risk is unchanged: positive and negative deviations of equal size remain equally likely, and the probability of extreme outcomes declines at the same rate in all cases.
2.2.3 Notes
Note: The integral
\begin{equation} \int e^{-t^2/2} dt \end{equation}
cannot be written in terms of elementary functions. This result follows from Liouville's theorem in differential algebra, which classifies when integrals have elementary antiderivatives. Because of this, the normal CDF is defined as a special function. Instead of working directly with \(\Phi(z)\), mathematics introduces the error function:
\begin{equation} \mathrm{erf}(x)= \frac{2}{\sqrt{\pi}} \int_{0}^{x} e^{-t^2} dt \end{equation}
The normal CDF can be written using this function:
\begin{equation} \Phi(z)= \frac{1}{2} \left[ 1+\mathrm{erf}\!\left(\frac{z}{\sqrt{2}}\right) \right]. \end{equation}
2.3 Student's t Distribution as an Extension for Controlling Tail Risk.
Let \(T_\nu\) denote the standard Student’s t distribution with \(\nu\) degrees of freedom. The canonical location–scale representation is
\begin{equation} X = l + s T_\nu . \end{equation}
Let
\begin{equation} z = \frac{x-l}{s}. \end{equation}
The pdf and CDF of the standardized variable \(T_\nu\) are
\begin{equation} t_\nu(z)= \frac{\Gamma\!\left(\frac{\nu+1}{2}\right)} {\sqrt{\nu\pi},\Gamma\!\left(\frac{\nu}{2}\right)} \left( 1+\frac{z^2}{\nu} \right)^{-\frac{\nu+1}{2}}, \qquad T_\nu(z)= \int_{-\infty}^{z} t_\nu(u),du . \end{equation}
An equivalent closed form uses the regularized incomplete Beta function
\begin{equation} T_\nu(z)= \frac{1}{2} + \frac{z}{|z|} \frac{1}{2} I_{\frac{\nu}{z^2+\nu}} \!\left(\frac{\nu}{2},\frac{1}{2}\right). \end{equation}
For the location–scale Student’s t distribution, as discussed before, we have:
\begin{equation} f_X(x)= \frac{1}{s} t_\nu\!\left(\frac{x-l}{s}\right), \qquad F_X(x)= T_\nu\!\left(\frac{x-l}{s}\right). \end{equation}
For \(\nu>1\),
\begin{equation} \mathbb{E}[X] = l. \end{equation}
For \(\nu>2\),
\begin{equation} \mathrm{Var}(X)= s^2 \frac{\nu}{\nu-2}. \end{equation}
Thus, the scale parameter \(s\) is not the standard deviation. Instead,
\begin{equation} \mathrm{Std}(X)= s\sqrt{\frac{\nu}{\nu-2}}. \end{equation}
Higher moments are:
- skewness = 0 for \(\nu>3\)
- excess kurtosis
\begin{equation} \frac{6}{\nu-4} \qquad (\nu>4). \end{equation}
2.3.1 Mean–Std Parametrizations
Many continuous probability models belong to a location–scale family. A random variable \(X\) belongs to such a family if it can be written as
\begin{equation} X = l + s R, \end{equation}
where
- \(R\) follows a referenced distribution,
- \(l \in \mathbb{R}\) is a location parameter, and
- \(s > 0\) is a scale parameter.
The location parameter shifts the distribution horizontally, while the scale parameter stretches or compresses it. If the standardized variable \(R\) has mean \(m_R\) and variance \(v_R\), then
\begin{equation} \mathbb{E}[X] = l + s m_R, \qquad \mathrm{Var}(X) = s^2 v_R. \end{equation}
Thus, in general, the structural parameters \((l,s)\) are not equal to the mean and standard deviation of the distribution. They coincide only if the standardized distribution satisfies \(m_R = 0,\;v_R = 1\). This property holds for the Normal distribution.
Note that we have:
\begin{equation} R=\frac{X - l}{s} \end{equation}
It is similar to another equation we encounter in this context. Recall that given a random variable \(X\) with mean \(\mu\) and standard deviation \(\sigma>0\), the standardized variable is defined as
\begin{equation} Z = \frac{X - \mu}{\sigma}. \end{equation}
This transformation produces a variable with \(\mathbb{E}[Z] = 0, \; \mathrm{Var}(Z) = 1\). Standardization is therefore a data transformation that removes the units of measurement and expresses deviations relative to the standard deviation. The resulting variable is often called a standard score or z-score. Anyway, I use the term Base/Reference Distribution instead of standard distribution to reduce confusion and highlight the difference between location-scale and mean-std.
For modeling and forecasting it is often preferable to work directly with parameters representing the mean and standard deviation. This approach separates two roles:
- moment parameters \((\mu,\sigma)\), which are easy to interpret and communicate;
- canonical parameters \((l,s)\), which define the mathematical form of the distribution.
We assume that the mathematical formulas for a base distribution \(R\) is known. Let \(f_R(r)\) and \(F_R(r)\) denote the probability density function and cumulative distribution function of the base variable \(R\) and \(m_R\) and \(v_R\) be the mean and variance formula (assuming the first and second moments lambda and are finite). These might be functions of other parameters (e.g., in t distribution, we have degrees of freedom, or in skew-normal distribution, we have the shape parameters). From the transformation \(X = l + s R\), we can derive the following formulas:
\begin{equation} f_X(x)= \frac{1}{s} f_R\!\left( \frac{x-l}{s} \right). \end{equation}
\begin{equation} F_X(x)= F_R\!\left( \frac{x-l}{s} \right). \end{equation}
\begin{equation} F_X^{-1}(p)= l + s F_R^{-1}(p). \end{equation}
These are good, but not enough, since the mean-std parameterization is important for us. Therefore, we need the pdf, cdf, etc. as a function of \(\mu\) and \(\sigma\). Because \(\mathbb{E}[X] = \mu\) and \(\mathrm{Var}(X) = \sigma^2\), by substitution we get:
\begin{equation} s = \frac{\sigma}{\sqrt{v_R}}, \qquad l = \mu - s m_R. \end{equation}
Thus, any location–scale distribution with finite first two moments can be reparameterized so that location equals the mean and scale equals the standard deviation.
For t distribution, we have
\begin{equation} l=\mu \end{equation}
\begin{equation} s = \sigma \sqrt{\frac{\nu-2}{\nu}}. \end{equation}
2.3.2 Fat Tails
The presence of fat tails has important implications for decision-making: models based solely on variance may severely underestimate the likelihood of extreme losses. In such settings, tail behavior (not just dispersion) becomes a primary driver of risk.
As \(\nu \to \infty\), the model converges to the Normal distribution. Smaller \(\nu\) implies heavier tails and a higher probability of extreme outcomes. Compared to the Normal distribution:
- the distribution remains symmetric
- tail thickness is controlled by \(\nu\)
- dispersion and tail behavior interact through \(\nu\)
The variance inflation factor
\begin{equation} \frac{\nu}{\nu-2} \end{equation}
shows that when the scale parameter is fixed, smaller \(\nu\) mechanically increases the variance. Thus, in the canonical parametrization, dispersion and tail thickness are not independent: the parameter \(\nu\) affects both the spread and the likelihood of extreme outcomes. In contrast, under the mean–standard deviation parametrization, dispersion is fixed by \(\sigma\), and \(\nu\) controls only the shape of the distribution. This separation allows us to isolate the effect of tail thickness on risk.
In contrast to the Normal distribution, where variance fully characterizes the magnitude of uncertainty, the Student’s t distribution separates typical variation from extreme risk. Two distributions can have the same variance but differ substantially in the probability of extreme outcomes. This highlights that variance alone is not sufficient to describe uncertainty in heavy-tailed settings.
Normal tails decay at a Gaussian rate proportional to \(e^{-x^2/2}\), while t tails decay polynomially as \(x^{-(\nu+1)}\). Polynomial decay is much slower than Gaussian decay, which explains the substantially higher probability of extreme outcomes.
Table: Frequency of Extreme Events (\(P(\vert X-\mu\vert > k\sigma)\))
| \(k\) | \(\begin{aligned}\nu=3\end{aligned}\) | \(\begin{aligned}\nu=5\end{aligned}\) | \(\begin{aligned}\nu=10\end{aligned}\) | \(\begin{aligned}\nu=1000\end{aligned}\) |
|---|---|---|---|---|
| 1 | \(\frac{2}{10}\) | \(\frac{3}{10}\) | \(\frac{3}{10}\) | \(\frac{3}{10}\) |
| 2 | \(\frac{4}{100}\) | \(\frac{5}{100}\) | \(\frac{5}{100}\) | \(\frac{5}{100}\) |
| 3 | \(\frac{1}{100}\) | \(\frac{1}{100}\) | \(\frac{1}{100}\) | \(\frac{3}{1000}\) |
| 4 | \(\frac{1}{100}\) | \(\frac{4}{1000}\) | \(\frac{1}{1000}\) | \(\frac{67}{10^{6}}\) |
| 5 | \(\frac{3}{1000}\) | \(\frac{1}{1000}\) | \(\frac{231}{10^{6}}\) | \(\frac{1}{10^{6}}\) |
| 6 | \(\frac{2}{1000}\) | \(\frac{1}{1000}\) | \(\frac{53}{10^{6}}\) | \(\frac{3}{10^{9}}\) |
With \(\nu = 3\), extreme events are relatively common. Observations beyond \(4\sigma\) occur about 4 times per 1,000, compared with only 63 per million under the Normal distribution. Even more striking, the probability beyond \(6\sigma\) decreases only slightly—to roughly 2 per 1,000—highlighting the very slow decay of tail probabilities when \(\nu\) is small.
This behavior contrasts sharply with the Normal distribution, where probabilities drop by several orders of magnitude over the same range. For comparison, when \(\nu = 10\), the probabilities are much smaller—about 1 per 1,000 beyond \(4\sigma\) and 53 per million beyond \(6\sigma\)—demonstrating a substantially faster decline in the tails.
Student’s t distribution can therefore be viewed as a minimal extension of the Normal distribution that introduces a single additional degree of freedom controlling tail risk.
2.3.3 Simulation
The first plot compares t distributions with equal scale parameters. As the degrees of freedom decrease, the tails become heavier. However, the interpretation is harder, because at the same time, the dispersion increases. The second plot adjusts the scale parameter so that all distributions have the same variance. The interpretation is cleaner. Making variance to be constant forces heavier-tailed distributions to compress slightly near the center to offset the additional probability in the extremes. In other words, we expect more mass in the extreme tails, more mass very close to the mean, less mass in the intermediate region. Note that kurtosis depends on \(\nu\) and not \(\sigma\) and it is the same in the two plots.
When kurtosis increases while variance remains fixed, probability mass is redistributed across three regions: the center, the intermediate region, and the tails. Mass is removed from the intermediate region and reallocated to both the center and the extremes. This produces a sharper peak and heavier tails simultaneously. Thus, kurtosis should be understood not as a direct measure of tail width, but as a measure of the influence of extreme deviations on the distribution.
Note that the peak changes more than the tails. The reason is largely mathematical. If a distribution allows rare very large deviations, their squared magnitude contributes heavily to variance. To keep the overall variance fixed, the distribution must compensate by placing many observations very close to the mean, reducing the typical squared deviation. This produces a sharp central peak.
2.3.4 Notes
Note: Regarding the \(\nu\le 1\), as explained before, distributions without a finite mean, are excluded.
Note: Regarding \(\nu \le 2\), for the Student-t distribution, the probability density function (PDF) and cumulative distribution function (CDF) are well-defined for all positive degrees of freedom \(\nu > 0\) and can always be plotted. The existence of moments, however, depends on \(\nu\): the mean exists only if \(\nu > 1\), while the variance exists only if \(\nu > 2\). When \(1 < \nu \le 2\), the mean is finite, so the distribution provides a meaningful central value, but the variance is infinite. This does not prevent visualization, yet the PDF decays very slowly for large \(|x|\), and the CDF approaches 0 and 1 gradually. In this case, typical spread cannot be summarized by variance, but the distribution still conveys a valid prediction with a well-defined center and meaningful probabilistic interpretation.
Note: A part of the preceding note is true for excess kurtosis when \(\nu<4\). In fact, there is an important distinction between what the distribution looks like and what its moments measure. In other words, kurtosis is not measuring visual tail width directly. It measures the fourth moment, which heavily amplifies extreme values. Infinite kurtosis means \(\mathbb{E}[(X-\mu)^4] = \infty\). It means extremely large values occur often enough that the fourth power of the deviations cannot be averaged to a finite number.
Note: The Beta function is
\begin{equation} B(a,b)= \int_0^1 t^{a-1}(1-t)^{b-1} dt. \end{equation}
The incomplete Beta function is
\begin{equation} B_x(a,b)= \int_0^x t^{a-1}(1-t)^{b-1} dt. \end{equation}
The regularized incomplete Beta function normalizes this integral:
\begin{equation} I_x(a,b)=\frac{B_x(a,b)}{B(a,b)}. \end{equation}
Thus \(I_x(a,b)\) is essentially the CDF of a Beta distribution with parameters \(a,b\). This expression is important because the incomplete Beta function has stable numerical algorithms.
2.4 Skew-Normal Distribution as an Extension for Controlling Directional Risk
The Skew-Normal Distribution extends the Normal distribution by introducing a parameter that controls directional asymmetry. Let \(S_{\lambda}\) be the standard skew-normal distribution. Then, the canonical location-scale representation is:
\begin{equation} X = l+sS_{\lambda}, \end{equation}
where \(l \in \mathbb{R}\) is the location parameter, \(s > 0\) is the scale parameter, and \(\lambda \in (-\infty,\infty)\) is the shape parameter. The standard skew-normal distribution \(S_{\lambda}\) has probability density function:
\begin{equation} f_{S_{\lambda}}(x)= 2 \, \phi(x)\, \Phi(\lambda x), \end{equation}
where \(\phi(x)\) and \(\Phi(x)\) denote the standard Normal pdf and CDF, respectively. The cumulative distribution function does not admit a simple closed form, but can be expressed as:
\begin{equation} F_{S_{\lambda}}(x)= 2 \int_{-\infty}^{x} \phi(t)\, \Phi(\lambda t)\, dt. \end{equation}
Equivalently, it can be written using Owen’s T-function:
\begin{equation} F_{S_{\lambda}}(x)= \Phi(x) - 2\, T(x,\lambda). \end{equation}
Let
\begin{equation} \delta = \frac{\lambda}{\sqrt{1+\lambda^2}}, \end{equation}
which satisfies \(\delta \in (-1,1)\). The moments can be expressed in terms of the transformed skewness parameter \(\delta\).
\begin{equation} \mathbb{E}[X] = l + s \delta \sqrt{\frac{2}{\pi}} . \end{equation}
\begin{equation} \mathrm{Var}(X) = s^2 \left( 1-\frac{2\delta^2}{\pi} \right). \end{equation}
2.4.1 Mean–Std Representation
Shape parameter \(\lambda\) alters the expectation, variance, and skewness. As we explained in t distribution section, for modeling purposes it is often preferable to work with a mean–standard deviation parameterization, such that \(\mu = \mathbb{E}[X]\), \(\sigma^2 = \mathrm{Var}(X)\) and the parameter \(\lambda\) controls asymmetry. The canonical parameters \((l,s)\) can always be recovered from \((\mu,\sigma,\lambda)\). Using the moment formulas of the skew-normal distribution, we obtain:
\begin{equation} s = \frac{\sigma}{\sqrt{1-\frac{2\delta^2}{\pi}}}, \end{equation}
\begin{equation} l = \mu - s \delta \sqrt{\frac{2}{\pi}} . \end{equation}
Once \((l,s)\) are determined, the standard pdf, CDF, and quantile formulas of the skew-normal distribution can be evaluated in the usual way.
2.4.2 Median–Std Parameterization
In addition to the mean–standard deviation representation, it is sometimes useful to parameterize a distribution in terms of its median and standard deviation. This is particularly natural for skewed distributions, where the mean may be influenced by extreme values and therefore may not represent the central location of the bulk of the distribution. It is also useful in asymmetric settings, where the notion of a typical outcome is better captured by the median than by the mean.
Consider a location–scale representation:
\begin{equation} X = w + s R, \end{equation}
where \(R\) is a base/reference distribution, \(w \in \mathbb{R}\) is a location parameter, and \(s > 0\) is a scale parameter. A key property of location–scale families is that quantiles transform linearly. In particular, the median satisfies:
\begin{equation} \mathrm{median}(X) = w + s \cdot \mathrm{median}(R). \end{equation}
This follows from the fact that monotonic transformations preserve quantiles. Since \(X\) is an affine transformation of \(R\), the ordering of outcomes is preserved, and therefore all quantiles (including the median) shift and scale accordingly.
Let \(S_\lambda\) denote the standard skew-normal distribution, and define the median of the standard skew-normal distribution:
\begin{equation} z_{0.5} = F^{-1}_{S_\lambda}(0.5), \end{equation}
Then for $X = l + s S_\lambda$, we obtain:\begin{equation} \mathrm{median}(X) = l + s z_{0.5}. \end{equation}
We want to express the distribution in terms of a target median \(m\), a target standard deviation \(\sigma\), and a skewness parameter \(\lambda\). From the variance formula we obtain a similar formula for \(s\) in mean-std parameterization:
\begin{equation} s = \frac{\sigma}{\sqrt{1 - \frac{2\delta^2}{\pi}}}. \end{equation}
To match the median \(m\), we solve \(m = l + s z_{0.5}\), which gives:
\begin{equation} l = m - s z_{0.5}. \end{equation}
Unlike the mean, the median is *robust to extreme values*. As a result, this parameterization separates the *typical outcome* (median), from the *direction and magnitude of extreme deviations* (skewness). In a skewed distribution, the mean and median generally differ: mean is pulled toward the long tail, while median remains anchored to the central mass. Thus, mean–std parameterization emphasizes *expected outcomes*, median–std parameterization emphasizes *typical outcomes*.2.4.3 Directional Risk
The skew-normal distribution complements the Student’s t distribution: while the t distribution introduces tail risk without asymmetry, the skew-normal introduces directional asymmetry without substantially altering tail thickness. Together, these models isolate two fundamentally different dimensions of uncertainty.
An important property is that the skew-normal distribution cannot produce very large kurtosis. The excess kurtosis of the skew-normal distribution is given by:
\begin{equation} \gamma_2=2(\pi - 3) \left( \frac{\frac{2\delta^2}{\pi}}{1 - \frac{2\delta^2}{\pi}} \right)^2. \end{equation}
Since \(\delta \in (-1,1)\), this expression is bounded above. The maximum is attained as \(|\delta| \to 1\), yielding:
\begin{equation} \gamma_2^{\max}\approx 0.869177. \end{equation}
This boundedness implies that the skew-normal distribution cannot represent processes with substantial tail risk. It is therefore unsuitable for modeling phenomena where extreme events occur frequently.
The skewness parameter \(\lambda\) can take any real value. Its sign determines the direction of asymmetry:
- \(\lambda = 0\): symmetric Normal distribution
- \(\lambda > 0\): right-skewed distribution
- \(\lambda < 0\): left-skewed distribution
The magnitude of \(|\lambda|\) determines the strength of the asymmetry, although the relationship between \(\lambda\) and conventional skewness measures is nonlinear. When \(|\lambda|\) is small, the distribution remains close to symmetric. As \(|\lambda|\) increases, probability mass shifts toward one side of the distribution, producing directional imbalance. As \(|\lambda| \to \infty\), the skew-normal distribution converges to a half-normal-type distribution, where most mass lies on one side of the mode and the opposite tail becomes negligible.
Note that in the mean–standard deviation representation, \(\mu\) determines the expected outcome, \(\sigma\) determines the overall dispersion, and \(\lambda\) determines directional risk, that is, whether large deviations are more likely on the positive or negative side.
Table: Frequency of Extreme Events (\(P(X-\mu < -k\sigma\)) and \(P(X-\mu > k\sigma)\))
| \(k\) | \(\begin{aligned}\lambda=0\end{aligned}\) | \(\begin{aligned}\lambda=1\end{aligned}\) | \(\begin{aligned}\lambda=2\end{aligned}\) | \(\begin{aligned}\lambda=5\end{aligned}\) | \(\begin{aligned}\lambda=10\end{aligned}\) |
|---|---|---|---|---|---|
| 1 | \(\frac{2}{10}\), \(\frac{2}{10}\) | \(\frac{2}{10}\), \(\frac{2}{10}\) | \(\frac{2}{10}\), \(\frac{2}{10}\) | \(\frac{1}{10}\), \(\frac{2}{10}\) | \(\frac{1}{10}\), \(\frac{2}{10}\) |
| 2 | \(\frac{2}{100}\), \(\frac{2}{100}\) | \(\frac{2}{100}\), \(\frac{3}{100}\) | \(\frac{1}{100}\), \(\frac{3}{100}\) | \(\frac{485}{10^{6}}\), \(\frac{4}{100}\) | \(\frac{190}{10^{9}}\), \(\frac{4}{100}\) |
| 3 | \(\frac{1}{1000}\), \(\frac{1}{1000}\) | \(\frac{1}{1000}\), \(\frac{2}{1000}\) | \(\frac{104}{10^{6}}\), \(\frac{5}{1000}\) | \(\frac{0}{10^{9}}\), \(\frac{1}{100}\) | \(\frac{0}{10^{9}}\), \(\frac{1}{100}\) |
| 4 | \(\frac{32}{10^{6}}\), \(\frac{32}{10^{6}}\) | \(\frac{10}{10^{6}}\), \(\frac{110}{10^{6}}\) | \(\frac{118}{10^{9}}\), \(\frac{439}{10^{6}}\) | \(\frac{0}{10^{9}}\), \(\frac{1}{1000}\) | \(\frac{0}{10^{9}}\), \(\frac{1}{1000}\) |
| 5 | \(\frac{287}{10^{9}}\), \(\frac{287}{10^{9}}\) | \(\frac{33}{10^{9}}\), \(\frac{3}{10^{6}}\) | \(\frac{0}{10^{9}}\), \(\frac{25}{10^{6}}\) | \(\frac{0}{10^{9}}\), \(\frac{98}{10^{6}}\) | \(\frac{0}{10^{9}}\), \(\frac{126}{10^{6}}\) |
| 6 | \(\frac{1}{10^{9}}\), \(\frac{1}{10^{9}}\) | \(\frac{0}{10^{9}}\), \(\frac{34}{10^{9}}\) | \(\frac{0}{10^{9}}\), \(\frac{1}{10^{6}}\) | \(\frac{0}{10^{9}}\), \(\frac{6}{10^{6}}\) | \(\frac{0}{10^{9}}\), \(\frac{9}{10^{6}}\) |
With \(\lambda = \pm 10\), extreme events occur with highly asymmetric probabilities across the two tails. For positive skewness, observations below \(-4\sigma\) are essentially negligible, while exceedances above \(4\sigma\) occur with probability around 1 per 1,000 (substantially higher than the 63 per million under the Normal distribution).
This may superficially resemble a fat right tail. However, the effect arises from asymmetric reallocation of probability mass, not from a slower decay of tail probabilities. The rate at which extreme probabilities decline remains essentially Gaussian. The skew-normal distribution does not produce genuinely heavy tails. Both tails still decay at a Gaussian rate. However, skewness redistributes probability mass asymmetrically, making one tail appear more prominent relative to the other without changing the underlying rate of decay. In other words, skewness redistributes probability mass from one side of the distribution to the other while maintaining relatively light (exponential-type) tail decay.
This can be seen in the \(6\sigma\) probabilities: the right tail probability is about 9 per million, which remains small in absolute terms and decreases rapidly, unlike heavy-tailed distributions such as the \(t\)-distribution (one per thousands with \(\nu=5\)).
As a result, skewness does not increase the overall likelihood of extreme events; it redistributes where those extreme events occur. In this sense, skewness represents directional risk, while kurtosis represents tail intensity.
2.4.4 Simulation
The first plot shows skew-normal distributions in the canonical location–scale representation, where the location and scale parameters \((l,s)\) are fixed while the skewness parameter \(\lambda\) varies. As skewness increases, the distribution becomes asymmetric: probability mass shifts toward the longer tail while the opposite side compresses. Because the canonical parameters are held constant, skewness also affects the mean and variance. In particular, the expected value moves toward the longer tail and the variance changes. As a result, the curves differ not only in shape but also in their central tendency and dispersion. This representation is mathematically convenient, but it mixes several effects—changes in asymmetry, mean, and variance—making interpretation less transparent.
The second plot uses the mean–standard deviation representation, where the mean \(\mu\) and standard deviation \(\sigma\) are held constant while the skewness parameter varies. With equal mean and variance, the primary visible change is the redistribution of probability mass between the two sides of the distribution. In a skewed distribution, the mean is pulled toward the long tail because it is sensitive to extreme values, whereas the median remains anchored to the central mass of the distribution.
The third plot uses the median–standard deviation representation, where the median and standard deviation are held constant while the skewness parameter varies. As the skewness parameter increases, the right tail becomes longer and the left side becomes more concentrated, but the central point defined by the median remains fixed. In contrast to the mean–standard deviation representation, the mean is no longer fixed.
2.5 SGT Distribution for Controlling Tail Risk and Directional Risk
The Skew Generalized-t (SGT) distribution is a flexible five-parameter family capable of modeling asymmetry and heavy tails simultaneously. The mean-std parametrization is quite common, we use it:
\begin{equation} X \sim \mathrm{SGT}(\mu,\sigma,\lambda,p,q), \end{equation}
where \(\mu \in \mathbb{R}\) and \(\sigma>0\) are mean and standard deviation. The parameters \((\lambda,p,q)\) control the shape of the distribution. \(\lambda \in (-1,1)\) is the skewness parameter, \(p>0\) and \(q>0\) are tail shape and decay parameters. The parameter \(q\) primarily controls tail decay, with smaller values producing heavier tails. The parameter \(p\) affects the shape near the center, influencing peak sharpness and the transition from the center to the tails.
The \(r\)-th moment exists only if \(pq > r\). This condition highlights that tail behavior directly affects the existence of moments. In particular, for sufficiently heavy tails (small \(p\) or \(q\)), higher-order moments such as variance or kurtosis may not exist, reflecting extreme sensitivity to large deviations.
2.5.1 Shape
The SGT distribution can be viewed as the minimal extension of classical models that allows simultaneous control over dispersion, asymmetry, and tail risk. This distribution can be viewed as a unified framework that separates the three fundamental dimensions of predictive uncertainty discussed before; i.e., dispersion (controlled by \(\sigma\)), directional asymmetry (controlled by \(\lambda\)), and tail intensity (controlled by \(q\), with interaction from \(p\)). This allows asymmetric distributions with heavy tails on both sides, where the probability of extreme outcomes can differ across directions.
Several well-known distributions arise as special cases of the SGT family:
- Normal distribution: \(p=2, \qquad q\to\infty, \qquad \lambda=0\).
- Student-t distribution: \(p=2, \qquad q=\nu/2, \qquad \lambda=0\).
2.5.2 Simulation
The skew generalized t distribution allows the modeling of skewness and tail thickness simultaneously. While the skew-normal distribution introduces asymmetry with Gaussian tails, and the Student-t distribution captures heavy tails under symmetry, the SGT distribution Unlike the skew-normal distribution, which introduces asymmetry while preserving thin (Gaussian) tails, and the Student’s t distribution, which introduces heavy tails under symmetry, the SGT distribution allows both asymmetry and tail thickness to vary independently.
To understand the role of each parameter, we consider three experiments: 1. Vary skewness while holding tail parameters fixed; 2. Vary tail parameters under symmetry; 3. Combine skewness and heavy tails.
Table: Frequency of Extreme Events (\(P(X-\mu < -k\sigma\)) and \(P(X-\mu > k\sigma)\))
| \(k\) | \(\begin{aligned}\lambda=0 \\ p=2 \\ q=5\end{aligned}\) | \(\begin{aligned}\lambda=0.1 \\ p=2 \\ q=5\end{aligned}\) | \(\begin{aligned}\lambda=0.5 \\ p=2 \\ q=5\end{aligned}\) | \(\begin{aligned}\lambda=0.9 \\ p=2 \\ q=5\end{aligned}\) |
|---|---|---|---|---|
| 1 | \(\frac{1}{10}\), \(\frac{1}{10}\) | \(\frac{1}{10}\), \(\frac{1}{10}\) | \(\frac{1}{10}\), \(\frac{1}{10}\) | \(\frac{1}{10}\), \(\frac{1}{10}\) |
| 2 | \(\frac{2}{100}\), \(\frac{2}{100}\) | \(\frac{2}{100}\), \(\frac{3}{100}\) | \(\frac{3}{1000}\), \(\frac{4}{100}\) | \(\frac{18}{10^{9}}\), \(\frac{5}{100}\) |
| 3 | \(\frac{4}{1000}\), \(\frac{4}{1000}\) | \(\frac{2}{1000}\), \(\frac{1}{100}\) | \(\frac{68}{10^{6}}\), \(\frac{1}{100}\) | \(\frac{0}{10^{9}}\), \(\frac{1}{100}\) |
| 4 | \(\frac{1}{1000}\), \(\frac{1}{1000}\) | \(\frac{330}{10^{6}}\), \(\frac{1}{1000}\) | \(\frac{3}{10^{6}}\), \(\frac{3}{1000}\) | \(\frac{0}{10^{9}}\), \(\frac{4}{1000}\) |
| 5 | \(\frac{115}{10^{6}}\), \(\frac{115}{10^{6}}\) | \(\frac{57}{10^{6}}\), \(\frac{204}{10^{6}}\) | \(\frac{311}{10^{9}}\), \(\frac{1}{1000}\) | \(\frac{0}{10^{9}}\), \(\frac{1}{1000}\) |
| 6 | \(\frac{27}{10^{6}}\), \(\frac{27}{10^{6}}\) | \(\frac{12}{10^{6}}\), \(\frac{51}{10^{6}}\) | \(\frac{44}{10^{9}}\), \(\frac{234}{10^{6}}\) | \(\frac{0}{10^{9}}\), \(\frac{421}{10^{6}}\) |
Table: Frequency of Extreme Events (\(P(X-\mu < -k\sigma\)) and \(P(X-\mu > k\sigma)\))
| \(k\) | \(\begin{aligned}\lambda=0 \\ p=2 \\ q=2\end{aligned}\) | \(\begin{aligned}\lambda=0.1 \\ p=2 \\ q=2\end{aligned}\) | \(\begin{aligned}\lambda=0.5 \\ p=2 \\ q=2\end{aligned}\) | \(\begin{aligned}\lambda=0.9 \\ p=2 \\ q=2\end{aligned}\) |
|---|---|---|---|---|
| 1 | \(\frac{1}{10}\), \(\frac{1}{10}\) | \(\frac{1}{10}\), \(\frac{1}{10}\) | \(\frac{1}{10}\), \(\frac{1}{10}\) | \(\frac{2}{100}\), \(\frac{1}{10}\) |
| 2 | \(\frac{2}{100}\), \(\frac{2}{100}\) | \(\frac{2}{100}\), \(\frac{3}{100}\) | \(\frac{3}{1000}\), \(\frac{4}{100}\) | \(\frac{2}{10^{6}}\), \(\frac{4}{100}\) |
| 3 | \(\frac{1}{100}\), \(\frac{1}{100}\) | \(\frac{5}{1000}\), \(\frac{1}{100}\) | \(\frac{426}{10^{6}}\), \(\frac{1}{100}\) | \(\frac{128}{10^{9}}\), \(\frac{2}{100}\) |
| 4 | \(\frac{2}{1000}\), \(\frac{2}{1000}\) | \(\frac{2}{1000}\), \(\frac{3}{1000}\) | \(\frac{110}{10^{6}}\), \(\frac{1}{100}\) | \(\frac{26}{10^{9}}\), \(\frac{1}{100}\) |
| 5 | \(\frac{1}{1000}\), \(\frac{1}{1000}\) | \(\frac{1}{1000}\), \(\frac{1}{1000}\) | \(\frac{40}{10^{6}}\), \(\frac{3}{1000}\) | \(\frac{8}{10^{9}}\), \(\frac{4}{1000}\) |
| 6 | \(\frac{1}{1000}\), \(\frac{1}{1000}\) | \(\frac{341}{10^{6}}\), \(\frac{1}{1000}\) | \(\frac{18}{10^{6}}\), \(\frac{2}{1000}\) | \(\frac{4}{10^{9}}\), \(\frac{2}{1000}\) |
Table: Frequency of Extreme Events (\(P(\vert X-\mu\vert > k\sigma)\))
| \(k\) | \(\begin{aligned}\lambda=0 \\ p=2 \\ q=2\end{aligned}\) | \(\begin{aligned}\lambda=0 \\ p=2 \\ q=5\end{aligned}\) | \(\begin{aligned}\lambda=0 \\ p=2 \\ q=10\end{aligned}\) | \(\begin{aligned}\lambda=0 \\ p=2 \\ q=15\end{aligned}\) |
|---|---|---|---|---|
| 1 | \(\frac{2}{10}\) | \(\frac{3}{10}\) | \(\frac{3}{10}\) | \(\frac{3}{10}\) |
| 2 | \(\frac{5}{100}\) | \(\frac{5}{100}\) | \(\frac{5}{100}\) | \(\frac{5}{100}\) |
| 3 | \(\frac{1}{100}\) | \(\frac{1}{100}\) | \(\frac{5}{1000}\) | \(\frac{4}{1000}\) |
| 4 | \(\frac{5}{1000}\) | \(\frac{1}{1000}\) | \(\frac{424}{10^{6}}\) | \(\frac{259}{10^{6}}\) |
| 5 | \(\frac{2}{1000}\) | \(\frac{231}{10^{6}}\) | \(\frac{37}{10^{6}}\) | \(\frac{14}{10^{6}}\) |
| 6 | \(\frac{1}{1000}\) | \(\frac{53}{10^{6}}\) | \(\frac{4}{10^{6}}\) | \(\frac{1}{10^{6}}\) |
Table: Frequency of Extreme Events (\(P(\vert X-\mu\vert > k\sigma)\))
| \(k\) | \(\begin{aligned}\lambda=0.5 \\ p=2 \\ q=2\end{aligned}\) | \(\begin{aligned}\lambda=0.5 \\ p=2 \\ q=5\end{aligned}\) | \(\begin{aligned}\lambda=0.5 \\ p=2 \\ q=10\end{aligned}\) | \(\begin{aligned}\lambda=0.5 \\ p=2 \\ q=15\end{aligned}\) |
|---|---|---|---|---|
| 1 | \(\frac{2}{10}\) | \(\frac{3}{10}\) | \(\frac{3}{10}\) | \(\frac{3}{10}\) |
| 2 | \(\frac{4}{100}\) | \(\frac{4}{100}\) | \(\frac{4}{100}\) | \(\frac{4}{100}\) |
| 3 | \(\frac{1}{100}\) | \(\frac{1}{100}\) | \(\frac{1}{100}\) | \(\frac{1}{100}\) |
| 4 | \(\frac{1}{100}\) | \(\frac{3}{1000}\) | \(\frac{2}{1000}\) | \(\frac{1}{1000}\) |
| 5 | \(\frac{3}{1000}\) | \(\frac{1}{1000}\) | \(\frac{268}{10^{6}}\) | \(\frac{163}{10^{6}}\) |
| 6 | \(\frac{2}{1000}\) | \(\frac{234}{10^{6}}\) | \(\frac{47}{10^{6}}\) | \(\frac{21}{10^{6}}\) |
The tables show that decreasing \(q\) dramatically increases the probability of extreme events (slow decay), increasing \(|\lambda|\) redistributes these extreme events asymmetrically, and combining both effects leads to highly asymmetric extreme risk profiles.
Note: The skew-normal and SGT distributions generate asymmetry through fundamentally different mechanisms. The skew-normal introduces skewness by reweighting a symmetric Gaussian density, which primarily redistributes mass around the center while preserving thin tails. In contrast, the SGT distribution introduces asymmetry through asymmetric scaling of deviations, which can amplify both skewness and tail behavior. As a result, the SGT can produce asymmetric heavy-tailed distributions, while the skew-normal cannot.
The plot shows that even when the SGT distribution is calibrated to have Gaussian-like tails, its skewness mechanism differs fundamentally from that of the skew-normal distribution, leading to different shapes and tail behaviors.
Note: As explained in the introduction, the flexibility of the SGT distribution comes at a cost: it requires estimating multiple shape parameters. In practice, reliable estimation of these parameters requires sufficient data, and misspecification can lead to misleading conclusions about tail risk or asymmetry.
2.6 Mixture Distribution as a Linear Opinion Pool
Mixture distributions provide a general and conceptually clean way to combine multiple probabilistic views into a single predictive distribution. Suppose we have \(K\) predictive distributions for the same quantity \(X\), denoted by \(F_1, \dots, F_K\). These distributions may arise from different models, alternative data sources, or distinct expert assessments. Rather than selecting a single model, we form a combined distribution as a weighted average:
\begin{equation} F(x) = \sum_{k=1}^{K} \pi_k F_k(x), \qquad \pi_k > 0, \quad \sum_{k=1}^{K} \pi_k = 1. \end{equation}
This construction is known as a linear opinion pool. The corresponding density, when it exists, is given by
\begin{equation} f(x) = \sum_{k=1}^{K} \pi_k f_k(x), \end{equation}
so the aggregation operates directly at the level of probability distributions. Unlike averaging point forecasts, this approach preserves the full structure of uncertainty contained in each predictive view.
Under this interpretation, each component represents a distinct description of uncertainty rather than a physical regime that may or may not occur. The weights \(\pi_k\) quantify how much influence each view has on the final prediction. They may reflect relative confidence, historical performance, or subjective judgment. The key point is that the mixture does not attempt to identify which model is correct. Instead, it represents a synthesis of beliefs, where all views contribute simultaneously to the final distribution.
The linearity of expectations implies that the mean of the pooled distribution is
\begin{equation} \mathbb{E}[X] = \sum_{k=1}^{K} \pi_k \mu_k, \end{equation}
where \(\mu_k = \mathbb{E}*{F_k}[X]\). Thus, the combined mean coincides with the weighted average of individual point predictions. However, the behavior of dispersion reveals a deeper effect. Writing \(\sigma_k^2 = \mathrm{Var}*{F_k}(X)\), the variance of the pooled distribution can be expressed as
\begin{equation} \mathrm{Var}(X)= \sum_{k=1}^{K} \pi_k \sigma_k^2 + \sum_{k=1}^{K} \pi_k (\mu_k - \mathbb{E}[X])^2. \end{equation}
The first term reflects the average uncertainty within each predictive model, while the second term captures disagreement across models. This decomposition shows that uncertainty arises not only from variability within each view, but also from differences between views. Even if each individual model is highly concentrated, disagreement in their means increases overall dispersion. In this sense, the linear opinion pool incorporates both risk and model uncertainty in a unified way.
This perspective connects naturally with the earlier framework of uncertainty. Within each component, dispersion, skewness, and tail behavior describe the shape of deviations around a local center. Across components, differences in location, scale, and shape introduce an additional layer of uncertainty. As a result, features such as asymmetry, heavy tails, or even multiple modes can emerge without being explicitly parameterized. They arise from the interaction between different predictive views rather than from a single underlying mechanism.
The interpretation of mixture distributions as opinion pools should be distinguished from the alternative regime-based interpretation. In a regime model, one component is assumed to generate the outcome, but the identity of the regime is unknown. In an opinion pool, by contrast, the mixture reflects a combination of beliefs, not a hidden switching process. The same mathematical form therefore supports two conceptually different interpretations: one structural, the other epistemic. In other words, when mixtures are used as opinion pools, the emphasis is on combining information rather than identifying latent states. The goal is not to uncover the true data-generating mechanism, but to produce a predictive distribution that reflects the diversity and uncertainty of available views.
2.7 Gaussian Mixture Distribution for Controlling Complex Uncertainty
A particularly important special case arises when each predictive distribution \(F_k\) is approximated by a Normal distribution. In this setting, the linear opinion pool becomes a Gaussian mixture. The justification for this choice is not that individual views are exactly Normal, but that the Normal distribution provides a convenient and interpretable approximation of a single predictor’s uncertainty, summarized by its mean and variance.
Formally, a Gaussian mixture with \(K\) components is given by
\begin{equation} X \sim \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mu_k,\sigma_k^2), \end{equation}
where \(\pi_k > 0\), \(\sum_{k=1}^{K} \pi_k = 1\), and \((\mu_k,\sigma_k^2)\) represent the mean and variance associated with the \(k\)-th predictive view. The density function is
\begin{equation} f_X(x) = \sum_{k=1}^{K} \pi_k\, \frac{1}{\sigma_k \sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu_k)^2}{2\sigma_k^2}\right), \end{equation}
and the cumulative distribution function is
\begin{equation} F_X(x) = \sum_{k=1}^{K} \pi_k\, \Phi\!\left(\frac{x-\mu_k}{\sigma_k}\right). \end{equation}
These expressions remain analytically tractable because they inherit the structure of the Normal distribution.
In this formulation, each component can be interpreted as a Gaussian approximation of a single predictive model. The mixture then aggregates these approximations into a combined distribution. This provides a practical bridge between simple parametric modeling and the more general idea of combining heterogeneous beliefs. Each model contributes a mean (its point prediction) and a variance (its internal uncertainty), and the mixture incorporates both these elements as well as the disagreement across models.
The mean and variance of the Gaussian mixture follow directly from the general mixture formulas:
\begin{equation} \mathbb{E}[X] = \sum_{k=1}^{K} \pi_k \mu_k, \end{equation}
\begin{equation} \mathrm{Var}(X) = \sum_{k=1}^{K} \pi_k (\sigma_k^2 + \mu_k^2) - \mathbb{E}[X]^2. \end{equation}
As before, total dispersion reflects both within-component variability and differences in component means. This dual source of uncertainty is a defining feature of mixture models. Unlike single-distribution approaches, where dispersion is tied to local fluctuations around a single center, Gaussian mixtures allow dispersion to arise from heterogeneity across predictive views.
From the perspective of uncertainty modeling, Gaussian mixtures differ fundamentally from the parametric extensions considered earlier. In distributions such as the Student’s \(t\) or skew-normal, features like fat tails or asymmetry are introduced through explicit shape parameters within a single distribution. In a Gaussian mixture, these features emerge indirectly. For example, asymmetry can arise if component means are unevenly distributed or differently weighted, while heavy tails can result from combining components with different variances or distant means. Multi-modality appears when components are sufficiently separated.
Thus, Gaussian mixtures do not impose a specific structural form for skewness or tail behavior. Instead, they generate these features through the combination of simpler building blocks. This makes them highly flexible, but also less directly interpretable: the parameters describe individual components rather than global properties such as skewness or kurtosis.
This flexibility must be used with care. The number of parameters grows quickly with the number of components, and without sufficient information the model may capture noise rather than meaningful structure. In the context of opinion pooling, however, the interpretation is more disciplined: each component corresponds to a specific predictive view, and the weights reflect how these views are combined. Under this interpretation, the Gaussian mixture serves not merely as a flexible density, but as a structured representation of aggregated beliefs under uncertainty.
2.8 Quantile-based Interpolated Distribution for Approximation
For approximation purposes, we introduce and use the Quantile-based Interpolated distribution (QI), a continuous univariate distribution constructed from a finite sequence of strictly increasing quantiles. Let
\begin{equation} q_0 < q_1 < \dots < q_{k-1} \end{equation}
be given real numbers, and let $0 < p_{\mathrm{lo}} < p_{\mathrm{hi}} < 1$ denote lower and upper probability bounds. Define a uniform probability grid\begin{equation} p_i = p_{\mathrm{lo}} + i \cdot \Delta p, \quad \Delta p = \frac{p_{\mathrm{hi}} - p_{\mathrm{lo}}}{k-1}, \quad i=0,\dots,k-1. \end{equation}
We write\begin{equation} X \sim \mathrm{QI}(q_0,\dots,q_{k-1}; p_{\mathrm{lo}}, p_{\mathrm{hi}}) \end{equation}
to denote that $X$ follows the QI distribution induced by these quantiles and probability bounds. Conceptually, the distribution should be understood as a *probability-partitioned model*: each interval carries equal probability mass (except for tails), and the geometry of the distribution is determined entirely by the spacing of quantiles. This makes the model particularly robust and interpretable, especially when the quantiles arise from empirical data.The QI distribution can also be viewed as a "dual" to a histogram. A standard histogram partitions the data space into fixed-width (or arbitrarily chosen) bins and assigns probability mass according to how many samples fall into each bin, resulting in equal-width, variable-mass intervals. In contrast, the QI distribution partitions the probability space into equal increments and lets the bin widths be determined by the spacing of quantiles, yielding equal-mass, variable-width intervals. This leads to a smoother and more stable representation, avoids sensitivity to bin edge choices, and guarantees a continuous, invertible CDF, unlike the typically discontinuous structure of histograms.
The QI distribution admits an exact representation as a finite mixture of uniform distributions. Define tail slopes based on the outer quantile spacings:
\begin{equation} s_{\mathrm{lo}} = \frac{\Delta p}{q_1 - q_0}, \qquad s_{\mathrm{hi}} = \frac{\Delta p}{q_{k-1} - q_{k-2}}. \end{equation}
These slopes determine the finite support of the distribution:\begin{equation} x_{\min} = q_0 - \frac{p_{\mathrm{lo}}}{s_{\mathrm{lo}}}, \qquad x_{\max} = q_{k-1} + \frac{1 - p_{\mathrm{hi}}}{s_{\mathrm{hi}}}, \end{equation}
ensuring that $F(x_{\min}) = 0) and (F(x_{\max}) = 1$.The distribution can then be written as
\begin{equation} X \sim \sum_{i=0}^{k} w_i \cdot \mathrm{Uniform}(a_i, b_i), \end{equation}
where the intervals are given by\begin{equation} [a_0, b_0] = [x_{\min}, q_0], \quad [a_i, b_i] = [q_{i-1}, q_i] ;; (i=1,\dots,k-1), \quad [a_k, b_k] = [q_{k-1}, x_{\max}], \end{equation}
and the corresponding weights are\begin{equation} w_0 = p_{\mathrm{lo}}, \quad w_i = \Delta p ;; (i=1,\dots,k-1), \quad w_k = 1 - p_{\mathrm{hi}}. \end{equation}
These weights sum to one and assign equal probability mass $\Delta p$ to each interior interval, while allocating the remaining mass to the tails.This representation highlights the key structural property of the QI distribution: probability is partitioned uniformly across intervals defined by quantiles, rather than space being partitioned uniformly. As a result, narrow quantile gaps correspond to high density, and wide gaps correspond to low density.
From the mixture representation, the density follows immediately. For a uniform distribution on ([a,b]), the density is constant and equal to (1/(b-a)). Therefore, the QI density is piecewise constant:
\begin{equation} f(x) = \begin{cases} \frac{p_{\mathrm{lo}}}{q_0 - x_{\min}} = s_{\mathrm{lo}}, & x \in [x_{\min}, q_0], \ \frac{\Delta p}{q_i - q_{i-1}}, & x \in (q_{i-1}, q_i), \ \frac{1 - p_{\mathrm{hi}}}{x_{\max} - q_{k-1}} = s_{\mathrm{hi}}, & x \in [q_{k-1}, x_{\max}], \ 0, & \text{otherwise}. \end{cases} \end{equation}
Thus, the density is constant on each interval and discontinuous only at the quantile boundaries. The CDF is obtained by integrating the density or equivalently by summing the contributions of the mixture components. Because each component is uniform, the CDF is linear on each interval. Explicitly, for $x \in [q_{i-1}, q_i]$,\begin{equation} F(x) = p_{i-1} + \frac{x - q_{i-1}}{q_i - q_{i-1}} \cdot \Delta p. \end{equation}
On the tails, the same linear structure applies:\begin{equation} F(x) = p_{\mathrm{lo}} + s_{\mathrm{lo}} (x - q_0), \quad x \in [x_{\min}, q_0], \end{equation}
\begin{equation} F(x) = p_{\mathrm{hi}} + s_{\mathrm{hi}} (x - q_{k-1}), \quad x \in [q_{k-1}, x_{\max}]. \end{equation}
Outside the support, the CDF is identically 0 or 1. The resulting function is continuous, strictly increasing, and piecewise linear. The inverse CDF (quantile function) is also piecewise linear and can be derived by inverting the expressions above. For $p \in [p_{i-1}, p_i]$,\begin{equation} F^{-1}(p) = q_{i-1} + \frac{p - p_{i-1}}{\Delta p}(q_i - q_{i-1}). \end{equation}
Analogous expressions hold in the tails:\begin{equation} F^{-1}(p) = q_0 + \frac{p - p_{\mathrm{lo}}}{s_{\mathrm{lo}}}, \quad p \le p_{\mathrm{lo}}, \end{equation}
\begin{equation} F^{-1}(p) = q_{k-1} + \frac{p - p_{\mathrm{hi}}}{s_{\mathrm{hi}}}, \quad p \ge p_{\mathrm{hi}}. \end{equation}
This closed-form inverse makes sampling straightforward: if $U \sim \mathrm{Uniform}(0,1)$, then $X = F^{-1}(U)$ follows the QI distribution.The mixture-of-uniforms representation makes moment calculations particularly simple. The expectation is given by
\begin{equation} \mathbb{E}[X] = \sum_{i=0}^{k} w_i \cdot \frac{a_i + b_i}{2}, \end{equation}
since the mean of a uniform distribution on $[a,b]$ is $(a+b)/2$.Similarly, the second moment is
\begin{equation} \mathbb{E}[X^2] = \sum_{i=0}^{k} w_i \cdot \frac{a_i^2 + a_i b_i + b_i^2}{3}, \end{equation}
from which the variance follows as
\begin{equation} \mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2. \end{equation}
These expressions are exact and require only a single pass over the intervals. The tail construction is a defining feature of the QI distribution. Rather than truncating the distribution at the extreme quantiles or imposing a parametric tail, the model assumes that the *local density near the boundary quantiles continues linearly*. This results in constant density tails that extend just far enough to account for the remaining probability mass.This choice ensures finite support while avoiding strong parametric assumptions. It can be interpreted as a minimal and conservative extrapolation: the distribution beyond observed quantiles behaves consistently with the nearest observed spacing.
3 Qualitative Prediction Framework
In this section, we explain a framework for converting qualitative forecast statements into parametric distributions. This contains converting verbal or partial information statements such as "Uncertainty is high", "Very heavy tails", "Slight right skew", "About 20% error", "There is 90% probability between 3 and 4", etc. into fully specified parametric distributions. Within our framework, we proceed as follows.
- \(\mu\) is usually stated directly as the point forecast.
- \(\sigma\) is cognitively harder to express, because it is a second-moment quantity.
- shape parameters are conveyed qualitatively for example by "low", "moderate", "high" tail risk or asymmetry.
The focus is on Normal, Student's t, Skew-Normal and SGT distributions. As noted earlier, the Gaussian Mixture Model is fundamentally different from these distributions, which will be suitable for another framework we discuss next.
3.1 Dispersion Statements and Normal Distribution
For a Normal distribution, parameter identification reduces to extracting \(\sigma\) from the predictor’s expressed beliefs. \(\mu\) might be given numerically.
3.1.1 Stated Probability Interval
Suppose the predictor states: “There is 90% probability that \(X\) lies between \(a\) and \(b\)”. Under normality:
\begin{equation} P(a \le X \le b) = 1-\alpha. \end{equation}
Because the Normal distribution is symmetric, the remaining probability \(\alpha\) is split equally between the two tails. Therefore the endpoints correspond to the symmetric quantiles \(\alpha/2\) and \(1-\alpha/2\). Let \(z_{1-\alpha/2}\) denote the corresponding standard Normal quantile. Then:
\begin{equation} a = \mu - z_{1-\alpha/2}\sigma, \qquad b = \mu + z_{1-\alpha/2}\sigma. \end{equation}
Solving gives:
\begin{equation} \mu = \frac{a+b}{2}, \qquad \sigma = \frac{b-a}{2z_{1-\alpha/2}}. \end{equation}
As an example, assume $a = 80$ and $b = 120$. So the midpoint is 100 and the total width is 40. If $1-\alpha = 0.90$, $z_{0.95} \approx 1.645$. Thus:\begin{equation} \sigma = \frac{120-80}{2 \times 1.645} = \frac{40}{3.29} \approx 12.16. \end{equation}
Now suppose the predictor instead says: “There is 50% probability that \(X\) lies between \(80\) and \(120\).”. Now \(z_{0.75} \approx 0.674\). So:
\begin{equation} \sigma = \frac{40}{2 \times 0.674} = \frac{40}{1.348} \approx 29.67. \end{equation}
The interval is identical in both cases, but if it contains *90%* probability, the distribution must be relatively concentrated, i.e., smaller $\sigma$. If it contains only *50%* probability, the distribution must be much more dispersed, i.e., larger $\sigma$.Note: This formulation assumes a symmetric interpretation of error around \(\mu\). When asymmetry is present, this representation should be replaced with asymmetric bounds. Such cases for skew-normal and DGT distribtuions will be discussed later.
Note: The statement \(P(a \le X \le b) = 1-\alpha\) is interpreted as a probability statement about the uncertain future value \(X\). While similar expressions appear in both frequentist confidence intervals and Bayesian credible intervals, here it is best understood as a subjective probability interval. Mathematically, all cases reduce to selecting quantiles that leave probability mass \(\alpha\) outside the interval.
Note: In a probabilistic statement such as \(P(a \le X \le b) = 1-\alpha\), some closely related quantities appear. The interval \([a,b]\) is the numerical range of values regarded as plausible for \(X\). In the language of Frequentist statistics, this range would typically be called a confidence interval, and the associated probability \(1-\alpha\) the confidence level. The remaining probability \(\alpha\) is known as the significance level. These terms originate in Statistical hypothesis testing, where \(\alpha\) represents the probability mass assigned to the rejection region. In the present discussion, however, we are not performing a hypothesis test. We are simply partitioning the probability mass of a distribution. From the perspective of Bayesian statistics, the same construction would instead be described as a credible interval with credibility level \(1-\alpha\), and \(\alpha\) would simply denote the total tail probability outside the interval. Thus, while the terminology is borrowed from testing theory, mathematically it only reflects how probability mass is divided between the central region and the tails of the distribution.
3.1.2 Expression as Percentage Error
Suppose the predictor states \(\mu\) value and this statement: “I expect about 20% error”. The point forecast is \(\mu\). However, such a statement is ambiguous unless a probability level is specified. In other words, “20% error” in this statement actually means:
\begin{equation} P(|X-\mu| \le 0.2|\mu|) = 1-\alpha. \end{equation}
If the probability level is not specified, a default value must be chosen. In this framework, we adopt \(1-\alpha = 0.90\) as a convention, though this choice should be adjusted depending on the application. Under normality:
\begin{equation} \sigma = \frac{0.2|\mu|}{z_{1-\alpha/2}}. \end{equation}
Once \(\alpha\) is specified, this reduces to the standard interval inversion problem.
Note: If the mean \(\mu = 0\), percentage error is undefined in relative terms (logical error).
As an example, suppose a predictor says: “I expect it to be 100 with 20% error.”. For a two-sided 90% interval, \(\alpha/2 = 0.05\) and \(z_{0.95} \approx 1.645\). Under the Normal assumption:
\begin{equation} \sigma = \frac{0.2 \cdot \mu}{z_{0.95}} = \frac{20}{1.645} \approx 12.16 \end{equation}
3.1.3 Qualitative Levels
Suppose the predictor states \(\mu\) value and this statement: “Uncertainty is low / moderate / high.”. Such statements do not directly identify \(\sigma\). They require a calibration map translating qualitative categories into quantitative interval widths. Rather than mapping categories directly into variance, each category is mapped to a symmetric 90% probability interval around the mean. Formally:
\begin{equation} P(|X-\mu| \le k|\mu|) = 0.90. \end{equation}
The parameter \(k\) depends on the qualitative level. Its role here is essentially the same as the “percentage error” in the previous example, just expressed through qualitative categories instead of a numeric statement.
| Category | \(k\) | Interpretation | CV | \([90\%] \mu=100\) |
|---|---|---|---|---|
| Extremely low | 0.02 | 90% in ±2% of \(\mu\) | 1.2% | 99 – 101 |
| Very very low | 0.05 | 90% in ±5% of \(\mu\) | 3% | 97 – 103 |
| Very low | 0.10 | 90% in ±10% of \(\mu\) | 6.1% | 94 – 106 |
| Low | 0.18 | 90% in ±18% of \(\mu\) | 10.9% | 89 – 111 |
| Medium | 0.30 | 90% in ±30% of \(\mu\) | 18.2% | 82 – 118 |
| High | 0.50 | 90% in ±50% of \(\mu\) | 30.4% | 70 – 130 |
| Very high | 0.80 | 90% in ±80% of \(\mu\) | 48.6% | 51 – 149 |
| Very very high | 1.10 | 90% in ±110% of \(\mu\) | 66.9% | 33 – 167 |
| Extremely high | 1.50 | 90% in ±150% of \(\mu\) | 91.2% | 9 – 191 |
Note that these calibration values are heuristic and they provide a practical mapping between qualitative labels and quantitative parameters. The Coefficient of Variation (CV = \((\sigma / \mu) \times 100\)) indicates relative variability, while the last column shows the 90% bound in absolute terms (assuming \(\mu = 100\)), helping to interpret the magnitude of variation in familiar, concrete numbers.
Denote the category-specific width by \(k\) and since \(1-\alpha = 0.90\), then:
\begin{equation} \sigma = \frac{k|\mu|}{z_{0.95}}, \end{equation}
If $\mu \neq 0$, interval widths are interpreted *relatively* (as percentages of $|\mu|)$. However, if $\mu = 0$, interval widths are interpreted *absolutely*:\begin{equation} P(-k \le X \le k) = 0.90. \end{equation}
This avoids division by zero.
3.2 Fat Tail Statements and t Distribution
In this scenario, skewness is not present, but the predictor explicitly requests fat tails. Student’s t-distribution is the natural choice. Compared to the Normal distribution, the use of t distribution introduces an additional parameter: the degrees of freedom \(\nu\). While the Normal distribution is fully characterized by mean and standard deviation, the t-distribution requires specification of \(\nu\), which governs tail thickness. Smaller values of \(\nu\) imply heavier tails and a higher probability of extreme realizations. Within our framework, we proceed as follows:
3.2.1 Step 1: Calibration
We assign exogenous values of \(\nu\) based on the selected qualitative level for fat tail. Below is the qualitative calibration table.
| Tail Level | \(\nu\) | \(P\gt 2\sigma\) | \(P\gt 4\sigma\) | \(P\gt 6\sigma\) |
|---|---|---|---|---|
| None | Inf | \(\frac{5}{100}\) | \(\frac{63}{10^{6}}\) | \(\frac{2}{10^{9}}\) |
| Extremely Low | 50.0 | \(\frac{1}{10}\) | \(\frac{271}{10^{6}}\) | \(\frac{338}{10^{9}}\) |
| Very Very Low | 25.0 | \(\frac{1}{10}\) | \(\frac{1}{1000}\) | \(\frac{5}{10^{6}}\) |
| Very Low | 15.0 | \(\frac{1}{10}\) | \(\frac{2}{1000}\) | \(\frac{52}{10^{6}}\) |
| Low | 10.0 | \(\frac{1}{10}\) | \(\frac{1}{100}\) | \(\frac{316}{10^{6}}\) |
| Medium | 8.0 | \(\frac{1}{10}\) | \(\frac{1}{100}\) | \(\frac{1}{1000}\) |
| High | 6.0 | \(\frac{2}{10}\) | \(\frac{2}{100}\) | \(\frac{3}{1000}\) |
| Very High | 4.0 | \(\frac{2}{10}\) | \(\frac{5}{100}\) | \(\frac{1}{100}\) |
| Very Very High | 3.0 | \(\frac{3}{10}\) | \(\frac{1}{10}\) | \(\frac{4}{100}\) |
| Extremely High | 2.5 | \(\frac{4}{10}\) | \(\frac{2}{10}\) | \(\frac{1}{10}\) |
These values are meant as a practical guide linking descriptive labels to numbers. The three extra columns show how often unusually large deviations can occur, giving a sense of how extreme each category is in practice.
Note: The qualitative scale avoids undefined variance regions by keeping \(\nu>2\). Because the distribution is symmetric, skewness is always zero when defined (and undefined for \(\nu\le 3\)). The kurtosis, however, becomes infinite or undefined when \(\nu\le 4\).
3.2.2 Step 2: Estimating Dispersion
\(\nu\) and \(\mu\) are given and we have the following information regarding the dispersion:
\begin{equation} P(a \le X \le b) = 1-\alpha. \end{equation}
The distribution is symmetric. Let \(t_{\nu,1-\alpha/2}\) denote the corresponding quantile of the standard t distribution. We have:
\begin{equation} P\left(-t_{\nu,1-\alpha/2} \le T_\nu \le t_{\nu,1-\alpha/2}\right) = 1-\alpha. \end{equation}
Then we have:
\begin{equation} \sigma = \frac{b - a}{2 t_{\nu,1-\alpha/2}}. \end{equation}
Because the t-distribution has heavier tails, a larger quantile is required to capture the same probability. As a result \(\sigma_{t} < \sigma_{\text{Normal}}\). For example, \(z_{0.95} = 1.645\) and \(t_{5,0.95} \approx 2.015\). If interval width is \(40\),
\begin{equation} \frac{40}{2 \times 1.645}= \frac{40}{3.29} \approx 12.16 \end{equation}
and,
\begin{equation} \frac{40}{2 \times 2.015}= \frac{40}{4.03} \approx 9.93 \end{equation}
For the same 90% interval, normal distribution requires $\sigma \approx 12.16$, and $t(5)$ model requires $\sigma \approx 9.93$. The t-distribution (with adjusted scale) needs a *smaller variance parameter* because heavy tails reduce the variance needed to match a central interval.3.3 Directional Imbalance and Skew-Normal distribution
The skew-normal distribution allows the expression of three distinct aspects of predictive beliefs: a typical central outcome, overall dispersion, and directional risk (upside versus downside asymmetry). Within our framework, we proceed as follows:
3.3.1 Step 1: Calibration
Skewness information is assumed to be conveyed qualitatively, together with a sign indicating the direction of skewness (left or right). We therefore assign exogenous values of the skewness parameter \(\lambda\) based on the selected qualitative level.
| Skew Level | \(\lambda\) | \(P\lt -2\sigma\) | \(P\gt 2\sigma\) | \(P\lt -3\sigma\) | \(P\gt 3\sigma\) |
|---|---|---|---|---|---|
| None | 0.0 | \(\frac{2}{100}\) | \(\frac{2}{100}\) | \(\frac{1}{1000}\) | \(\frac{1}{1000}\) |
| Extremely Low | 0.4 | \(\frac{2}{100}\) | \(\frac{2}{100}\) | \(\frac{1}{1000}\) | \(\frac{1}{1000}\) |
| Very Very Low | 0.7 | \(\frac{2}{100}\) | \(\frac{2}{100}\) | \(\frac{1}{1000}\) | \(\frac{2}{1000}\) |
| Very Low | 1.1 | \(\frac{2}{100}\) | \(\frac{3}{100}\) | \(\frac{1}{1000}\) | \(\frac{3}{1000}\) |
| Low | 2.0 | \(\frac{1}{100}\) | \(\frac{3}{100}\) | \(\frac{104}{10^{6}}\) | \(\frac{5}{1000}\) |
| Medium | 3.0 | \(\frac{4}{1000}\) | \(\frac{4}{100}\) | \(\frac{4}{10^{6}}\) | \(\frac{1}{100}\) |
| High | 4.5 | \(\frac{1}{1000}\) | \(\frac{4}{100}\) | \(\frac{6}{10^{9}}\) | \(\frac{1}{100}\) |
| Very High | 7.0 | \(\frac{32}{10^{6}}\) | \(\frac{4}{100}\) | \(\frac{0}{10^{9}}\) | \(\frac{1}{100}\) |
| Very Very High | 9.0 | \(\frac{1}{10^{6}}\) | \(\frac{4}{100}\) | \(\frac{0}{10^{9}}\) | \(\frac{1}{100}\) |
| Extremely High | 12.0 | \(\frac{3}{10^{9}}\) | \(\frac{4}{100}\) | \(\frac{0}{10^{9}}\) | \(\frac{1}{100}\) |
3.3.2 Step 2: Estimating Dispersion
\(\lambda\) and \(\mu\) are given and we have the following information regarding the dispersion, and to be used to calculate \(\sigma\):
\begin{equation} F_{SN}(b; \mu, \sigma, \lambda)- F_{SN}(a; \mu, \sigma, \lambda) = 1-\alpha, \end{equation}
Unlike normal and t, symmetric assumption is not valid here. Therefore, there is no closed-form solution for \(\sigma\) because the skew-normal CDF has no closed inverse expression. Therefore, the equation must be solved numerically.
3.4 SGT Distribution
The SGT distribution allows a more general framework, determining all types of uncertainty, dispersion, skewness and fat tail. Within our framework, we proceed as follows:
3.4.1 Step 1: Calibration
Similar to previous discussion, we therefore assign exogenous values of the skewness parameter \(\lambda\) based on the selected qualitative level. Note that how \(\lambda\) parameter affects skewness is different in SN and SGT distributions.
| Skew Level | \(\lambda\) | \(P\lt -2\sigma\) | \(P\gt 2\sigma\) | \(P\lt -3\sigma\) | \(P\gt 3\sigma\) |
|---|---|---|---|---|---|
| None | 0.00 | \(\frac{1}{100}\) | \(\frac{1}{100}\) | \(\frac{2}{10^{6}}\) | \(\frac{2}{10^{6}}\) |
| Extremely Low | 0.10 | \(\frac{1}{100}\) | \(\frac{1}{100}\) | \(\frac{361}{10^{9}}\) | \(\frac{7}{10^{6}}\) |
| Very Very Low | 0.20 | \(\frac{4}{1000}\) | \(\frac{1}{100}\) | \(\frac{39}{10^{9}}\) | \(\frac{21}{10^{6}}\) |
| Very Low | 0.33 | \(\frac{2}{1000}\) | \(\frac{2}{100}\) | \(\frac{1}{10^{9}}\) | \(\frac{59}{10^{6}}\) |
| Low | 0.45 | \(\frac{1}{1000}\) | \(\frac{2}{100}\) | \(\frac{0}{10^{9}}\) | \(\frac{124}{10^{6}}\) |
| Medium | 0.60 | \(\frac{50}{10^{6}}\) | \(\frac{2}{100}\) | \(\frac{0}{10^{9}}\) | \(\frac{251}{10^{6}}\) |
| High | 0.70 | \(\frac{1}{10^{6}}\) | \(\frac{2}{100}\) | \(0\) | \(\frac{365}{10^{6}}\) |
| Very High | 0.80 | \(\frac{0}{10^{9}}\) | \(\frac{2}{100}\) | \(0\) | \(\frac{499}{10^{6}}\) |
| Very Very High | 0.85 | \(\frac{0}{10^{9}}\) | \(\frac{2}{100}\) | \(0\) | \(\frac{1}{1000}\) |
| Extremely High | 0.95 | \(0\) | \(\frac{3}{100}\) | \(0\) | \(\frac{1}{1000}\) |
We assign exogenous values of \(p\) and \(q\) based on the selected qualitative level for fat tail. Below is the qualitative calibration table. Note that degrees of freedom in the t distribution correspond to \(2q\) and as \(q\) decreases, tails become heavier.
| Tail Level | \(p\) | \(q\) | \(P\gt 2\sigma\) | \(P\gt 4\sigma\) | \(P\gt 6\sigma\) |
|---|---|---|---|---|---|
| Extremely Low | 10.0 | 30.0 | \(\frac{2}{1000}\) | \(0\) | \(0\) |
| Very Very Low | 8.0 | 20.0 | \(\frac{4}{1000}\) | \(0\) | \(0\) |
| Very Low | 6.0 | 15.0 | \(\frac{1}{100}\) | \(0\) | \(0\) |
| Low | 4.0 | 10.0 | \(\frac{2}{100}\) | \(\frac{38}{10^{9}}\) | \(\frac{0}{10^{9}}\) |
| Medium | 3.0 | 6.0 | \(\frac{4}{100}\) | \(\frac{42}{10^{6}}\) | \(\frac{86}{10^{9}}\) |
| High | 2.5 | 4.5 | \(\frac{4}{100}\) | \(\frac{424}{10^{6}}\) | \(\frac{9}{10^{6}}\) |
| Very High | 2.0 | 3.5 | \(\frac{5}{100}\) | \(\frac{2}{1000}\) | \(\frac{194}{10^{6}}\) |
| Very Very High | 1.8 | 3.0 | \(\frac{1}{10}\) | \(\frac{4}{1000}\) | \(\frac{1}{1000}\) |
| Extremely High | 1.5 | 2.5 | \(\frac{5}{100}\) | \(\frac{1}{100}\) | \(\frac{2}{1000}\) |
3.4.2 Step 2: Estimating Dispersion
This step is also similar to the SN case. \(\lambda\), \(p\), \(q\), and \(\mu\) are given and we have the following information regarding the dispersion, and to be used to calculate \(\sigma\):
\begin{equation} F_{SGT}(b; \mu, \sigma, \lambda,p,q)- F_{SGT}(a; \mu, \sigma, \lambda,p,q) = 1-\alpha, \end{equation}
There is no closed-form solution for \(\sigma\) and the equation must be solved numerically.
4 Extended Qualitative Prediction Framework
Qualitative assessments map the predictor's idea about the future of a variable to parametric distributions (e.g., Normal, t, Skew-Normal, and SGT). The extended approach here is a more flexible approach based on direct density elicitation. The main idea is the predictor provides a graphical representation of the expected density shape. This drawing is interpreted as an approximate probability density function and is subsequently converted into a Gaussian Mixture Model (GMM).
The resulting model should be understood as a smooth approximation of the drawn shape, or a probabilistic representation of subjective beliefs, or a bridge between visual intuition and formal statistical modeling. It is not intended to perfectly reproduce every detail of the drawing, but rather to capture its overall structure and implied likelihoods in a stable and usable form.
4.1 Predictor's Data
In this extended framework, the predictor-provided curve which is treated as a relative density profile, i.e., higher regions indicate more likely values and lower regions indicate less likely values. Let \({(x_i, y_i)}_{i=1}^n\) denote the predictor-provided drawing. The values \(y_i\) are interpreted as unnormalized density values, defined up to a multiplicative constant.
The curve is cleaned, smoothed, and normalized so that it behaves like a proper probability distribution. Importantly, the framework allows for the possibility that the predictor only specifies the central portion of the distribution, leaving the tails to be inferred.
The preprocessing consists of Cleaning and ordering (Sort observations by \(x_i\) and remove near-duplicate points), applying a moving average with a small fixed window (to reduce small-scale noise introduced by manual drawing, without materially altering the overall shape). Furthermore, the curve is shifted so that its minimum is zero and negative values are clipped, ensuring non-negativity of the density.
Let \((x_i, \hat y_i)_{i=1}^n\) denote the smoothed and shifted version of the original curve. This is interpreted as an unnormalized density, where higher values indicate more likely outcomes. To approximate its total area, we apply the trapezoidal rule:
\begin{equation} \sum_{i=1}^{n-1} A_i, \quad A_i = (x_{i+1} - x_i)\,\frac{\hat{y}_i + \hat{y}_{i+1}}{2}, \quad i = 1,\dots,n-1 \end{equation}
This corresponds to connecting each pair of adjacent points $(x_i, \hat y_i)$ and $(x_{i+1}, \hat y_{i+1})$ with straight lines, forming trapezoids whose areas are summed. This provides a numerical approximation of the area under the curve, which is then used for normalization.The estimation algorithm operates on discrete weighted points, rather than a continuous curve. Therefore, we convert the curve into weights:
\begin{equation} w_i \approx \int_{\text{neighborhood of } x_i} f(x)\,dx \end{equation}
Each interval contributes half of its mass to each endpoint, yielding:
\begin{equation} w_i = \begin{cases} \tfrac{1}{2} A_1 & i = 1 \\ \tfrac{1}{2} A_{i-1} + \tfrac{1}{2} A_i & 2 \le i \le n-1 \\ \tfrac{1}{2} A_{n-1} & i = n \end{cases} \end{equation}
Thus, each point \(x_i\) is assigned a weight \(w_i\) proportional to the probability mass in its neighborhood.
4.2 Estimation
Given a target density (i.e., the normalized and smoothed version of the predictor-provided curve, interpreted as a probability density over the observed range), it is approximated by a Gaussian mixture; a set of mixture weights, component means, and component standard deviations. Intuitively, we are reconstructing the drawn curve by stacking smooth bell-shaped curves. The parameters are estimated using a weighted Expectation-Maximization (EM) algorithm. Note that Gaussian mixtures are not identifiable in a strict sense; multiple parameter configurations may represent the same density.
In the weighted version used here, each point \(x_i\) contributes proportionally to its weight \(w_i\), so regions with more probability mass have greater influence on the fitted model. EM maximizes the log-likelihood:
\begin{equation} \mathcal{L}_w = \sum_{i=1}^n w_i\, \log \left( \sum_{k=1}^K \pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2) \right). \end{equation}
The number of components \(K\) is chosen a priori. The Expectation–Maximization (EM) algorithm iteratively refines the mixture model through two steps:
- E-step (assignment): Each point \(x_i\) is softly assigned to all components, with probabilities proportional to how well each Gaussian explains it.
\begin{equation} r_{ik} = \frac{\pi_k \, \mathcal{N}(x_i \mid \mu_k, \sigma_k^2)}{\sum_{j=1}^K \pi_j \, \mathcal{N}(x_i \mid \mu_j, \sigma_j^2)} \end{equation}
where \(r_{ik}\) represents how much component \(k\) explains point \(x_i\). In practice, computations are performed in log-space for numerical stability.
- M-step (update): The component parameters \((\pi_k, \mu_k, \sigma_k)\) are updated using these assignments. Define:
\begin{equation} N_k = \sum_{i=1}^n w_i \, r_{ik} \end{equation}
Then mixture weights \(\pi_k\), means \(\mu_k\), and variances \(\sigma_k^2\) are calcuated by:
\begin{equation} \pi_k = \frac{N_k}{\sum_{j=1}^K N_j} \end{equation}
\begin{equation} \mu_k = \frac{\sum_i w_i \, r_{ik} \, x_i}{N_k} \end{equation}
\begin{equation} \sigma_k^2 = \frac{\sum_i w_i \, r_{ik} \, (x_i - \mu_k)^2}{N_k} \end{equation}
Since the EM objective is linear in the weights, scaling all weights by a constant does not change the solution. Therefore, only the relative mass distribution matters.
EM is a local optimization procedure and may converge to suboptimal solutions if poorly initialized. Therefore, we start with a simple neutral configuration that roughly covers the data range:
- Mixture weights: All components are initially equal:
\begin{equation} \pi_k = \frac{1}{K}. \end{equation}
- Component means: The means are placed evenly across the main range of the data.
\begin{equation} \mu_k = x_{\min} + \frac{k+1}{K+1} (x_{\max} - x_{\min}) \end{equation}
This ensures that the components are roughly evenly spread across the bulk of the density, providing coverage without biasing toward outliers.- Component standard deviations: All components are initialized with the same variance, chosen to roughly "fill the gaps" between means:
\begin{equation} \sigma_k = \frac{\text{range}}{2K} \end{equation}
This simple rule-of-thumb ensures some overlap between adjacent components without creating excessively narrow spikes.
A component may try to explain a tiny fluctuation in the curve by collapsing into a very narrow spike. To prevent this, we enforce a minimum width:
\begin{equation} \sigma_k \geq \sigma_{\text{floor}} \end{equation}
To select the number of Gaussian components \(K\) when approximating a target density, we use the Bayesian Information Criterion (BIC). For each candidate mixture model, the weighted EM algorithm estimates parameters \((\pi_k, \mu_k, \sigma_k)\) and computes the weighted log-likelihood. The BIC balances model fit with complexity. In practice, because the EM log-likelihood is computed as a weighted average rather than a sum, we multiply by the number of points \(n\) as a proxy for the total contribution:
\begin{equation} \mathrm{BIC} \approx -2 \, n \, \mathcal{L}_w +p \log n. \end{equation}
where $p = 3K-1$ is the number of free parameters. The model with the *lowest BIC* is selected as the final mixture.5 CRPS
A scoring rule is a way to grade a forecast, once the true outcome is known. If you predict a single number (say price of oil tomorrow will be 80) you can measure error directly. But if uncertainty is expressed, for example "There is a 70% chance the price of oil will be between 79 and 81.", now the prediction is a distribution, and we need a way to evaluate how good that distribution is. A scoring rule assigns a numerical penalty based on the prediction and what actually occurred. A helpful way to think about it is as a game; You must state your beliefs about an uncertain quantity as a probability distribution. After the true value is revealed, a rule assigns you a score. Your goal is to get the lowest possible score.
The Continuous Ranked Probability Score (CRPS) is a scoring rule used to evaluate probabilistic forecasts for continuous outcomes. In simple terms, it measures how close a predicted probability distribution is to what actually happens. Let \(F(x)\) be forecast CDF and \(y\) be the observed outcome. The CRPS is defined as:
\begin{equation} \text{CRPS}(F, y)= \int_{-\infty}^{\infty} \left( F(x) - \mathbf{1}_{x \ge y} \right)^2 \, dx \end{equation}
The term \(\mathbf{1}_{x \ge y}\) is the indicator function, defined as:
\begin{equation} \mathbf{1}_{x \ge y} = \begin{cases} 1 & \text{if } x \ge y \\ 0 & \text{if } x < y \end{cases} \end{equation}
This is simply a step function that jumps from 0 to 1 at the observed value \(x\). Importantly \(\mathbf{1}_{x \ge y}\) is the CDF of a degenerate distribution concentrated at \(y\) (i.e., the true distribution after observing the outcome). So CRPS is really:
\begin{equation} \int ( \text{forecast CDF} - \text{true CDF} )^2 dx \end{equation}
It measures the squared distance between the predicted distribution and the empirical distribution of the observation. Think of it as How far is my predicted distribution from reality? In other words, CRPS measures the area between those two curves. Imagine plotting 1. the forecast CDF (smooth S-shaped curve), and 2. The step function that jumps from 0 to 1 at \(y\). CRPS is the total squared vertical distance between those two curves. If the forecast puts most probability mass near \(x\), the curves are close, therefore a small CRPS. If the forecast spreads mass far away from \(y\), the curves differ a lot, therefore a large CRPS.
5.1 Notation
CRPS is a function of predictive distribution and realized value. So we use the notation \(\mathrm{CRPS}(F,y)\), where \(F\) is predictive cumulative distribution function (CDF), and \(y\in\mathbb{R}\) is the observed realization. So mathematically,
\begin{equation} \mathrm{CRPS}: \mathcal P_1(\mathbb R)\times \mathbb R \to \mathbb R. \end{equation}
where \(\mathcal P_1(\mathbb R)\) is all probability measures on \(\mathbb{R}\), with finite first moments. So, CRPS takes a distribution (with finite first moment), a number, and returns a real-valued score.
If we write \(X \sim F\), then the distribution of \(X\) is \(F\). So writing \(\mathrm{CRPS}(X,y)\) is shorthand for \(\mathrm{CRPS}(F_X,y)\). This is common in probability theory to refer to a distribution via its random variable (or its CDF, or its density). All three encode the same law.
5.2 Propriety
We are dealing with three different distributions:
- There is some "true" or "actual" mechanism from which the outcome is generated. The actual value \(y\) is a realization of a random variable \(Y\) the follows this distribution, denoted by \(P\) and \(Y \sim P\).
- The forecaster may have a subjective belief about \(P\). We call it \(G\). From the forecaster's perspective, we have \(Y\sim G\).
- What forecaster reports is not necessarily her belief. This can be any distribution, denoted by \(F \in P_1(\mathbb R)\).
It is important to motivate the forecaster to report her actual belief. This is a job for the scoring rule. The scoring rule calculates and reports a score for every reported prediction \(F\), but as we explain later, if the expected value of the score is minimum under reporting the subjective belief, there will be no reason to lie.
A scoring rule is called proper if the best strategy in that game is to be honest. To see why this matters, imagine you believe tomorrow’s oil price will be around 80, but you deliberately report a much wider distribution (say 50–100) to "play it safe". A poorly designed scoring rule might reward this kind of hedging. A proper scoring rule prevents this type of behavior. It ensures that: if you truly believe a distribution \(F\), then reporting \(F\) gives you the best expected score. In other words, you cannot improve your score by distorting your beliefs.
As we discussed, the forecaster's belief about the unknown outcome is given by a distribution \(G\) and she is allowed to report any distribution \(F\). A scoring rule assigns you a loss \(S(F, y)\) once the outcome \(y\) is observed. However, we should move to the point of time when decision is being made, i.e. when the outcome is not yet known. In this ex ante perspective, \(y\) is unknown, so we model it as a random variable, denoted by \(Y\). Now, pointwise scoring rule \(S(F,y)\) is turn into a random quantity.
In this ex ante perspective, what matters is the expected score. Of course, the forecaster evaluates the expectation using her own subjective belief distribution \(G\). The main idea is that if you evaluate outcomes using your own beliefs and you can’t do better (in expectation) than reporting those beliefs honestly, you report them honestly. In this context, expected score under \(G\) is defined as:
\begin{equation} \mathbb{E}_{Y \sim G}[S(F,Y)] = \int S(F,y)\, dG(y) \end{equation}
We sometimes use the notation $CRPS(F,G)$ instead of the expectation (don't get confused with the real function $S(F,y)$, here the second argument is a distribution).A scoring rule is called proper if, for every belief distribution \(G\), the expected score (taken with respect to \(Y \sim G\)) is minimized by reporting \(F = G\), i.e.
\begin{equation} G \in \arg\min_{F \in \mathcal{P}_1(\mathbb{R})} \; \mathbb{E}_{Y \sim G}[S(F, Y)] \end{equation}
or, equivalently:
\begin{equation} \mathbb{E}_{Y \sim G}[S(G, Y)] \le \mathbb{E}_{Y \sim G}[S(F, Y)] \quad \text{for all } F \in \mathcal{P}_1(\mathbb{R}). \end{equation}
If \(G\) is the only answer to the minimization problem, or if the inequality is strict for all \(F \neq G\), the rule is called strictly proper.
This definition can feel abstract at first, but it encodes a simple idea: "You are evaluated on average, according to what you think is likely to happen". Even if a single outcome might reward a "dishonest" forecast by chance, over many repetitions the scoring rule ensures that honesty wins.
Returning to the earlier example; suppose you believe the oil price will be tightly concentrated around 80. You might be tempted to report a very wide distribution (50–100) to avoid being "wrong". A proper scoring rule counters this by making such hedging worse on average. While a wide forecast reduces the risk of large errors, it also spreads probability mass away from where you actually think the outcome will be, and this is penalized.
In practice, people rarely have a perfectly defined distribution \(G\) in their heads. The definition of propriety does not assume that forecasters are perfectly self-aware; rather, it provides a normative benchmark: "If you had a coherent set of probabilistic beliefs, the scoring rule would reward you for reporting them faithfully". In practice, this has two important implications: 1. It discourages strategic distortion (e.g., inflating uncertainty to look safer); 2. It encourages building models that produce well-calibrated probabilities, because any systematic mismatch between forecasts and reality will increase the score over time. So, a proper scoring rule pushes the predictors in the right direction.
CRPS is a proper scoring rule, therefore, it does not just measure accuracy, it also promotes honest and well-calibrated probabilistic forecasts, which is essential when evaluating uncertainty.
In order to prove it intuitively, start from the definition and take expectation over \(Y \sim G\):
\begin{equation} \mathbb{E}_{Y}[\text{CRPS}(F, Y)]= \int \mathbb{E}_{Y} \left[ \left(F(x) - \mathbf{1}_{Y \le x}\right)^2 \right] dx \end{equation}
Recall that expectation is an integral and we swap it and integral. This is valid under mild conditions. Now, given \(x\), define \(Z = \mathbf{1}_{x \ge Y}\). This is a Bernoulli random variable:
\begin{equation} Z = \begin{cases} 1 & Y \le x \\ 0 & Y > x \end{cases} \end{equation}
Given \(x\), we define:
\begin{equation} q = F(x), \quad p = G(x) = \mathbb{P}(Y \le x) \end{equation}
Recall that we are evaluating outcomes using the predictor's beliefs and by definition, \(G\) is the CDF of \(Y\). Therefore, at each threshold \(x\), \(p\) is the true probability that we think event \(Y \le x\) happens and \(q\) is the predicted probability that we express for this event.
The inner expectation is \(\mathbb{E}_Y[(q - Z)^2]\). Since \(Z\) only takes two values (\(1\) or \(0\) with probabilities \(p\) and \(1-p\)), expectation is just a weighted average:
\begin{equation} \mathbb{E}[(q - Z)^2]= (q - 1)^2 \cdot p + (q - 0)^2 \cdot (1-p) \end{equation}
If we simplify and add and subtract \((p-p^2)\), we get:
\begin{equation} \mathbb{E}[(q - Z)^2]= (q - p)^2 + p(1 - p) \end{equation}
This decomposition says:
\begin{equation} \text{expected squared error}= (\underbrace{q - p}_{\text{mismatch between forecast and truth}})^2 + \underbrace{p(1-p)}_{\text{irreducible randomness}} \end{equation}
Note that the second term does not depend on \(q\). So the only way to minimize the expectation is to minimize \((q - p)^2\) and this happens only when \(q = p\). In other words, the best prediction for each threshold is the true probability \(G(x)\). While this is the main point, we can plug this result back into the integral and show that:
\begin{equation} \mathbb{E}_Y[\text{CRPS}(F, Y)]- \mathbb{E}_Y[\text{CRPS}(G, Y)] =\int (F(x) - G(x))^2 dx \end{equation}
Note that the right-hand side is an integral of squares and is non-negative, with zero only if \(F(x) = G(x)\) for all \(x\).
5.3 Comparison
When evaluating probabilistic forecasts, different proper scoring rules emphasize different aspects of accuracy. Three widely used examples are the log score, the quadratic score, and the CRPS. All are proper, but they differ in how they measure discrepancy.
The log score evaluates a forecast density \(f\) at the observed outcome \(y\):
\begin{equation} S_{\log}(f, y) = -\log f(y) \end{equation}
It focuses entirely on how much probability mass is assigned to what actually happened. This makes it highly sensitive: assigning very low probability to the observed outcome leads to a large penalty. This sensitivity is desirable in likelihood-based inference and when rare events matter, but it can also make the score unstable and overly harsh when models are imperfect.
The quadratic score measures the squared difference between the forecast distribution and the realized outcome. For a continuous outcome with density \(f\), it can be written as:
\begin{equation} S_{\text{quad}}(f, y) = \int f(x)^2 \, dx - 2 f(y) \end{equation}
(up to an additive constant that does not depend on \(f\)). It evaluates forecasts by penalizing both excessive spread (through \(\int f(x)^2 dx\)) and lack of concentration at the observed value (through \(f(y)\)). Compared to the log score, it is smoother and more forgiving, since errors are distributed across the whole density rather than concentrated at a single point. However, it is correspondingly less sensitive to assigning very low probability to the observed outcome.
The CRPS provides a distribution-wide comparison between the forecast and the observation (as discussed above), effectively extending quadratic scoring ideas to cumulative distributions.
The main distinction lies in how they trade off sensitivity and robustness. The log score is local and highly sensitive to errors at the realized outcome, making it powerful but potentially unstable. The quadratic score spreads error across the entire density, providing a more stable but less discriminating measure. CRPS builds on this by evaluating discrepancies across all thresholds of the distribution. In practice, the log score is preferred when likelihood and tail behavior are central, while CRPS and quadratic-type scores are often chosen when a more stable and globally informative measure of forecast quality is desired.
5.4 Alternative Formulas
5.4.1 CRPS via CDFs (1)
\begin{equation} CRPS(F,y)= \int_{-\infty}^{y} F(x)^2 dx + \int_{y}^{\infty} (1-F(x))^2 dx \end{equation}
The CRPS measures the squared distance between the forecast CDF \(F\) and the step function at the observation \(y\). For \(x < y\), the observation CDF is \(0\), so we penalize accumulated forecast probability. For \(x \ge y\), the observation CDF is \(1\), so we penalize missing forecast probability. Thus, the two integrals correspond to probability mass assigned too early and too late, respectively.
To derive it, we use the fact that the indicator function splits the real line into two regions. Therefore we split the integral at \(y\) and use the indicator function definition:
\begin{equation} CRPS= \int_{-\infty}^{y} (F(x)-0)^2 dx + \int_{y}^{\infty} (F(x)-1)^2 dx. \end{equation}
which is the stated formula.
5.4.2 CRPS via CDFs (2)
\begin{equation} \text{CRPS}(F,y)= \int_{-\infty}^{y} F(x) dx + \int_{y}^{\infty} (1-F(x)) dx - \int_{-\infty}^{\infty} F(x)(1-F(x)) dx. \end{equation}
To derive this, Start from the squared-difference representation:
\begin{equation} \mathrm{CRPS}(F,y)= \int_{-\infty}^{\infty} F(x)^2 \, dx + \int_{-\infty}^{\infty} \mathbf{1}_{x \ge y} \, dx- 2 \int_{-\infty}^{\infty} F(x)\mathbf{1}_{x \ge y} , dx. \end{equation}
Using the indicator function to adjust bounds:
\begin{equation} \int_{-\infty}^{\infty} \mathbf{1}_{x \ge y} \, dx= \int_{y}^{\infty} 1 \, dx, \quad \text{and}\quad \int_{-\infty}^{\infty} F(x)\mathbf{1}_{x \ge y} \, dx= \int_{y}^{\infty} F(x) \, dx. \end{equation}
Hence,
\begin{equation} \mathrm{CRPS}(F,y)= \int_{-\infty}^{\infty} F(x)^2 \, dx + \int_{y}^{\infty} 1 \, dx- 2 \int_{y}^{\infty} F(x) , dx. \end{equation}
Although \(\int_{y}^{\infty} 1 \, dx\) is infinite, this term will cancel out in the final expression. Use the identity \(F(x)^2 = F(x) - F(x)(1 - F(x))\), and integrate both sides, then substitute:
\begin{equation} \mathrm{CRPS}(F,y)= \int_{-\infty}^{\infty} F(x)\, dx - \int_{-\infty}^{\infty} F(x)(1 - F(x))\, dx + \int_{y}^{\infty} 1 \, dx- 2 \int_{y}^{\infty} F(x)\, dx. \end{equation}
Now split the first integral at \(y\). Substitution and simplification completes the derivation.
5.4.3 CRPS via Expectations
Let \(X, X' \sim F\) be i.i.d., and assume \(\mathbb{E}|X| < \infty\). Then
\begin{equation} \mathrm{CRPS}(F,y)= E|X - y|- \tfrac{1}{2} E|X - X'|. \end{equation}
This is also known as energy form of CRPS. The finite first-moment condition ensures that these expectations are well-defined. Distributions with extremely heavy tails violate this condition, leading to divergence. Although this assumption is not always stated explicitly in the integral formulations, those integrals may also diverge in such cases. In other words, forecasts that assign too much probability mass to extreme values can yield an infinite CRPS and are therefore not meaningfully comparable.
This representation decomposes CRPS into:
- a calibration/sharpness term \(\mathbb{E}|X - y|\),
- (minus) a dispersion term \(\tfrac{1}{2} \mathbb{E}|X - X'|\).
We will interpret these components in more detail later.
Step 1: Representation of E|X - y|
We show that
\begin{equation} \mathbb{E}|X - y|= \int_{-\infty}^{y} F(x)\, dx + \int_{y}^{\infty} (1 - F(x))\, dx. \end{equation}
For any nonnegative random variable \(Z\), \(\mathbb{E}[Z] = \int_0^\infty P(Z > t)\, dt\). Apply this to \(Z = |X - y|\):
\begin{equation} \mathbb{E}|X - y|= \int_0^\infty P(|X - y| > t)\, dt. \end{equation}
Observe that if \(|X - y| > t\), then \(X < y - t\) or \(X > y + t\). So,
\begin{equation} P(|X - y| > t)= F(y - t) + 1 - F(y + t). \end{equation}
Thus,
\begin{equation} \mathbb{E}|X - y|= \int_0^\infty \big[ F(y - t) + 1 - F(y + t) \big]\, dt. \end{equation}
Now change variables:
- For the first term, let \(x = y - t\). Then \(t = 0 \Rightarrow x = y\), and \(t \to \infty \Rightarrow x \to -\infty\):
\begin{equation} \int_0^\infty F(y - t)\, dt = \int_{-\infty}^{y} F(x)\, dx. \end{equation}
- For the second term, let \(x = y + t\). Then \(t = 0 \Rightarrow x = y\), and \(t \to \infty \Rightarrow x \to \infty\):
\begin{equation} \int_0^\infty (1 - F(y + t))\, dt = \int_{y}^{\infty} (1 - F(x))\, dx. \end{equation}
Combining both gives the result.
Step 2: Representation of E|X - X'|
We show that
\begin{equation} \tfrac{1}{2}\mathbb{E}|X - X'|= \int_{-\infty}^{\infty} F(x)(1 - F(x))\, dx. \end{equation}
First note that
\begin{equation} \mathbb{E}|X - X'|= \mathbb{E}[(X - X')_+] + \mathbb{E}[(X' - X)_+]. \end{equation}
By symmetry of the joint distribution of $(X, X')$, these two terms are equal, hence\begin{equation} \tfrac{1}{2}\mathbb{E}|X - X'|= \mathbb{E}[(X - X')_+]. \end{equation}
Now use the identity
\begin{equation} (X - X')_+= \int_{-\infty}^{\infty} \mathbf{1}_{{X' \le x < X}}\, dx. \end{equation}
Taking expectations,
\begin{equation} \mathbb{E}[(X - X')_+]= \mathbb{E}\left[ \int_{-\infty}^{\infty} \mathbf{1}_{{X' \le x < X}}\, dx \right]. \end{equation}
Exchanging expectation and integration:
\begin{equation} \mathbb{E}[(X - X')_+]= \int_{-\infty}^{\infty} \mathbb{E}[\mathbf{1}_{{X' \le x < X}}]\, dx = \int_{-\infty}^{\infty} P(X' \le x < X)\, dx. \end{equation}
Using independence of \(X\) and \(X'\),
\begin{equation} P(X' \le x < X)= P(X' \le x)\, P(X > x)= F(x)\big(1 - F(x)\big). \end{equation}
Therefore,
\begin{equation} \mathbb{E}[(X - X')_+]= \int_{-\infty}^{\infty} F(x)(1 - F(x))\, dx, \end{equation}
and multiplying by 2 yields the claim.
5.4.4 Sharpness: \(E|X - y|\)
Assume \(X\) is a continuous random variable with density \(f\), CDF \(F\), and finite first moment \(\mathbb{E}|X| < \infty\). Then
\begin{equation} \mathbb{E}|X - y|= \int_{-\infty}^{y} F(x)\, dx + \int_{y}^{\infty} (1 - F(x))\, dx. \end{equation}
This decomposition splits the absolute deviation into two one-sided tail contributions. The general identity for a nonnegative random variable shows that expectations can be expressed as tail integrals. Here, shifting by \(y\) moves the reference point from \(0\) to \(y\), so the “negative” and “positive” parts correspond to \(x < y\) and \(x > y\), respectively.
Geometrically, this means the area under the CDF up to \(y\), plus the area above the CDF beyond \(y\). Together equal the expected absolute deviation from \(y\).
An important equivalent expression is:
\begin{equation} \mathbb{E}|X - y|= \mathbb{E}[X] + y(2F(y) - 1)- 2 \int_{-\infty}^{y} x f(x), dx. \end{equation}
Equivalently,
\begin{equation} \mathbb{E}|X - y|= \underbrace{\mathbb{E}[X]}_{\text{global center}} + \underbrace{y(2F(y) - 1)}_{\text{position of } y}- \underbrace{2\,\mathbb{E}[X \mathbf{1}_{{X \le y}}]}_{\text{left-side contribution}}. \end{equation}
Interpretation:
- The first term is the global mean of the distribution.
- The second term captures how the point \(y\) is positioned relative to the distribution via \(F(y)\). If \(y\) is a median (i.e., \(F(y)=\tfrac12)\), this term vanishes.
- The third term is the first lower partial moment (or truncated first moment), describing how mass below \(y\) contributes to the mean.
In general, this truncated moment,
\begin{equation} \mathbb{E}[X \mathbf{1}_{{X \le y}}]= \int_{-\infty}^{y} x f(x)\, dx \end{equation}
does not simplify further without additional assumptions. However, for common distributions it often admits a closed form.To derive this, start from
\begin{equation} \mathbb{E}|X - y|= \int_{-\infty}^{\infty} |x - y| f(x)\, dx. \end{equation}
Split at \(x = y\):
\begin{equation} = \int_{-\infty}^{y} (y - x) f(x)\, dx + \int_{y}^{\infty} (x - y) f(x)\, dx. \end{equation}
For the first integral, we have:
\begin{equation} \int_{-\infty}^{y} (y-x)f(x)\,dx = y\int_{-\infty}^{y} f(x)\,dx- \int_{-\infty}^{y} x f(x)\,dx =\int_{-\infty}^{y} F(x)\, dx. \end{equation}
The last equality can be verified via integration by parts. Let \(u = F(x)\), \(dv = dx\), so \(du = f(x)dx\), \(v = x\). Then
\begin{equation} \int_{-\infty}^{y} F(x)\, dx= \big. x F(x) \big]_{-\infty}^{y}- \int_{-\infty}^{y} x f(x)\, dx. \end{equation}
Since \(F(x) \to 0\) as \(x \to -\infty\), the boundary term vanishes, yielding the result.
For the second integral, we have:
\begin{equation} \int_{y}^{\infty} (x-y)f(x)\,dx= \int_{y}^{\infty} x f(x)\,dx- y\int_{y}^{\infty} f(x)\,dx =\int_{y}^{\infty} (1 - F(x))\, dx. \end{equation}
To derive this, again we use integration by parts. Let $u = 1 - F(x)$, $dv = dx$, so $du = -f(x)dx$, $v = x$. Then\begin{equation} \int_{y}^{\infty} (1 - F(x))\, dx= \big. x(1 - F(x)) \big]_{y}^{\infty} + \int_{y}^{\infty} x f(x)\, dx. \end{equation}
Under the assumption \(\mathbb{E}|X| < \infty\), we have \(x(1 - F(x)) \to 0\) as \(x \to \infty\). So the boundary term becomes \(y(1 - F(y))\) which yields the desired equality.
Remarks on E|X|
The quantity \(\mathbb{E}|X|\) measures the average magnitude of \(X\), ignoring its sign. One representation is:
\begin{equation} \mathbb{E}|X|= \int_{-\infty}^{\infty} |x| f(x)\, dx, \end{equation}
which is a special case of $\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x) f(x), dx$ with $g(x) = |x|$. An alternative CDF-based representation is:\begin{equation} \mathbb{E}|X|= \int_{-\infty}^{0} F(x)\, dx + \int_{0}^{\infty} (1 - F(x))\, dx. \end{equation}
If $X$ is symmetric about $0$ (i.e., $f(x) = f(-x)$), then $\mathbb{E}|X|= 2 \int_{0}^{\infty} x f(x)\, dx$. This simplification relies on symmetry around zero; it does not hold for distributions that are symmetric around a different point or are asymmetric. There is no simple algebraic relationship between $\mathbb{E}|X|$ and $\mathbb{E}[X]$. However, by Jensen’s inequality (since $|\cdot|$ is convex), $\mathbb{E}|X| \ge |\mathbb{E}[X]|$. Equality holds if and only if $X$ does not change sign (i.e., $X \ge 0$ or $X \le 0$ almost surely) (For example, if $X$ is exponential).5.4.5 Dispersion Penalty: E|X-X'|
We have shown that
\begin{equation} \mathbb{E}|X - X'|= 2 \int_{-\infty}^{\infty} F(x)\big(1 - F(x)\big)\, dx, \end{equation}
where $X$ and $X'$ are independent random variables with common CDF $F$.The left-hand side is the expected absolute difference between two independent draws from the distribution. It is therefore a direct and natural measure of dispersion: if the distribution is wide, two independent draws will typically be far apart; if it is concentrated, they will be close.
To understand the right-hand side, fix a point \(x \in \mathbb{R}\). The key idea is to reinterpret the absolute difference \(|X - X'|\) geometrically. For any fixed realization of \(X\) and \(X'\), the quantity \(|X - X'|\) is simply the length of the interval between them. This length can be written as
\begin{equation} |X - X'|= \int_{-\infty}^{\infty} \mathbf{1}_{{x \text{ lies between } X \text{ and } X'}}\, dx, \end{equation}
because a point $x$ contributes to the length exactly when it lies between the two values. Taking expectations, we obtain\begin{equation} \mathbb{E}|X - X'|= \int_{-\infty}^{\infty} P(x \text{ lies between } X \text{ and } X')\, dx. \end{equation}
Now, a point \(x\) lies between \(X\) and \(X'\) in two symmetric ways: either \(X \le x < X'\), or \(X' \le x < X\). Each of these events has probability \(F(x)(1 - F(x))\), and since they are disjoint and equally likely, their total probability is \(2F(x)(1 - F(x))\). Substituting this into the integral yields
\begin{equation} \mathbb{E}|X - X'|= 2 \int_{-\infty}^{\infty} F(x)(1 - F(x))\, dx, \end{equation}
5.5 CRPS for Distributions
5.5.1 Equivariance of CRPS under Linear Transformations
The CRPS behaves naturally under linear transformations, which makes it directly comparable across different scales and locations. Let \(X\) be a random variable with distribution \(F\), and consider a general linear transformation
\begin{equation} Y = a + bX, \quad \text{with } a,b \in \mathbb{R}, \; b > 0. \end{equation}
Let \(y\) be an observation corresponding to \(X\), and define the transformed observation \(y' = a + b y\). Then CRPS satisfies the equivariance property
\begin{equation} \mathrm{CRPS}(Y, y')= b \, \mathrm{CRPS}(X, y), \quad y=\frac{y'-a}{b} \end{equation}
This result immediately implies that shifting both the forecast and the observation does not change CRPS, and scaling by \(b>0\) multiplies CRPS by \(b\). As a result, it is sufficient to study CRPS on a reference distribution. All other members of the location–scale or mean-std family follow immediately by scaling.
To derive the equality, use the expectation representation and substitute \(Y = a + bX\) and \(y'=a+by\). We get the following equations for the sharpness and disperssion penalty terms:
\begin{equation} |Y - y'|= |a + bX - (a + by)|= b |X - y|. \end{equation}
\begin{equation} |Y - Y'|=|a + bX - (a + bX')|= b |X - X'|. \end{equation}
Taking expectations and combining the terms shows the expected result.5.5.2 CRPS for Normal Distribution
Let \(\mathcal{N}_0\) denote the standard Normal distribution, with pdf \(\phi(z)\) and CDF \(\Phi(z)\) and let \(z\) be the observed value. The CRPS is calculated by:
\begin{equation} \text{CRPS}(\mathcal{N}_0,z)=z(2\Phi(z)-1)+2\phi(z)-\frac{1}{\sqrt{\pi}} \end{equation}
From equivariance of CRPS under linear transformations, if $\mathcal{N}(\mu,\sigma)$ is the Normal distribution with parameters $\mu$ and $\sigma$, and if $y$ is observed, then\begin{equation} \text{CRPS}(\mathcal{N}(\mu,\sigma),y)=\sigma\,\text{CRPS}(\mathcal{N}_0,z),\quad z = \frac{y - \mu}{\sigma} \end{equation}
To derive the equation, let \(Z\sim \mathcal{N}_0\) and recall that:
\begin{equation} \text{CRPS}(Z,z)=\mathbb{E}(Z) + z(2F(z)-1) -2\int_{-\infty}^z x f(x)\,dx- \frac12 \mathbb{E}|X - X'| . \end{equation}
We know that \(\mathbb{E}(Z)=0\), \(F(z)=\Phi(z)\). Therefore, we should deal with the last two terms. First, the normal density satisfies \(\phi'(x) = -x\phi(x)\). So \(x\phi(x) = -\phi'(x)\) and therefore,
\begin{equation} \int_{-\infty}^z x\phi(x)\,dx= \int_{-\infty}^z -\phi'(x)\,dx= -\phi(z)+\phi(-\infty)= -\phi(z). \end{equation}
Second, if $X,X' \sim N(0,1)$ are independent, then $X-X' \sim N(0,2)$. Recall that mean subtracts, variances adds. For a centered normal: $W\sim N(0,\tau^2)$, we have:\begin{equation} \mathbb{E}|W|=\tau\sqrt{\frac{2}{\pi}}. \end{equation}
Here $\tau=\sqrt{2}$ and therefore,\begin{equation} \mathbb{E}|X - X'| = \frac{2}{\sqrt{\pi}}. \end{equation}
An alternative formula for CRPS of normal distribution can be derived by using the following equation:
\begin{equation} A(m,s) = 2s \, \varphi\Big(\frac{m}{s}\Big) + m \Big( 2 \Phi\Big(\frac{m}{s}\Big) - 1 \Big), \end{equation}
It is easy to show the following equality:
\begin{equation} \mathrm{CRPS}(\mathcal{N}(\mu,\sigma),y) = A(y-\mu,\sigma) - \frac{1}{2} A(0, \sqrt{2}\sigma), \end{equation}
Note that \(A(0, \sqrt{2}\sigma)=2\sqrt{2}\sigma\phi(0)\) which is equal to \(2/\pi\). Therefore,
\begin{equation} \mathbb{E}|X - X'| = A(0, \sqrt{2}\sigma). \end{equation}
\begin{equation} \mathbb{E}|X - y| = A(y-\mu,\sigma). \end{equation}
In fact, we can derive the following more general equation similarly:
\begin{equation} Z \sim \mathcal{N}(m, s) \quad\Rightarrow\quad \mathbb{E}|Z| = A(m,s) \end{equation}
If $Z \sim \mathcal{N}(0,1)$, then $\mathbb{E}|Z| = \sqrt{2/\pi} \approx 0.798$. This gives a useful interpretation: the *typical magnitude* of a standard normal draw is about $0.8$.5.5.3 CRPS for Student-t Distribution
Let \(T_\nu\) denote the standard Student-t distribution with \(\nu\) degrees of freedom, with pdf \(t_\nu(z)\) and CDF \(T_\nu(z)\). Assume \(\nu>1\), so the mean exists. The CRPS is calculated by:
\begin{equation} \mathrm{CRPS}(T_\nu, z)= z \big(2T_\nu(z) - 1\big) + 2 \frac{\nu + z^2}{\nu - 1} t_\nu(z)- \frac{2\sqrt{\nu}}{\nu - 1} \frac{B\!\left(\tfrac{1}{2}, \nu - \tfrac{1}{2}\right)} {B\!\left(\tfrac{1}{2}, \tfrac{\nu}{2}\right)^2}. \end{equation}
Similar to the normal case, it is easy to use this formula and calculate the CRPS for the mean-std representation of the t distribution.
Compared to the CRPS for normal distribution, there is a heavier-tail correction \(\frac{\nu+z^2}{\nu-1}\). Also, instead of \(1/\sqrt{\pi}\), there is beta-ratio in the formula.
5.5.4 CRPS for Gaussian Mixture Distribution
Assume that \(X\) has Gaussian Mixture distribution with \(K\) components. CRPS is calculated by:
\begin{equation} \mathrm{CRPS}(F,y)= \sum_{k=1}^{K} \pi_k \, A(y-\mu_k,\sigma_k) -\frac{1}{2} \sum_{k=1}^{K} \sum_{l=1}^{K} \pi_k \pi_l \, A(\mu_k - \mu_l,\; \sqrt{\sigma_k^2 + \sigma_l^2}) \end{equation}
To derive this, we start from the expectation formula. Note that the expectation of a mixture is the weighted sum of expectations. This allows us to compute both terms explicitly. For the first term (sharpness) we have:
\begin{equation} \mathbb{E}|X-y|= \sum_{k=1}^{K} \pi_k \, \mathbb{E}|X_k - y| \;\;. \end{equation}
There is no interaction between components here, because only one draw \(X\) is involved.
For the second term (dispersion Penalty), let \(X, X'\) be independent draws from the mixture. Assume \(X\) comes from component \(k\) with probability \(\pi_k\) and \(X'\) comes from component \(l\) with probability \(\pi_l\). Note that the mean and variance of \(X-X'\) is \(\mu_k - \mu_l\) and \(\sigma_k^2 + \sigma_l^2\), respectively. Now we must consider all possible combinations and each pair occurs with probability \(\pi_k \pi_l\). So,
\begin{equation} \mathbb{E}|X-X'|= \sum_{k=1}^{K} \sum_{l=1}^{K} \pi_k \pi_l \, \mathbb{E}|X_k - X_l'|. \end{equation}
5.5.5 CRPS for Uniform Distribution
Assume \(X \sim U(a,b)\) for \(a,b \in \mathbb{R}\) with \(b>a\). Let \(y' = (y-a)/(b-a)\), we have:
\begin{equation} \mathrm{CRPS}_{[a,b]}(y) = (b-a)\times \begin{cases} \tfrac{1}{3} - y', & y' < 0, \\ y'^2 - y' + \tfrac{1}{3}, & 0 \le y' \le 1, \\ \quad y' - \tfrac{2}{3}, & y' > 1. \end{cases} \end{equation}
To derive this, define the standardized variable:
\begin{equation} Z = \frac{X-a}{b-a} \sim U(0,1), y' = \frac{y-a}{b-a}. \end{equation}
This transformation reduces the general case \([a,b]\) to the unit interval \([0,1]\). Using the transformation \(X = a + (b-a)Z\), we get:
\begin{equation} |X-y| = (b-a)\,|Z - y'|, \quad |X-X'| = (b-a)\,|Z - Z'|. \end{equation}
Taking expectations:
\begin{equation} \mathbb{E}|X-y| = (b-a)\,\mathbb{E}|Z-y'|, \quad \mathbb{E}|X-X'| = (b-a)\,\mathbb{E}|Z-Z'|. \end{equation}
Therefore,
\begin{equation} \mathrm{CRPS}_{[a,b]}(y) = (b-a)\,\mathrm{CRPS}_{[0,1]}(y') \end{equation}
Therefore, we focus on $U(0,1)$. For the *Sharpness*, we have the following:\begin{equation} \mathbb{E}|X-y| = \begin{cases} \tfrac{1}{2} - y, & y < 0, \\ y^2 - y + \tfrac{1}{2}, & 0 \le y \le 1, \\ y - \tfrac{1}{2}, & y > 1. \end{cases} \end{equation}
To show this, note that we have $\mathbb{E}|X - y| = \int_0^1 |x - y| \, dx$. We split depending on where $y$ lies. When $0 \le y \le 1$, we have:\begin{equation} \mathbb{E}|X-y| = \int_0^y (y - x)\,dx + \int_y^1 (x - y)\,dx = y^2 - y + \tfrac{1}{2}. \end{equation}
When \(y < 0\), then \(|x-y| = x - y\) for all \(x \in [0,1]\) and we have:
\begin{equation} \mathbb{E}|X-y| = \int_0^1 (x - y)\,dx = \tfrac{1}{2} - y. \end{equation}
When \(y > 1\), then \(|x-y| = y - x\) and we have:
\begin{equation} \mathbb{E}|X-y| = \int_0^1 (y - x)\,dx = y - \tfrac{1}{2}. \end{equation}
For the dispersion penalty, we have:
\begin{equation} \mathbb{E}|X - X'| = \tfrac{1}{3}. \end{equation}
To derive this, let \(X, X' \sim \mathrm{Unif}(0,1)\) i.i.d., then \(\mathbb{E}|X - X'|= \int_0^1 \int_0^1 |x - x'| \, dx \, dx'\). By using symmetry, this is \(2 \int_0^1 \int_0^x (x - x')\,dx'\,dx\). The inner integral:
\begin{equation} \int_0^x (x - x'),dx' = \left.x x' - \tfrac{x'^2}{2}\right]_0^x = x^2 - \tfrac{x^2}{2} = \tfrac{x^2}{2}. \end{equation}
And the outer integral:
\begin{equation} \mathbb{E}|X - X'| = 2 \int_0^1 \tfrac{x^2}{2} \,dx = \int_0^1 x^2\, dx = \tfrac{1}{3}. \end{equation}
Instead of working with piecewise expressions, it is convenient to introduce a *clipped (projected) value*. For the standardized variable $y' = \frac{y-a}{b-a}$, define\begin{equation} z' = \begin{cases} 0, & y' < 0, \\ y', & 0 \le y' \le 1, \\ 1, & y' > 1, \end{cases} \quad \text{equivalently} \quad z' = \min(\max(y',0),1). \end{equation}
Therefore:
\begin{equation} \mathbb{E}|X-y| = (b-a)\left[\,|y' - z'| + z'^2 - z' + \tfrac{1}{2}\right]. \end{equation}
This expression is equivalent to the usual piecewise formula but avoids case distinctions. \(|y'-z'|\) captures behavior outside \([0,1]\) and \(z'^2 - z'\) captures the quadratic behavior inside the interval.
Also,
\begin{equation} \mathbb{E}|X - X'| = \frac{b-a}{3}. \end{equation}
The constant \(\tfrac{1}{3}\) is the average distance between two independent uniform points on ([0,1]).
Combining both expectations we obtain:
\begin{equation} \mathrm{CRPS}_{[a,b]}(y) = (b-a)\left[ |y' - z'| + z'^2 - z' + \tfrac{1}{3} \right]. \end{equation}
5.5.6 CRPS for Quantile-based Interpolated Distribution
Let
\begin{equation} X \sim \sum_{i=0}^{k} w_i \, U(a_i,b_i), \end{equation}
with weights \(w_i\), \(\sum w_i = 1\), intervals \([a_i,b_i]\), lengths \(L_i = b_i - a_i\), and midpoints \(m_i = \frac{a_i + b_i}{2}\). We have:
\begin{equation} \mathrm{CRPS}(y)= \sum_i w_i A_i(y) -\frac{1}{2} \sum_i w_i^2 \frac{L_i}{3} -\sum_{i\lt j} w_i w_j (m_j - m_i) \end{equation}
with
\begin{equation} A_i(y) = \begin{cases} m_i - y, & y \le a_i, \\ \frac{(y-a_i)^2 + (b_i-y)^2}{2(b_i-a_i)}, & a_i \le y \le b_i,\\ y - m_i, & y \ge b_i. \end{cases} \end{equation}
The derivation is straight forward from the uniform result and Uniform mixture interpretation of QI distribution. Using mixture linearity, we have:
\begin{equation} \mathbb{E}|X - y|= \sum_{i=0}^{k} w_i \,\mathbb{E}|U_i - y|, \quad U_i \sim U(a_i,b_i). \end{equation}
Define \(A_i(y) = \mathbb{E}|U(a_i,b_i) - y|\). From your uniform derivation, we get \(\mathbb{E}|X - y|=\sum_i w_i A_i(y)\).
Regarding the term \(\mathbb{E}|X - X'|\), we expand using mixture independence:
\begin{equation} \mathbb{E}|X - X'|= \sum_i w_i^2 \mathbb{E}|U_i - U_i'| + 2 \sum_{i\lt j} w_i w_j \mathbb{E}|U_i - U_j|. \end{equation}
For the diagonal terms, we get
\begin{equation} \mathbb{E}|U_i - U_i'|=\frac{b_i - a_i}{3}= \frac{L_i}{3}. \end{equation}
For the cross terms (i.e., different intervals), note that for \(i < j\), the intervals are disjoint and ordered (i.e., \(b_i \le a_j\)), thus \(|U_i - U_j| = U_j - U_i\), so
\begin{equation} \mathbb{E}|U_i - U_j|= \mathbb{E}[U_j] - \mathbb{E}[U_i]= m_j - m_i. \end{equation}
5.5.7 CRPS via Numerical Integration
we can rewrite CRPS as:
\begin{equation} \mathrm{CRPS}(F, y)= \int_{-\infty}^{y} F(x)^2 \, dx + \int_{y}^{\infty} (1 - F(x))^2 \, dx. \end{equation}
This representation is often numerically more stable than formulations involving expectations. In practice, the integrals over \((-\infty, \infty)\) must be approximated numerically. The general approach is to truncate the integration domain to \([L, U]\) where \(L\) and \(U\) are chosen as extreme quantiles, such that \(y\in[L,U]\). Techniques such as Gauss–Kronrod quadrature are commonly used because they work well for a wide range of distributions. The method only requires evaluation of the CDF, therefore applicable to any distribution with a computable CDF.
While the above approach is fully general, different distributions allow for optimization. For example, uniform distribution allow exact finite bounds, eliminating the need for truncation, or heavy-tailed distributions may require wider bounds or adaptive strategies.
6 Directional Predictive Distributions
In probabilistic forecasting, a fundamental distinction is made between continuous and discrete predictive distributions. A continuous predictive distribution assigns probability mass over an uncountable outcome space (e.g., real-valued returns), and is typically described via a probability density function (PDF) and its associated cumulative distribution function (CDF). In contrast, discrete predictive distributions assign probabilities to a finite or countable set of outcomes and are characterized by a probability mass function (PMF). Rather than modeling the full range of possible values, discrete models operate on a partitioned or coarsened outcome space, where each category represents a subset of the original continuous domain.
A particularly important special case of predictive modeling is directional prediction, where the continuous outcome \(Y \in \mathbb{R}\) is reduced to its direction relative to a reference value \(Y_0\). Define the directional variable:
\begin{equation} D = \begin{cases} \text{down}, & Y < Y_0 \\ \text{constant}, & Y = Y_0 \\ \text{up}, & Y > Y_0 \end{cases} \end{equation}
Note that \(D\) is a derived direction from a continuous variable (i.e., \(D=g(Y)\)), however, we work entirely in the induced categorical space of \(D\), without assuming consistency with an underlying continuous predictive distribution over \(Y\). Now, we assume that \(D\) has a categorical distribution with 3 parameteres:
\begin{equation} D \sim \text{Categorical}(p_d, p_c, p_u), \quad p_d + p_c + p_u = 1, \quad 0 \le p_k \le 1 \end{equation}
This defines a three-state outcome space: \(\{\text{down}, \text{constant}, \text{up}\}\) with probability mass function:
\begin{equation} P(D = y) = \begin{cases} p_d, & y = \text{down} \\ p_c, & y = \text{constant} \\ p_u, & y = \text{up} \end{cases} \end{equation}
or equivalently:
\begin{equation} P(D = y) = \prod_{k \in {\text{down}, \text{constant}, \text{up}}} p_k^{\mathbb{1}_{{y = k}}} \end{equation}
A CDF is not naturally defined without imposing an artificial ordering on the categorical states. Also, to define moments such as the mean and variance, the categorical states must be mapped to numerical values.6.1 Mixture of Directional Distributions
In many applications, we may have multiple predictive models, each providing its own directional probabilities. Suppose there are \(M\) predictors, and let \({\mathbf{p}^{(m)}}_{m=1}^M\) denote their corresponding directional predictive distributions, where
\begin{equation} \mathbf{p}^{(m)} = \big(p_d^{(m)}, p_c^{(m)}, p_u^{(m)}\big), \end{equation}
and each \(\mathbf{p}^{(m)}\) parameterizes a categorical distribution over the three outcomes \(\{\text{down}, \text{constant}, \text{up}\}\).
A natural goal is to combine these predictions into a single consensus forecast. Let \(\{\pi_m\}_{m=1}^M\) be non-negative weights such that
\begin{equation} \pi_m \ge 0, \quad \sum_{m=1}^M \pi_m = 1. \end{equation}
We define the combined probability vector as the weighted average of the individual predictive probabilities:
\begin{equation} \mathbf{p}^{(\text{mix})} = \sum_{m=1}^M \pi_m \mathbf{p}^{(m)}. \end{equation}
Component-wise, this becomes:
\begin{equation} p_k^{(\text{mix})} = \sum_{m=1}^M \pi_m \, p_k^{(m)}, \quad k \in {\text{down}, \text{constant}, \text{up}}. \end{equation}
These combined probabilities can be used to define a new categorical distribution with three states, referred to as the mixture distribution (or linear opinion pool). To verify this, note that each \(p_k^{(\text{mix})} \in [0,1]\), since it is a convex combination of valid probabilities, and:
\begin{equation} \sum_{k} p_k^{(\text{mix})} = \sum_{k} \sum_{m=1}^M \pi_m p_k^{(m)} = \sum_{m=1}^M \pi_m \sum_{k} p_k^{(m)} = \sum_{m=1}^M \pi_m = 1. \end{equation}
Therefore, \(\mathbf{p}^{(\text{mix})}\) is a valid probability vector, and the resulting model is simply another categorical distribution with updated probabilities.
6.2 Brier Score
For directional predictive distributions, a standard evaluation tool is the Brier Score, which measures how close the predicted probabilities are to the observed outcome. For the three-state case \(\{\text{down}, \text{constant}, \text{up}\}\), the Brier Score is defined as:
\begin{equation} \text{BS}(\mathbf{p}, \mathbf{o}) = \frac{1}{3} \sum_{k \in {\text{down}, \text{constant}, \text{up}}} (p_k - o_k)^2, \end{equation}
where \(\mathbf{p} = (p_d, p_c, p_u)\) is the predicted probability vector, \(\mathbf{o} = (o_d, o_c, o_u)\) is the observed outcome encoded as a one-hot vector (i.e., \(o_k = 1\) for the realized outcome and \(0\) otherwise). Alternatively:
\begin{equation} \text{BS}(\mathbf{p}, \mathbf{o}) = p_k^2 - 2p_{\text{ture}}+1 \end{equation}
where \(p_{\text{ture}}\) is the pdf at the observed value.
The Brier Score is simply the average squared error between predicted probabilities and what actually occurred. If a model assigns high probability to the true outcome, we get a small error. If it assigns low probability to the true outcome, we get a large error.
The Brier Score is a strictly proper scoring rule, meaning that, in expectation, it is uniquely minimized when the predicted probabilities match the true distribution. Let the true data-generating distribution be:
\begin{equation} \mathbf{q} = (q_d, q_c, q_u). \end{equation}
We compute the expected Brier Score under \(\mathbf{q}\):
\begin{equation} \mathbb{E}_{\mathbf{q}}[\text{BS}(\mathbf{p}, \mathbf{o})] = \frac{1}{3} \sum_{k} \mathbb{E}_{\mathbf{q}} \big[(p_k - o_k)^2\big]. \end{equation}
Since \(o_k\) is an indicator variable, we have \(\mathbb{E}[o_k] = q_k, \quad \mathbb{E}[o_k^2] = q_k\). Therefore:
\begin{equation} \mathbb{E}[(p_k - o_k)^2] = p_k^2 - 2p_k \mathbb{E}[o_k] + \mathbb{E}[o_k^2] = p_k^2 - 2p_k q_k + q_k. \end{equation}
By substituting and Rearranging terms, we get
\begin{equation} = \frac{1}{3} \left( \sum_k (p_k - q_k)^2 + \sum_k q_k - \sum_k q_k^2 \right). \end{equation}
Since \(\sum_k q_k = 1\), this becomes:
\begin{equation} \mathbb{E}_{\mathbf{q}}[\text{BS}(\mathbf{p}, \mathbf{o})] = \frac{1}{3} \sum_k (p_k - q_k)^2 + \text{constant}, \end{equation}
where the constant does not depend on \(\mathbf{p}\). This distance is minimized if and only if \(\mathbf{p} = \mathbf{q}\). Therefore, the Brier Score is strictly proper.
7 Designing Fair and Honest Competitions
To evaluate forecasting performance across different prediction types, we rely on three standard scoring families:
- Brier Score for categorical or directional probabilities.
- Absolute Error for point predictions.
- Continuous Ranked Probability Score (CRPS) for full distributional forecasts.
Brier Score and CRPS are standard proper scoring rules for probabilistic forecasting, while absolute error is the standard loss for point forecasts when the target is the conditional median. In practice, we report all three because they capture different aspects of forecasting performance and provide complementary diagnostic information.
7.1 Combining Multiple Predictions from a Single Predictor
A predictor may submit more than one prediction for the same horizon, either to express several plausible views or because multiple submissions are allowed before scoring. The system therefore needs a principled way to reduce multiple submissions of the same type into one score for that horizon.
For probabilistic forecasts, such as distribution predictions, mixture distributions provide a natural foundation. A distribution forecast already represents uncertainty about the future value. Combining several distributional forecasts as a mixture preserves this uncertainty and represents the combined forecast as a weighted collection of submitted beliefs. The resulting mixture can then be scored using CRPS, consistent with CRPS as a score for full predictive distributions.
The same idea applies to directional predictions. Multiple directional forecasts can be combined as a mixture of directional distributions. The combined forecast remains probabilistic and can be scored using the Brier Score, preserving the meaning of the submitted directional uncertainty.
Point predictions require more care. A point prediction is a single-value forecast, not a probability distribution. Converting several point predictions into an empirical distribution can be useful for presentation, especially when comparing forecasts from different predictors. It can show disagreement, spread, clustering, and outliers. In that context, treating point forecasts as samples from a collection of opinions is reasonable.
However, this transformation is not appropriate for scoring multiple point predictions submitted by a single predictor. Unless the predictor explicitly submitted a distributional forecast, calculating CRPS on an empirical distribution built from point submissions changes the meaning of the submission. It evaluates the predictor as if they had made a distributional forecast, even though they chose the point-prediction format.
For scoring, multiple point predictions should remain point predictions. We evaluate each point prediction with absolute error and then average those absolute errors for the horizon. Thus, point forecasts are judged by point-forecast accuracy, not by distributional quality. This avoids rewarding or penalizing point predictors for distributional properties they did not explicitly claim.
This principle also affects cross-type inference. A distribution forecast may imply a point estimate, such as a mean or median, and both point and distribution forecasts may imply a direction. These implications can be useful for visualization, summaries, or exploratory comparison, such as displaying an implied direction or implied median.
For scoring, however, we intentionally avoid cross-type inference. A distribution submission is scored with CRPS, a point submission with absolute error, and a directional submission with the Brier Score. We do not derive absolute error from a distribution submission or derive a Brier Score from point or distribution submissions, even when such quantities are mathematically available.
This policy keeps the competition fair, honest, and interpretable. Users are evaluated according to the forecast type they chose: distributions on distributional quality, points on point accuracy, and directional predictions on directional probability calibration and accuracy.
Under this policy, a single horizon may still receive more than one score if the predictor submitted more than one prediction type. For example, if a predictor submits one directional prediction, one point prediction, and two distribution predictions for the same horizon, the system may compute:
- one Brier Score from the directional prediction;
- one Absolute Error score from the point prediction;
- one CRPS score from the combined distribution predictions.
The two distribution predictions are combined because they belong to the same prediction family. The point and directional predictions are scored separately. No score is created by inferring one prediction type from another.
This separation preserves the semantics of each submission type while still allowing multiple forecasts of the same type to be combined in a principled way.
7.2 From Betting to an Information-Revealing Framework
A scoring rule is proper if a forecaster maximizes expected score by reporting their true belief. This is the basic mathematical condition behind honest elicitation. Proper scoring rules are the right foundation for contests whose purpose is to reveal information rather than to reward strategic gaming.
Proper scoring rules remain proper under positive affine transformations. If \(R(p,y)\) is a proper reward score, then
\begin{equation} R'(p,y)=a+b\,R(p,y), \qquad b>0 \end{equation}
is also proper, because under the true belief \(q\),
\begin{equation} \mathbb{E}_q[R'(p,Y)] = a + b\,\mathbb{E}_q[R(p,Y)], \end{equation}
and the constants \(a\) and \(b>0\) do not change which report \(p\) maximizes expected reward. With \(b<0\), this is still valid and incentives are not changed, however a score is re-oriented from “smaller is better” to “larger is better”.
By contrast, arbitrary nonlinear transformations, step functions, rank rewards, and winner-take-all mechanisms generally do not preserve truthful incentives. In those designs, participants are no longer rewarded smoothly for being more accurate; instead, they are rewarded for maximizing their probability of beating others or crossing a threshold. This is the sense in which a forecasting contest can drift toward betting.
Our framework is designed to elicit forecasts rather than betting behavior. The distinction is important because the two activities optimize different objectives. In scientific prediction, the goal is to minimize expected loss under a stated evaluation rule; in betting, the goal is typically to maximize expected utility under a payoff schedule.
This also clarifies why forecasting and betting reveal different kinds of information. A predictive distribution contains information about center, spread, asymmetry, and tail risk. A bet, by contrast, is a decision obtained by combining beliefs with payoffs, wealth, budget constraints, and risk preferences. In that sense, betting compresses a distribution into an action. That compression is often appropriate for decision-making, but it is not ideal when the goal is to evaluate predictive accuracy itself, because the observed action no longer isolates the underlying belief.
Several standard considerations reinforce this distinction. First, risk aversion can make truthful reporting suboptimal. Second, cognitive and emotional factors can affect betting behavior in ways that are not part of the forecaster’s underlying statistical judgment. Third, once incentives become rank-based, threshold-based, or winner-take-all, participants may rationally prefer more extreme or higher-variance reports because the objective shifts from being accurate in expectation to improving the chance of a favorable payoff.
None of this means that odds are uninformative. On the contrary, betting and prediction-market prices can aggregate dispersed information. Information pooling is therefore a genuine strength of market-based mechanisms. Our framework is information-revealing rather than betting-based. It is designed to elicit forecasts themselves, not actions that combine beliefs with utility, wealth, risk tolerance, and strategic incentives. Because our objective is cross-project comparability and interpretable skill measurement rather than strict incentive compatibility under a single proper scoring rule, we allow limited and controlled departures from strict properness in later aggregation steps to improve robustness, fairness, and usability of the resulting scores.
7.3 Cross-Contest Normalization: The Skill Score
Because different competitions have different levels of difficulty, raw penalties such as CRPS, Brier Score, or Absolute Error are not directly comparable across projects. To normalize performance across heterogeneous settings, we use a baseline-relative Skill Score:
\begin{equation} SS_{u,c,h}^{(m)} = 1 - \frac{S_{u,c,h}^{(m)}}{S_{0,c,h}^{(m)}}, \end{equation}
where:
- \(u\) indexes the user,
- \(c\) indexes the competition or session,
- \(h\) indexes the forecast horizon,
- \(m \in {\text{Brier}, \text{AE}, \text{CRPS}}\) indexes the score type,
- \(S_{u,c,h}^{(m)}\) is the participant’s realized penalty,
- \(S_{0,c,h}^{(m)}\) is the corresponding baseline penalty.
This formulation applies to penalty metrics where smaller values are better, and it requires safeguards whenever the baseline score is close to zero.
- (SS = 1): perfect forecast,
- (SS = 0): no improvement over the baseline,
- (SS < 0): worse than the baseline.
Skill Scores are dimensionless, which makes them much easier to compare and aggregate across very different projects. This is the main reason we use them. However, they should be understood as a normalization device, not as a pure proper scoring rule.
It is tempting to think that the Skill Score remains proper because it looks linear in the participant’s score. However, that is not correct in general. A true affine transform has the form \(R'(p,y)=a+b\,R(p,y),\) where \(a\) and \(b>0\) are fixed constants. The Skill Score, by contrast, has the form
\begin{equation} SS(p,y)=1-\frac{S(p,y)}{S(b,y)}, \end{equation}
where \(b\) is the baseline forecast. The denominator \(S(b,y)\) depends on the realized outcome \(y\), so the transformation is outcome-dependent, not a fixed affine transform. Therefore,
\begin{equation} \mathbb{E}[SS(p,Y)] = 1 - \mathbb{E}\!\left[\frac{S(p,Y)}{S(b,Y)}\right], \end{equation}
which is not the same optimization problem as minimizing \(\mathbb{E}[S(p,Y)]\). The denominator reweights outcomes by \(1/S(b,Y)\), and this can change the report that maximizes expected performance. This is why skill scores are generally improper even when they are built from proper scores.
A simple binary Brier example makes the problem concrete. Suppose the true event probability is \(q\), the forecaster reports \(p\), and the baseline reports \(b\). Using Brier loss,
\begin{equation} S(p,1)=(1-p)^2, \qquad S(p,0)=p^2. \end{equation}
The Skill Score objective is equivalent to minimizing
\begin{equation} L(p)= q\frac{(1-p)^2}{(1-b)^2} + (1-q)\frac{p^2}{b^2}. \end{equation}
Differentiating and solving gives the optimizer
\begin{equation} p^* = \frac{q\,b^2}{q\,b^2 + (1-q)(1-b)^2}. \end{equation}
If the score were proper, we would have \(p^*=q\) for all \(q\). But if \(q=0.7\) and \(b=0.8\), then
\begin{equation} p^* = \frac{0.7 \cdot 0.64}{0.7 \cdot 0.64 + 0.3 \cdot 0.04} = \frac{0.448}{0.46} \approx 0.974 \neq 0.7. \end{equation}
So a forecaster who truly believes the event probability is \(0.7\) would be incentivized to report approximately \(0.974\).
For our purposes, this is an accepted trade-off. We use the Skill Score because it makes cross-project comparison and aggregation possible, while recognizing that it is not a fully proper elicitation rule.
7.4 The Choice of Baseline: Context-Aware Intelligence
The usefulness of the Skill Score depends critically on the quality of the baseline. If the baseline is too weak, participants can earn high scores by exploiting only the most basic structure in the data. If it is too strong, the competition becomes uninformative because improvement over baseline is too rare.
To strike this balance, we use a Seasonal Autoregressive baseline as the primary benchmark. The goal is not to compete with the best possible forecasting model, but to define a transparent and context-aware reference point: a baseline that captures immediate persistence and simple structure, while intentionally leaving room for human or model-based forecasting skill to add value.
- Context-awareness: the baseline responds to recent history rather than ignoring it entirely.
- Intentional simplicity: it does not attempt to absorb all available structure.
- Transparency: the benchmark is stable, easy to explain, and easy to reproduce.
When available history is too short to fit the main baseline reliably, we fall back to a simpler forecast. This ensures continuity and avoids overfitting the benchmark itself.
7.5 Aggregating Scores and Designing a Fair Market of Attention
A predictor may contribute to more than one competition. To produce a single leaderboard metric, we aggregate the transformed per-performance scores across competitions. A forecasting platform has two distinct goals that are not perfectly aligned:
- Encourage truthful and skillful forecasting.
- Direct attention toward projects that matter.
The second goal is not merely about ranking accuracy. It is about designing a mechanism that makes users more willing to participate in selected projects.
This creates a genuine design trade-off. The closer we stay to the pure theory of proper scoring, the cleaner the honesty incentive. The more we introduce boosts, support pools, visibility, or leaderboard incentives, the more we shape participation and attention. Our approach is to keep this trade-off explicit and controlled rather than pretending it does not exist. Therefore, we do not simply calculate average or sum of the scores.
7.5.1 Averaging across horizons
Each performance may contain several forecast horizons. We therefore begin by identifying the subset of horizons that are admissible for aggregation. Let \(C_u^{(m)}\) denote the set of competitions in which user \(u\) has a valid transformed performance for metric \(m\).
For user \(u\), competition \(c\), and metric \(m\), let \(\theta_m>0\) denote a small baseline floor, and define the aggregation-valid horizon set
\begin{equation} H_{u,c}^{*(m)}= \{ h : S_{u,c,h}^{(m)} \text{ and } S_{0,c,h}^{(m)} \text{ are finite, and } S_{0,c,h}^{(m)} > \theta_m \} \end{equation}
Horizons with \(S_{0,c,h}^{(m)} \le \theta_m\) are excluded from the aggregation. This includes the case in which both the participant and the baseline receive zero loss. The reason is not that such horizons are “wrong”, but that once the baseline loss is extremely small, the relative comparison becomes numerically fragile and can dominate aggregated scores for reasons that are not substantively informative. We therefore treat these horizons as unhelpful for cross-competition aggregation and simply omit them.
A practical default is to take \(\theta_m\) to be very small, for example
\begin{equation} \theta_m = 10^{-8}. \end{equation}
Under this choice, the floor is not intended to normalize the three score families onto a common scale. Its role is narrower: it serves only as a guard against degenerate or near-degenerate denominators in the aggregation. Because the negative side of the aggregated score is compressed later, one can justify using a fixed and very small common threshold as a purely numerical safeguard rather than as a substantive modeling choice. Nevertheless, such a common floor is appropriate only if ordinary baseline losses for all score families m remain well above that scale in practice; otherwise, metric-specific thresholds \(\theta_m\) should be used instead.
Given the valid set \(H^{*(m)}_{u,c}\), we define the raw per-performance score by averaging across the retained horizons:
\begin{equation} \overline{SS}_{u,c}^{(m),\mathrm{raw}}= \frac{1}{|H^{*(m)}_{u,c}|} \sum_{h \in H^{*(m)}_{u,c}} SS_{u,c,h}^{(m)}, \qquad |H^{*(m)}_{u,c}|>0. \end{equation}
If \(H^{*(m)}_{u,c}\) is empty, then that performance contributes no score for metric \(m\).
7.6 Compressing the negative tail
A practical asymmetry of the raw Skill Score is that the positive side is bounded above by \(1\), while the negative side is unbounded below. Even after filtering out horizons with vanishing baseline losses, a poor performance can still yield a very large negative average. This is mathematically valid, but it can make aggregated scores unstable and discouraging. Since our framework already uses the Skill Score as a normalization device rather than as a fully proper scoring rule, we allow a bounded monotone transformation at the aggregation stage.
We therefore define the transformed per-performance score as
\begin{equation} \widehat{SS}_{u,c}^{(m)}= \psi\!\left(\overline{SS}_{u,c}^{(m)\,\mathrm{raw}}\right). \end{equation}
Rather than bending the score immediately at zero, we allow mildly negative values to pass through unchanged and compress only the more extreme negative tail. Let \(\tau>0\) denote the tail threshold, so that compression begins only below \(-\tau\), and let \(L>\tau\) denote the lower bound. A simple bounded monotone choice is
\begin{equation} \psi^{\mathrm{bound}}_{L,\tau}(s)= \begin{cases} s, & s \ge -\tau,\\ -\tau + \dfrac{(L-\tau)(s+\tau)}{(L-\tau)-(s+\tau)}, & s < -\tau. \end{cases} \end{equation}
This transformation is continuous, strictly increasing, and agrees with the identity at the threshold \(s=-\tau\). It also has slope \(1\) at that point, so the bend begins smoothly rather than abruptly. As \(s\to -\infty\), \(\psi^{\mathrm{bound}}_{L,\tau}(s)\to -L\), so the downside is bounded below by \(-L\). For example, with \(\tau=0.8\) and \(L=1\), all scores above \(-0.8\) remain unchanged, while more extreme negative values are compressed into the interval \([-1,-0.8)\).
A logistic-style alternative can be used when one wants a softer bend after the threshold while preserving the same threshold and lower bound. Let \(\sigma(x)=1/(1+e^{-x})\) denote the logistic function. Then one convenient choice is
\begin{equation} \psi^{\mathrm{logistic}}_{L,\tau}(s)= \begin{cases} s, & s \ge -\tau,\\ -L + 2(L-\tau)\,\sigma\!\left(\frac{2(s+\tau)}{L-\tau}\right), & s < -\tau. \end{cases} \end{equation}
This transformation is also continuous, strictly increasing, and equal to the identity at \(s=-\tau\). Its slope at the threshold is likewise \(1\), and it approaches \(-L\) as \(s\to -\infty\). Compared with the bounded monotone transform above, the logistic version bends the tail more gradually, while the bounded monotone form is simpler and more transparent.
The figure shows two alternative transformations used to compress the negative tail of the aggregated Skill Score below the threshold \(0\) (vertical dotted line). Scores above the threshold remain unchanged, while more extreme negative values are smoothly mapped toward the lower bound \(-1\). The bounded monotone transform bends the tail more sharply, whereas the logistic transform produces a smoother transition.
7.6.1 Aggregating across competitions
We allow a competition to carry two distinct types of boost. Recall that \(\widehat{SS}_{u,c}^{(m)}\) denotes the transformed per-performance Skill Score for user \(u\), competition \(c\), and metric \(m\). We define the boosted contribution of that participation as
\begin{equation} \widetilde{SS}_{u,c}^{(m)} = \alpha_c^{(m)}\,\widehat{SS}_{u,c}^{(m)} + \beta_c^{(m)}, \end{equation}
where:
- \(\alpha_c^{(m)} \ge 1\) is a multiplicative boost,
- \(\beta_c^{(m)} \ge 0\) is an additive boost.
These two terms play different roles.
Additive boost: The additive term \(\beta_c^{(m)}\) is designed to make people show up. It rewards participation in a targeted competition regardless of whether the participant is only moderately above baseline or very strong. This is the right mechanism when the goal is to increase participation in an important project, to make users feel that participating in that project matters, or to direct attention toward competitions that the platform or community wants to highlight. In this sense, \(\beta\) acts like a participation bounty.
Multiplicative boost: The multiplicative term \(\alpha_c^{(m)}\) is designed to make strong forecasters show up. Since it scales the forecasting contribution itself, users with better forecasting performance benefit more from it. It also scales the negative scores, therefore increase the risk. This is the right mechanism when the goal is not merely to attract more participants, but to attract participants who expect to perform well. In this sense, \(\alpha\) acts like a talent multiplier.
For metric \(m\), the Average Score is defined as
\begin{equation} A_u^{(m)} = \frac{1}{|C_u^{(m)}|} \sum_{c \in C_u^{(m)}} \alpha_c^{(m)}\,\widehat{SS}_{u,c}^{(m)}. \end{equation}
This remains the cleaner summary of forecasting performance. It intentionally excludes the additive participation bounty.
The multiplicative boost is included in this score. This is because transformed Skill Scores may still be positive or negative. Therefore, multiplying by \(\alpha_c^{(m)} \ge 1\) does not simply inflate rewards: it also magnifies losses when performance is poor, even though the negative side is now bounded. A participant entering a multiplicatively boosted competition therefore faces greater upside and greater downside. In this sense, the multiplicative boost does not merely subsidize participation; it changes the effective stakes of forecasting performance itself. It is therefore more naturally interpreted as part of the skill-sensitive score than as a pure bounty.
For metric \(m\), the Total Score is
\begin{equation} T_u^{(m)} = \sum_{c \in C_u^{(m)}} \widetilde{SS}_{u,c}^{(m)}. \end{equation}
This reflects both forecasting performance and participation bounty. Because the total is a sum over participations rather than an average, a user who performs very well in only one competition may have a high Average Score but still a relatively modest Total Score. This is an intended feature. The average summarizes quality; the total summarizes cumulative contribution.
7.6.2 Aggregating three metrics
Because we maintain three distinct score families, we can form a composite total by combining the three metric-specific totals linearly:
\begin{equation} \mathrm{Total}_u = w_{\mathrm{Brier}}\,T_u^{(\mathrm{Brier})} + w_{\mathrm{AE}}\,T_u^{(\mathrm{AE})} + w_{\mathrm{CRPS}}\,T_u^{(\mathrm{CRPS})}. \end{equation}
A natural default is to place greater weight on CRPS, because it evaluates the full predictive distribution rather than only one narrower aspect of predictive performance. One reasonable default is
\begin{equation} w_{\mathrm{Brier}} = 0.25, \qquad w_{\mathrm{AE}} = 0.25, \qquad w_{\mathrm{CRPS}} = 0.50. \end{equation}
The same linear structure can be used for the averaged headline score:
\begin{equation} \mathrm{Average}_u = w_{\mathrm{Brier}}\,A_u^{(\mathrm{Brier})} + w_{\mathrm{AE}}\,A_u^{(\mathrm{AE})} + w_{\mathrm{CRPS}}\,A_u^{(\mathrm{CRPS})}. \end{equation}
These weights are a design choice. This assumption may not hold in every context, which is why the component metrics remain visible to participants.