← Back to notes

Predictions Ranking Methodology

How we evaluate predictive ability: Brier Score, operationalization rubric, and temporal structure of the Predictions Observatory.

By Philippe Prince Tritto · Published on February 28, 2026 ·
predictionsmethodology

In December 2024, The National Law Review published 65 expert predictions on the future of AI and law for 2025. Our team took those predictions — which, when broken down, totaled 141 — and manually verified each one against reality. The results are in our U.S. Predictions Observatory.

The exercise revealed a fundamental problem: the predictions were collected without any scientific methodology. They included no assigned probabilities, no concrete deadlines, and no predefined verification criteria. Many were so vague that they were almost impossible to get wrong — what psychologists call the Barnum effect.

This document describes the methodology we are developing for the AI & Law Predictions Ranking · Mexico 2026. This is a work in progress and will be pre-registered shortly at the Center for Open Science (cos.io) to ensure transparency and methodological rigor.

The starting point

The following table summarizes the differences between NLR’s prediction-gathering approach and the LabDerIA instrument:

CriterionNLR (2024)LabDerIA (2026)
Assigned probability✗ No✓ 5%–95%
Resolution deadline✗ “In 2025”✓ Exact date
Verification criterion✗ No✓ Predefined by the predictor
Barnum effect control✗ No✓ Operationalization rubric (0–5)
Meta-cognition✗ No✓ 3 levels
Calculable ranking✗ No✓ Brier Score
Theoretical basis✗ Journalism✓ Peer-reviewed

Scientific foundation

Our methodology is grounded in the theory of proper scoring rules developed in statistics and forecasting science. A proper scoring rule is an evaluation function that incentivizes the forecaster to report their true belief — neither overstating confidence nor hiding behind vagueness.

The core metric is the Brier Score, proposed by Glenn Brier in 1950 [1]. It is calculated as the mean squared error between the assigned probability and the observed outcome:

BS = (1/N) Σ (ft − ot

Where ft is the assigned probability and ot is the outcome (0 or 1). A score of 0 is perfect; 1 is the worst possible. If you always assign 50%, your Brier Score will be 0.25 — the baseline for “knowing nothing.”

Murphy (1973) [2] showed that the Brier Score can be decomposed into three components: calibration (when you say 70%, does it happen 70% of the time?), resolution (do you discriminate between events that do and don’t occur?), and uncertainty (the inherent difficulty of the phenomenon). This decomposition helps diagnose why a forecaster fails.

Gneiting and Raftery (2007) [3] formalized the general theory of strictly proper scoring rules in the Journal of the American Statistical Association, proving that both the Brier Score and the log score are strictly proper: the expected score is maximized only when the forecaster reports their true belief distribution.

Empirical validation: the Good Judgment Project

The most rigorous framework for evaluating predictions about social phenomena comes from the Good Judgment Project led by Philip Tetlock and Barbara Mellers, funded by IARPA (the research agency of the U.S. intelligence community). Between 2011 and 2015, over 5,000 forecasters made more than one million predictions on approximately 500 geopolitical questions.

The results — documented in over 25 peer-reviewed articles — showed that a small group of “superforecasters” (the top 2%) consistently outperformed professional intelligence analysts with access to classified information by 30% [4]. Mellers et al. (2014) [4] identified three key factors: training in probabilistic reasoning, teamwork, and tracking (accountability). Friedman et al. (2018) [5] demonstrated specifically that precision in probability estimates carries informational value, confirming that granularity is signal, not noise.

Our instrument

Each prediction registered in our system includes seven fields:

  1. Prediction in free text
  2. Explicit probability (5%–95% in 5% increments)
  3. Resolution deadline
  4. Objective verification criterion defined by the predictor
  5. Subject category
  6. Geographic zone (Mexico federal/state, U.S., EU, Latin America, Africa, Asia/Pacific, Global, Other)
  7. Meta-cognitive confidence self-assessment at three levels

The explicit probability is what enables calculating real Brier Scores. Without it, there are only opinions.

Barnum effect control: operationalization rubric

The Barnum effect is the tendency to formulate predictions so vague they are nearly impossible to falsify (“AI will transform the law”). To control this bias, the LabDerIA team evaluates each prediction using an operationalization rubric inspired by question quality criteria from platforms like Metaculus and Tetlock’s Clairvoyance Test: if you handed your prediction to a genuine clairvoyant, could they look into the future and tell you with certainty whether it came true?

The rubric has five dimensions, each binary (0 = does not meet, 1 = meets). Two independent team evaluators assess each prediction; in case of disagreement, a third evaluator breaks the tie.

#Dimension✗ Does not meet✓ Meets
D1Binary verifiability. Can it be resolved as YES or NO?”AI will transform the law""The Senate will approve a federal AI law bill”
D2Actor specification. Is the relevant actor or institution identified?”There will be more regulation""The Supreme Court will issue a protocol”
D3Temporal specification. Does it include a concrete time horizon?”AI will be adopted in courts""By December 2026, at least 5 federal courts will use generative AI”
D4Observable verification criterion. Can it be verified with an identifiable public source?”Judicial efficiency will improve""It will be published in the Official Gazette”
D5Non-trivial condition. Would an informed observer say there is genuine uncertainty?”Firms will continue investing in technology""At least one top-10 firm will close its AI department due to lack of ROI”

Score classification

The resulting score (0–5) is used as a quality filter at entry, not as a weight on the Brier Score:

ScoreClassificationAction
0–1RejectedThe predictor is asked to reformulate with greater precision
2ConditionalProvisionally included; the team may request clarification
3–4AcceptedMeets sufficient operationalization standards
5ExemplaryPasses the Clairvoyance Test without difficulty

This design choice is deliberate. The Brier Score is a strictly proper scoring rule [3]: it already incentivizes the forecaster to report their true belief and mathematically penalizes vagueness. Weighting the Brier Score by an additional specificity index would break this property. Instead, controlling quality at entry — as IARPA tournaments and platforms like Metaculus do — preserves the purity of the scoring rule and incentivizes participants to formulate well-operationalized predictions from the outset.

In the published report, we complement the Brier Score with the Murphy decomposition [2], which separates calibration, resolution, and uncertainty. Resolution — the predictor’s ability to discriminate between events that do and don’t occur — is a natural indicator of predictive quality already contained in the Brier Score, requiring no external adjustments.

Temporal structure

The ranking operates on two tracks:

  • Annual Track. Includes predictions with deadlines within the current year and feeds the publishable ranking every January.
  • Long-Term Track. Accumulates predictions with horizons beyond one year, incorporated into the predictor’s cumulative score as they expire.

Following Tetlock’s evidence, a minimum of 3 predictions per participant enables aggregate analysis, though a minimum of 10 is recommended for more robust individual scores.


Want to test your predictive ability? Join the AI & Law Predictions Ranking · Mexico 2026 and show how well you can anticipate the future of law and artificial intelligence.


References

  1. Brier, G.W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2
  2. Murphy, A.H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology, 12, 595–600. doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2
  3. Gneiting, T. & Raftery, A.E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359–378. doi:10.1198/016214506000001437
  4. Mellers, B. et al. (2014). Psychological strategies for winning a geopolitical forecasting tournament. Psychological Science, 25(5), 1106–1115. doi:10.1177/0956797614524255
  5. Friedman, J.A., Baker, J.D., Mellers, B.A., Tetlock, P.E. & Zeckhauser, R. (2018). The value of precision in probability assessment: Evidence from a large-scale geopolitical forecasting tournament. International Studies Quarterly, 62, 410–422. doi:10.1093/isq/sqx078
  6. Mellers, B. et al. (2015). Identifying and cultivating superforecasters as a method of improving probabilistic predictions. Perspectives on Psychological Science, 10(3), 267–281. doi:10.1177/1745691615577794