This living document describes goals and implementation progress for V6 of the OGS ratings system.
- Enable rating system to be used with non-standard komi and board sizes
- Extend metrics analyzed when evaluating rating system to enable more experiments
- Improve predictive performance of specific rating categories, enabling their use for automatch and handicap (instead of the general "overall" category)
- Make ratings less volatile for players that play lots of games (and play at a consistent strength)
- Re-evaluate conditions for annullment of games after timeouts (bot games and mass timeout)
- Summary of proposed changes
- Background: Glicko-2
- Background: OGS Ratings v5
- Details of each proposed change
- 👍: Approved (maybe not implemented).
- ✅: Implemented and landed (maybe not approved / behind flag).
- 🔃: Implemented in a PR, but not immediately landed.
- 👎: Considered and rejected.
- Compute fine-grained effective rating of opponent based on handicap, komi, and ruleset 👍✅
- Increase effective rating deviation of opponent for larger handicaps and strange komi values 🔃
- Increase rating deviation as rating ages ✅
- Extend metrics analyzed when tallying
- Add cohesive-ratings-grid script, with the goal of eventually using specific ratings categories for matching and handicaps
- Improve predictive performance and volatility of cohesive-ratings-grid
- Evaluate changes to how timeouts are handled, for each of correspondence and bots
- Evaluate and/or increase correlation of ratings categories
Here's a quick editorialized summary of the Glicko-2 rating system, based on the Glicko-2 paper dated March 22, 2022.
-
$\mu$ : the rating, an estimate of the mean of the playing strength based on past performance-
$r = 173.7178\mu + 1500$ : a user-friendly view of the rating
-
-
$\phi$ : the rating deviation, a standard error for$\mu$ -
$\textrm{RD} = 173.7178\phi$ : a user-friendly view of the rating deviation -
$\phi^2$ : the rating variance
-
-
$\sigma$ : the rating volatility, how much the rating changes over time -
$\tau$ : system constant that constrains volatility over time; higher values allow higher volatility
Typically, reasonable values of
The default is to initialize new players to a rating of 1500 (
Glicko-2 is designed to run periodically (e.g., once per day, week, or month).
The ideal period length would have 10-15 games per active player.
Ratings are updated at the end of each period, computing new values
A few definitions for the update:
-
$m$ : the number of games played -
$\mu_{1},...,\mu_{m}$ : ratings of opponents in each game, taken from the beginning of the rating period -
$\phi_{1},...,\phi_{m}$ : rating deviation of opponents in each game, taken from the beginning of the rating period -
$s_{1},...,s_{m}$ : scores in each game ($0$ for a loss,$0.5$ for a tie, and$1$ for a win)
The basic update procedure (editorialized) is:
- Step 3: accumulate
$v^{-1}$ , which describes opponents' relative ratings and their deviation; - Step 4a: accumulate
$\Gamma$ (not in paper), which compares actual scores to expected scores; - Step 4b: compute
$\Delta = v\Gamma$ , an estimate of the ratings change; - Step 5: compute
$\sigma'$ , the new volatility, using$v$ and$\Delta$ ; - Step 6-7a: compute
$\phi'$ , the new deviation; - Step 7b: compute
$\mu' = \mu + \phi'v\Gamma$ , the new rating.
If the player hasn't played any games in the period, the deviation should increase by the volatility. All the steps reduce to the following:
Games are accumulated into
Estimated rating change is:
Define
Then set constants
See Step 5 of Glicko-2 paper for the details omitted here.
The deviation is expanded by the new volatility,
Finally, the rating is computed using the new deviation:
OGS ratings (v5) uses a modified version of Glicko-2, where rating periods are variable-length. A player's rating is updated after each game. The ratings period has exactly one game in it, and the period's length is the time since the last game.
Note that Glicko-2 documentation suggests it works best when active players have 10-15 games on average in each period, but on OGS every period has exactly one game.
A few notes about this system:
- Easy to compute.
- Easy to understand/predict the effect of each individual game on ratings.
- Every period looks to Glicko-2 like an "outlier", where the player has
"surprisingly" either won or lost all of their games.
- Ideally, a rating should be stable if there's lots of data and the player's strength is consistent.
- On OGS, the deviation stays relatively high and ratings are volatile (move around a lot) as a result.
- Deviation ought to increase after a long time with no games, but doesn't.
Most players have different strengths at different board sizes and game speeds. OGS ratings (v5) has a grid of ratings for different rating categories.
The primary rating is the general rating category:
- overall
"Overall" is used for match-making (automatch) and for setting handicaps.
There are also nine "specific" rating categories, representing board size and game speed:
- blitz-9x9
- blitz-13x13
- blitz-19x19
- live-9x9
- live-13x13
- live-19x19
- correspondence-9x9
- correspondence-13x13
- correspondence-19x19
... and six general categories in the middle:
- blitz
- live
- correspondence
- 9x9
- 13x13
- 19x19
... for 16 ratings total.
These are each separately maintained ratings:
- After a player finishes a game, their rating is update in the three relevant categories.
- For "overall", the ratings are updated as specified above.
- The other two ratings use take the opponent's rating and deviation (
$\mu_{j}$ and$\phi_{j}$ ) from the "overall" category.
A few notes about the rating categories:
- Players get some visibility into their playing strength in different categories.
- Overall has a recency bias, dominated by the player's most recent games.
Thus, automatch and handicap settings are based on the recent games as well.
- Great if the player has had a dramatic change in strength while only playing with one specific category.
- Not great if the player consistently has a different strength in different categories.
- OGS forum regulars recommend maintaining multiple accounts, one for each specific rating category, as standard practice to work around this.
- Increases accuracy of ratings updates.
- Handles non-standard komi (including handicap via reverse komi) cleanly, enabling a policy decision to allow rated games with non-standard komi.
- Can easily be extended to handle non-standard board sizes (such as 7x7 or 25x25) by specifying a multiplier.
- Computes a fractional effective handicap for game
- Assumes a territorial value of 12 for each stone at start of game
- Komi of 6 considered perfect for territory
- Komi of 7 considered perfect for area, since black usually gets extra play
- Combines value of handicap stones and effective komi (including handicap bonus), and compares against perfect komi to determine black's advantage
- Divides by territorial value of stones to get effect rank difference
- Multiplier of 3 for 13x13
- Multiplier of 6 for 9x9
- Extensible to other board sizes. For example, we could try:
- Multiplier of 12 for 7x7
- Multiplier of 0.5 for 25x25
- Reflects higher uncertainty that opponents' rating is correct, the more the board diverges fom "normal".
- Reduces impact of high handicap games on ratings, enabling a policy decision to allow rated 19x19 games with 10+ stones.
- Reduces impact of reverse-komi games on ratings, enabling a policy decision to allow handicap via reverse komi.
- #70
- An error value is computed based on how "strange" the handicap is.
- For each rating update, the opponent's rating deviation is expanded by error using root-of-sum-of-squares
- More closely matches the design of Glicko-2, where deviation increases when no games are played.
- Reflects that strength may have atrophied through neglect or improved through study elsewhere.
- Command-line option:
--aging-period
- Before using a rating, checks if its timestamp is older than the length of
the aging period.
- If so, increases deviation according using the formula to "age" it to 1 period old.
- Else, uses it as-is.
Goals:
- Show whether and how much predictions improve when we expect them to (higher rank difference, lower deviation, etc.)
- Show rating volatility
- Clean up noise in the analysis output for metrics we don't know how to use
- Tally all rated games (stop ignoring rank diff greater than ±1)
- Show black win rate cross-sectioned by rank (as now!), by deviation, and by rating category; still limited to rank diff ±1
- Show expected-winner-wins rates cross-sectioned by effective rank diff, by rank, by deviation, and by rating category
- Show daily/weekly/monthly historical rating volatility cross-sectioned by rank, by deviation, by game frequency, and by rating category
- Remove old metrics we don't look at any more
Goals:
- Enable analysis of using specific rating categories (rather than "overall") as the primary rating(s)
- Add foundation for improving predictive performance of a ratings system centered on specific rating categories
- Land this script whether or not it's better than using "overall" as the primary rating. (It probably won't be, initially!)
- For bring-up, each rating category is updated in isolation
- Note: Initially, like ratings v5, except opponent rating/deviation come from the same category as the player.
- Single TallyGameAnalytics instance, which tallies each game exactly once, in
its most specific rating category
- Note: Existing ratings-grid scripts have a separate TallyGameAnalytics instance for each ratings category.
- Measures predictive performance of specific rating categories
- Ignores predictive performance of general rating categories
- Historical rating volatility metrics should look at all 16 rating graphs.
- Measure volatility of both specific and general rating categories.
Goals:
- Improve the predictive performance of specific rating categories, beating "overall"
- Reduce the volatility of ratings without compromising predictive performance
- If a player's strength changes, we still want the system to respond quickly!
- Change general rating categories to be weighted averages of specific categories, instead of independently computed Glicko-2 ratings
- For stale rating categories, blend in general rating categories
- Add per-user, fixed-length time periods for ratings updates, using incremental computation to mitigate computation costs
- Fine-tune mid-period "observed" rating/deviation
- Add metrics for games-per-period cross-sectioned by rank, by deviation, and by rating category, and tune period length of each rating category
- Tune the
$\tau$ system constant for each rating category; consider lower values for correspondence to counteract naturally high volatility
This proposal differentiates between:
- "last" rating, which is the most recent rating after rating a game; and
- "effective" rating, which is the rating used as input for the next ratings update, and should be used for matching, handicap, and the UI.
This distinction clarifies the relationship between general and specific rating categories.
- For the "last" rating, the flow is upward: specific ratings are averaged to compute a new general rating.
- For "effective" ratings, the flow is downward: general ratings can be blended into the specific rating.
Also, either type of rating can be aged to a newer timestamp by expanding its deviation. Unless otherwise stated, it's assumed that a "last" rating has not been aged, but an "effective" rating has been.
After rating a game in a specific rating category, compute all relevant general rating categories by combining the new specific rating with the last ratings in other specific categories from previous games, using a weighted average. If there is no "last" rating, since no games have yet been rated, that category should be excluded.
The new general rating will have the timestamp of the just-rated game. Other ratings should be aged to the same timestamp (expanding their deviation) before combining them.
This proposal bases the weights on the deviation; specifically, the inverse of
the variance,
For example, to compute the general blitz rating,
Note: if this player has no rated blitz 9x9 games, then
Then compute the general rating, variance, and volatility as weighted averages:
Note: the deviation,
At least initially, the overall rating should be computed in the same way, directly from the specific rating categories. Perhaps in the future we'll want to experiment with another option:
- cascade up, combining 9x9, 13x13, and 19x19;
- cascade up, combining blitz, live, and correspondence; or
- average of the those two possibilities.
When a rating category is stale -- it has a high deviation and/or hasn't been used recently -- other rating categories with more recent games may be better predictors of performance. This proposal dynamically blends the effective overall rating into the specific rating when it seems stale enough.
The blend is weighted by age and deviation, it's gradual (to prevent cliffs), and it cascades from the overall rating through the midpoints to the specific rating.
Blending has two scaling factors:
-
$w_{t}$ (time) starts when the last specific rating is at least 30 days older than the last general rating; and -
$w_{\phi}$ (deviation) starts when, after aging, the last specific rating has a$\phi$ at least 0.3 (~50 RD) higher than the last general rating.
These scaling factors are combined and must both be active for
The effective specific rating is the weighted average of the specific and general ratings. The deviation and volatility are increased by the weight of the general rating.
Set a fixed time period,
Thus we have system constants:
-
$P$ : system constant that sets the length of a time period -
$\tau$ : system constant that constrains volatility over time
Previously, we stored for each rating:
-
$\mu$ : rating (or$r = 173.7178\mu + 1500$ ) -
$\phi$ : rating deviation (or$\textrm{RD} = 173.7178\phi$ ) -
$\sigma$ : rating volatility
Instead, store:
- Initial rating for the period
-
$\mu$ : rating -
$\phi$ : rating deviation -
$\sigma$ : rating volatility
-
- Estimated rating at end of period
-
$\mu'$ : rating -
$\phi'$ : rating deviation -
$\sigma'$ : rating volatility
-
- Period details:
-
$t_{e}$ : the timestamp for the end of the rating period -
$v^{-1}$ : accumulation of opponent ratings and deviation for the period -
$\Gamma$ : accumulation of user's performance in the period
-
Perform a ratings update for each game.
First, define:
- the game's timestamp as
$t_{g}$ , - this player's last rating subscripted with
$p$ (as in$\mu_{p}$ ,$\mu_{p}'$ ,$t_{ep}$ , etc.), and - the opponent's last rating subscripted with
$o$ (as in$\mu_{o}$ ,$\mu_{o}'$ ,$t_{eo}$ , etc.).
Second, determine the opponent's observed rating and deviation,
If the opponent has no previous games, choose appropriate starting values.
Else, take the rating from the opponent's last period, aging the deviation if the rating is old:
Third, determine whether this game starts a new period for this player or
continues an old one, and accumulate the game result in
If there is no previous game, set
Else if
In both cases so far, this game starts a new period. Initialize
Else,
Fourth, compute
Reduce the volatility in the observed rating graph within time periods by scaling the rating and deviation change according to the proportion of the period that game accounts for.
- Define
$\mu_{N}$ and$\phi_{N}$ , the observed rating and deviation at timestamp$N$ (now), using one of the alternatives described below. - Use
$\mu_{N}$ and$\phi_{N}$ for:- measuring predictive performance and historical volatility
- eventually, in OGS, for automatch and setting handicap
- Consider using
$\mu_{N}$ and$\phi_{N}$ to set$\mu_{j}$ and$\phi_{j}$ when observing opponent ratings during a ratings update.
Here are a few alternatives to implement and evaluate:
As defined above, use the rating from the last completed time period, and age
the deviation to the current timestamp. This is the formula used above for
Or, we could use the estimated new rating, and just age the deviation.
Or, we could scale the estimated rating according to how much of the period has
passed by the time of the observation
Given
Or, we could compute
Assuming
Or we could compute
... and using that to compute
Add games-per-period metric, which shows the average number of games per period for all players in each specific rating category.
For each rating category:
- For each player, compute the average games per period by dividing the number
of total rated games and by the number of new periods that player had.
- Ignores time between periods, when the player was inactive.
- Compute both the mean and median of those averages, to understand how many games per period an average player experiences.
- Show cross-sections for bots-only vs humans-only vs humans+bots.
- Show cross-sections by rank ranges and by deviation ranges.
Using these data, tune the period length separately for each rating category, in each case aiming to get the mean games-per-period into the 10-15 game range.
Likely, we want to tune the humans-only games-per-period to the 10-15 games range, but we might also experiment with tuning the humans+bots metric.
Tune the
- Consider lower values for correspondence to counteract naturally high volatility in playing strength.
- Consider higher values in rating categories that have long time periods to ensure ratings adjust quickly enough.
- Alternative #1: Annull all games lost by timeout (status quo for bots)
- Alternative #2: Annull games 2+ mass timeout (status quo for correspondence)
- Alternative #3: Annull games 4+ mass timeout
- Alternative #4: Do nothing (treat as normal result)
- Alternative #5: Add error/deviation to all timeouts
- Alternative #6: Add error/deviation to 2+ mass timeouts
- Add metrics for similarity across time, using deviation cut-offs to filter out stale ratings
- Compare blitz vs live vs correspondence; consider whether there are options for increasing correlation
- Compare 9x9 vs 13x13 vs 19x19; tune rank-to-stone multipliers, and consider other options for increasing correlation