By Ian Kaplan, Hybrid Performance Method COO
We love sports for the opportunity to test ourselves, but we love sports more when we see where we stack up against the competition. It's really us, and our team, against the world. Athletes compete to see what they are made of, and fans enjoy sharing in their favorite athletes' successes.
We love sports for their objectivity. There is a game with clearly defined rules. In the end, there is a score. There are always winners and losers; ties never last. No conjecture, no "what ifs". Assuming the game was played fairly, we accept the outcome. You can't just show up and tell everyone how good you are. You have to prove it every time you play.
We love sports for the comparison, and the controversy. We endlessly argue over cases for "the greatest of all time", about who would win in a game that was never played, about "the most underrated" or the "next big thing". These debates are usually heavy on passion, but light on facts.
Nowadays, sports and statistics have a deeper connection than ever. Mathematically derived rating systems are used widely in sports from basketball to football to racing. They are used to seed playoffs and sometimes even to predict game outcomes. The beauty of such systems is that they limit subjectivity to the assumptions underneath the mathematics. Once we define our limited set of assumptions, the rest is just computation.
While rating systems are no substitute for playing the game, they can inform qualification decisions, inspire athletes to live up to the ratings (or prove them wrong) and they create a richer viewing experience for fans. A-priori ratings are often no less a part of the game experience than the final score.
Powerlifting is a perfect candidate for a more robust rating system. As athletes, coaches, and fans of the sport, we at HYBRID want to contribute a more unified leaderboard to give the sport the attention it deserves.
Until now, comparing powerlifting performances has been surprisingly challenging. The fundamental premise is very simple. Squat, bench press, and deadlift; the highest total score wins. Add in some weight classes and we know the champion for a given weight class. Multiply by a standardized coefficient to adjust for body weight and you have a "champion of champions" the best across all weight classes. It sounds simple, but the task gets complex quickly.
As powerlifting has evolved, so has the selection of supportive gear. First, it was garments designed to give lifters a little bit of a boost. Then it evolved into full-blown, single-ply and multi-ply suits, adding huge percentages to lifters' totals, fundamentally shifting the sport's dynamics. The rise of "raw" powerlifting meant shedding the most supportive gear, but different divisions permitted different kinds of limited support, particularly around the knees. For the past few decades the sport has been fragmented into two sub-sports, with divisions that classify as "equipped" on one side, and divisions that classify as raw on the other. Within each "sub-sport", there are differences that complicate the picture further. The type of equipment influences weight lifted in different ways. Some ways are significant, some are subtle.
It's not just differences in equipment, but differences in rules that contribute to differences in performance. Unlike other sports, there are hundreds of established powerlifting federations, and the rules that define a successful lift vary significantly between them. Some federations pass lifts that wouldn't count in other federations. To rank a large group of lifters against each other, we need to account for differences in lifting conditions that systematically influence final lifter performance.
With hundreds of established powerlifting federations and many more federation-equipment combinations, it's no wonder that many in the sport have given up comparing lifters across such wildly different conditions.
We at HYBRID believe that Powerlifting is actually not more complex than other sports, and that classic sports rating algorithms can be applied to powerlifting, with only minor adjustments. The only major differences depend on defining what constitutes a "game" and what constitutes a "score".
You might be wondering why we can't just rank everyone based on their Wilks score. Like other similar formulas out there, Wilks is not so much a rating tool as a bodyweight adjustment scaler. The Wilks coefficient (or another similar number) is just applied to the lifter's total to control for bodyweight effects so that performances can be compared across weight classes. Wilks does not account for what the lifter was wearing while recording the total, or what the rules were. It is an absolute score and not a relative score.
Wilks values differ by gender as well, but we are not yet as confident in its ability to adequately eliminate performance differences explained by gender by applying different coefficients to men and women. See this paper, courtesy of openpowerlifitng.org that evaluates Wilks as a tool to adjust for weight class differences. Wilks is more equivalent to a "game score" than a true "rating" on which we can rank lifters across different conditions. Wilks is the input, and a HYBRID Rating is the output.
*Note: The Ferland paper is the reason why we will focus on the classic Wilks as our "input of choice" until enough work is done on another formula to justify a switch. In our rating system, we opted to keep male and female lists separate. Maybe one day, when we are more confident, we'll combine them into one super-list.
The HYBRID Data Science team proposes a lifter rating system modeled after other gold standard sports rating systems. The intuitions are relatively straightforward, though the math can get complex. For a full description of the mathematics behind the algorithm, see this paper on applying statistical models like least-squares linear regression to sports teams. (Fun fact, this paper was Kenneth Massey's undergraduate thesis. What did you contribute to your field in undergrad?).
We used the Open Powerlifting dataset, generously open sourced on openpowerlifting.org as our "source of truth". With such a clean data source, we were able to quickly develop a rating system dependent on several core assumptions...
A total recorded at a meet in a federation-equipment combination constitutes a "game". All lifters who lifted in x federation with wraps can be directly compared against each other, even across weight classes (provided that they are the same gender). The General idea is comparable to the Heat playing 82 games over the previous regular season. Those scores are known and can be used as inputs to the model. Those lifters who record totals in more than one federation-equipment combination enable comparison across.
One major modeling decision was how to treat lifts recorded at different times and different physical locations. We tried several versions and found that the simplest answer was best. We compared scores directly anytime the rules are roughly equivalent. Meets at different times and different places still counted towards a single "game" for our purposes. This constraint allows us to directly compare many lifters at once, which is more ideal for the model. We trade the ability to capture every subtle variation in location and time period for the ability to directly compare many lifters across a much larger set of lifters, all on their best days.
We acknowledge that small inconsistencies within federations across different meets are not captured by a model that doesn't define a game as a single meet, but inconsistencies in judging are an unavoidable part of sports. Until we have automated referees out of the game, missed calls are part of the magic we encapsulate mathematically in the error term of the model.
Another limitation here is that we don't control for time effects. Our system pits lifters decades ago against lifters today who have the advantage of time. Better equipment, better training methods, better weight control methods, among other factors all could have subtle but systematic effects. We decided not to restrict direct comparisons to totals recorded in the same time period to keep the model more easily understandable given that the goal here is not to perfectly control for all the subtleties in the sport. Nor is the goal to perfectly predict future results. We've produced a simple way of inputting a bodyweight adjusted total and returning a number that indicates how a lifter compares to other lifters who lifted with the same rules, so that they can be compared against lifters who lifted with different rules. We think it's appropriate that ratings tend to get higher over the years, highlighting how the sport has evolved.
It's helpful to think of Wilks as an equivalent to a game score in a team sport. It's as if the Miami Heat scored 100 points in game 3 of the season. We chose to take a lifter's best performance in a federation-equipment combination as their score in that "game". Since the goal is to create an all-time list, we thought it was important to compare lifters at their best.
The third major assumption is that the difference between two lifters' Wilks is proportional to the difference in lifter strength. Since the Wilks is a constant term, we believe this assumption holds. It means that we don't just care whether you beat other lifters, we care about the number of points between you and your opponents. The math depends on this assumption, and that the ratings are scaled so that they equal zero. Under the right circumstances, the difference between two lifters Wilks explains something important about their relative strength. And the difference between Hybrid Ratings estimates the expected difference in bodyweight adjusted total (Wilks) should those lifters compete against each other again.
After some experimentation, we decided a separate list for raw equipment divisions is necessary. The top totals are significantly higher in equipped divisions, and top lifters rarely move between raw divisions and geared powerlifting. When they do switch, they may not spend enough time adapting to the gear (or lack thereof) to maximize potential in that division. If the dataset doesn't include enough lifters on "their best day", it makes it more challenging to compare across the rather large divide between those equipment divisions. We believe the two kinds of powerlifting are sufficiently different and the "game matrix" is sparse enough to justify a list exclusively for raw powerlifting. So we chose to focus only on a subset of equipment divisions and federations, leaving the challenge of handling equipped divisions for an organization more active in that area. The HYBRID Rating list and rankings include all federation-equipment combinations in the "raw" and "wraps" equipment divisions.
*Above is a sample of the list for raw powerlifting female used as input scores to the model. You can see Marianna Gasparyan has three of the top five scores across different federation-equipment combinations.
After carefully drawing assumptions, we apply these rules to build a list of score differentials to form a long win-loss table. The table has a lifter name, a score, a second lifter name, and a second score. With a simple example of 3 teams, where team A beats team B, team B beats team C and team A beats team C, we can easily rank teams A > B > C. We can even go as far as to rank team D only after winning one game against team B, having never played teams A or C (A>B>C>D) since rank is determined by an average normalized score. Real-life sports are never so simple, but even only after defining a few basic assumptions, there will always be a single "best" answer.
A HYBRID Rating is what Kenneth Massey calls a normalized score. It's an average score across the lifter's best days in each federation-equipment combination, adjusted for the strength of the competition. It doesn't matter how many games you play aside from the fact that it gives you more opportunities to record big scores relative to the competition. The rating is not as much influence by absolute scores, but by a margin of victory and "strength of schedule". Significant wins over strong opponents power high ratings, intuitively controlling for the biggest variables not directly related to lifter strength.
The HYBRID system applies the A > B > C concept to thousands of lifters and millions of individual comparisons. To simplify things, we only take the top one thousand scores based on our criteria and generate a much longer list where each lifter on that top score list is compared to every other lifter he or she directly competed against (by "directly," we mean in the same federation-equipment combination). We then fit a standard least-squares model to that linear system to find the HYBRID Rating for each lifter and rank the top 10.
We have provided the outputs below...
Female Raw Top 10 HYBRID Ratings
Male Raw Top 10 HYBRID Ratings
HYBRID will continue to iterate on this rating system as we get more data and input from the community. If there is interest, we would like to share more of the deeper technical details involved in composing the above solution and engaging in a deeper conversation around tech in sports. We are committed to building more tools to improve powerlifting for everyone. Our medium-term goal to serve HYBRID Ratings online for the community and regularly update them to stay current, possibly rewarding lifters who break into the top tier. We hope that the powerlifting and broader strength community can use it as a tool to celebrate lifters for superior performances, regardless of where they chose to compete, and provide the fans with a richer spectating experience.
*For more collaboration between HYBRID Data Science and HYBRID Meets, stay tuned through the run-up to The Hybrid Showdown III, broadcasted live February 20-21, 2021.
You can get in touch with Ian at firstname.lastname@example.org.
To keep Open Powerlifting great and free to enjoy, please consider becoming a supporter at https://www.patreon.com/join/openpowerlifting.