Stat Weights vs Machine Learning

We’re big into ranking gear at Ask Mr. Robot. We probably like it more than WoW itself. OK… almost more than WoW. We like WoW more because it lets us have fun ranking gear.

Throughout Legion, we ranked stats on gear with a Machine Learning methodology. We’ll be using Machine Learning to rank gear with BfA as well (no surprise, considering how much better it is).

Machine learning replaces stat weights. Stat weights are dead to us. OK not really, I was being a bit dramatic. If you want, you can still use stat weights on the site like you always did, and they will still work like they always have, forever. But machine learning is way better and way cooler, and it ranks gear incredibly well. Read on to learn why!

Stat Weights are Good

There are still good things about stat weights, which is why we used them for so long:

They actually can rank stats pretty well.
They are easy to use in optimizers and pawn and stuff.
They are popular.
They are easy to understand.

Stat Weights are Bad

But… there are also bad things about stat weights:

They sometimes don’t rank stats very well.
They don’t rank trinket procs, Azerite Powers (or anything with a proc)
People say that you have to recalculate them any time you change gear, which is both true and not true.
They suck for specs that really need to balance stats instead of stacking one stat through the roof.
“Caps” are weird and fake and a band-aid, but sometimes people think they are real.
While they seem easy to understand, the problem is a lot more complex and stat weights aren’t able to tell the full story.

So… What Are Stat Weights?

I’m sure that you know, and aced all your math classes. But if you have that friend who, ug, just doesn’t get math and stuff, we’ll give you a short explanation that you can quote for them.

There are two “interpretations” of stat weights and thus two methods for creating them:

Partial Derivatives of the DPS Function

Oh shit, calculus. But don’t worry, we don’t need to get all complicated.

A derivative is just a fancy word for how “steep” something is. Imagine you are trying to get to the top of a hill with infinitely strong robotic thighs, but they only have enough power to take 100 steps. You would find the steepest part of the hill because it would get you highest in the least number of steps. So you can see how choosing gear in WoW is exactly the same as having robotic thighs. You only have so many stat points (item budget), so you want to pick the stats that make your DPS go up the fastest.

Thing is… hills aren’t perfect triangles. At the bottom it might be super steep, but near the top it might level off. The steepness changes with every step that you take. This is why people say that you need to recalculate your stat weights all the time — to find the new slope at your current spot on the hill.

This approach is good for finding the next-best item. It’s not necessarily the best for getting to the highest DPS though. Another hill-climbing analogy, imagine you are in the Himalayas. You search around at ground level for the steepest starting point, and every step, you keep going towards the steepest spot. After climbing for a while you reach the top… of K2, the 2nd tallest mountain in the world. WHAT? WHY AM I NOT AT THE TOP OF EVEREST? Well, the bottom of K2 is a lot steeper, but it doesn’t go as high. (This analogy could totally suck because Mr. Robot aggressively avoids extreme sports, snow, and heights, and thus knows nothing about how steep any of these mountains are. But you get the idea.)

This is one of the big traps of this approach, getting stuck in a “local maximum” instead of finding the “global maximum”.

Simulationcraft was the first major simulator to generate stat weights and it uses this method. It calculates these “partial derivatives” by choosing two data points for each stat on either side of your current value. It sounds all fancy, but really all it does is add some crit, divide the gain in DPS by how much crit you added, and asigns that as the stat weight. The fact that it only uses three data points (current crit, a little less, and a little more) makes it extremely “noisy” and also very sensitive to how much crit you decide to add. Sometimes this “noise” could be caused by some minor issue in the simulator that manifests at a particular stat value. (Secret! No computer program is perfect, no matter who wrote it. But don’t tell Mr. Robot I said that!)

Wouldn’t it be great if there was a different method that used more data to smooth over any potential “noise” in the data instead of amplify it into Great and Meaningful Stat Plateau Cap Shift TC Buzz Words? Well, funny that you ask…

Multiple Linear Regression

Oh shit, statistics. But don’t worry, we still don’t need to get all complicated.

The actual mathematical “function” for your DPS in terms of your stats is some beastly equation of doom (if there even is one that works). There are complex interactions between stats that cause one stat to gradually become better than another, and then maybe as you get even better gear they cross again. Stuff like that. But if you look at a small enough region… say between player average item level 320 and 340 (random example, nothing special about those ilvls, these asides are necessary to preempt trolling), you can just draw a straight line and it is a pretty good approximation.

There are tons of really good ways to figure out the best straight line to approximate a bunch of data points that are straight-ish. “Multiple linear regression” is one of those techniques. The basic approach is this: generate a bunch of data points for random combinations of stats in the region of interest. In our example, say we generate 1000 random combinations of stats that are possible to get with a player between ilvl 320 and 340 and simulate the DPS for all of them. Then you pass those combos and DPS values into a math library, and it spits back the stat weights. Done!

Wow Mr. Robot… you cheated pretty hardcore right there. “Math library” is all you have to say about linear regression? Yep. It is a simple concept (find the best “line” that is as close to as many of the data points as possible) that is pretty complex to implement. But it is also a technique that has been used and perfected by statisticians and mathematicians for a very long time. We use Math.NET Numerics to do it. Check it out if you want more details.

By using a lot of data and some solid statistics, we can smooth over a lot of “noise”. This gives a more stable result and also tends to work over a wider range without having to recalculate your stat weights as often. That said, this approach still has its flaws — if you make the range for the data too big, then the weights will be further away from your actual DPS. If you make the range too small, you start to get some of the problems of the derivative approach.

The Perfect Stat Weights

Enough background and math talk. Doesn’t matter how you got your stat weights or how you “think” about them, let’s say that you have found the perfect stat weights for your character. You ran 5 billion simulations on 5 different continents, you won 13 theorycraft battles and used the blood of the losers to summon a spirit who also confirmed that your weights are perfect.

Eh. They still won’t work that great in some cases, like:

Balancing Stats

Just look at your perfect weights. Crit = 8. Everything else = less than 8. What do you think an optimizer will tend to do? Give you a ton of crit. But… what if you do more damage by keeping some kind of balance between crit and haste? And what if that balance shifts up or down depending on how much mastery you have or something?

A “hack” that us stat weight people had been using for years is called “caps”. You could put a cap on crit at say 7000 rating, after which you say it is worth less, like 6, and haste becomes better. That’s an OK approach… but it only works for a very specific setup. If your gear was a little different, maybe a better cap for you would be 6800 rating, or 7300, or whatever.

People tend to latch on to a single number that somebody posts somewhere for a “cap” and think that it will work for everyone in every case. Not true. Just as you have to change your stat weights as your gear changes, you also have to change your caps. So why don’t theorycrafters tell you do to this? Some try to explain it but no matter what they do, the number they use as an example gets set in stone then passed around and turned into the cap you must use. It is hard to figure out what these “caps” are… mainly because they don’t actually exist. It would take too much work and be too hard to explain a “shifting cap” of some kind. I’ve stopped making sense to myself already.

Non-Linear Scaling

That math-talk again… this is just a fancy way of saying: for some specs, the value of some stats goes up or down more quickly than a straight line can suggest. It’s more like a curve than a line. So trying to approximate it with a line (stat weights) always gives mediocre results if the curve is really curvy. If you recalculate your stat weights often enough you can mitigate this… but you are losing accuracy no matter how good your weights are.

Weird stat weights

Stat weights should be a predictor of your DPS. So if you stat weight for Intellect is 3 (a made up value), that means you get 3 DPS for every point of Intellect.

You might see some ‘normalized’ stat weights shared in guides or on forums. When stat weights are normalized, they set your main stat’s value to 1, and adjust everything around it. So in this case, Intellect would be a value of 1, instead of 3.

That’s a problem for 2 reasons:

It doesn’t do a good job of describing the actual value of the stat anymore. You no longer know how powerful Intellect is.
It causes problems when trying to use stat weights to rank a stat-stick trinket compared to a proc-trinket.

Special Items

Another aside, but worth a quick mention: stat weights are useless for ranking special effects like trinket procs, Azerite Powers, and anything else with a proc. Theorycrafters know this already, of course, which is why you see all of those ranked trinket lists. Those are meant to be a good starting point for casual players (or your alts), giving you a quick and dirty idea of how powerful each trinket can be.

But as you’ve probably already noticed, things get complicated when you have warforged or socketed versions. Confusion sets in, maybe even panic. Don’t panic. You don’t need to simulate different setups with each of those trinkets (and other special items), because our ‘gearing strategy’ simulations have done that for you already. And then we use some smarts to handle all of the variations, which Mr. Robot also does for you.

What’s the better way to rank gear?

It may feel like we’re coming down pretty hard on stat weights at this point. Before we continue though, we should step back and get some perspective: well-crafted stat weights are surprisingly good at ranking gear. The best stat weights are those designed to actually predict your DPS. In conjunction with simulation data for special items and a little extra effort to customize weights to your own situation, you can get decent results.

But we wanted to improve even more on our gear rankings, and we’d sort of hit a wall with how far we could take stat weights. We created a better method in Legion for calculating stat weights with the regression technique. We made it really easy for even the least technical users to create their own stat weights whenever they need to, and plug them into the optimizer with no hassle. We have a UI for customizing stat weights if you need to tweak them based on some other data. What else can we do?

The answer was staring us in the face the whole time: stop using stat weights.

Stat weights just can’t handle the way stats actually interact in the game. Feel free to keep using them, but know that you are “capped” man — no amount of clever theorycraft will make them much better than they are now.

Our new method ditches stat weights entirely. Enter…

Machine Learning

“Machine learning” is a pretty generic term these days. For our purposes it is best described by example:

Create 1000 (or more) random combinations of stats on your gear, simulate them all, and get the DPS. Then do fancy math and statistics on that data to create a “model” that can predict your DPS for any combination of stats, not just the 1000 we simulated.

Sometimes this is also called “predictive analytics” or “computational statistics” but “machine learning” is way catchier, so we’ll go with that. Some troll computer science college student (or more likely his elder sibling grad student) will probably come at me because they didn’t read this far down the page, but that’s cool, the internet is a harsh place and I cry myself to sleep each night to prepare for the next cruel, cruel day. (Kidding! I don’t think everyone gets my jokes).

This sounds a lot like the multiple linear regression approach described above… doesn’t it? Yes it does! The twist is that we don’t limit ourselves to the really basic stat weight formula — we explore many possible predictive models, ones that can describe all the curves and interactions between your stats. Then we pick the best one and save it with your gearing strategy.

The actual method used can vary: it could be one of a plethora of regression techniques, a decision tree variant, some other cool-kid computer term, or a combination of them. Describing these methods is beyond the scope of this blog post. It would be very long and very technical because it is way more complex than stat weights, so we’ll do that another day. Just like stat weights, the important thing is that we give you some way to understand the output, and some way to determine how well it is working. To that end we have done two crazy things:

Stat Goals

After we do our fancy analysis, we spit out some representative “stat goals” that the predictive model will try to achieve. For example, it might find that around item level 320 you should go for about 500 crit, 700 haste, and 200 versatility. But around item level 340 you should go for 400 crit, 800 haste, 200 mastery, and 400 versatility.

In Legion, we showed you a summary in a mini-graph, as well as a larger graph with all of the data points. It wasn’t as clear to everyone as we had hoped, so we have a new way to display this data coming out in BfA. We’re excited to get your feedback and see if you can easily understand it.

Benchmarks

This is an insane concept that has never been done before for stat weights or any stat ranking approach: uh… test it? To see how well it actually works? Seriously… we looked all over the internet and found basically zero actual tests of of the predictive power of stat weights (including our own, shameful). So on our gearing strategy outputs we now include “benchmarks” that are a simple but good test:

We generate a bunch of random stat combinations that are different from the stat combinations used to create the stat weights or predictive model. The fact that they are different is key. You want to see how well your model can work outside of the data that you already have.

Then we use the predictive model (whether it is stat weights or machine learning) to calculate how much DPS it thinks you should do for those combinations. We compare that to the simulated DPS, and find the “mean absolute error”. Fancy term for: on average, how far away was the prediction from the real value? The best predictive model is the one with the lowest difference from the real value.

It’s that simple — true data-driven stat ranking should strive to be as close to the simulated data as possible.

In all of our tests, the error of our machine learning models is anywhere from 2-4x smaller than the best stat weights we have been able to create.

Why Has Nobody Done This?

If your new method is so great, Mr. Robot, why has nobody done it before? There are tons of smart theorycrafters, and there is no way that you are smarter than all of them. True! The main reason this hasn’t been done before has nothing to do with the intelligence or skill of the theorycrafting community. It has everything to do with the tools available to them.

To clarify, there are definitely theorycrafters out there who “get” all this and have been working on stuff. There is a cool project that can script simulationcraft to find the optimal secondary stat balance at a given item level, similar to the “stat goals” that we talked about above. There is another set of people who create 4D plots to figure out secondary stat relationships, also cool.

While the concepts and insights of these projects compared to our machine learning approach are similar, they are essentially “research projects”. You use them to gain a deeper understanding of how the game works, not to actually rank gear or get simple, actionable advice.

We have bridged the gap between this research and a tool that can actually put it into use: our gear optimizer. This means that every WoW player has access to advanced gear ranking techniques — not just a few hardcore programmers.

We also took it one step further half way through Legion, and calculated all of these data points with all of these talent and gear combinations for you ahead of time. That means, you don’t have to ‘sim it’ every time you get new gear. We’ve already simmed “all the things” and applied the “Machine Learning” approach to it. So as your gear changes, or you pick new talents, we already have the data needed to rank gear for you. These are called “Adaptive Gearing Strategies” on our site.

We’ll have Adaptive Gearing Strategies again with BfA because they worked so well, and people seemed to really appreciate not having to constantly sim to get updated “stat weights.”

If you want to customize the gearing strategies, you can. The Ask Mr. Robot optimizer works with any predictive model. It is what us programmer kids call a “combination optimizer”. There are quadrillions of gear combinations just in the stuff that you have in your own bags. It would take forever just to list them all out, let alone find the one with the highest predicted DPS. But programmer nerd-magic can sift through all those combos super fast.

Conclusion

Phew. That was a lot of theorycraft. If you got lost along the way, the short version is this: simulators are cool, but slow. So how can you simulate just some stuff and use it to rank all stuff? Do a bunch of super fancy math and statistics on a chunk of simulation data to create a [beastly thing] that can predict your performance with any gear. Plug the [beastly thing] into the AMR gear optimizer.

[beastly thing] eats [stat weights] for breakfast. End of story. It is literally impossible for stat weights to beat machine learning. That’s not a boast, it is a mathematical fact. If you find a case where this is not true, tell us immediately! We will investigate and fix it.

Try it! If you haven’t tried it yet, give it a whirl with Battle for Azeroth. Once you like data-driven gear ranking, you’ll never go back.