We’re big into ranking gear at Ask Mr. Robot. We probably like it more than WoW itself. OK… almost more than WoW. We like WoW more because it lets us have fun ranking gear.
We released an update today with a beta version of a new way to rank stats on gear that we call Machine Learning. It replaces stat weights. Stat weights are dead to us. OK not really, I was being a bit dramatic. You can still use them on the site like you always did, and they will still work like they always have, forever. But machine learning is way better and way cooler. Read on to learn why!
NOTE: To use the new method during the beta, you must run your own custom gearing strategy simulation, then check “Use Machine Learning” on the report that it generates, and then press “Save and Use”. You will know that it is in effect because you will see a “stat goal” next to each section on the gear optimizer (Gear Explorer, Best in Bags, etc.). If you encounter any bugs or issues, let us know immediately! It is a beta!
- Stat Weights are Good
- Stat Weights are Bad
- So… What Are Stat Weights?
- Partial Derivatives of the DPS Function
- Multiple Linear Regression
- The Perfect Stat Weights
- Balancing Stats
- Non-Linear Scaling
- Weird Stat Weights
- Special Items
- How Can We Improve?
- Machine Learning
- Stat Goals
- Why Has Nobody Done This?
Stat Weights are Good
There are still good things about stat weights, which is why we used them for so long:
- They actually can rank stats pretty well.
- They are easy to use in optimizers and stuff.
- They are popular.
- They seem like they are simple and make sense.
Stat Weights are Bad
But… there are also bad things about stat weights:
- They sometimes don’t rank stats very well.
- They don’t rank trinket procs or legendaries (or anything with a proc)
- People say that you have to recalculate them any time you change gear, which is both true and not true.
- They suck for specs that really need to balance stats instead of stacking one stat through the roof.
- “Caps” are weird and fake and a band-aid, but people think they are real.
- They aren’t as simple and don’t make as much sense as people think.
So… What Are Stat Weights?
I’m sure that you know, and aced all your math classes. But if you have that friend who, ug, just doesn’t get math and stuff, we’ll give you a short explanation that you can quote for them.
There are two “interpretations” of stat weights and thus two methods for creating them:
Partial Derivatives of the DPS Function
Oh shit, calculus. But don’t worry, we don’t need to get all complicated.
A derivative is just a fancy word for how “steep” something is. Imagine you are trying to get to the top of a hill with infinitely strong robotic thighs, but they only have enough power to take 100 steps. You would find the steepest part of the hill because it would get you highest in the least number of steps. So you can see how choosing gear in WoW is exactly the same as having robotic thighs. You only have so many stat points (item budget), so you want to pick the stats that make your DPS go up the fastest.
Thing is… hills aren’t perfect triangles. At the bottom it might be super steep, but near the top it might level off. The steepness changes with every step that you take. This is why people say that you need to recalculate your stat weights all the time — to find the new slope at your current spot on the hill.
This approach is good for finding the next-best item. It’s not necessarily the best for getting to the highest DPS though. Another hill-climbing analogy, imagine you are in the Himalayas. You search around at ground level for the steepest starting point, and every step, you keep going towards the steepest spot. After climbing for a while you reach the top… of K2, the 2nd tallest mountain in the world. WHAT? WHY AM I NOT AT THE TOP OF EVEREST? Well, the bottom of K2 is a lot steeper, but it doesn’t go as high. (This analogy could totally suck because Mr. Robot aggressively avoids extreme sports, snow, and heights, and thus knows nothing about how steep any of these mountains are. But you get the idea.)
This is one of the big traps of this approach, getting stuck in a “local maximum” instead of finding the “global maximum”.
simulationcraft is the main tool that uses this interpretation. It calculates these “partial derivatives” by choosing two data points for each stat on either side of your current value. It sounds all fancy, but really all it does is add some crit, divide the gain in DPS by how much crit you added, and call that the stat weight. The fact that it only uses three data points (current crit, a little less, and a little more) makes it extremely “noisy” and also very sensitive to how much crit you decide to add. Sometimes this “noise” could be caused by some minor issue in the simulator that manifests at a particular stat value. (Secret! No computer program is perfect, no matter who wrote it. But don’t tell Mr. Robot I said that!)
Wouldn’t it be great if there was a different method that used more data to smooth over any potential “noise” in the data instead of amplify it into Great and Meaningful Stat Plateau Cap Shift TC Buzz Words? Well, funny that you ask…
Multiple Linear Regression
Oh shit, statistics. But don’t worry, we still don’t need to get all complicated.
The actual mathematical “function” for your DPS in terms of your stats is some beastly equation of doom (if there even is one that works). There are complex interactions between stats that cause one stat to gradually become better than another, and then maybe as you get even better gear they cross again. Stuff like that. But if you look at a small enough region… say between player average item level 860 and 880 (random example, nothing special about those ilvls, these annoying asides are necessary to preempt trolls), you can just draw a straight line and it is a pretty good approximation.
There are tons of really good ways to figure out the best straight line to approximate a bunch of data points that are straight-ish. “Multiple linear regression” is one of those techniques. The basic approach is this: generate a bunch of data points for random combinations of stats in the region of interest. In our example, say we generate 1000 random combinations of stats that are possible to get with a player between ilvl 860 and 880 and simulate the DPS for all of them. Then you pass those combos and DPS values into a math library, and it spits back the stat weights. Done!
Wow Mr. Robot… you cheated pretty hardcore right there. “Math library” is all you have to say about linear regression? Yep. It is a simple concept (find the best “line” that is as close to as many of the data points as possible) that is pretty complex to implement. But it is also a technique that has been used and perfected by statisticians and mathematicians for a very long time. We use Math.NET Numerics to do it. Check it out if you want more details.
By using a lot of data and some solid statistics, we can smooth over a lot of “noise”. This gives a more stable result and also tends to work over a wider range without having to recalculate your stat weights as often. That said, this approach still has its flaws — if you make the range for the data too big, then the weights will be further away from your actual DPS. If you make the range too small, you start to get some of the problems of the derivative approach.
The Perfect Stat Weights
Enough background and math talk. Doesn’t matter how you got your stat weights or how you “think” about them, let’s say that you have found the perfect stat weights for your character. You ran 5 billion simulations on 5 different continents, you won 13 theorycraft battles and used the blood of the losers to summon a spirit who also confirmed that your weights are perfect.
Eh. They still won’t work that great in some cases, like:
Just look at your perfect weights. Crit = 8. Everything else = less than 8. What do you think an optimizer will tend to do? Give you a ton of crit. But… what if you do more damage by keeping some kind of balance between crit and haste? And what if that balance shifts up or down depending on how much mastery you have or something?
A “hack” that us stat weight people have been using for years is called “caps”. You could put a cap on crit at say 7000 rating, after which you say it is worth less, like 6, and haste becomes better. That’s an OK approach… but it only works for a very specific setup. If your gear was a little different, maybe a better cap for you would be 6800 rating, or 7300, or whatever.
People tend to latch on to a single number that somebody posts somewhere for a “cap” and think that it will work for everyone in every case. Not true. Just as you have to change your stat weights as your gear changes, you also have to change your caps. So why don’t theorycrafters tell you do to this? Short answer: it is hard to figure out what these “caps” are… mainly because they don’t actually exist. It would take too much work and be too hard to explain a “shifting cap” of some kind. I’ve stopped making sense to myself already.
That math-talk again… this is just a fancy way of saying: for some specs, the value of some stats goes up or down more quickly than a straight line can suggest. It’s more like a curve than a line. So trying to approximate it with a line (stat weights) always gives mediocre results if the curve is really curvy. If you recalculate your stat weights often enough you can mitigate this… but you are losing accuracy no matter how good your weights are.
Weird Stat Weights
Sort of an aside: many stat weights posted out there on the internet have become progressively “weirder”. For example, a popular thing to do is “normalize” them by dividing them all by the weight on the primary stat, so you get numbers like 1.0, 0.8, 0.5, etc. Another popular thing to do is “merge” stat weights for different kinds of fights or different setups to create a new set of weights that seems like it would work well for multiple kinds of fights.
Ultimately, stat weights have to predict your DPS in order to rank gear, so applying a transformation like normalization serves no purpose except to make the weights harder to understand (or if you are a conspiracy theorist, it serves to hide how well the weights are able to predict DPS… because maybe they aren’t that great).
Merging stat weights to come up with something that can work for multiple fights is OK… but it’s more for casual players, as it will be categorically worse at predicting your DPS for any particular style of fight. A serious player should optimize for a particular fight type on which they want to perform well. For other fight types, a quick gear swap to some decent gear for that other fight type will be sufficient and work better than trying to come up with a single magical set of gear that works for all fight types.
Another aside, but worth a quick mention: stat weights are useless for ranking special effects like trinket procs, legendary effects, relic traits, and set bonuses. Theorycrafters know this already, of course, which is why you see all of those ranked trinket lists. Those are meant to be a good starting point for casual players (or your alts), giving you a quick and dirty idea of how powerful each trinket can be. But as you’ve probably already noticed, things get complicated when you have warforged or socketed versions. Confusion sets in, maybe even panic. Don’t panic. You need to simulate different setups with each of those trinkets (and other special items), which our ‘gearing strategy’ simulations do for you. And then use some smarts to handle all of the variations, which Mr. Robot also does for you.
A related but important note on this: if you use simulation data to rank special effects, then your stat weights must be designed to predict DPS if you want to use them in conjunction with the special effect simulations. “Weird” stat weights described above are almost useless in Legion because of this. For example, some “stat-stick” trinkets are pretty good compared to some fancy proc trinkets. Stat weights are used to value the stat-stick. Simulated DPS is used to value the proc trinket. If your stat weights don’t predict DPS, the value for that stat stick will be meaningless next to the proc, and you will have no way to make an informed choice.
How Can We Improve?
It may feel like we’re coming down pretty hard on stat weights at this point. Before we continue though, we should step back and get some perspective: well-crafted stat weights are surprisingly good at ranking gear. The best stat weights are those designed to actually predict your DPS. In conjunction with simulation data for special items and a little extra effort to customize weights to your own situation, you can get decent results.
But we want to improve even more on our gear rankings, and we’ve sort of hit a wall with how far we can take stat weights. We created a better method for calculating stat weights with the regression technique. We made it really easy for even the least technical users to create their own stat weights whenever they need to, and plug them into the optimizer with no hassle. We have a UI for customizing stat weights if you need to tweak them based on some other data. What else can we do?
The answer was staring us in the face the whole time: stop using stat weights.
Stat weights just can’t handle the way stats actually interact in the game. Feel free to keep using them, but know that you are “capped” man — no amount of clever theorycraft will make them much better than they are now.
Our new method ditches stat weights entirely. Enter…
“Machine learning” is a pretty generic term these days. For our purposes it is best described by example:
Create 1000 (or more) random combinations of stats on your gear, simulate them all, and get the DPS. Then do fancy math and statistics on that data to create a “model” that can predict your DPS for any combination of stats, not just the 1000 we simulated.
Sometimes this is also called “predictive analytics” or “computational statistics” but “machine learning” is way catchier, so we’ll go with that. Some troll computer science college student (or more likely his elder sibling grad student) will probably come at me because they didn’t read this far down the page, but that’s cool, the internet is a harsh place and I cry myself to sleep each night to prepare for the next cruel day.
This sounds a lot like the multiple linear regression approach described above… doesn’t it? Yes it does! The twist is that we don’t limit ourselves to the really basic stat weight formula — we explore many possible predictive models, ones that can describe all the curves and interactions between your stats. Then we pick the best one and save it with your gearing strategy.
The actual method used can vary: it could be one of a plethora of regression techniques, a decision tree variant, some other cool-kid computer term, or a combination of them. Describing these methods is beyond the scope of this blog post. It would be very long and very technical because it is way more complex than stat weights, so we’ll do that another day. Just like stat weights, the important thing is that we give you some way to understand the output, and some way to determine how well it is working. To that end we have done two crazy things:
After we do our fancy analysis, we spit out some representative “stat goals” that the predictive model will try to achieve. For example, it might find that around item level 850 you should go for about 5000 crit, 7000 haste, and 2000 versatility. But around item level 890 you should go for 4000 crit, 8000 haste, 2000 mastery, and 4000 versatility. And we’ll show you this at all the points in between. This gives you a really simple understanding of how the predictive model thinks your stats interact with each other. The gear optimizer will also show these stat goals.
This is an insane concept that has never been done before for stat weights or any stat ranking approach: uh… test it? To see how well it actually works? Seriously… we looked all over the internet and found basically zero actual tests of of the predictive power of stat weights (including our own, shameful). So on our gearing strategy outputs we now include “benchmarks” that are a simple but good test:
We generate a bunch of random stat combinations that are different from the stat combinations used to create the stat weights or predictive model. The fact that they are different is key. You want to see how well your model can work outside of the data that you already have.
Then we use the predictive model (whether it is stat weights or machine learning) to calculate how much DPS it thinks you should do for those combinations. We compare that to the simulated DPS, and find the “mean absolute error”. Fancy term for: on average, how far away was the prediction from the real value? The best predictive model is the one with the lowest difference from the real value.
It’s that simple — true data-driven stat ranking should strive to be as close to the simulated data as possible.
In all of our tests, the error of our machine learning models is anywhere from 2-4x smaller than the best stat weights we have been able to create.
Why Has Nobody Done This?
If your new method is so great, Mr. Robot, why has nobody done it before? There are tons of smart theorycrafters, and there is no way that you are smarter than all of them. True! The main reason this hasn’t been done before has nothing to do with the intelligence or skill of the theorycrafting community. It has everything to do with the tools available to them.
To clarify, there are definitely theorycrafters out there who “get” all this and have been working on stuff. There is a cool project that can script simulationcraft to find the optimal secondary stat balance at a given item level, similar to the “stat goals” that we talked about above. There is another set of people creating 4D plots to figure out secondary stat relationships, also cool.
While the concepts and insights of these projects compared to our machine learning approach are similar, they are essentially “research projects”. You use them to gain a deeper understanding of how the game works, not to actually rank gear or get simple, actionable advice.
We have bridged the gap between this research and a tool that can actually put it into use: our gear optimizer. Every WoW player now has access to the most advanced gear ranking techniques — not just a few hardcore programmers.
The AMR simulator is designed to make it easy for anybody to generate chunks of data suitable for statistical analysis with our custom batches. The flexibility is incredible, and it does a lot of the legwork for you, like determining what a realistic stat budget would be at any given item level, and giving you tons of options for how to randomize over a range.
From there, we added the key insight of creating a model that can predict your DPS for any combination of stats. This is available to all users via our “gearing strategy” simulation type. It runs a custom batch of simulations and then does all the statistics for you to create a predictive model.
Then, with the press of one button, you can save this predictive model and use it directly in our gear optimizer. This is the huge piece that was missing: how do you take these crazy complex predictive models and actually… tell me what gear to use? The Ask Mr. Robot optimizer works with any predictive model. It is what us programmer kids call a “combination optimizer”. There are quadrillions of gear combinations just in the stuff that you have in your own bags. It would take forever just to list them all out, let alone find the one with the highest predicted DPS. But programmer nerd-magic can sift through all those combos super fast.
Of course there are catches. Nothing is perfect.
The main “catch” with this approach is that you still need to set up your simulations well. The item level range for which you choose to generate data will have a significant impact. The machine learning approach will work much better over a wide range than stat weights though, so you should not have to recalculate nearly as often.
Certain special items (like legendaries or set bonuses) can also have a significant impact on how stats rank and relate. We plan to release extra gearing strategies for the cases that are significantly different, or you can run your own. This will be the final “big step” for gear ranking that we plan to work on early in 2017 — providing a lot of gearing strategies for different cases, and automatically choosing the best one based on your situation.
Phew. That was a lot of theorycraft. If you got lost along the way, the short version is this: simulators are cool, but slow. So how can you simulate just some stuff and use it to rank all stuff? Do a bunch of super fancy math and statistics on a chunk of simulation data to create a [beastly thing] that can predict your performance with any gear. Plug the [beastly thing] into the AMR gear optimizer.
[beastly thing] eats [stat weights] for breakfast. End of story. It is literally impossible for stat weights to beat machine learning. That’s not a boast, it is a mathematical fact. If you find a case where this is not true while beta testing, tell us immediately! We will investigate and fix it.
Try it! If you like data-driven gear ranking, you’ll never go back.
Don’t like data? No problem! You can still enter custom stat weights, and you can even benchmark them on the fly now, right where you enter them! Nobody has done that before either: give you a way to quickly test your own stat weights.
We are very excited about this update — it is a huge step forward for gear ranking. Our main reservation is that it will be harder to “explain” and we aren’t sure if this beta makes it clear. Give us feedback on our forum or in discord!