Bluetracker

Tracks Blizzard employees across various accounts.

A critique of single-card analysis based on "Played Winrate" and a Machine Learning idea to fix it.

Hey all,

I'm Lagerbaer. Rank 5 player, scientist in my day job. It involves a range of topics from finance to physics to data science, and I enjoy thinking about how some of these concepts apply to Hearthstone.

I just tuned into Trump's stream and he was analyzing a deck of his, card by card, based on the "Win-rate when this card is played" statistics off HSReplay. He used this to figure out what the "worst card" in his deck was to replace it with something stronger.

Now here is why, from a data science point of view, this statistic can be very misleading. Later, I will also discuss what the correct statistic would be.

First, why is it bad? Well, let's use a real-life example. Statistically, you are more likely to get injured on a car ride where your airbag was deployed than on a car ride where your airbag was not deployed. Does that mean an airbag is a bad safety feature? No, it means that the fact that your airbag was deployed means you were in an accident, where you are at higher risk of being injured.

Back to Hearthstone: Cards meant to mount a comeback in an unfavourable matchup are not bad. Their "played" win-rate is merely being dragged down by the fact that you really only play them in a bad match up in the first place. If you play some sort of Control Warrior and never get pressured by the opponent's board, you won't bother playing Brawl. But when you're up against Taunt Druid, even playing 2 Brawls won't save you from their obscene amount of reload.

Other statistical effects also distort the analysis: Is it really fair comparing the 'played' win rate of cards with different mana cost? To be able to play a 10-drop, you actually have to make it to turn 10.

Let's say for the sake of discussion your deck has 0% win-rate against aggro and 100% win-rate against control. The only thing the played win rate of a 10-drop tells you is how much aggro and control there is on ladder. And it definitely won't tell you that maybe you should remove that 10 drop for something that improves your early game.

So what would be a better way of measuring a card's deck impact?

Enter machine learning, and the concept of feature importance.

For a simple method that I'd expect would already be better than the current one used by, e.g., HSReplay, would work as follows:

Use the game data (Cards in mulligan, cards drawn, cards played, game won or lost) to construct a machine learning model that will predict whether or not a game was won based on the input data (cards drawn, played, whatever else you want to include). Just a Random Forest would be pretty good to start because they are super versatile, robust, insensitive to underlying statistical distributions, can pick up subtle correlations etc.

Now you have a model that will predict if you'd win a game based on what cards were drawn and played etc. Of course it won't be perfect, but it doesn't have to be.

Now here comes the crucial step: We'll use our machine learning model back on the game data we already collected, but with a twist. First, we go back to every row in our table of games played, and change the entry for the card we're interested in to "Didn't play that card". Then we ask our machine learning model if it thinks we'd win that game or not. We do this for all the games we collected, and calculate a win rate for that.

Then we repeat the same process, but now we change every entry to "We played that card" and, again, compute the win-rate predicted by the model.

The difference between those two values would then give you a pretty good idea whether playing that card or not has positive impact.

The cool thing is that the machine learning model would pick up on a lot of subtleties like what other cards were played, in what situations it was played, etc. In a way, you are asking the model: "Everything else being equal, what difference would playing this card make"?

The key phrase here is "Everything else being equal". That's the way to avoid incorrect conclusions based on e.g. cards that you'd only need in unfavorable matchups anyway. It also helps you identify "Win more" cards. Those, too, get distorted stats because you only play them if you're already winning, so statistically they have a super high "Played" win rate. The machine learning model will dig down to the deeper truth: "Sure, it has a good win rate, but what DIFFERENCE did it make"?

I'd love to hear thoughts about this from fellow nerds. And heck, maybe someone with access to the right data even wants to see if they can turn it into a project? (I'd recommend Python, Pandas, and Scikit Learn)

Iksar

Posted 6 years ago (Source)

There are many flaws with play-win-rate, I feel it's best to leave it off entirely because it can be misleading. Currently we use a metric based on draw-win-rate with an adjustment for how many cards were drawn that game.

Bluetracker

A critique of single-card analysis based on "Played Winrate" and a Machine Learning idea to fix it.

Tweet

ODYN