Greetings!

It has been a while since I posted anything here at OilersNerdAlert, though since the season ended I have had the occasional rant over at BeerLeagueHeroes.com.

The quietness is not indicative of a lack of activity though! Woodguy (@Woodguy55, becauseoilers.blogspot.ca) and I have been hammering and chiseling and bulldozing away on what we feel is potentially groundbreaking new analytical work around quality of competition.

It’s looking fantastic, and so we’ll soon be publishing a whole boatload of that work at BLH and at Because Oilers, along with explanations, charts, and data feeds – an analytical smorgasbord!

A healthy aspect of this new work relies on my Dangerous Fenwick statistic. I explained what it is and how it is calculated last October. I also alluded at that time that I’d done some quick statistical work on it that gave me comfort that it was a useful and valid statistic, but I never published that work.

As we’re going forward with this new work, it will likely put a spotlight on DangerFen, so I figure I better get off my duff and explain to you why its OK to believe.

If you’re a skeptic and need this info, please read on.

And if you’re not – that’s perfectly OK, stay tuned for a ‘quality of competition’ tidal wave!

### Split Half Reliability

A key initial test for any statistic is that it should have reasonable *split half reliability* – that is, if you randomly take half the games in a season for any given team, and compare that to the other half of the games, they should show a strong relationship.

If they fail to show that relationship, then the statistic in question may very well not be measuring anything repeatable enough to have value.

Shot metrics have a second hurdle, which is that they should show reliability that is at least approximately in line with old venerable Corsi. Otherwise, whatever methods are being used to generate the new statistic are arguably adding more noise than value.

The way in which I tested this is I used random sampling to to split the 2015-2016 season’s games into two groups of 41 (some use even-odd splits for this, but it’s my own particularly oddity that I prefer statistical tests be based on random samples) for each NHL team. I then calculated DFF% for that team in each of the resulting two halves.

These were repeated for every team, and the correlation between the first and second half of the resulting data set was calculated. As a control, I also calculated the correlation for Corsi and for Fenwick (on which Dangerous Fenwick is based) for the same random samples.

Here’s how that looks.

#### Dangerous Fenwick Split Half

(The r value is shown, square it for R^2). All data shown here is 5v5 data. The data values for Dangerous Fenwick were generated by me; all other values are from corsica.hockey. As you can see, there is a healthy correlation between the two splits for Dangerous Fenwick.

#### Corsi and Fenwick Split Half

Unsurprisingly, both Corsi and Fenwick show good split half reliability, with Corsi in the lead. The split half correlations of all three metrics are very close, and all have high statistical significance.

It’s commonplace to run this many times and use the average results of the runs. However, these results are in line with other tests I’ve reviewed, so I’m comfortable that this is a valid result.

Even so, I did rerun the calculation a handful of times (each time generates a different random sample so the results are always slightly different) to confirm. The pattern seen here is quite consistent and representative: the correlations are high (generally .72 to .82), the three metrics are close, and Corsi generally has a slight edge in correlation followed by DangerFen then Fenwick.

All of which suggests to me that the adjustments that are being made to Fenwick to incorporate “Danger” are not adding noise, and instead are adding useful information to the metric.

#### The ‘xGF’ Test

There are a variety of additional tests that can be used to give additional confidence in the metric, but I felt there was a useful shortcut available to me.

It’s my opinion that Emmanuel Perry’s (@MannyElk, corsica.hockey) xGF metric is arguably the gold standard for danger-weighted metrics. (Perhaps Nick Abe and a few others might disagree).

My thought process is that if Danger Fenwick indicates much the same things as xGF does, I have comfort that what DFF is telling me has validity.

(In fact, if I had access to how Manny’s xGF is calculated, I would even consider replacing Danger Fen with xGF in my stats programs, but as I don’t have that, I’m using Danger Fen as an acceptable replacment. And I’ll show you here why I think it’s just fine as a substitute!)

First, here’s a panel of charts showing a variety of scatter plots (with regression line) visualizing the relationship between DFF% (Dangerous Fenwick), xGF% (Perry’s Expected Goals), CF% (unadjusted Corsi), FF% (unadjusted Fenwick), and GF% (goals) for all 30 teams for the 2015-2016 season:

You can see where the metrics have a lot of similarities, and where they don’t. In particular, note how well DFF and xGF track each other.

Specifically, here are selected correlation values:

- DFF% and xGF%, r = 0.954. I’ve casually mentioned before that DFF and xGF are well correlated – and when I say well correlated, I do mean *well* correlated.

- DFF% and CF% correlation is r = 0.853, and xGF% and CF% is r = 0.852, so both metrics correlate with Corsi equally well.

- The correlation of DFF% to GF% is r = 0.462, the correlation of xGF% and GF% is r = 0.431, and the correlation of CF% to GF% is r = 0.345. In many ways, shot metrics are simply a large-sample proxy for what really counts, which is goals. As a metric, goals (or GF%) has too much noise to be of much use within a single season, but it’s still of interest to see how the large sample metrics track goals scored. In this sample, DFF% actually correlates/predicts goals a hair *better* than xGF% (though I don’t believe that what we’re seeing is likely to be a sustainable difference). Perhaps more importantly, both metrics outperform raw Corsi by quite a noteable margin.

And with that, I’ll end this ‘light’ statistical analysis.

### Northern Comfort

The split half test tells me that Dangerous Fenwick is as or more reliable than raw Fenwick, and almost on par with Corsi.

The correlations with the wider set of metrics tell me that the danger-weight adjustments being made to Fenwick to create Dangerous Fenwick add significant value to the metric over raw Fenwick or Corsi.

Additionally, the way in which Danger Fen tracks xGF puts it at least close to being in the same league as that metric, despite DFF being a simpler metric.

I can test more, and in future I might – but for now, I’m comfortable with where it’s at.

Excellent post.

LikeLike

Thanks!

LikeLike

I’m typing this comment as something of a summary of an extended conversation (argument?) with super-smart Micah Blake McCurdy. Darcy and I have nothing but respect for Micah, but we’re not sure we understand his question(s). And Twitter is a *terrible* way to clarify matters. So let me try here.

As best as I can tell, Micah objects to the way in which we bucket the players, due to issues with variance, arbitrary bucketing, etc.

My counter to that is this: if we were running comparative statistics on those bins, I see the concern. “Bucket a’s corsi is x and bucket b’s corsi is y and therefore z”.

But we aren’t. We’re not using those buckets directly in the analysis. We’re using them as what we believe to be a direct analogue to a real world ‘three tiers of player’, so we can see how players do against those tiers.

Why three tiers? Because it’s a practical number it turns out. There are data availability and other concerns at play beyond mathematical purity, such as sample size and volume of information generated. (In fact, mathematical purity and NHL data arguably are allergic to each other).

What we are doing is using those bins to decompose a singular number (a players Corsi) into numbers against the bins. (Variance who?)

The metaphor I would use is that of someone telling you that the average daily temperature in your house is 22C.

Then the WoodMoney Furnace Company comes along and tells you that we bucketed your day into three eight-hour segments, and within those segments, it was 22C between 8am and 4pm, 24 C between 4pm and midnight, and 20C between midnight and 8am.

Even if the buckets we’re using are not relevant to you (we would argue that those time buckets have relevance in the ‘real world’ but accept that you may object to them), the fact is that there is no loss of information.

We measured the temperature in those three buckets the same way that the original 22C was measured. When you weighted-sum the temperatures, you get the original one back.

We believe that those temperature buckets *are* useful for understanding why your heating bills are what they are, much moreso than using just the average daily temperature.

You can criticize the definition of the buckets, but it seems irrelevant to the output. Either you believe the buckets have real world significance (and then the decomposed values are meaningful) or you don’t (in which case go ahead and ignore the decomposed values).

But arguing that the buckets are “wrong”, or we should have more or fewer of them is a mystifying in that regard. They’re concrete choices made using (our published) criteria representing what we believe is a good match to the real world of NHL players and their skill levels.

You’d like all models to be mathematically rigorous, but real world models have all kinds of drivers beyond mathematical purity, and this one is no exception.

Anyway, I don’t know if I”m capturing Micah’s point or not. We certainly don’t want to dismiss anything he says, but at the moment, we seem to be speaking past each other.

LikeLike

Very excited to see this project as it continues in real time.

Appreciate the tone and content of your clarification as well, hopefully it explains some of the misconceptions out there.

I think the buckets you chose are more than reasonable. Perhaps down the road they may need to be tweaked (hopefully we get access to more accurate and comprehensive data collection), but they seem like a strong starting point to me.

As always, really appreciate the work you do (Woodguy as well)!

LikeLike

Took me too long to see this – but great stuff – I see your comments on Lowetide also – I think your 3 buckets thing is the right approach – it’s unimportant overall if player X is on the fringe of muddle or dregs – just that when you are seeing how someone else is faring against those buckets in general – otherwise it’s only by having a hundred buckets or more could it be fair – and that’s not practical

Looking at the correlation between gf and cf or dff – yes dff is better – but can we get better still – maybe someone (you ?) can, but suspect that will be some effort – but then you would have the golden ticket of hockey analysis ….

LikeLike

Thanks Dave! As to getting better prediction, it’s an interesting question. So many unpredictable factors (we lump them together and call them ‘random’ … some are some aren’t but at the moment we can’t predict them so the terminology fits) in hockey, I suspect the ultimate bounds of predictability will prove to be quite modest.

LikeLike

Visualization idea: A Vollman type of chart.

x-axis: Gritensity (DFF%-50)

y-axis: Middle (DFF%-50)

bubble: Elite (DFF%-50) colour-coded

So every defensemen in quadrant 2 and 3 where Gritensity is negative is a horrible defensemen.

Every defensemen in quadrant 4 where Gritensity is positive and Middle is negative is a 3rd pairing D.

Every defensemen in quadrant 1 where Gritensity and Middle are both positive is a 2nd pairing D.

For defensemen in quadrant 1 with a positively coloured bubble is a 1st pairing D. With a negatively coloured bubble, a 2nd pairing D.

Roughly speaking.

LikeLike