Skeptical Football: Goatslingers, Manning vs. Messi, And The Andrew Luck Experiment

In last week’s column, I pointed out the importance of teams’ early records when trying to predict their playoff fates. This prompted a few skeptical tweets, like so:

@skepticalsports Correlation is not causation: early results reveal team quality, they don’t create it.

— Tim Dierks (@tdierks) September 11, 2014

This tweeter is obviously right. The first few games of the season are predictive in part because losing games makes it harder to make the playoffs, and in part because they tell us something about the strength of the teams that lost them.

That said, “Correlation is not causation” is what I like to call The Hammer to end arguments against all kinds of statistical findings. People use it to bash anything, but it’s blunt and dangerous.¹

The artist and writer Randall Munroe took on The Hammer in xkcd:

In the alt text of that comic, he hits the nail on the head: “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there.’”

Let’s break down an example²: Last week I observed that quarterbacks who (A) start more games in their rookie seasons (B) tend to have better careers. What does this observation imply?

There are several possibilities:

Starting rookies causes them to have better careers (A causes B).
The types of rookies who are likely to have better careers are more likely to earn a rookie starting spot (B causes A).
Rookies who are drafted higher are more likely to get starts, and are also more likely to have better careers (something else — call it C — causes both A and B).
This is all just a coincidence and we should go home.
Some combination of the above.

That covers a lot of bases, but by making the observation, Nos. 1 through 3 become more likely than they were before. In this case, it’s fairly easy to establish that the relationship between A and B (rookie starts and non-rookie career AV) exists even when controlling for C (draft position).

Following the observation that A and B are correlated, basically any possible state of the universe in which A and B are causally related has become more likely. For a Bayesian, determining which possibilities have seen their likelihood change the most involves consulting his prior beliefs, establishing which possibilities were the most likely before his new observation, and how likely the observation would be if each possibility were true. This leads to an updated set of beliefs about the likelihood of each scenario, which becomes the baseline for evaluating new observations, and so on.

Charitably, “correlation ≠ causation” itself is a kind of limited Bayesian analysis. When people use it, they often mean simply that the “A causes B” scenario still doesn’t seem very likely to them, and thus they think other explanations are more likely. This is the case for most popular statistics examples, like the fact that lemon imports correlate negatively with highway fatality. That lemons are somehow preventing accidents is obviously ridiculous, so it doesn’t matter how strong the correlation is: It’s either a coincidence or we’re going to need other explanations.³

But the idea that rookies playing could help them develop is not ridiculous — it’s highly debatable. After observing the relationship between rookie QB starts and career success (plus controlling for draft position), I must conclude that playing rookies is more likely to be good for their careers than I thought before, barring any other evidence. But that doesn’t mean it’s true. The alternative (or concurrent) explanation is also plausible: If coaches are good at determining which rookie QBs are actually good, and then tend to start the better ones, it’s still possible that starting them has a neutral (or even negative) effect on their careers individually. Regardless of which explanation is true, the observation remains the same: a rookie QB getting the start is good news for his prospects.

Charts of the week

Aaron Rodgers had his ups and downs against the Jets last week:

Aaron Rodgers: 42 attempts, sacked 4 times, played well after going down 21-3, but failed to bring his team back in the 4th quarter AGAIN.

— Benjamin Morris (@skepticalsports) September 15, 2014

I jest, of course. Rodgers brilliantly brought the Packers back from a 21-3 hole, but the comeback was complete by the end of the third quarter.

This was Rodgers’s first-ever win after being down 15 points or more⁴ against an opponent — though it was only his 12th opportunity. Here’s how he stacks up against other QBs since 2001 in comparable situations:

Whoa, Peyton Manning! Forget Rodgers, Manning is the story here. But, it’s only 10 wins. Crazy things happen right? Let’s widen the scope, taking a look at all games in which a player’s team trailed by eight or more points, rather than just 15 or more:

Peyton Manning is a practically Messi-esque outlier, complete with his own Cristiano Ronaldo to keep him company.

Goatslinger of the week

This was a tough week for gunslingers, as QBs who threw interceptions went 1-14, most of those games weren’t that close, and many of the interceptions were terrible. (Our nominal winner: Matt Ryan, whose three interceptions were at least all thrown downfield while his team was trailing.)

So I’ve invented a new (hopefully temporary) award of ignominy: the Goatslinger.

Andrew Luck, last week’s Gunslinger, is a contender for Goatslinger this week. With just 5:15 left, up seven against the Philadelphia Eagles, and already in field goal range, he threw an interception to Malcolm Jenkins. Plays like that give gamblers a bad name!

But the top Goatslinger was Colin Kaepernick for his amazing effort to throw away San Francisco’s win against Chicago. He managed four turnovers (three interceptions and a fumble), three of them with his team up, including the interception up 20-14 in the fourth quarter that led to Chicago’s game-deciding touchdown.

Twitter question of the week, Part 1

I had two interesting questions on Twitter this week related to the timing and length of drives. First up:

@skepticalsports how much more scoring happens in pressure situations like 2 minute drill before half vs 1st qtr/3rd qtr first drives.

— jagdwire (@jagdwire) September 16, 2014

The answer is essentially “none,” or that there ends up being even less scoring in these scenarios. But the question is deceptively interesting. It’s also a fun vehicle for exploring the relationship between turnover rates and scoring/touchdown rates.

In general, teams score more per drive when they are behind, but are also more likely to turn the ball over. I’ve broken down drives by quarter and point margin (tied, up or down 1-3 points, 4-7 points, 8-14 points, and 15 points or more) and compared how often the drives resulted in touchdowns to how often they resulted in turnovers.⁵ This gives us a sense of the trade-off between the two.

Think of a drive when the game is tied in the first quarter as a kind of baseline: If a team starts at least 70 yards out, 15.5 percent of such drives will end in TDs, and 12.5 percent will end in turnovers. Compare that to the situation where teams are most aggressive: when they’re down 8-14 points in the fourth quarter. In those scenarios they score touchdowns 21.2 percent of the time and turn it over at a 27.5 percent clip.

As teams play more aggressively, their chances of scoring go up, but so do their chances of turning the ball over. You can think of the ratio between these chances as the “price” of marginal scoring. For example, increasing your chances of scoring a touchdown by 1 percent requires increasing your chances of turning the ball over by up to 2 percent.⁶ In some situations, that’s a price you’re willing to pay (such as when you’re behind and stalled drives are pretty much just as bad), and in some it’s not.

Understanding this trade-off is useful in analyzing a whole range of things in football, and my study of it is ongoing. But in the meantime, we can use our immediate findings to look at the situations our tweeter asked about and see what’s going on there.

Before the half, it’s apparent that teams are extremely willing to settle for the points they have. With between one and two minutes on the clock in the second quarter, teams score touchdowns on 7 percent of their drives and turn the ball over on 12.9 percent. These are both lower than our baseline, so they’re definitely being conservative. It’s unclear what effect more aggression would have.

With between one and two minutes on the clock at the end of the fourth quarter in games separated by between four and eight points, teams score touchdowns on 15.3 percent of drives, and turn the ball over on 42.1 percent of them. This is interesting because they spend a large number of turnovers on a completely average number of touchdowns. I think this reflects time pressure, but it could also suggest that true last-ditch “prevent” defenses may be pretty effective.

Twitter question of the week, Part 2

@skepticalsports do long (time wise) possessions actually bring more value than “normal” ones? Everyone freaked out about KC’s 10 min drive

— Matt Mills (@millsGT49) September 16, 2014

The simple answer is: Absolutely, a drive that eats up clock is valuable — when a team is ahead and wants to shorten the game. But shortening the game can also be useful when one team is a lot worse than the other.

Imagine trading 100 drives with a team led by Peyton Manning, the Chiefs’ opponent in Week 2. Manning scores more per drive than anyone, and his accumulated points scored over 100 of them would be impossible for all but the best teams to overcome. Say the difference between your team and Manning’s was that Manning’s was one point per drive better — in a 100-drive game, your team would have to run 100 points above expectation to have a fighting chance. Statistically, that’s virtually impossible.⁷

But if each team got only one drive, yours would win every time it scored and Manning’s didn’t. That’s orders of magnitude more likely.

This was pretty much exactly what happened with the Chiefs against the Broncos. The Chiefs had two extremely long drives in the second half: The first came at the start of the third quarter, lasted 10 minutes, and ended with a missed 37-yard field goal. The second came at the start of the fourth quarter, lasted 7:42, and ended with a Chiefs TD that drew them within four and set up a potential game-winning drive after Manning failed to score. As a result, Manning had only two meaningful possessions in the entire second half. Down 11 points, the Chiefs needed to score twice in their three possessions and have Denver score none in their two to win. Given the circumstances, those aren’t terrible odds.

But let’s focus on their second drive at the very beginning of the fourth. It’s extremely risky to draw up a drive that lasts that long when down 11, as the end of the game quickly approaches. But leaving that aside, they did score a TD in a supposedly back-breaking fashion. Are such TDs any more valuable than regular TDs in similar situations?

Using play-by-play data from ESPN, I looked back at all touchdown-scoring drives starting in the third quarter⁸ since 2001 in which a team was down 11-13 points at the start. I was kind of surprised by the results:

The sample sizes on these aren’t very big (it’s only 107 cases total, and the most likely drive is right around the middle), but teams have won nine of 19 cases (47 percent) in which their scoring drives lasted longer than three minutes. That’s a pretty big number for being down, and it’s way higher than the 20 percent teams won after scoring on more normal drives. Why and if that’s significant, I don’t know, but it certainly leaves open the possibility that long drives like that may indicate/affect something larger.

The Hacker Gods read FiveThirtyEight

As we all know, the Hacker Gods — who probably created this universe, by accident, while simulating a fourth-dimensional supernova — obviously read FiveThirtyEight. Last week they appeared to enjoy bolstering my analysis of Philip Rivers, but this week they are trying to undo me.

Aaron Rodgers, whom I previously criticized for playing too conservatively (especially when behind), somehow brought the Packers back from 18 down against the Jets, earning the first 15 point comeback victory of his career.
Last week I talked up the majesty of gambling even if it risks an interception, but in Week 2 quarterbacks who threw one or more interceptions went 1-14.
The only INT-throwing QB to win was Nick Foles against the Colts, but he won in part because inaugural Gunslinger of the Week Andrew Luck basically gave the game away by throwing his own INT with his team up seven and in field goal range in the fourth (suffice to say, that is a terrible spot to gamble).

Experimental chart of the week

Inspired by the Aaron Rodgers comeback, I asked on Twitter who people would want leading their team if it was down 15 or more points. Andrew Luck won the straw poll by a landslide with 47 percent of the votes, versus 20 percent for Peyton Manning. (Turnout was poor.⁹)

From the Charts of the Week above, this might seem pretty silly. For the most part, it is: Manning has won a higher percentage of games in which he has been down by 15 points than Luck, over a lot more games, even though it seems Luck has been on a tear for a couple of years. Impressive, but Manning has been down 15 much less often than Luck.

This chart plots the percentage of 15-point comeback opportunities won vs. how often those opportunities have come up. I’ve also represented the total number of games, the number of comeback opportunities, and the number of successful comebacks as concentric circles, and plotted like so:

Manning is even more impressive relative to Brady/Rodgers, but Luck managing to win in 3 of just 13 tries despite being on a team that ends up in that spot 36 percent of the time isn’t too shabby (the other data point near Luck at 20 percent is Matthew Stafford). If he can keep that up for another decade or so, he might just be a worthy successor to Manning.

Most empirically significant game of Week 3

If I could only watch one game, obviously it would be the Broncos/Seahawks Super Bowl rematch. But there is probably nothing that could happen in that game that would surprise me.

Minnesota at New Orleans, on the other hand, holds some mystery. It may have even more empirical effect on Peyton Manning’s legacy than Manning’s own game: Every game that Matt Cassel bombs is more evidence that Bill Belichick has more to do with Tom Brady’s success than Tom Brady (because then it’s more likely that Cassel’s/Brady’s success in New England was because of Belichick), that Randy Moss is likely responsible for much of Brady’s (and Cassel’s) statistical accomplishment, and thus that Peyton Manning is the greatest quarterback of this generation.

Charts by Reuben Fischer-Baum

CORRECTION (Sept. 18, 1:50 p.m.): This article originally misstated the time and recipient of Andrew Luck’s interception in the Colts’ game against the Eagles. Luck threw the interception with 5:15, not 5:32, left in the fourth quarter and Malcolm Jenkins, not Rahim Moore, intercepted it.

Footnotes

Every time someone uses The Hammer on me, a puppy loses its wings.
Rookie Quarterback Watch has pretty much devolved into “How Bad Will the QBs Ahead of Rookie QBs Get Before the Rookie QBs Get to Start?” Watch.
I should note that for a true Bayesian, the odds that lemon imports actually do reduce highway fatality has still increased on the margins.
I picked this number because it’s the smallest margin which Rodgers has never overcome, but as a separate and interesting point, I’ve found that 15-16 point margins, while technically “two scores” because they can be reached with two touchdowns plus two point conversions, actually act more like three score margins (17) than two score margins (14).
To pre-empt a question I will almost certainly get despite this attempt to pre-empt it: Yes, obviously a lot more can happen on drives than just touchdowns or turnovers. For example, drives that end in field goal attempts count as neither, even though they may lead to points. This matters in situations where there’s no time for a touchdown, or where a team only cares about the three points. But we’ve excluded a lot of those situations by filtering out the last two minutes of each half. It’s also possible to do the same analysis on a points-per-drive, or even “expected points added” basis, but the results are similar. Considering the implications are the same, I prefer the symmetry and ease of interpreting touchdowns vs. turnovers.
I should note that this exchange rate is likely skewed a little by the fact that worse teams tend to be behind more. I’m working on deskewing this to get a more exact comparison for a future project.
A team’s standard deviation on points scored over 100 drives is only 10 times the standard deviation of points scored for a single drive, so it can’t be more than 35, which would make a 100-point swing a three-standard-deviation event.
I excluded the fourth quarter to minimize end-of-game effect.
Only 15 votes total.