You know, I know nothing about sports (except for the small bits that I've gleaned from my roommate's obsession with football and a general obsession of my own with games and game design) and I care even less, but what I'm interested in is math and matching systems. Which is effectively what we have here, except in the case of a matching system you're looking to find the match ups which are the most uncertain and thus produce the most memorable games for the players and not trying to discover the most likely winner. Still, you're looking for a series of exchanges which allow you to decide an abstract spatial difference between two players.
It occurs to me, and I'm sure that this is not a new idea, that there's nothing about these algorithms that say they have to be run on current data. Knowing the general impulses of sports fans, I'm betting that the last few decades of scores from pretty much everything you ever might want to try and predict brackets for our stored online somewhere, along with a pile of other metadata and other potential informative signifiers. It would be interesting to see the results of the algorithms run on historical data to see if they can predict how things actually turned out compared to the ground truth. It's probably not terribly interesting for people who have been immersed in algorithmic bracket creation for a while, but as someone from outside the field, it would be interesting. I would compare the predictive potential of different historical eras in whatever sport in question is in play; for example, is predicting the bracket of football teams in the 70s significantly different in algorithmic result than feeding the data from the 80s in? As a random example more than anything.
Sure, granted data gets more sparse the further back you get into time. Advanced statistics in the sports realm as only really gotten popular in the last twenty years. But given large amounts of time to waste on this problem it would be interesting to run these models on prior data and against prior events to see how effective they are over time. College basketball has been tending recently to being more random given the higher percentage of three point shots being taken nowadays versus twenty years ago. It would be interesting to see if this increased randomness makes games harder to predict nowadays versus in earlier periods.
See, that's the sort of thing that I find fascinating.
When the dynamics of the meta-game change, how does that affect our ability to predict the outcome of games going forward based on all the data we have versus smaller and more specific slices of the data for training? There's probably an entire PhD thesis waiting for somebody with a pile of sports knowledge, computing power, and the patience for sorting through a lot of tedious, scattered information – but I would definitely read that thesis.
It would also be interesting to compare the predictive power of the various models run against one sport versus another. Is the bracket density of college football more amenable to statistical analysis than college basketball? I have no idea; I barely know enough to even ask the question. But it's interesting!
I am all for more analysis of algorithms and looking at their applicability across novel regimes. That's cool stuff. Which is why I'm following this series of posts even though I really have no grounding or interest in the sports side of things. It's fascinating in and of itself.