I wanted to come back and bump this thread with some new information about the Soccer Rankings (SR) app. I weighed the options of just starting a new thread, but figured it might make more sense to have the information consolidated here where there has already been so much discussion about the ratings/rankings/algorithm/etc.
So today Mark made a pretty incredible discovery, and I'm giddy because it was at least partially based on a suggestion I gave him. But before I get there, a little background might be helpful to ground the discussion. So first off, the way this system works is pretty well known and well described at this point, at least to folks who frequent this board. Game data is pulled in from a various electronic sources, and assigned to a team entity. If a correct team entity for the data can't be identified, it creates a new team entity. Rinse and repeat, continuing to add game results to each entity. If the game results have a rated team on the other side of it, the rating for each team is adjusted based on the new results. The ratings of the two teams are compared, and if the actual goal difference is more than expected by the existing ratings, the one who overperformed has their rating bumped up a smidge. If the goal difference is less than expected. the one who underperformed has their rating bumped down a smidge. If the goal difference is pretty much spot on with what was expected, neither team's ratings will move much at all. (more details on this up on the
FAQ for the app)
There are a couple outcomes of these ratings, but essentially they are useful for predicting what is going to happen when two rated teams compete. Those predictions can be used to flight tournaments, choose proper league brackets, or as a fun prediction for how an upcoming weekend may be expected to play out. Now these predictions are never going to be 100% accurate (right every time), or 0% accurate (wrong every time); but the better the data, and the better the algorithm, the better quality the predictions can be. For definitions, Mark uses "predictive power" to state these same concepts. 0% predictive power means a coin flip (getting no better than 50% correct). 100% predictive power = god. You can convert predictiveness to the % of results correctly predicted by dividing by 2 and adding 50%. So 70% predictive power would translate to getting 85% of predictions correct. In all of these trials correct is defined as picking the correct winner, for games that result in a winner. If the wrong winner is chosen, it's a failure. Tie game results are excluded from these predictivity results.
With this setup, predictivity of the app isn't an estimate or a guess - it's a specific number that can be calculated as often as desired. Run through all the stored games in the database right now, and compare the predicted results using the comparative ratings, and the actual game results, and divide the correct predictions over all of the games being predicted, and 1 number gets spit out. Turns out this number, as of today, is
66.7% predictive over all games, which translates into picking the correct winner of the soccer game
83.35% of the time. So as expected, it's way better than a coin flip, and will pick the right winner about 5 out of 6 times. This predictive number is a validation that the ratings derived from the algorithm themselves have a certain level of accuracy. If the ratings were wildly inaccurate, the predictive number would trend to 0%; if the ratings were supernatural, the predictive number would trend to 100%. But by any measure, the real, provable, actual predictivity number is pretty darned good (and better than a well known other ranking system by more than 50 points, it's insane). For any skeptics that doubt that youth soccer can be ranked/rated, or even skeptics of this particular algorithm / ranking system, the predictivity number is what mathematically shows the expected probability - and it's an admirable number.
But that still isn't the interesting discovery. Here comes the interesting discovery. There is an intuition, even by proponents of this type of comparative ranking that uses goal differences, that the quality of the data (and the predictions) depends on how close the compared teams are to each other, and how many expected shared opponents they have. The more interplay, the better - the less interplay, the more drift. I believed that to be the case, as it seems reasonable. For example, if teams are in the same league, or same conference, or even same state; they play each other enough, that their comparative ratings will be honed and sharpened by each other, and would have a higher predictive value. And conversely, if you're comparing teams that are not in the same league, same location, may have never seen each other before, and have few if any common opponents - it makes intuitive sense that their comparative ratings would drift a bit more, and would be somewhat less accurate. Remember, this actual predictivity, this quality of each prediction, can be calculated by looking at the existing data for games that would fit into this category.
So what I suggested to Mark - and to be fair, he had also thought of himself within the past few days as well - was that he should exclude all in-state games, and measure the predictivity of interstate games exclusively. CA teams playing AZ, TX playing OK, or any other permutation in the country where the opposing teams are in different states. What this would do, is measure how good the predictions are, when there is very little shared information going into the upcoming game. Interplay is low. This represents what happens when you go to a big tournament elsewhere, as opposed to predicting what will happen with a local league game. He coded the query, ran the data, and a few hours later the number was spat out. And it turns out that for these interstate games, the algorithm is
67.0% predictive, which translates into picking the correct winner of the soccer game
83.5% of the time. So all of the intuitive worry about drift, or more local data being more refined than less remote data, turned out to be a false intuition. The comparative ratings, when used even across different states, provide just as good (and in fact a teensy bit better) predictions as when they are applied to local / in-league contests. If a team has sufficient data to be rated, that rating can be trusted regardless of extensive interplay or not. It's an incredible finding, and it validates all of the work and effort Mark and his team have done over the years to polish and refine the algorithm, tying game data to a useful rating.
And now to a real-world use, it looks like we're predicted to lose both games this Saturday with my youngest's team, so what's the leading recommendation to fill my thermos?