I might be misunderstanding, but it looks to me like your proposed extension is essentially just the Elo model with some degrees of freedom that don’t yet appear to matter?
The dot product has the property that <theta_A-theta_B,w> = <theta_A,w> - <theta_B,w>, so the only thing that matters is the <theta_P,w> for each player P, which is just a single scalar. So we are on a one-dimensional scale again where predictions are based on taking a sigmoid of the difference between a single scalar associated with each player.
As far as I can tell, the way that such a model could still be a nontrivial extension of Elo would be if you posited w could vary between games, whether randomly from some distribution or whether via additional parameters associated to players that influence what w is in the games they are involved in, or other things like that. But it seems you would need something like that, or else some source of nonlinearity, because if w is constant then every dimension orthogonal to that fixed w can never have any effect on predictions by the model.
Circling back to this with a thing I was thinking about—suppose one wanted to figure out just one additional degree of freedom to the Elo rating a player had (at a given point in time, if you also allow evolution over time) that would add as much improvement as possible. Almost certainly you need more dimensions than that to properly fit real idiosyncratic nonlinearities/nontransitivities (i.e. if you had a playing population with specific pairs of players that were especially strong/weak only against specific other players, or cycles of players where A beats B beats C beats A, etc), but if you just wanted to work out what the “second principal component” might be, what’s a plausible guess?
First, you can essentially reproduce the Elo model by rather than each player having a rating and the winning chance being a function of the difference between their ratings, instead you posit that each player has a rating and when they play a game, they each indepedently sample a random value from a fixed probability distribution centered around their own rating, and the player with the larger sample wins.
I think that you exactly reproduce the Elo model up to scaling if this distribution is a Gumbel distribution, because the difference of two Gumbels is apparently equivalent to a draw from a logistic distribution, and the CDF of the logistic distribution is precisely the sigmoid that the Elo model posits. But in practice, you should end up with almost the same thing if you choose any other reasonable distribution so long as it has the right heaviness of tail.
In particular, I’d expect having linearly-exponential tails is good rather than quadratically-exponential tails like the normal distribution has, because linearly-exponential tails tend to be desirable for real-world ratings models due to being much more outlier-resistant and in the real world you have issues like forfeits, sandbaggers, internet disconnection/timeouts, etc. (If you have a quadratically exponential tail, then a ratings model can put so low probability on an outlier that subject to seeing the outlier, the ratings model is forced to make a too-large update to accommodate it, this should be intuitive from a Bayesian perspective). Outliers and noise and the realities of real world ratings data I’d expect introduces far bigger variation in ratings quality anyways than any minor distribution-shape differences would.
So for example, you could also say each player draws from a logistic distribution, rather than only a Gumbel. The difference of two logistics is not quite a logistic distribution but up to rescaling it should be pretty close so this is nearly the Elo model again.
Anyways, with any reformulation like this, there is a very natural candidate now for a second dimension—that of the variance of the distribution that a player draws their sample from. Rather than each player drawing from a fixed distribution centered around their rating before seeing who has the higher value and wins, we now add a second parameter that allows the variance of that distribution to vary by player. So the ratings model now becomes able to express things like “this player is more variable in performance between games, or prone to blunders uncharacteristic of their skill level than this other player”. This parameter might also improve the rating system’s ability to “explain away” things like sandbagger players by assigning them a high variance, thereby reducing their distortionary impact on other players’ ratings even before manual intervention.