Fascinating, do the gained insights allow to directly compute the central flow in order to speed up convergence? Or is this preliminary exploration to understand how it had been working?
They explicitly ignore momentum and exponentially weighted moving average, but that should result in the time-averaged gradient descent (along the valley, not across it). But that requires multiple evaluations, do any of the expressions for the central flow admit fast / computationally efficient central flow calculation?
> We emphasize that the central flow is a theoretical tool for understanding optimizer behavior, not a practical
optimization method. In practice, maintaining an exponential moving average of the iterates (e.g., Morales-Brotons
et al., 2024) is likely a computational feasible way to estimate the optimizer’s time-averaged trajectory.
They analyze the behavior of RMSProp (Adam without momentum) using their framework to come up with simplified mathematical models that are able to predict actual training behavior in experiments. It looks like their mathematical models explain why RMSProp works, in a way that is more satisfying than the usual hand waving explanations.
Yes, it certainly provides a lot more clarity than the handwaving.
While momentum seems to work, and the authors clearly state it is not intended as a practical optimization method, I can't exclude that we can improve convergence rates by building on this knowledge.
Is it guaranteed for the oscillating behavior to have a period of 2 steps? or is say 3 step period also possible (a vector in a plane could alternately point to 0 degrees, 120 degrees and 240 degrees).
The way I read this presentation the implication seems to be that its always a period of 2. Perhaps if the top-2 sharpnesses are degenerate (identical), a period of N distinct from 2 could be possible?
It makes you wonder what if instead of storing momentum with exponential moving average one were to use the average of the last 2 iterations, so there would be less lag.
It also makes me wonder if we should perform 2 iterative steps PER sequence so that the single-sample-sequence gives feedback along it's valley instead of across it. One would go through the corpus at half the speed, but convergence may be more accurate.
Not a formal proof, but there is this fun theorem called period 3 implies chaos that my gut instinct says applies here.
Basically if you have a continuous mapping from [a,b] -> [a,b] and there exists a 3 cycle then that implies every other cycle length exists.
Which in this case would kinda say that if you are bouncing between three values on the y axis (and the bouncing is a continuous function which admittedly the gradient of a relu is not) you are probably in a chaotic system
Now that requires assuming that the behaviour of y is largely a function of just y. But their derivation seems to imply that it is the case.
They explicitly ignore momentum and exponentially weighted moving average, but that should result in the time-averaged gradient descent (along the valley, not across it). But that requires multiple evaluations, do any of the expressions for the central flow admit fast / computationally efficient central flow calculation?