Dynamical Systems for Generalists

Dynamical systems theory is a field which has largely been abandoned since the 1990s when chaos was still a hot topic. It was developed by mathematicians studying general forms of differential equations and the kinds of behaviors they can exhibit. That's why the theory is incredibly difficult to learn. The best resources are buried in pages of proofs about abstract function spaces and ergodic measures, so it's nearly impossible for physicists, biologists and ML researchers to make sense of it. At this point I've exhausted every available resource on dynamical systems and I have yet to find one which would make sense to a non-mathematician. That's why I wrote this post!

Hopefully, I can convince you that this theory is the best tool we have for studying any kind of complex system. With that, let's begin the math.

1. Flow: A Mathematical and Geometric Interpretation

We begin with an ODE: \[\large \frac{dx}{dt} = f(x), \quad x \in \mathbb{R}^n, \quad x(0) = x_0 \] This ODE is autonomous, meaning the dynamics do not depend on time. We're interested in the behavior (dynamics) of \(x\) for a general non-linear function \(f\). Why non-linear? Because a linear system has a closed form solution and that's not very exciting: \[\large \frac{dx}{dt} = Ax \implies x(t) = e^{tA}\,x_0 \] Here, a solution means a mapping from an initial condition \(x_0\) and a time \(t\) to a final state \(x(t)\). A general non-linear function doesn't have a closed form solution, but it can still have such a mapping. We call the function mapping \((x_0, t) \mapsto x(t)\) a flow, and denote it as: \[\large \begin{equation} \Phi(x_0,t) = x(t) \label{1} \end{equation} \] In the case of the linear system, the flow is \(\Phi(x_0,t) = e^{tA}\,x_0\). The flow encodes how every initial state maps to a final state, so it's natural to ask: how does the flow map an initial distribution over states to a final distribution? It turns out that with a few clever steps we can derive an evolution equation for the density: \begin{align} \nabla \cdot f(\Phi(x,t)) &= \operatorname{Tr} (\nabla f(\Phi(x,t))) \nonumber \\[8pt] &= \tfrac{d}{dt} \log \det (\nabla \Phi(x,t)) & \text{(by Jacobi's formula)} \nonumber \\[8pt] &= \partial_t \log \det (\nabla \Phi(x,t)) + f(\Phi(x,t)) \cdot \nabla \log \det (\nabla \Phi(x,t))& \text{(Material Derivative)} \nonumber \\[8pt] \det (\nabla \Phi(x,t)) (\nabla \cdot f(\Phi(x,t))) &= \partial_t \det (\nabla \Phi(x,t)) + f(\Phi(x,t)) \cdot \nabla \det (\nabla \Phi(x,t)) &\nonumber \\[8pt] \partial_t \det (\nabla \Phi(x,t)) &= -\nabla \cdot (f(\Phi(x,t)) \det (\nabla \Phi(x,t))) \label{2}&\\[8pt] \end{align} This is the form of the Liouville equation! We can recover it exactly, but we have to be incredibly careful not to mix up \(x\), \(x_0\) and \(x(t)\). In \eqref{1}, we defined \(\Phi\) to be the function which brings an initial state \(x_0\) to its future state at time \(t\). When we plug in a variable \(x\), it can have any value but it must be interpreted as an initial state that gets mapped by \(\Phi\). With that said, we can make the following definition: \[\large \rho(\Phi(x,t),t) = \frac{\rho_0(x)}{\det (\nabla \Phi(x,t))}, \quad \rho(x,0) = \rho_0(x) \] Then, plugging this into \eqref{2} and letting \(x\) represent a final state, we get the Liouville equation: \[\large \partial_t \rho(x,t) = -\nabla \cdot \left(f(x)\, \rho(x,t)\right) \] Intuitively, the determinant is playing the role of volume since \(\rho\) is a density. But does this tell us something new? The Liouville equation is a well known result which simply follows from conservation of probability. Well, the determinant is just the product of the singular values. Therefore: \[\large -\log \rho(x,t) = \sum_{i=1}^n \log \sigma_i \left(\nabla \Phi(x_0,t)\right) \] Another way to write this is in the form of a singular value decomposition: \[\large \nabla \Phi = U \Sigma V^\top \implies -\log \rho = \operatorname {Tr}(\log \Sigma) \] How do we interpret these singular values? Their product is a volume, and there's one for each state dimension, so maybe each one can be thought of as the axis lengths of some ellipsoid. While this is true, it turns out that whatever geometric shape you want to imagine gets rotated and distorted very quickly by the flow. To see that, we can look at the evolution equations for both the singular values and the left and right singular vectors: \[\large \begin{align*} \dot U &= U \Omega_U \\ \dot V &= V \Omega_V \\ \dot \Sigma &= \Sigma \operatorname{diag}\left(U^\top (\nabla f) U\right) \end{align*} \] Here, \(\Omega_U\) and \(\Omega_V\) are skew-symmetric matrices which evolve along the flow. Skew-symmetric matrices are the Lie algebra of rotations in n dimensions, so right multiplying is the same as an infinitesimal rotation. In simpler terms, \(\Omega_U\) and \(\Omega_V\) only rotate \(U\) and \(V\). The exact expressions for the skew-symmetric matrices are quite ugly to write down in full, but if you're curious: \[\large \boxed{ \begin{aligned} P &:=U^{\top}(\nabla f)\,U\\[8pt] \dot\Sigma &= \Sigma\;\operatorname{diag}\bigl(P_{11},\dots ,P_{nn}\bigr),\\[8pt] \dot U &= U\,\Omega_U, & (\Omega_U)_{ij} &=\frac{P_{ij}-P_{ji}}{2}+\frac{\sigma_i^{2}+\sigma_j^{2}}{\sigma_j^{2}-\sigma_i^{2}} \frac{P_{ij}+P_{ji}}{2},\quad i\neq j,\\[8pt] && (\Omega_U)_{ii}&=0, \quad \Omega_U^{\top}=-\Omega_U,\\[8pt] \dot V &= V\,\Omega_V,& (\Omega_V)_{ij} &=\frac{2\sigma_i\sigma_j}{\sigma_j^{2}-\sigma_i^{2}} \frac{P_{ij}+P_{ji}}{2},\quad i\neq j,\\[8pt] && (\Omega_V)_{ii}&=0, \quad \Omega_V^{\top}=-\Omega_V. \\[8pt] \end{aligned}} \] One more important note is that if there are multiple identical singular values, then \(U\) and \(V\) are actually arbitrary vectors that just span the space. They don't have a unique evolution equation. Now that we know how \(\Sigma\) evolves, we see that it is just a matrix whose diagonal elements exponentially shrink/grow depending on the projection of the Jacobian of \(f\) onto the left singular vectors. In geometric terms, that means there is some ellipsoid that is advected, rotated and exponentially stretched/squeezed depending on how it aligns with the spatial variation in the flow. Still, this analogy is lacking because if you really were to drop some finite ellipsoid into the flow, it would get distorted until it was no longer an ellipsoid.

Fortunately, there exists a better interpretation. It turns out that the singular values of a Jacobian also appear in the study of continuum mechanics! The right/left Cauchy-Green deformation tensors are defined as \(J^\top J = U^\top \Sigma^2 U\) and \(JJ^\top\ = V^\top \Sigma^2 V\) respectively, where \(J\) is the Jacobian of some deformation map. The eigenvalues of these tensors are the diagonal of \(\Sigma^2\) in both cases. Formally, \(J^\top J\) is the metric tensor which describes the local geometry as compared to the Euclidean metric. Physically, it encodes the squared local changes in distance. The singular values of the Jacobian are the ratio between the original lengths in Euclidean coordinates, and the stretched/compressed lengths in some rotated coordinates. This is directly related to strain, which is defined as a relative difference \(\frac{l - l_0}{l_0}\) so you just need to subtract the identity matrix from the metric tensor.

Since \(\nabla \Phi\) is a Jacobian, we can interpret the flow \(\Phi\) as some kind of deformation in space which drastically changes the lengths between points so that in some directions they shrink onto the same point, and in others they grow apart. But remember that these contractions and expansions are entirely local. Exponential growth of local distances does not imply that a small initial distance will exponentially grow forever. Just like when studying the deformation of a metal rod, we are making a very good first-order approximation of the actual deformation. The rod will eventually begin to buckle, tear, and fracture if you keep stretching far enough. In a similar way, a flow will eventually have higher order effects which behave nothing like exponential stretching/compression. The analogy only draws a parallel between the deformation of the lattice of atoms in a material and the abstract notion of coordinates deformed by a flow. Globally, a flow really behaves more like, well, something that flows.

2. Lyapunov Exponents and their Corresponding Vectors

Lyapunov exponents are fundamental to the study of dynamical systems, but their mathematical definition is typically very confusing. To avoid this confusion, most sources just give the equation for the largest Lyapunov exponent, but that hides the true complexity of the object. I will also avoid talking about ergodicity, since it is highly system dependent and therefore removes necessary nuance.

The truth is, I've already introduced a definition for Lyapunov exponents in the previous section! The textbook definitions are just the averages of the log singular values we encountered before: \[\large \begin{align*} \lambda_i(x,t) &= \frac{1}{t} \log \sigma_i(\nabla \Phi(x,t)) & \text{} \\[8pt] \bar \lambda_i(t) &= \int_{\mathbb{R}^n} \lambda_i(x,t) \rho_0(x)\, dx & \text{(Finite Time Lyapunov exponents)} \\[8pt] \lambda_i(x) & = \lim_{t \to \infty} \lambda_i(x,t) & \text{(Local Lyapunov exponents)} \\[8pt] \bar \lambda_i & = \lim_{t \to \infty} \int_{\mathbb{R}^n} \lambda_i(x,t) \rho_0(x)\, dx & \text{(Lyapunov exponents)} \\[8pt] \end{align*} \] I've specially denoted the spatial averages with a bar, because these are typically the implied definitions. That's because most systems where Lyapunov exponents are applied are (approximately) ergodic, which just means that \(\lambda_i(x) = \bar \lambda_i \). While we don't require ergodicity, we do need some minimal conditions for these limits to actually exist. We will cover those later.

Another formulation of Lyapunov exponents you might encounter is in terms of a small perturbation vector \(v_i\): \[\large \begin{equation} \lambda_i(x_0) = \lim_{t \to \infty} \frac{1}{t} \log \bigl\|\nabla \Phi(x_0,t) \, v_i \bigr\|, \quad v_i \in E_i(x_0) \label{3} \end{equation} \] ...except it would normally be missing all the important details, which I will now explain. You see, you might expect that the vectors which satisfy the above equation are just the left or right singular vectors of the \(\nabla \Phi\). It's not that simple!

Let's say you choose an initial condition \(x_0\) and some large time \(t\) and calculate that the \(i\)th right singular value of \(\nabla \Phi(x_0,t)\) is \(v_i\). Then, you plug that vector naively into \eqref{3}. What you will get is not \(\lambda_i(x_0)\). Instead, the formula will give you \(\lambda_1(x_0)\) for every single \(v_i\)! This is because, as we derived before, the singular values constantly rotate over time. So, after a short time \(v_i\) will no longer correspond to a singular vector in \(V\) and it will have components in each one of the basis directions. That means it will have a finite component in the direction of the largest singular vector, and this component asymptotically will grow like the \(e^{\lambda_1(x_0)t}\). If the largest Lyapunov exponent is positive, then this immediately implies that almost any initial perturbation vector will grow exponentially. This is precisely the definition of a chaotic system.

How do we avoid this? Well, it turns out that there is a subspace of \(\mathbb{R}^n\) which contains only the vectors that do not get rotated onto other singular vectors. The space, as implied in \eqref{3}, is called \(E_i(x)\) and it is the set of vectors which does not get rotated by the flow and gets stretched like \(e^{\lambda_i(x)t}\) asymptotically. The simplest example of such a space is just \(\operatorname{span}\{f(x)\}\). This corresponds to \(\lambda_i(x) = 0\) because perturbations in this direction don't grow or shrink, they just push the state forward or backward along the trajectory it was already going to take. If the ODE has any other symmetry besides time (e.g. rotation, scale invariance), then \(E_i(x)\) will have one dimension for each symmetry.

Oseledet's Theorem (which, contrary to what Wikipedia says has nothing to do with ergodicity), states that as long as the expected log norm of \(\nabla \Phi(x_0,1)\) is finite, then there exists a "splitting" of \(\mathbb{R}^n\) into independent subspaces \(E_i(x)\) which are invariant under the flow and grow like \(e^{\lambda_i(x)t}\). In simpler terms, that means for every lyapunov exponent, there is a corresponding vector space which contains perturbations that grow like that exponent. These subspaces span all of \(\mathbb{R}^n\) but they are not an orthogonal basis. You cannot project a given perturbation vector on to the basis by simple dot product, you must perform a \(QR\) decomposition of the basis and then project the vector with \(R^{-1}Q^\top\). Furthermore, linear combination of perturbation vectors with different Lyapunov exponents will grow like the largest exponent. That means we can actually generalize \eqref{3} to: \[\large \begin{equation} \lambda_i(x_0) = \lim_{t \to \infty} \frac{1}{t} \log \bigl\|\nabla \Phi(x_0,t) \, v \bigr\|, \quad v \in \bigoplus_{j \ge i} E_j(x_0) \label{4} \end{equation} \] The vectors in each of the subspaces \(E_i(x)\) are called covariant Lyapunov vectors. This is poor naming because Lyapunov exponents are already associated with another set of vectors; namely, the left and right singular vectors of \(\nabla \Phi\). Perhaps they should be called Oseledets vectors, but for now I will just call them covariant vectors of the flow. They are called covariant because the field \(v_i(x)\) is fixed, and evolving a vector in the field will keep it aligned with the field: \[\large \nabla \Phi(x,t) \, v_i(x) \propto v_i(\Phi(x,t)) \] From the previous definition, the proportionality constant will scale like \(e^{\lambda_i(x)t}\) as \(t \to \infty\). We really do need to take the long time limit in practice to calculate them. But once you know these covariant vectors, it opens the door to making informed perturbations that can actually drive a system to entirely new behaviors.

3. The Attractor and its Dimension

Before we can understand how the system will respond to perturbation, we better know how it behaves in the absence of perturbation. As \(t\to \infty\), how does the flow map an initial distribution of states to a final distribution? If we try to naively evaluate \(\lim_{t\to \infty} \rho(x,t)\) we immediately we run into a problem. Except in very special cases, \(\rho\) will exponentially grow/shrink and/or oscillate indefinitely so the limit doesn't exist. There are two ways to resolve this problem and each tells us different physically meaningful things about the system.

Method 1: Normalizing with Time Averages

The first way is to just say that infinities and non-stationary distributions are totally fine as long as we only care about expectation values. We know that \(\lim_{t \to \infty} \int_{\mathbb{R}^n} \rho(x,t)\, dx\) is finite, in fact it's just equal to 1 by definition. However, for a smooth and bounded function \(\phi(x)\), its ensemble average \(\int_{\mathbb{R}^n} \phi(x)\rho(x,t)\, dx\) may oscillate or diverge as \(t\to \infty\) so the limit will not exist. For this reason, the only limits we can safely evaluate have to be both ensemble and time averages. To do this, will need 3 things:

\(f\) is continuous.

There exists a closed and bounded invariant set \(\Omega \subset \mathbb{R}^n\), meaning trajectories starting in \(\Omega\) will stay in \(\Omega\) forever.

\(\Omega\) is an absorbing set, meaning every trajectory in \(\mathbb{R}^n\) ends up in \(\Omega\) in finite time.

Those are precisely the conditions for the existence of a global attractor. Note that global just means "everything ends up in the set", but the set itself can be made up of several disjoint subsets. Knowing there exists a global attractor is a very good starting point for understanding long-time behavior of the system! We can now start to analyze the properties of the attractor.

Recall that our definition of Lyapunov exponents from before was exactly the time average of an ensemble average. Now we know that if a global attractor exists, then \(\bar\lambda_i\) is guaranteed to exist. We can use this to see that the time averaged rate of change of differential entropy is the sum of the Lyapunov exponents: \[ \begin{aligned} h(t)\, &\colon\!= -\int_{\mathbb{R}^{n}} \rho(x,t) \log \rho(x,t)\,dx & \text{(Definition of differential entropy)} \\ \implies \sum_{i=1}^{n}\bar\lambda_i &= \sum_i^n \lim_{t\to\infty}\int_{\mathbb R^{n}} \frac{1}{t} \log \sigma_i(\nabla \Phi(x,t))\, \rho_0(x)\,dx &\\ &= \lim_{t\to\infty}\int_{\mathbb R^{n}} \frac{1}{t} \log \operatorname{det}(\nabla \Phi(x,t))\, \rho_0(x)\,dx &\\ &= \lim_{t\to\infty}\frac{1}{t} \int_{\mathbb R^{n}} \bigl( \log\rho_0(x) - \log \rho(\Phi(x,t),t) \bigr)\, \rho_0(x)\,dx &\\ &= \lim_{t\to\infty}\frac{1}{t} \left(-h(0) - \int_{\mathbb R^{n}} \rho(x,t) \log \rho(x,t)\,dx\right) &\\ &= \lim_{t\to\infty}\frac{h(t)-h(0)}{t}&\\[4pt] &= \;\bigl\langle\dot h\bigr\rangle_{t} \end{aligned} \] Another related identity is that: \[\large \begin{equation} \dot h(t) = \int_{\mathbb{R}^n} ( \nabla \cdot f(x))\, \rho(x,t) \, dx = \big\langle \nabla \cdot f \bigr\rangle_{\rho_t} \label{5} \end{equation} \]

Method 2: Normalizing by Coarse-Graining

The second way to resolve the problem of the diverging state distribution is to coarse-grain \(\rho\). This instantly solves the problem of singular (infinite) densities since the infinites always appear below any finite resolution. It does not, however, solve the problem of non-stationary/oscillating densities. To avoid that problem this time around, we are going to restrict our focus to the coarse-grained entropy of the attractor.

There are a number of ways you could go about coarse-graining the density. The simplest way is to discretize the state space into cubes of side length \(\varepsilon\) (volume \(\varepsilon^n\)) and redefine the expectation integral with a sum. Another way is to convolve the density with a Gaussian kernel with covariance \(\varepsilon^2 I\). The entropy associated with the cube coarse-graining is defined as: \[\large H_\text{cube}^{\varepsilon}(t) = -\sum_{i=1}^N p_i(t) \log p_i(t), \quad p_i(t) = \int_{C_i} \rho(x,t)\, dx \] where \(C_i\) is the \(i\)th \(\varepsilon\)-cube. The Gaussian coarse-graining is defined as: \[\large H_\text{conv}^{\varepsilon}(t) = -\int_{\mathbb{R}^n} \rho(x,t) \log \left(\int_{\mathbb{R}^n} \rho(y,t)\, e^{-\frac{\|x-y\|^2}{2\varepsilon^2}}\, dy\right)\, dx \] This is known to be related to the differential entropy by the following asymptotic \(\varepsilon \to 0\) relation: \[\large H^{\varepsilon}(t) = h(t) - \mathcal{D}_I \log \varepsilon + \mathcal{O}(1), \quad \mathcal{D}_I \in [0,n] \] where the \(\mathcal{O}(1)\) terms come from particular choice of coarse-graining procedure. \(\mathcal{D}_I\) is a constant known as the information dimension of the attractor. In this form, \(\mathcal{D}_I\) just looks like an empirical constant and that's the way it's largely been treated in the literature, but it's a very important physically relevant quantity! It answers the question: how many real numbers do I need to encode the state of a system? as opposed to the classical notion of entropy which only tells you how many bits you need in total. This is about as close as you can get to a universal definition for complexity in a physical system! The biggest problem with information dimension is that it's exactly \(n\) for all finite times. It's defined in a way that only after taking the limit as \(t \to \infty\) can you ever get a smaller number. This makes it impossible to assign an intermediate value or associate an "energy" to this "force". What causes information dimension to decrease? Until we have a definition which has a meaningful finite time value and a finite rate of change, we cannot pin down the cause. That's why I'm working on defining a new quantity which equals the information dimension at the limit, but has a physically meaningful intermediate values. Paper coming soon!