“Model-free“ analysis of a complex system. Part II
Or: when you don't even see what your model is
For this post, as for the previous ones, there is a podcast. Click on the “Our podcast” menù above to find it on Spotify.
This is the second part of a series about the widespread opinion that there are “model-free” methods for the analysis of a complex system (or, anything else, in general). You can find here the part I of this series:
In that post I have discussed about the several assumptions that are literally buried behind widespread procedures for the analysis of complex systems (such as the human brain, gene regulatory networks, financial networks, …) in terms of some measure of causality or correlation, that is later used for network analysis.
Here, triggered by a long and stimulating exchange with Nicole Rust on BlueSky (you can follow me there), I am going to discuss about why other methods (not necessarily based on networks) can be limited in scope, are not model-free and should not be used as magic black boxes for your problem at hand.
In a nutshell, my suggestion is always the same: if you are not an expert of one method, whatever it is, you should first use it on synthetic data that you perfectly control under different levels of complexity (length, noise, …). Once you have understood the limitations of the method, you can apply it to real data. Skipping the exploratory part to improve your understanding of the method can lead to poor results.
Empirical Dynamical Modeling in a nutshell
Empirical Dynamical Modeling (EDM, from now on) is a powerful framework rooted in nonlinear dynamical systems theory. Its primary goal is to infer and analyze the dynamics of a system directly from time-series data, without relying on pre-defined parametric models: EDM operates under the important assumption that the behavior of a system can be reconstructed from observed data since there is an underlying attractor.
It has been popularized by Sugihara and May in a pioneering paper published in 1990 and it became known for allowing “equation-free” forecasting of nonlinear dynamical systems. In another paper, its “model-free” nature has been further pushed.
But is it really model-free? The short answer is NO. The long answer is, well NO.
In fact, the “model” used to do the magic is the one reconstructed from the data itself: thanks to an old yet powerful theorem it is possible, under strict conditions, to approximate the attractor of a nonlinear dynamical from the observation of one of its time series (e.g., the observation of one of its components). Since the system lives along its attractor, that’s exactly the object you need to reconstruct to overcome your ignorance of the mechanisms (that we usually approximate through ODE).
It is a very powerful approach, since it exploits the underlying geometry of the system to predict its future states from the knowledge of its past ones. It works like if you were an ant walking on a manifold: from your current location you are not allowed to move everywhere in the space, but you must follow the paths along the geometric shape…
Note, by the way, that it is only the tip of a giant iceberg: there are uncountable papers on the reconstruction of attractors from the observation of time series. It would be impossible to mention all of them in this short post, so I’ll do my best to mention the relevant ones for our discussion.
Foundations: delay embedding and Takens' theorem
A cornerstone of EDM is Takens' theorem (1981), which provides the mathematical foundation for reconstructing the state space of a dynamical system. The theorem states that, under the assumption that an underlying low-dimensional attractor exists, the dynamics of a system described by a deterministic autonomous equation like
can be reconstructed from a sequence of observations through a delay embedding space, formed by vectors of the form:
where x(t) is one component of the observed state, m is known as the embedding dimension and d is the time delay. If you are wondering if you can use more than one component to enhance the reconstruction, the answer is yes.
While powerful, the theorem has an important assumption: it supposes that we are able to perfectly estimate d and m, the embedding parameters.
Challenges in estimating embedding parameters
The necessity for the estimation of the embedding parameters opened a whole new research field, that was named embedology.
Fortunately, solutions arrived. At the beginning, it was proposed to use the first zero of the autocorrelation function to estimate the time delay. At that lag the time series is no more correlated, therefore it was plausible to use that value to capture an orbit before decorrelation was taking place. That was a smart idea, but autocorrelation is not able to capture nonlinear features, therefore a better method was proposed in 1986 by Fraser and Swinney. Based on mutual information, it was similar in spirit to the autocorrelation while being based on an information-theoretic technique able to capture nonlinearities. Accordingly, the first local minimum of their average mutual information (AMI) was suggested to be used as delay time d. This is a first important assumption, since in principle other delays could be used as well (e.g., the values from other local minima).
The problem of determining the parameter m was successfully attacked by Kennel, Brown and Abarbanel in 1992, who proposed the false nearest neighbors (FNN) method. It checks if points that look close together in a lower-dimensional embedding still stay close when you add more dimensions. If not, it means the lower-dimensional view is distorting the data, and more dimensions are needed to faithfully reconstruct the underlying attractor. The process continues until adding extra dimensions no longer changes the relationships between points. In principle, this algorithm ensures that the reconstructed space accurately reflects the system’s dynamics without unnecessary complexity, but in practice it depends on the arbitrary choice of what is the definition of “good enough”, i.e., the fraction at which one considers that the reconstruction is good. Let’s show this with one practical example, the famous Lorenz attractor described by the set of equations
where the coefficient have been chosen to guarantee that the system is in its chaotic regime (if you don’t know what chaos is, let’s say that at low dimension it looks like a pseudo-periodic time series with a strong sensitivity to initial conditions, at variance with periodic, non-chaotic, systems).
I have generated a trajectory of 10,000 samples: calculating the AMI, the first local minimum is around d=30, while the FNN is below 5% (my arbitrarily chosen threshold) for m>6. That’s in perfect agreement with what we expect from the theory: for a system with a correlation dimension D we need the embedding m to be around 2D for a good reconstruction, and for the Lorenz attractor D=2.05.
As you can see from the above image, the original attractor (its bidimensional projection) in the bottom-left panel is well reconstructed using d and m from the previous analyses, as shown in the bottom-right panel.
The first time I have seen this it was 2007, and I was working for my thesis on characterizing the nature of the flow of high-energy particles detected by the Pierre Auger Observatory.
It looked like magic the first time I have seen it.
In practice, the delay embedding exploits the intrinsic dynamical correlations in the data to reconstruct the underlying geometry (i.e., the attractor): an amazing intuition by Takens. The almost incredible aspect is that even if you have a state with 3, 4 or 5 dimensions (i.e., degrees of freedom), it’s enough to observe one of them and then use Takens’ theorem to reconstruct the whole geometry.
Why is this feature so powerful? Because you can exploit it to forecast the future states of your system!
Again, at the beginning it was mind-blowing, but once I understood the underlying idea it was simply elegant and powerful. If you correctly reconstruct the attractor, you don’t need the equations of motion (i.e., the aforementioned ODEs) to predict the next state: you are in the state space and you are traveling along a trajectory constrained by temporal correlations, therefore you are not allowed to move next into any possible state in the state space (see our ant, above). You are allowed to literally flow only towards a few possible states: therefore, by looking at a bunch of orbits close to your current orbit (e.g., within a hypersphere), you can see where they pointed next in the past (within another hypersphere) and you can use that knowledge to calculate an average orbit towards which your current state can go. Simply amazing, because it works like a charm.
But.. there is always a but, as you can imagine. This is not magic, it’s science, and there are limitations. Even under the assumptions that your attractor is well reconstructed (see next section), the prediction of future states is strongly limited by the knowledge of the past states, therefore if you don’t have a long enough observation you could be severely limited in the averages that you predict. Another strong limitation comes from the theory itself: the goodness of the predicted orbit strongly depends on the Lyapunov horizon of the system. Chaotic systems are characterized by at least one positive Lyapunov exponent, which characterizes the rate at which two close orbits diverge exponentially, and you can imagine the Lyapunov horizon as proportional to the inverse of the largest Lyapunov exponent. The closer the exponent to zero, for longer time your system is “less chaotic” and you can predict it reliably. Conversely, the larger the exponent the smaller the temporal window within which your prediction will be accurate.
The issue is that you have no way to know, a priori, what is the limitation of the framework. You have some empirical data and you see some patterns that suggest the existence of some deterministic behavior that could be generated by an attractor, so you assume there is an attractor. Under this assumption you somehow estimate the embedding parameters (which depend on some arbitrary choices on your side) and, at best, you can use the same knowledge to estimate the largest Lyapunov exponent and estimate your prediction horizon.
At this point, it should be clear that embedology is a sort of art: lovely, powerful, but to take with a grain of salt, since I didn’t mention, so far, that it requires also your data to be nearly noise-free.
The impact of the embedding parameters and of noise
What happens if the chosen delay time is sub-optimal? After all, there are some recipes to choose it, but no ultimate method. Let’s see it together, starting from the same Lorenz system I have generated for the previous analysis and, fixing the embedding parameter m=6 as before, let’s change the delay time:
Yes, we see that if the delay time is too short then the attractor is not correctly unfolded. Furthermore, if the delay time is too long then the attractor unfolds too much. In both cases it becomes almost useless, since you cannot fully rely on the trajectories in the state space for doing your prediction. And this is just the beginning… remind that we are still working with a low-dimensional system, fully deterministic, noise-free: the reconstruction is very sensitive to the choice of the delay time, a feature not desirable when there is no optimal way to find its optimal value.
There is more. Let’s try to add some noise, like the one that we inevitably have when we perform any measurement. If we are lucky, the response of our instrument is linear and noise is additive, otherwise the response can be nonlinear and the noise will still be there. This is an important point: we never measure the true observable x(t), but a function h(…) of it that can be nonlinear and create problems to our reconstruction. Remind this the next time that you perform nonlinear time series analysis, since you are going to build an attractor from this time series:
The second term in the r.h.s. of the equation is additive noise that, in the best case is uncorrelated and Gaussian-like. Let’s see what happens to our reconstruction if different levels of additive Gaussian noise affect our measurement (and to keep things simple, let’s assume that the function h(…) is just an identity or, more generally, linear).
To control the level of noise let’s consider a variable signal-to-noise ratio (SNR). The idea is simple: our Gaussian noise has zero mean and some positive variance and the SNR is defined by
At SNR = 10db it means that the variance of the Gaussian noise is 10% of the variance of the signal or, equivalently the ratio between the two standard deviations is about 31%. To achieve a standard deviation for the noise that is 10% of the signal, your SNR needs to be 20dB (i.e., a good measurement).
I have calculated the AMI and the FNN for different levels of noise and this is the result:
The AMI is somehow robust to noise above a SNR of 20db, while below that threshold the local minima start to change or to become less marked. For the FNN the impact is even more catastrophic: if for SNR larger than 50db we find that the fraction of false nearest neighbors is below 5% above m=6, for SNR=20db we need m=10, while for higher levels of noise the FNN never goes below 75%. Take a second a think about how dramatic is this result: if the standard deviation of the additive noise (due to your instrument) is larger than 10% of the standard deviation of the signal you are trying to measure, your chances to find good embedding parameters are essentially gone.
To give an idea of the resulting reconstruction, I show below the attractors for you:
At SNR=20db we can barely see the attractor (and believe me: this is the case that corresponding to majority of empirical data), while for lower values of the SNR the “attractor” is not distinguishable from an Escherichia coli.
Yes, there are nonlinear techniques for noise reduction that, by the way, are based on additional assumptions and can sometimes alter dramatically the signal. I will not cover them here.
The impact of the signal length
Unfortunately, there is more to account for. After the original hype for the power of embedology, lasting for a decade, in 1992 Echmann and Ruelle come with an unexpected finding: the length of your time series matters, a lot.
They didn’t specifically focus on the impact of length on the embedding reconstruction, but they focused on some of its direct effects: the estimate of (fractal) dimensions and Lyapunov exponents. Why? Well, for two obvious reasons:
The fractal dimension of a system is a proxy for its number of degrees of freedom; i.e., it provides guidance about the minimum number of equations needed to describe the system.
The Lyapunov exponents of a system will tell us if it is in a chaotic regime (at least one positive exponent) or not (all exponents are negative) or something else (all exponents are non-positive).
Their abstract is perfect:
We show that values of the correlation dimension estimated over a decade from the Grassberger-Procaccia algorithm cannot exceed the value 2 log10N if N is the number of points in the time series. When this bound is saturated it is thus not legitimate to conclude that low dimensional dynamics is present. The estimation of Lyapunov exponents is also discussed.
That is: if your time series has a given length N, the best estimate of the correlation dimension you can achieve will depend on N. That’s not good news at all for high-dimensional attractors: while for low-dimensional chaos the number of required points is still manageable, for systems with just 10 degrees of freedom one would need to measure nearly noise-free time series of at least 100,000 samples. Maybe today this is not considered an ambitious goal, but in the 90s it was out of reach for most experimental setups.
Did you assume you had an attractor, right?
So far so good, but there is still more… when the noise is part of the dynamical process as well. That happens, essentially, in every type of measurement that we perform with biological systems. On the one hand, we have intrinsic/dynamical noise, e.g., think about the thermal fluctuations of a molecule suspended in some fluid, or the random directions taken by an E. coli while moving along a sugar gradient). On the other hand, we have the (possibly nonlinear) response function of our instrument with its noise (due to mechanics, friction, electromagnetism, …). Overall, what we are really dealing with is this:
where we want to learn something about the unknown underlying governing laws (i.e., F and G) from the bare measurement s(t). Here the ξ(t) is a vector of stationary noise with some correlation function, while G is, in general, a matrix.
Yes, it’s an ambitious goal, at this point. Note that if the G is not identically null, we have a beautiful stochastic process that, under weak and plausible conditions, leads to colored noise, which is autocorrelated noise: random numbers with a persistent or anti-persistent pattern. You might have heard of it, since its power spectrum shows a power-law scaling with some exponent and the value of that exponent is associated to a specific “color”: e.g., pink noise has exponent -1, while white noise has exponent 0. Imagine that a whole sub-field was dedicated to the study of this type of processes, where real signals were first detrended to remove the deterministic part and then analyzed with techniques such as multifractal detrended fluctuation analysis.
Our fundamental assumptions before starting to play with embedology was that we were actually convinced that there was an attractor for our observed system. In 1989, Osborne and Provenzale (a friend and collaborator nowadays; I have done my MSc thesis under his guidance) have shaken the field with another unexpected discovery:
We discuss a counter-example to the traditional view that stochastic processes lead to a non-convergence of the correlation dimension in computed or measured time series. Specifically we show that a simple class of “colored” random noises characterized by a power-law power spectrum have a finite and predictable value for the correlation dimension. These results have implications on the experimental study of deterministic chaos as they indicate that the soie observation of a finite fractal dimension from the analysis of a time series is not sufficient to infer the presence of a strange attractor in the system dynamics. We demonstrate that the types of random noises considered herein may be given an interpretation in terms of their fractal properties. The consequent exploitation of the non-Gaussian behavior of these random noises leads us to the introduction of a new time series analysis method which we call multivariate scaling analysis. We apply this approach to characterize several “global” properties of random noise.
Recap:
Before this paper, everyone was convinced that embedology could be safely applied because for purely stochastic systems the fractal dimension should plausibly diverge.
Thousands of papers were published between 1981 and 1989 showing that whatever signal we can think of (neurons, climate, earthquakes, ecosystems, finance, …) was exhibiting finite correlation dimension.
After this paper, colored noise (i.e., a broad class of stochastic processes) can have finite correlation dimension: the assumption that we could be able to distinguish chaos from random processes just by looking at this number was challenged for ever.
In 1992, Provenzale et al proposed robust tests to distinguish chaos from genuine stochastic processes, to provide further guidance for the (uncontrolled) use of embedology as a tool to inspect time series:
The success of current attempts to distinguish between low-dimensional chaos and random behavior in a time series of observations is considered. First we discuss stationary stochastic processes which produce finite numerical estimates of the correlation dimension and K2 entropy under naive application of correlation integral methods. We then consider several straightforward tests to evaluate whether correlation integral methods reflect the global geometry or the local fractal structure of the trajectory. This determines whether the methods are applicable to a given series; if they are we evaluate the significance of a particular result, for example, by considering the results of the analysis of stochastic signals with statistical properties similar to those of observed series. From the examples considered, it is clear that the correlation integral should not be used in isolation, but as one of a collection of tools to distinguish chaos from stochasticity.
That’s probably the first paper calling for using more than one measure to make any assessment about a signal. Sounds familiar, isn’t it? Still true today, but for other methods (tSNE, UMAP, networks, deep learning, …).
Take home message
What a ride! It’s from 1989 that we know that even noise can lead to reconstruct (spurious) attractors and, still, the vast majority of users of embedology missed — in thousands of papers — to apply useful checks to recognize if their data was chaotic or random. Again, it sounds very familiar (see my previous post concerning the uncontrolled use of network science on correlation networks, as well as Tiago Peixoto’s blog on the problems of modularity maximization).
Despite that, some efforts to generalize embedology to the case including noise are available in the literature. In this paper (and refs. therein) you can find the generalization of Takens theorem for using multiple time series to reconstruct an attractor. This paper adds some support for discrete signals with additive noise. But to really appreciate the problem of reconstruction in presence of noise, you should read this paper by Casdagli et al, appeared in 1991 (yes, 10 years after Takens’ paper and just during the hype for embedology):
We extend Takens' treatment, applying statistical methods to incorporate the effects of observational noise and estimation error. We define the distortion matrix, which is proportional to the conditional covariance of a state, given a series of noisy measurements, and the noise amplification, which is proportional to root-square time series prediction errors with an ideal model. We derive explicit formulae for these quantities, and we prove that in the low noise limit minimizing the distortion is equivalent to minimizing the noise amplification.
We identify several different scaling regimes for distortion and noise amplification, and derive asymptotic scaling laws. When the dimension and Lyapunov exponents are sufficiently large these scaling laws show that, no matter how the state space is reconstructed, there is an explosion in the noise amplification - from a practical point of view determinism is lost, and the time series is effectively a random process.
In the low noise, large data limit we show that the technique of local singular value decomposition is an optimal coordinate transformation, in the sense that it achieves the minimum distortion in a state space of the lowest possible dimension. However, in numerical experiments we find that estimation error complicates this issue. For local approximation methods, we analyze the effect of reconstruction on estimation error, derive a scaling law, and suggest an algorithm for reducing estimation errors.
For an overview of the (other) techniques for embedding reconstruction, I recommend to start from this paper by Tan et al (2023) and this paper from Pecora et al (2007).
Let me finally remark that I have worked with embedology since 2007, and I definitively love it. It has shown that, under some conditions, it is possible to do a kind of magic with applied math, like reconstructing a geometry from the bare observation of a time series. Despite all its merits, it comes with several limitations:
it needs “long enough” data
it needs “almost noise-free” data, since it’s sensitive to the choice of embedding parameters
it needs sanity checks to distinguish chaos from random processes
it needs to find the embedding parameters, which is still an (artistic) task depending on the choice of arbitrary points and thresholds
it reliably works only for low-dimensional chaos
If your next dream is to use embedology and EDM for the analysis of signals from the metabolic network of the C. elegans or from the human brain, be very careful. Usually these systems consist of thousands of nodes with a coupled dynamics.
For instance, let’s consider a Kuramoto-like model on a network:
where the classical Kuramoto model is recovered if g(…) = sin(…) for all pairs and the coupling strength is constant and equal to K/N. In this simplified case, the order parameter r (quantifying the amount of collective behavior) as a function of the coupling strength leads to a well-known transition:
For complex networks with heterogeneous structure things are more complicated, but the outcome would be similar. The point is that if your brain network is in a regime where collective behavior is large (panel G, large values of the coupling strength) then the effective degrees of freedom of the system are low enough to open to applications via embedology. Otherwise, if your collective behavior is not that strong or not present at all, the degrees of freedom are too many and, as we have discussed above, the system could not be reliably distinguished from a random process.
Since it seems that systems like the brain try to avoid to live close to very synchronized states — i.e., there are many degrees of freedom (at least at some observational scale) — a thorough analysis of the applicability of EDM to neuronal signals is needed.
In a paper about network reconstruction from time series (related to the previous post about the inverse problem) we have reported that:
Overall, analysis shows that the majority of the results obtained from the fuzzy analysis are consistent with the expectations, whereas the thresholding approach, regardless of the statistical method, tends to underestimate or overestimate the true values for varying thresholds. These results provide a strong indication that results from threshold models not only depend strongly on the value of the threshold, but also that there can be no thresholds for which, on average, a reliable measure of network indicators as simple as degree centrality and local clustering coefficient can be obtained.
We have shown that whatever the correlation measure you use (cross-correlation, mutual information, spectral coherence, Spearman correlation or cross-convergent mapping; the latter being based on EDM) your results, even the simple estimation of the node degree, will inevitably depend on:
the choice of the correlation measure;
the choice of the threshold to apply;
the nature of the underlying dynamics.
In the light of this post, does it sound familiar? Yes, we find at a large scale most of the issues discovered for the analysis of low-dimensional dynamical systems.
Although I do not expect that this will prevent the appearance of new rounds of thousands of papers adopting such approaches without understanding their limitations and, overall, the fact that they are not really “model-free”. They are not assumption-free and since assumptions are hypotheses, every time one assumes something one is literally building some form of model.
By the way research is ongoing, and it cannot be excluded that approaches like manifold learning or the Koopman operator theory might provide solutions to mitigate some or most of the current issues.
In the meanwhile, for many applications of interest (like the analysis of connectomes) it has been recently shown that old-fashioned methods based on the joint analysis spatial and temporal correlations do their job, while overcoming most of the issues mentioned in this post and in the previous ones.
Thanks for the kind feedback.
In the ideal noiseless case (with infinite observation time) the attractors reconstructed with different delay values should still be diffeomorphic to the original attractor, as long as the embedding dimension satisfies the conditions of the Takens’ theorem. AFAIK, the choice of delay affects the geometry of the embedding but does not break the diffeomorphic relationship.
However, in noisy cases (i.e., the real world) the diffeomorphism guarantee no longer holds, and inappropriate delay values can significantly distort the attractor’s structure. Even in the noiseless case, very small or very large delays can lead to reconstructions that are mathematically valid but less useful for practical analysis. I have shown the example for increasing delay where the geometry of the attractor is too unfolded: using that geometry for predicting the next state will surely lead to a poor estimate.
Regarding the 2D projections, you are right: they do not preserve the diffeomorphism to the original attractor and are just a way to grasp the higher-dimensional structure. I do not think that someone uses the 2D projection beyond visual inspection, in fact.
In a nutshell, I could say that even if the math allows you, theoretically, to reconstruct the attractor (under the relatively strict requirements of the Takens' theorem), it hardly guarantees that it could be useful for your applications.
Thank you very much! Really nice post! I have only one small question. Can it be that, only in the noiseless case, attractors reconstructed for different delay values (1, 5, 20, 50, 100 200) are still diffeomorphic to the original one? You use m=6, so an embedding space is 6-dimensional, however, you show only 2d projections of it.
Of course, the fact of being merely diffeomorphic does not make such projections useful. However, as I understand the formulation of Takens theorem, such embeddings should be still diffeomorphic regardless of the choice of the delay in an ideal noiseless case and infinite observation time.
I apologise in advance if it is a stupid question.