Topological Data Analysis: the algebraic topology of point data clouds?

This area combines ideas from Computational Topology, and Intelligent Data Analysis. Extending both my original research area in Topology and the Geometry of Information project in a practical direction is this area of Intelligent Data Analysis also called Topological Data Analysis (TDA). This new area, to quote from a DARPA website, `aims to develop mathematical concepts and techniques necessary to determine the fundamental geometric structures underlying massive data sets and then develop further tools to exploit that knowledge.' The methods used combine statistical and pattern recognition techniques with algorithmic approaches to topology developed initially for use in Visualisation and Computer Graphics (often called Computational Topology), and methods from Algebraic Topology itself. My research in this area is centred on two projects. (a) (Joint with R. Zwiggelaar (CS, Aberystwyth) and S.Cox (Mathematics and Physics, Aberystwyth)), Exploiting Data Topology and Manifolds in Medical Image Analysis aims to examine the relevance of ideas such as Persistent Homology , which measures the way in which features in the space can become apparent at certain filtering resolutions, survive for a certain range of filtering parameters and then disappear as that range is exceeded. (b) Using multiresolution techniques, the second project will look at the possibility of detecting phenomena related to fractal topology in artificially generated data. This will combine the persistent homology idea with methods and ideas from Shape Theory. These methods do not assume that the data space being observed is a manifold, as this is an unnecessary and unreasonable assumption if the phenemena are non-linear, for instance, with feedback in the system, as such systems often have strange attractors, which are far from being locally nice . (Such non-locally nice spaces are the objects of study for the area of Shape Theory.)


Here is an extract of the introduction to a (draft) paper making some of the points in more detail:

The subject of Topological Data Analysis (TDA) has emerged recently as being that part of Computational Topology concerned with applying the methods of that subject to the analysis of data sets, that are often of very large size; the methods used are adapted from algebraic and differential topology and are are closely related to those used for spatial reconstruction from scanned data in Visualisation, but the context is, theoretically, not limited to low dimensions nor to data of spatial origin nor, initially, to the visualisation of the data. Its aim, rather, is to give qualitative information on the data, allowing for statistical variation, noise etc. This, if that view is correct, puts it in a strange position. The invariants it puts forward will, if they are to be successful, have to give useful information about the data. The theory does require some hard work by the user to interpret the information it is giving in a usable way, but until it is interpreted, how can it be useful? It is probable that one can extend the methods from those currently being used, but also there should be an examination what new information can be obtained by using those same methods in new ways.

The sort of applications considered so far have been looking, for instance, for qualitative structure in the clusters obtained by some classifier. These methods usually assume the data comes from sampling a manifold or simplicial complex. To these we would suggest the addition of a new type of analysis related to the verification of a mathematical model for what theoretically might be a non-linear situation involving feedback and perhaps even some chaotic aspects. The idealised mathematical model would then be likely to predict that the experimental data might be as if sampled from a (possibly high dimensional) space that is embedded in some (even higher dimensional) Euclidean space, but this idealised model space need not form a manifold, and might even be fractal in its nature.

It has been argued that there are two related views of Spatial Representation. The first is representation of spaces, but the second, and for the purposes here the more basic one, is representation by spaces. In both cases, the space is an idealised object obtained as the limiting case of indefinitely refined observations of the context object or data. A mathematical model, if possible, will give a second idealised object, another `space'. As this second idealised space is typically observable only through finite approximations, it also is obtained as a limit. One eventual aim of TDA in this situation could be to provide a comparison between these two `spaces'. In other words, in this analysis, any space gives an idealisation of a context and whether or not that is a `spatial context', or even what `spatial context' means might be the subject of a lot of philosophical debate and we will not explore it more here.

The classical methods of algebraic topology had quite a lot to say about the `invariants' of such limiting spaces, and we will look briefly at some of these methods later. They, by themselves, are not algorithmically efficient, and sometimes not even feasible, but they can be adapted to give much more computationally `friendly' tools. We will briefly review the history of these tools to show how, on a limiting space, the information on the (finite) approximations relates to that on the limit. Here some examples show relevant behaviour for our `non-linear model' thought experiment. The critical phenomena in such examples relates closely to the analysis of not so much the approximations to the limit, but to the refinement or comparison maps between them. This means that we must examine whether the available tools (in particular the various forms of homology) of the present form of TDA can be pushed beyond their current analysis of objects to help in the analysis of maps, and, in the limit, to give qualitative information on the idealised limiting space.


This page was last modified on 28-10-2005. T.P.