For the record, I'm working on a topological data analysis project that is in some ways quite similar to this, and I'd be astonished if a single community hospital has the depth needed. We're working with thirty-some.
Thank you for your feedback. I din't know that topological data analysis even exist! I have some published and unpublished research on the topic.
I had used this particular dataset by random. I wanted to use something with integers, large sample size and with high dimension.
I'm glad you did the community detection stuff. I found it useful when the dimension is really large (in the order of thousand variables). For instance, when doing text mining.
very nice comment!
Topological Data Analysis was the next big thing a couple of years ago, and it certainly has its applications [1], but the breadth of its applicability may have been overhyped and the energy has fizzled somewhat.
The most well-known practitioners of this sort of 'topological' approach are Ayasdi, they have some slick demos [1]. The general name for this idea is topological data analysis [2].
I replicated this particular experiment in WL, of course, because it's a 5-minute thing to do [3], and I could actually do the community detection the author alluded to.
But I noticed that the correlation matrix itself is much more suggestive than the graph ends up being, with or without community detection. Take a look at the correlation matrix (note that MatrixPlot does some clever combination of rank and absolute value to get high dynamic range):
The tri-diagonal structure is because the original dataset is derived from the pixel counts from successive 4x4 tiles on NIST written-digit images [4].
Those 8x8 matrix of tiles is flattened onto the 64 random variables, so the large correlation with tiles on the left and right explain the 1-off-diagonal orange lines, the other two diagonals are offset by 8 and correspond the high correlation with the tiles above and below. That's the 'connectivity kernel' of a 2D manifold, so to speak.
The curious squiggles in all the other blocks of this matrix are unusual. I don't know what's going on there. Maybe something interesting.
The total is about a million nodes, though I'm fairly sure the most interesting data is a subset of about 50,000 nodes. Each node has on the order of 100 edges. Since that's still a lot I'll have to rethink my plan I'm afraid.
Thanks. I skimmed the linked "Mapper" article they cite as their method, and it looks about as topological as t-SNE, i.e. in the sense of caring about local nearness but not global distance.
But did I miss any heavier stuff? Is this all people mean when they talk about topological data analysis?
The paper seems to be using homology as the topological feature here. I've done some work in Topological Data Analysis before and it feels like the hidden issue is that computing homology is generally very inefficient(Since it usually amounts to reducing an nxn matrix).
It definitely feels like graphs/topology should be helpful tools to work with data(Since graph-like structures are good representations of the real world), but we need to solve this efficiency issue before this can be possible.
Also to address the confusion on how category theory comes into it, category theory studies abstract structures where you have objects and relationships between these objects. A lot of algebraic topology(Which is the sort of topology relevant here) is built in the language of category theory(Either by neccesity or by convention).
I'm nearly salivating at the thought of writing graph algorithms at that scale . . . and actually having the outcome mean something and be acted upon in a timely fashion. It sounds like a dream job to me. That scale and depth of information is a very powerful tool, no doubt, and it should be wielded for a good purpose. This article at least encourages me that people are thinking beyond the bottom line on these issues. Awesome.
Yeah, now I'm wondering what topological skeletons might have in common with the abstract simplicial complexes generated by running a persistent homology algorithm on a point cloud.
Also very cool is the recognition of branching in the data by the computation of a persistent Borel-Moore homology. This is the method that was used in their cancer study.
reply