I get the impression that much of the popularity of t-SNE and UMAP is their ability to detect clusters. But then, I think: if you want to detect clusters, wouldn’t it be better to use a clustering algorithm which is designed for that purpose and not constrained by the need to make nice pictures?

I think I have some insight into why UMAP works well, but it has nothing to do with category theory.

In a high dimensional space, all the distances between pairs of points from a finite set are about the same. This isn’t quite true of course, but if someone hands you a set of high dimensional real-world data, it’s a good guess. Taking the MNIST data for example, over 99% of the pairwise distances are between 9 and 14. Almost all are between 5 and 15. If these are to be sensibly mapped to 2 dimensions, something major needs to happen to this distribution of distances.

Puzzle: how many points can you place in the plane so that the maximum distance between any pair is no more than 3 times the minimum?

To make local distances from each point, UMAP subtracts the distance to the nearest neighbour. This is done to ensure local connectedness. But it has a big effect on the distribution of distances. I guess that roughly speaking, the range 9 to 14 becomes 0 to 5 for a typical point in the MNIST data.

UMAP then converts these to asymmetric similarities which have to be made symmetric using some function f(p,q). I would expect this to satisfy f(p,p) = p. If a pair of similarities are already the same, why would you change them? In UMAP f(p,p) = 2p – p^2. So, for example, a similarity of .9 becomes .99. This corresponds to more squashing of small distances.

In summary, the way that UMAP ensures local connectivity and symmetry have drastic side effects on distances, and these side effects are a good thing if you are making a large reduction in dimensionality.

]]>**The details of UMAP are not constrained by category theory.**

Instead, the algorithm’s details seem to have been chosen to make it work in practice, and sometimes justified by analogy to concepts from pure math. For example

UMAP’s score function is based on probability theory, but probability appears nowhere in the Barr/Spivak framework.

Spivak’s use of the log function was arbitrary, but it plays a key role in UMAP (as mentioned above).

UMAP sets the distance between nearest-neighboring data points to zero. This is justified by local connectivity of the embedding manifold. But manifolds are always locally connected – this instead makes it non-Hausdorff (and not actually a manifold).

It might be possible to come up with a categorical framework from which you can derive UMAP’s details, and that would be a great topic of research which could lead to further improvements in the algorithm. For example, this would involve modifying the Barr/Spivak framework so that UMAP’s cross-entropy score emerges naturally.

UMAP is a great algorithm, whose inventors were inspired by pure math. Which is fine: the discovery of benzene’s ring structure was inspired by a dream of a snake eating its own tail. Whatever works.

But to claim right now that UMAP is a killer app for category theory would be a mistake. This will just lead to disappointment for those who actually work through the details, and won’t help the field of applied category theory gain respect.

]]>John: Spivak introduces the logarithm in the second paragraph of Section 3 in http://math.mit.edu/~dspivak/files/metric_realization.pdf, though we all agree that that choice is not crucial for the theory.

What I had meant is that the UMAP authors make minimal adaptations of this framework to the case of finite metric spaces, including the non-unique choice of using the logarithm! In UMAP, the finite extended pseudo-metric spaces in Definition 7 are the counterparts of Spivak’s and the logarithm seems to me to have been inspired by Spivak’s — though I’d agree that there is no well-defined rigorous link.

One could maybe interpret the points in as the vertices of Spivak’s geometric realizations of the standard fuzzy -simplices, in the sense that in both cases pairwise distances grow linearly with .

]]>The appearance of the logarithm does not follow from Spivak’s theory: he just chooses that function out of the blue, and any function with a few nice properties would work as well for his results.

]]>@Graham, you’re right to say that about logarithms. Indeed I raised that point to one of the authors of UMAP a few weeks ago, and he agreed that the theory should carry through with more general functions (monotonically increasing from $-\infty$ to $+\infty$, roughly speaking). Apparently, the logarithm also happens to give good results in practice, so UMAP sticks to this original choice.

Indeed, the kernel is not really Gaussian in UMAP, it’s rather a “Laplace kernel”.

]]>Spivak’s paper is at

I don’t know category theory, so I may have missed something big, but it doesn’t seem to me that logarithms play an important role. I think (1-x)/x, for example, would work just as well as -lg(x).

Also, the Gaussian uses squared distance, not distance, and squared distance is not a metric.

]]>It seems I cannot reply to your last comment (perhaps because it’s too nested), so I’m replying here instead.

Thanks! This is very helpful. The points in your list that are most interesting for me are (d) and (f). Regarding (d), I did not realize that the exponentially decaying weights is something that follows naturally from Spivak’s theory! I have now re-read Section 3.1 of the UMAP preprint, and it does not make mention this at all. If true, this is interesting. However, practically speaking, Gaussian kernel (which is exponentially decaying) is by far the most often used kernel in machine learning and statistics, so while it’s nice to have an additional motivation, it does not really yield anything new.

Point (f) is the most interesting one. The cross-entropy loss is arguably the most important UMAP’s ingredient for the comparison with t-SNE. (By the way, this loss was introduced in a method called largeVis back in 2016 without any category theory, and I largely see UMAP as providing a motivation for largeVis). So I think it’s important to understand if there could be various reasonable ways to measure fuzzy set dissimilarity or whether cross-entropy is unique and/or somehow preferred.

]]>Sure, no problem!

It seems that most of this algebraic geometry is needed to motivate working with a weighted k-nearest-neighbours (kNN) graph of the dataset (where by “weighted” I mean that the edges have weights, with shorter edges having larger weights). Do you think it’s fair to say that?

I think it *is* fair to say that. Though I’d refine the statement to say that the category theory and computational topology are used as a justification for the *specific* form of the final weighted graph. Schematically:

a) consider ways to compute a meaningful topological signature from a finite “metric-like” space;

b) Spivak’s framework (appropriately modified) says that the “right thing” to look at, categorically speaking, is a “fuzzy singular homology” functor which sends a “metric-like” space to a “fuzzy simplicial set”;

c) due to combinatorial explosions, it is only reasonable to ask a computer to store a part of the resulting fuzzy simplicial set, namely the fuzzy set of its edges (1-simplices);

d) a quick computation shows that the resulting structure is that of a weighted graph with exponential weights (due to logarithms appearing in Spivak’s theory);

e) the fuzzy set framework lends itself to local-to-global procedures via fuzzy unions — this is an extra bonus;

f) the fuzzy set cross-entropy is quite a natural way of measuring dissimilarity between fuzzy sets (you essentially regard a fuzzy set as a “field” of Bernoulli random variables). So this type of loss seems well-adapted to the algorithm’s theoretical guiding principles.

I hope this helps!

]]>This paper by Ting, Huang, and Jordan, https://arxiv.org/abs/1101.5435, I think gives a good framework for graph kernels including kNN graphs and approximating embeddings. Also, the classic papers by Coifman on diffusion maps (e.g., https://www.pnas.org/content/102/21/7426). I think these are more straightforward standard manifold learning treatments.

(I am still trying to understand enough to ask John a more intelligent question on the CT framework of the UMAP and how it motivates the algorithm part.)

]]>