Network theory is catching on—in a very practical way!

Google recently started a new open source library called TensorFlow. It’s for software built using **data flow graphs**. These are graphs where the edges represent tensors—that is, multidimensional arrays of numbers—and the nodes represent operations on tensors. Thus, they are reminiscent of the spin networks used in quantum gravity and gauge theory, or the tensor networks used in renormalization theory. However, I bet the operations involved are nonlinear! If so, they’re more general.

Here’s what Google says:

## About TensorFlow

TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

## What is a Data Flow Graph?

Data flow graphs describe mathematical computation with a directed graph of nodes & edges. Nodes typically implement mathematical operations, but can also represent endpoints to feed in data, push out results, or read/write persistent variables. Edges describe the input/output relationships between nodes. These data edges carry dynamically-sized multidimensional data arrays, or tensors. The flow of tensors through the graph is where TensorFlow gets its name. Nodes are assigned to computational devices and execute asynchronously and in parallel once all the tensors on their incoming edges becomes available.

## TensorFlow Features

Deep Flexibility. TensorFlow isn’t a rigid neural networks library. If you can express your computation as a data flow graph, you can use TensorFlow. You construct the graph, and you write the inner loop that drives computation. We provide helpful tools to assemble subgraphs common in neural networks, but users can write their own higher-level libraries on top of TensorFlow. Defining handy new compositions of operators is as easy as writing a Python function and costs you nothing in performance. And if you don’t see the low-level data operator you need, write a bit of C++ to add a new one.

True Portability. TensorFlow runs on CPUs or GPUs, and on desktop, server, or mobile computing platforms. Want to play around with a machine learning idea on your laptop without need of any special hardware? TensorFlow has you covered. Ready to scale-up and train that model faster on GPUs with no code changes? TensorFlow has you covered. Want to deploy that trained model on mobile as part of your product? TensorFlow has you covered. Changed your mind and want to run the model as a service in the cloud? Containerize with Docker and TensorFlow just works.

Connect Research and Production. Gone are the days when moving a machine learning idea from research to product require a major rewrite. At Google, research scientists experiment with new algorithms in TensorFlow, and product teams use TensorFlow to train and serve models live to real customers. Using TensorFlow allows industrial researchers to push ideas to products faster, and allows academic researchers to share code more directly and with greater scientific reproducibility.

Auto-Differentiation. Gradient based machine learning algorithms will benefit from TensorFlow’s automatic differentiation capabilities. As a TensorFlow user, you define the computational architecture of your predictive model, combine that with your objective function, and just add data — TensorFlow handles computing the derivatives for you. Computing the derivative of some values w.r.t. other values in the model just extends your graph, so you can always see exactly what’s going on.

Language Options. TensorFlow comes with an easy to use Python interface and a no-nonsense C++ interface to build and execute your computational graphs. Write stand-alone TensorFlow Python or C++ programs, or try things out in an interactive TensorFlow iPython notebook where you can keep notes, code, and visualizations logically grouped. This is just the start though — we’re hoping to entice you to contribute SWIG interfaces to your favorite language — be it Go, Java, Lua, JavaScript, or R.

Maximize Performance. Want to use every ounce of muscle in that workstation with 32 CPU cores and 4 GPU cards? With first-class support for threads, queues, and asynchronous computation, TensorFlow allows you to make the most of your available hardware. Freely assign compute elements of your TensorFlow graph to different devices, and let TensorFlow handle the copies.## Who Can Use TensorFlow?

TensorFlow is for everyone. It’s for students, researchers, hobbyists, hackers, engineers, developers, inventors and innovators and is being open sourced under the Apache 2.0 open source license.

TensorFlow is not complete; it is intended to be built upon and extended. We have made an initial release of the source code, and continue to work actively to make it better. We hope to build an active open source community that drives the future of this library, both by providing feedback and by actively contributing to the source code.

## Why Did Google Open Source This?

If TensorFlow is so great, why open source it rather than keep it proprietary? The answer is simpler than you might think: We believe that machine learning is a key ingredient to the innovative products and technologies of the future. Research in this area is global and growing fast, but lacks standard tools. By sharing what we believe to be one of the best machine learning toolboxes in the world, we hope to create an open standard for exchanging research ideas and putting machine learning in products. Google engineers really do use TensorFlow in user-facing products and services, and our research group intends to share TensorFlow implementations along side many of our research publications.

For more details, try this:

Looks a lot like LabVIEW which can also be used for mathematical computation. One of the problems, which this may also have is that you have to enforce order of computation to avoid race conditions.

I don’t think there is a graphical (like LabView) interface to TensorFlow, it’s just regular old programming in C++ or Python.

It has been on my mind for a while now that I should learn about TensorFlow. (I’m a working computer scientist after all.) I didn’t know it was a open sourced! (The only problem is that this competes with my attempted learning of mathematical physics.)

Tensorflow is the dernier cri in so-called Artificial Intelligence now based on the revival of neural nets, which has pushed the state of the art in several machine perception tasks such as artificial vision or voice recognition and even language understanding (how? firstly introducing realistic semantic

metricspaces learned by the technique of word embeddings). The artificial neuron combines a dot product with a deliberate nonlinearity and they are connected in a big network with a huge set of parameters learned minimizing an error function by following the corresponding gradient. The error measures how good is the net in predicting an output from an input and the learning needs a training set or sample of good size.There have been efforts to express neural nets in terms of graphical models, for instance the sigmoid belief network of Neal, a survey is in [1], and I like specially the connection neural net, directed graphical model (bayes net), factor graph, tensor network, and (via monoidal categories), to string diagrams. I haven’t seen this developed systematically but there’s a talk of Jason Morton called “An Algebraic Perspective on Deep Learning” [2] with a lot of keys. I think that to relate a bayes net with a tensor net, one needs to note that the vector representing the marginal probability distro of a parent node is the result of applying the multilinear transform associated with it to the vectors of the distros of all its children, and that is a tensor multipication. The tensor has the info of the conditional probabilities, pictures in this blog post [3]. Other work in this categorical bayesian machine learning is [4] and [5] by Cultberson and Sturtz. I’ve found the idea of probabilistic mapping a much powerful tool to disentangle easily Markov chains, HMMs, MDPs, POMDP, Kalman filters, and seeing a bayes net node as a many-variables mapping.

And speaking of MDPs, one of the headline-making successes of Deep learning has been to use a deep neural network to learn the policy of an agent that was able to play lots of Atari games with superhuman scores. While playing Atari is frivolous, robotic control or artificial life behaviour learning aren’t, and the principles are not that far away. It’s intriguing to compare this with the bayesian perception/action loop of [6]. While deep learning excels in applications, the mathematics under it or its deep understanding is advancing slowly. It would be nice if the categorical and bayesian glasses could help with this.

[1] Towards Bayesian Deep Learning: A Survey, Hao Wang, Dit-Yan Yeung, arXiv:1604.01662 [stat.ML]

[2] https://www.ipam.ucla.edu/programs/summer-schools/graduate-summer-school-deep-learning-feature-learning/?tab=schedule

[3] https://qbnets.wordpress.com/2015/03/26/tensor-networks-versus-quantum-bayesian-networks-and-the-winner-is/

[4] A categorical foundation for Bayesian probability, Jared Culbertson, Kirk Sturtz

[5] Bayesian Inference using the Symmetric Monoidal Closed Category Structure, Kirk Sturtz

[6] https://johncarlosbaez.wordpress.com/2014/10/30/sensing-and-acting-under-information-constraints/

I have seen the slides from reference [2] before (pdf link here) and have wondered if there is a corresponding paper or any follow-up research.

John,

It would be interesting to see people like yourself delve into deep learning. I am curious to see what category theory has to say about deep networks.

The field has shown that deep networks + stochastic gradient descent + some regularization works

waybetter than it should. And there is still a bit of mystery as to why.I myself have been learning applied sheaf theory (from Ghrist and others) to see if there is application to data flow graphs in deep learning. That is, when the graph is given the Alexandrov topology, the data flow graphs look a bit sheafy. What do you think?

I’m very slowly trying to get up to speed on neural networks, so I can include them in my network theory program. The vast amount of attention given to them deters me a bit from diving in: I like quieter waters. However, from what little I’ve learned so far, it seems there’s a lot people don’t understand. That attracts me.

Data flow graphs mainly remind me of tensor networks: if the description in this blog post is accurate, the only difference is that we allow nonlinear maps at vertices. So, while tensor networks are string diagrams in the symmetric monoidal category of finite-dimensional vector spaces and linear maps, data flow graphs are string diagrams in the symmetric monoidal category of finite-dimensional vector spaces and

arbitrarymaps.I’d be happy if someone could confirm or disconfirm this: I didn’t dig down into the TensorNet tutorials deep enough to see a formal definition of data flow graphs!

For me, figuring out a slick mathematical description of something is a good way to start learning about it.

I have had a hard time finding a formal definition of data flow graphs, but I can give some pointers.

First, in deep learning they tend to be called ‘computational graphs’.

One starting point is just-about-published deep learning book, in particular chapter 6. I didn’t find the definitions crystal-clear, but it’s a workable resource.

Figure 6.2 shows that there is some flexibility in how you define the graph. You can represent each individual operation (each multiplication, each addition, each nonlinear operation), or combine them into one big operation, say, .

The vertex has , , and as inputs. usually comes from the previous ‘layer’, but the parameters of this layer, and , are there as well, and are sources in the overall graph — these are sometimes implied but not drawn in diagrams.

Figure 6.8 gives examples of computational graphs.

Figure 6.10 shows one way to include the derivatives in order to do backpropagation. You create a second graph where the arrows flow downward. The arrows from the main graph to the second graph show the chain rule in action. I think there are other (better?) ways to formalize the concept of backpropagation in graph terms, but this is how TensorFlow and does things.

Figure 6.11 shows what a computational graph might look like in practice, regularization vertices, hyperparameters and all.

Finally, you can look at this blog post, which gives some straightforward examples of computational graphs, with an eye towards the backpropagation part of the story.

In deep learning, the nonlinear part is essential. A “deep linear network” is nothing more than a sequence of matrix multiplications.

(Interestingly, there is actually a great paper on deep linear networks, where they study the learning dynamics of gradient descent.)

Nevertheless, the nonlinear operation is an essential part of why deep networks work the way they do. Figure 6.5 of the book chapter I linked gives one theory for why this is. Furthermore, the last layer of a deep network is usually a loss function, say mean-squared-error or softmax, which is nonlinear.

By symmetric monoidal category of finite-dimensional vector spaces and arbitrary maps, what do you mean exactly? Is the tensor product the cartesian product? If so, this would be equivalent to the cartesian monoidal category of sets of cardinality that of plus a terminal set chucked in for good measure.

Todd wrote:

No, the usual tensor product of vector spaces. That’s what all this junk is supposed to hint at:

Let’s ignore the “dynamically-sized” bit, which requires us to formalize the idea of a vector space whose dimension is variable rather than fixed.

Apart from that, my guess is that in practice, we start with the symmetric monoidal category consisting of vector spaces and linear maps, with its usual tensor product. Then we create a bigger symmetric monoidal category generated by together with some nonlinear maps. Since different people seem to use different nonlinear maps, I was chucking them all in, just to get something easy to describe until someone comes along and tells me exactly what they want.

For example, in neural networks people like to include a sigmoid function from the reals to the reals among the morphisms in their category. The most famous sigmoid function is the logistic function, but (as the Wikipedia link explains), there are a number of other popular ones. They all look kind of like this:

The point is that we want something like an “off-on” gate, so our neuron can be “on” when the input is big enough, and “off” when it’s small enough… but we want something smooth, not a characteristic function, since we want our neural networks to behave in a fairly mellow fashion, not jerkily jumping between outputs as the input changes.

But anyway, thanks. I don’t think we should rely too heavily on using “all” maps between vector spaces to disintegrate their structure down to mere sets. But I wouldn’t be shocked if the direct sum of vector spaces were

alsoimportant in this game!The question in that case is the well-definedness of the tensor product on maps (whatever those may be). The usual tensor product behaves well on linear maps, but other maps?

Good point. I still feel pretty that some category like this is what “data flow graphs” are supposed to formalize, but you’re right, I don’t see why it should be a symmetric monoidal category!

Right now it seems like a category where there’s a tensor product of objects (which we could make strictly associative, to give it some teeth), but not for general morphisms. That’s pretty sad.

I’ll have to look into this some more, to learn more precisely what data flow graphs are, and what people do with them. If they’re really important there should be some mathematically nice way to think about them.

I should point out, for what it’s worth, that all our vector spaces here come with distinguished bases. So, we could equally well think of the objects as finite sets, and the tensor product of the objects as the cartesian product.

For machine learning, the difficulty is in understanding why something works. Deep learning methods will produce models that fit the data very well, yet understanding the “why” is tricky. And often there is another level of abstraction that you will have to penetrate from what the tool produces. In other words, the physical interpretation is left open to interpretation.

I appreciate the deep learning tools that produce mathematical equations and adjunct complexity metrics. Just last week I was looking at some old geophysics data and the tool found a simple sinusoidal component buried in it. That also happened on further analysis to be aliased to a well-known lunar cycle.

http://contextEarth.com/2016/06/02/seasonal-aliasing-of-long-period-tides-found-in-length-of-day-data/

Perhaps the important part is having a social network where people can share findings and work out interpretations. I wonder where such a scientific/mathematical forum exists? :)

John said in the post

I can’t tell from by looking at TensorFlow whether the edges represent tensors or lists of tensors (possibly of different rank and shape). From a programming point of of view, it seems nearly as easy to allow lists, and more flexible, so I guess lists are allowed. I did see that strings as well as numbers are allowed as elements.

and from a comment:

Are you aiming for a description of TensorFlow or of a typical/vanilla sort of ANN? (A slick mathematical description of all ANNs is about as likely as a slick mathematical description of all bridges.)

Assuming you want the latter… ANNs operate in at least two modes: training and prediction. The graphs will be different, though often the prediction graph is a subgraph of the training graph. Graphs will be directed acyclic graphs. You do get recurrent ANNs but they are a minority pursuit.

ANN = artificial neural net

I would love a slick formalization of these two interacting graphs, because I feel there is something missing from the descriptions offered by those in ML.

(You can formulate it the way TensorFlow and theano do: by formulating a second gradient graph that takes input from the first.)

But since training mode can also be thought of as flipping the arrows, certainly something co____ must be going on? Apologies if this is a naive question; I am a CT neophyte.

In any case, there are relations of the training/prediction phases and the “forward/backward pass” seen in things like Kalman filtering. A potentially clarifying post is offered by Ben Recht: http://www.argmin.net/2016/05/18/mates-of-costate/

I’m aiming for description of something in this general vicinity that admits a slick mathematical description. Being a mathematician, this is a way for me to get my foot in the door.

Interesting that TensorFlow (TF) is mentioned in this blog. TF is a convenient computational framework that was designed for DL computation. As far as wether you can gain any insight looking at TF graphs, I seriously doubt it.

I’m in the process of writing what I call “A Pattern Language for Deep Learning”. Deep Learning (DL) systems have a common requirement that the layers are differentiable (so as to support SGD). The question I have is what Category theory has to say about this? How does a category that supports a “natural transformation” imply the need for differentiable layers?

It’s because I’m an expert on some aspects of tensor networks, so the name and description caught my attention.

It would be helpful to explain why you doubt it.

Since I’m an expert on how various kinds of labeled graphs are used to describe morphisms in various kinds of categories, and how operations on graphs correspond to operations on morphisms, it seems like an obvious thing for me to look into — especially since it’s something other people are less likely to have spent a lot of time on.

(If you have a super-power, you have to try to use it even in situations where it’s not obviously the best solution to a problem. This is why Spiderman spends so much time swinging around on webs.)

I’m pretty good at the theory of differentiable categories, so I’d be inclined to try something like that. But it will take me a year or two — at

least— before I have anything very interesting to say about neural networks.The reason I remarked about the lack of insight on the tensorflow computation graph concept is that it is no different from dataflow parallel computation that was popular in the early 80’s. Recall the Japanese 5th generation computing project? The closes equivalent would be Petri-nets. It’s a means for specifying a computation which really does not give insight into the dynamics of a DL system.

For a DL, I conjecture that there would be 3 categories that need to be defined. The first would be the Representation category which would be equivalent to probably a vector space. The second would be the Model category. The Model category will have morphism that map from one Model to another. Furthermore, a Model happens to be a morphism in the Representation category. The 3rd category, I’m a bit fuzzy about, would be a Learner category. Nevertheless, SGD would be a morphism in the Model category.

I think there is a lot of promise in a category theory for Deep Learning considering done lately. That is, I do see category theory on bayesian causality as well as some work regarding entropy. The step to the next level may not be that much of a leap.

I think just being able to formalize DL in terms of category theory will help in discussion.

What will be extremely difficult is coming up with an explanation as to why DL depth architecture works. There is some group theory based proof that explains the higher probability of forming morphisms that have a large limit cycle, however I can’t see how a layer (or several layers) build up into higher classification abstraction.

My goal is not to be too ambitious, but rather, have a set of design patterns that aid in building DL solutions.

I think there are a bunch of those ‘plug components into other components’ interfaces out there, the one that comes to mind right now is gnuradio, here’s a recent example: http://gnuradio.org/blog/filtering-time-series-data__elemental-building-blocks/

I did not know about tensorflow, will check it out, mainly because of it portability across hardware platforms