The Flywire Minimum Feedback Data Challenge and “The Bitter Lesson”

code: https://github.com/TuragaLab/connectome_reordering

The central brain of the fruit fly Drosophila Melanogaster is vast, consisting of a staggering ~135,000 neurons which interconnect at 50,000,000 synapses in a sophisticated, well structured network which allows the fly to avoid danger, localize itself in its environment, find mates, among many, many other activities. Now, after immense effort spent constructing a detailed “wiring diagram” map of connectivity, called the connectome, the community is performing analysis to seriously study the network. 

With such complexity in a structure barely the size of the tip of a needle, it isn’t obvious where to begin- we have a rough idea of cell “types,” which are roughly defined to be cells that are functionally or morphologically similar, but these only simplify things a bit. Trying to look for meaningful subnetworks also only takes you so far. We are aware of most of the large structural divisions within the brain (the optic lobes stick out obviously from the sides of the brain, emphasizing the large importance of vision to the fly), but the detailed functional organization within these regions remains murky. There are ideas abound for how we can better understand the function embedded within the structure of the brain (my own main research consists of trying to simulate the fly brain as realistically as possible), but one broad approach lies in sorting the brain into “feed-forward” and “feedback” areas. For instance, multiple disconnected feedforward pathways may indicate parallel feedforward processing units, while the feedback between and within these regions may be considered as control or attentional units (in the behavioral sense, not the machine learning architecture sense). However, it is also important to remember that the underlying graph is still the same, and these distinctions probably need to be carefully qualified when applied to the continuous function of the neurons of a living organism. 

When I saw the challenge presented for the first time, my labmate and I were thinking primarily in the realm of permutation matrices. These matrices, when applied to an input matrix of a compatible size, simply reshuffle the values in the input. Any matrix with a single “1” per row and column can be considered a permutation matrix. This makes optimizing for permutation matrices a struggle- it’s not trivial to set constraints on a differentiable problem such that the learned permutation matrix remains a permutation matrix. You might think about calculating some continuous values and taking softmax, but taking softmax simultaneously along rows and columns is also nontrivial. Essentially, this has to be done iteratively, row by row, then repeating. There are a lot more details here (if you want to read more, you can see “Learning Latent Representations with Gumbel-Sinkhorn Networks”), but this approach didn’t work simply because this step takes much too long on a weight matrix the size of the Drosophila connectome - about 135,000 nodes. 

An approach that works, then, needs to probably only work in one dimension to be tractable. To that end, I thought of a recent paper by the neuroscientist Axel Borst of the Max-Planck institute who recently had a paper, Connectivity Matrix Seriation via Relaxation, describing this exact problem. The idea is that given a complex connectivity matrix, one way to understand it is by finding an optimal arrangement of rows and columns that optimizes the number of edges in the upper triangular (where the target order is greater than the source order). In this way, we can reduce the problem to a one-dimensional framework based on the edge list- each node receives an embedding, we index the embeddings with the edge list, then take the norm of the target embedding minus the source embedding. Pass this norm through a nonlinearity and take the derivative to form a gradient update, leveraging JAX for efficient computation. 

However, embedding-based optimization can be highly sensitive to initialization and parameter choices. In particular, I found the approaches to be sensitive to the choice of nonlinearity. Using a sigmoid, I found the beta parameter to define the final score. To that end, I tried adding a beta scheduler to first encourage edges with embedding differences close to 0 to flip, with a high beta. Then, gradually decrease beta to encourage edges that are farther away to come towards the center. Repeat and repeat and repeat. I also tried adding a hinge loss towards the same ends as the low-beta approach. I tried learning rate scheduling, to minimal effect. One thing that helped was multiplying the original sigmoid by a sigmoid with very high beta- this made the objective function only consider edges which were backwards. 

This reached a limit which I suspected the other teams had passed. To explore whether random swapping could provide a useful comparison, I implemented a Monte Carlo approach that randomly selects two nodes in each iteration, swaps their positions, and calculates the resulting change in the feedforward metric. If the new configuration improves the score, the swap is accepted; if not, it may still be accepted with a probability determined by a simulated annealing-inspired criterion, which allows the algorithm to escape local minima by occasionally accepting worse configurations (I found the algorithm works best without simulated annealing, though).

The Monte Carlo approach offered a direct, computationally straightforward way to navigate the solution space without relying on gradient-based optimization, making it a useful baseline against which more sophisticated methods can be measured. But, it was dumb as rocks algorithmically. Even so, with Jax, it seems anything is possible. Hearing the presentations from the other contenders, who were running their approaches on CPUs for over a month, I found that with Jax, we were able to explore the solution space many times faster, exceeding their results in a weekend. 

This invokes Rich Sutton’s bitter lesson- the simpler the approach, and the better it leverages computation, the better. 

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. - Ruch Sutton

Previous
Previous

It’s Worse Than You Think: Please Stop Eating Animals