Deep learning is black magic. For some reason, a neural network with millions of parameters is not cursed to overfit. Somewhere, either in the architecture or the training or the weights themselves, exists a magic that allows deep neural networks to generalize. We are now only beginning to understand why this is. As in most cases in science, a good starting place is to observe a paradox about the system, suggest a hypothesis to explain the paradox, and then test, test, test.
The Paradox: This week we discussed “The lottery ticket hypothesis: Finding sparse, trainable neural networks” by Frankle and Carbin, ICML, 2019. Recent studies have shown that a deep neural network can be pruned down to as little as 10% of the size of the original network with no loss in test prediction. This leads to a paradox: Training the small, pruned network (randomly initialized) leads to worse test prediction than training a large network and then pruning it. This is heresy to the ML doctrine of “Thou shalt start simple, and then go complex.” What is going on?
The Hypothesis: This paradox suggests that the initialization of the weights of the pruned network matters. This leads to the Lottery Ticket Hypothesis:
The Lottery Ticket Hypothesis: “dense, randomly-initialized, feed-forward networks contain subnetworks (i.e., winning tickets) that—when trained in isolation—reach test prediction comparable to the original network in a similar number of iterations.”
These “winning tickets” start out with an initialization that make training particularly effective. If the winning tickets were re-initialized randomly, they would no longer be winning tickets—training would not reach the same level of test prediction. This may explain why deep and wide networks tend to perform better than shallow, narrow networks—> the deeper, wider networks have more chances of having the winning ticket. It also suggests that training a neural network is akin to stochastic gradient descent seeking out these winning tickets.
The Test, Test, Test: To find the winning ticket, the authors employ iterative pruning. They start with a large neural network, train it a bit, set the smallest-magnitude weights to zero (these weights are no longer trainable), rewind the trainable parameters to their original initialized values, and train again. This is repeated until pruning drastically hurts test prediction. They find that these winning tickets can be as sparse as 3.7% of the original network’s size while having the same if not higher test prediction (Figure 1, solid lines). Interestingly, if they randomly re-initialize the weights of these pruned networks, test prediction decreases considerably (Figure 1, dashed lines, ‘reinit’). Thus, we have a way to identify winning tickets, these winning tickets are a measly fraction of the size of the original network, and their original initialization is crucial to train them. The authors do a thorough job of confirming the robustness of this effect (45 figures in all). However, they did not investigate the properties of the subnetworks (e.g., did pruning happen most in the deeper layers?).
One question I had remaining is if these winning tickets were necessary or sufficient to train the large network. One analysis that could get at this question is the following. First, identify a winning ticket, and then re-initialize its weights in the original, large network. This ensures that this subnetwork is no longer is a winning ticket. Keep repeating this process. After removing winning tickets, does the large network fail to train? How many winning tickets does the large network have? Will the large network always have a winning ticket? We could also do the reverse of this: For a large network, keep initializing subnetworks that are winning tickets. Is it the case that with more winning tickets, the network trains faster with higher test prediction?
Implications for deep learning: There are several implications for deep learning. 1. Finding out what is special about the initializations of these winning tickets may help us improve initialization strategies. 2. We may be able to improve optimization techniques by better guiding stochastic gradient descent to find and train the winning ticket as fast as possible. 3. New theory that focuses on subnetworks may lead to more insight into deep learning. 4. These winning tickets may be helpful for solving other tasks (i.e., transfer learning). There have already been some follow up studies about these issues and re-examining the lottery ticket hypothesis: Liu et al., 2019, Crowley et al., 2019, Frankle et al., 2019, Zhou et al, 2019.
Does the brain have winning tickets? I had one recurring thought while reading this paper: I know brains prune synaptic connections substantially during development—could the brain also consist of whittled-down winning tickets? After some searching, I realized that it is largely unknown the amount of pruning that occurs or when certain pruning happens. What would be some tell-tale signs of winning tickets? First, substantial pruning should occur (and likely does…perhaps as much as 50%). Second, randomly initializing a developing circuit should lead to a drop in performance after the same amount of training as a control subject (not sure if we can randomly set synaptic weights yet). Third, what would happen if we could prune small-magnitude synaptic connections ourselves during development? Could the brain recover? These tests could first be carried out in insects, where we have gene lines, optogenetics, whole-brain recordings, and well-labeled cell types.