May 2017 – Inherit a mess, build a miracle

I happened to come across this recently published paper with the titled Face classification using electronic synapses. Since I did similar research at school I spent some time reading it. As I scrolled down the web page, there was a section called “Peer Review File” which greatly aroused my interest. I clicked linked and quickly found this Review #3 (shown below), though lengthy, it is very read-worthy. It reminds me of those perplexing days struggling with a research topic that was born with defects.

To give you a little bit background in case you are not familiar with machine learning based neural networks – you can deem them (detailed definition can be found here) as a black box which you can feed it human images and it will tell you who it is. Of course, in order to give neural networks such capabilities, you must “train” it first, which is traditionally done digitally on a CPU based computer. In recent years a new type of analog device called “memristor” emerged as a potentially “promising” component in many applications. Loosely speaking, a memristor is a variable resistor that can be modulated by external electric stimuli. Subsequently, some researchers claim that they might find spots in machine applications where low power feature is desired. Many works tackle the charming features of memristors by showing a trending machine learning based example, but most of them are so vulnerable in turns of the following points

Up to now, no robust memristor has ever been fabricated. Conceptually an ideal memristor should demonstrate good linearity, just like its counterparts – resistors, capacitors, and inductors. If you give a capacitor two times the charge, you know it will have two times the voltage. Nothing is even close for a memristor. This makes memristor’s applicability questionable because you will always need calibration. How are you able to calibrate devices that are of the size of tens of thousands? Furthermore, If you consider device degradation due to aging, resistance drift over time, memristor becomes almost instantly unattractive.
Even if you can manufacture a memristor that has good linearity and is as stable as a traditional resistor, it still does not bring many benefits. All the memristor related application up to now assume they work in open loop and we rely on its absolute resistance value, which is a scheme circuit designers try to avoid. To battle the sensitivity to temperature, analog circuit engineers would only use resistance ratio and feedback loops to make an amplifier – in any scenario high precision is needed, an absolute resistance value is not the thing to trust. How can you make people trust a memristor to substitute a synapse weight that is usually represented by a 16-bit value or higher?
It is almost a consensus that memristors would not demonstrate competency in turns of machine learning accuracy and speed – so energy efficiency becomes the last highlight. But this viewpoint is vulnerable too – accounting the auxiliary circuitry and all those data converters (ADC & DAC), without a full functioning system, people can always doubt if you are making fair comparisons with digital platforms.

If these comments appear in your paper review, it’s very difficult to respond, because they shake the very foundation your research is built upon. It’s not like there are side experiments you have missed or too broad claims that you need to narrow down. It embodies a sad truth – not all research topics are created equal. This paper is lucky to be accepted for publication finally, but I can feel the pain the authors have felt during the revision process.

Appendix – Review #3 to the above-mentioned paper.

Reviewer #3 (Remarks to the Author):

This manuscript describes an experimental implementation of the on-chip training of a very small, single-layer perceptron neural network on a toy-size problem. The in-situ training is described, generalization and noisy inference is tested, and network accuracy is shown. The authors do share a large amount of data illustrating the high variability of their devices, which is good. Authors also focus overly much on the power advantages of such an online array, which of course are large — yet do not provide useful information to other experts in the field, vis-a-vis: If we or they built a large array of these devices and used it to train a multi-layer neural network on a non-toy problem, would there be any chance of it actually working (e.g., converging to acceptably-interesting accuracies within any reasonable time)? I recommend the authors go back and attempt to answer the question above using simulation of neural networks as informed by their extensive data on device-to-device variability and conductance response and provide the answer in their revised paper. Also, since I have very good reason to believe the answer to my question above is “No, it won’t work with these devices,” then authors will furthermore need to describe for the reader how much these devices need to be improved (in order to make non-toy DNN work), whether this is feasible, and how this could potentially be done. I recommend reject and mandatory revision before resubmission. Specific comments:

1) The authors include the usual hand-waving rhetoric about the glories of neuromorphic computing, but of course low energy computation is completely useless if you are not obtaining a result of relevance. In this context, future neuromorphic systems built around such devices MUST be able to train the exact same size (or bigger) networks as are currently trained by GPUs, MUST obtain similar classification accuracies by the end of training, and most likely MUST provide a speedup advantage IN ADDITION TO lower power. Yes, there is a niche market for solutions that train DNN at lower power to sufficient accuracy but more slowly than GPUs, but it is a fairly small niche. There is NO niche for solutions that fail to achieve similar classification accuracies on non-toy-size DNN. There is zero mention or acknowledgement of this reality in the current intro to this paper, and this must be fixed.

2) Authors claim “this is the first experimental demonstration of such a significant cognitive task.” I get tired of papers that are explicitly built so that one can claim they are “the first,” leaving all the actual hard work of taking something from first demonstration to usefulness to someone else. In this case, I cannot agree that the network shown here can be considered “significant.” This network and dataset is far too small to allow any predictive (or even indicative) power for larger non-toy networks. One can say that MNIST itself is also far smaller in size and easier in difficulty than CIFAR or ImageNet, but if a memristor-based solution were able to deliver identical-accuracy-performance on MNIST, I would have to claim that that accomplishment would be a major step towards the demonstration of NVM-based neuromorphic systems. The present network is far far smaller and much easier — to the extent that success on this toy problem is indicative of nothing.

3) There is a major bait-and-switch between the aspect ratio of the experimental array (described as having 128 rows and 8 columns) and the logical array needed for the network of interest (described as having 3 rows and 320 columns). Thus statements like “A fully parallel read operation” is clearly a bald-faced lie, since there is absolutely no way the authors can experimentally be inputting all 320 columns at the same time given the dimensions of the physical array. Statements like this will need to be fixed to reflect the actual realities of the experiment as it was actually performed.

4) Ways in which this experiment is a toy experiment – It is not that the images are too small – 20×20 is not far off in size from MNIST. The problem is that there are so few classes, which appear to be cherry-picked from the dataset in order to maximize the distance between the classes and make the problem easier. – Batch-based processing (e.g., going through the whole dataset and THEN updating all the weights) has long been abandoned by the AI community. IF you wanted to do this for a large dataset, you would take a large hit in convergence rate, plus you must describe how this data will be stored locally (or include the power for aggregating this data offline in your power and latency assessment). – The convergence of this learning appears to be so rapid that the authors are able to almost completely avoid having to use any SET pulses during the experiment. This will most definitely not be the case for any non-toy experiment. (Authors should show the histogram of desired conductance changes during training, so readers will know what the balance between requests-for-lower-conductances and requests-for-higher-conductances). – Because the dataset is toy-size, authors probably terminated training too early, thus losing the opportunity for further improvements in generalization performance or classification of noisy inputs. – The hundreds of write-read-verify cycles required for these devices are not going to scale well for a non-toy problem, without some trick which allows you to do this extremely infrequently and/or efficiently

5) Other comments: – If authors are going to focus on power to this extent, they must include the power required to implement their tanh() function, to compute the delta values after each example, and to aggregate and store the weight updates during their batch-based programming.

6) The RRAM devices do seem to show a nice gentle behavior. This reviewer is impressed at the slick usage of a logarithmic horizontal axis to make the conductance response “look” linear, when in actuality it is quite far from linear. That said, the supplementary data in S3 S4 and S5 (showing just how bad the variability between these devices really is) is greatly appreciated by the reviewer. After seeing this data, I was amazed that even this toy experiment worked at all, until I realized that the toy database only required RESET pulses, and that in the verify version of your experiment, you were willing to wait as long as it took, even if that meant many hundreds of programming pulses. One comment on Figure 3B,C — it is disconcerting that the vertical axes of B and C are not the same scale. You should either start your RESET characteristic at the same peak value that the SET characteristic terminates at, OR at least make the two plots have the same vertical scale so that the reader can immediately observe that the SET and RESET characteristics are NOT completing each other. Assuming that there were even any SET pulses needed during the “training” of this toy problem, it would be good to show the “SET” versions of Figures 3F,G.

7) The results with classification of noisy training data and with the test data (non-training data) are appreciated, and help improve the quality of the paper. Two points here:

– since the 9000 noisy images are noise added to the training images, they should NOT be referred to EVER as “test patterns” (see line 269). It is VERY important that it be 100% clear this is noise on already-seen training images, NOT on never-before-seen test data.

– to differentiate the statistically relevant percentage numbers in Figure 5b (out of 9000 noisy images) from the generalization “percentages” in Figure 5c, please label the green bars as generalization accuracy and include the exact fraction of correct generalized images (e.g., 22/24, etc.) as well as the percentage (to make it clear these “percentages” are accumulated over a very small number of instances).

Authors take great pains to laud the great power of experimental results, and similarly spend significant effort disparaging simulation-based efforts (including Ref 36 and the relevant portions of Ref 19, which also included a MUCH larger “first ever” experiment than the present manuscript). However, this reviewer would claim that simulations — especially ones that are done well enough that they can project accurately for the field whether the devices at hand will succeed or will fail at tasks of actual interest — are much more useful for the field than experimental demonstrations at toy scale without such simulations. In most fields, toy experiments have little forward predictive power for tasks of actual interest — for neural networks, they have zero forward predictive power.

That said, it would not take much for the authors of this manuscript to take their extensive experimental data and to project through simulation whether these devices as they stand (nonlinear conductance response AND massive device-to-device variability) would suffice for even easy non-toy problems like MNIST. Such simulations would also determine whether the nonlinear conductance response itself were invalidating — e.g., if every single device in a large array functioned exactly like the device described by Figure 3B,C with ZERO variability, even in that ideal case would one have a shot at training MNIST successfully? This is the most important and relevant question, which remains unanswered (or even acknowledged) by the present manuscript. This is why the manuscript should not be accepted as it currently stands. Without this kind of information, the reviewer cannot agree that “these experimental results pave the way” towards the desired energy-efficient neuromorphic systems. Instead, this manuscript is merely yet another “first to show yet more useless results” paper.

Month: May 2017

Major Things I Learned or Got Exposed to in My Current Job

Struggling with Research Topic Born with Defects

Dealing with flaky downstream services: simple can be effective