

## Supplementary Information



**Fig. S1: The proposed two-stage programming scheme.** (a) Programming curves (program@  $V_g=9$  V and read@  $V_g=6$  V,  $V_d=0.8$  V). (b) The proposed two-stage programming scheme. (c) Programming curves with different  $V_d$  incremental steps. (d) Programming curves with excellent linearity by employing the two-stage programming scheme.

2 In the scenario of working as CAM, Flash states are represented by the read  
 3 currents, rather than the threshold voltages as in memory mode. The applicable  
 4 programming scheme should be also different, which demands a linear current  
 5 modulation during programming. However, when programming under a fixed drain  
 6 voltage ( $V_d$ ), the read currents will decrease linearly at first, and then more and more  
 7 nonlinearly, as shown in **Fig. S1(a)**. This is because as the Flash programming goes on,  
 8 the electrons trapped in the floating gate is accumulating to saturation, which reduces  
 9 the programming efficiency [1].

10 To address this and improve linearity, we propose a two-stage programming  
 11 scheme, which employs a constant  $V_d$  in the first  $N_0$  cycles, and then an increasing  $V_d$   
 12 with a  $\Delta V$  which is also incremental, as shown in **Fig. S1(b)**. The principle is utilizing  
 13 the increasing  $V_d$  to compensate for the influence of electron saturation. The effects of  
 14 different  $V_d$  incremental steps with initial  $V_d=5$  V are shown in **Fig. S1(c)**. As the step  
 15 increases, we can see that the programming linearity is recovered gradually. Besides,  
 16 by selecting the step sizes carefully, we can obtain an excellent linearity for different  
 17 initial  $V_d$ , as shown in **Fig. S1(d)**. Hence, we can obtain different programming speeds  
 18 and current steps by selecting the different initial  $V_d$  and corresponding  $\Delta V$ .



**Fig. S2: The employed parallel programming scheme.** (a) Flow chart of the parallel scheme. (b) Array structure and parallel programming operation. (c) Programming results in a parallel group of 8.

1 In addition to improving the programming linearity of single transistor, parallel  
2 programming is another effective approach to speed up on-the-fly learning. The flow  
3 chart of the employed parallel programming scheme is shown in **Fig. S2(a)**, which is  
4 combined with the aforementioned two-stage programming scheme. The parallelism is  
5 along with different BLs, where the cells needed to be programmed are applied with  $V_d$ ,  
6 and others connected to ground, as shown in **Fig. S2(b)**. According to the erasing  
7 mechanism, the Flash transistors in the same WL will be erased together. Therefore,  
8 they should be erased first, and then programmed to the respective states for termination,  
9 which guarantees no erasing operation during programming. **Fig. S2(c)** shows the  
10 whole process of parallel programming in a group of 8. We can see that the Flash  
11 currents start from the erased states and gradually decrease during the programming  
12 operation. When one Flash transistor reaches its target, the corresponding programming  
13 is terminated.

14

15



**Fig. S3: The comparison of programming cycles in three programming schemes.** PGM cycles are dramatically reduced by parallel scheme. The target conductance is set by the L2 distance in each column for fair comparison. (measured @400cells)

1    **Fig. S3** shows the comparison of required programming cycles in three  
 2    programming schemes, constant  $V_d$  programming, two-stage programming alone,  
 3    parallel and two-stage programming together, respectively. Noted that the number of  
 4    cycles required in parallel scheme depends on the smallest target currents in the group,  
 5    therefore, for a fair comparison, the targets in each group are set to cover all the states  
 6    of 3-bit L2 distance. We set the parallelism as 16 (BL number), and the target currents  
 7    in each group are 0, 1, 4, 9, 16, 25, 36, 49, 0, 1, 4, 9, 16, 25, 36, 49. By jointly employing  
 8    the parallel and two-stage programming schemes, we can get a 19.6-fold (average of  
 9    25 WLs) reduction in total programming cycles, which provides the basis for rapid and  
 10   precise on-device lifelong learning.



**Fig. S4: The overall current distribution of 3-bit-storage CAM.** This degree of current variation ( $\sigma<0.6\mu\text{A}$ ) is proved to have almost no degradation to accuracy (<0.15%) in **Fig. 4(d)** of main text.

1  
2



**Fig. S5: The fitted (a)  $I_d$ - $V_g$  and (b)  $I_d$ - $V_d$  curves of NOR Flash transistors based on BSIM3v3 model calibrated with experimental data.** The model shows a good consistency with the experiments. The modified parameters include the thickness of oxide layer, saturation velocity of electron, carrier mobility, and source-drain series resistance.



**Fig. S6: Search latency with different word widths in the proposed CAM design.** The search latency will increase as the word width increases due to the increasing parasitic resistance and capacitance in MLs. We can see that the search latency in 256-dimension word is almost 500 ps, which increases by 3.3-times compared to 64-dimension word. Therefore, multi-bit capability in CAM is also crucial to reduce the word width (related to the circuit area and search energy) and search latency.



**Fig. S7: Experiments for few-shot learning.** Pre-trained 4-layer CNN is used to extract features. During few-shot learning, features of support set are written into the Flash-based CAM. During inference, search vector of query is compared with all the stored contents by calculating pair-wise L2 distance. Predicted label is given by sensing the minimum current of MLs.

1 The schematic of our experiments is shown in **Fig. S7**. we choose the widely used  
 2 Omniglot dataset for demonstration. In this dataset, there are 1623 handwritten  
 3 characters (classes) from different alphabets, and each character contains 20 examples  
 4 from different people. We randomly choose 1200 classes for meta-training of feature  
 5 extractor and the rest of 423 classes for configuring few-shot learning tasks. We employ  
 6 a four-layer CNN as the feature extractor, which consists of 4 convolutional modules,  
 7 each including 64-channel  $3 \times 3$  kernels, batch normalization, ReLU, and  $2 \times 2$  max-  
 8 pooling. The output of final layer is flattened to a 64-dimension vector.

9 During meta-training, for one  $n$ -way  $k$ -shot task, we randomly sample  $n$  different  
 10 classes (each class with  $k + x$  examples) from 1200 classes to establish support set ( $k$   
 11 examples) and query set ( $x$  examples). Here, both the support set and query set are  
 12 labeled examples. The CNN is then trained to minimize the error of predicting labels in  
 13 the query set by the back-propagation (BP) algorithm, which is called an “episode”. By  
 14 repeating this process (randomly sampling few-shot tasks and BP training), the CNN is  
 15 gradually meta-trained to be capable of employing a general scheme to map the input  
 16 data to their appropriate labels regardless of their specific contents, namely learning to  
 17 learn [2]. Note that the CNN weight parameters are fixed after meta-training and will  
 18 no longer change during the following phase. During few-shot learning, the 64-  
 19 dimension features of support set are first extracted by the CNN, quantized to 3-bit  
 20 according to the scheme of L2 distance, and then stored in the Flash-based CAM.  
 21 During inference, the extracted query features are coded and transformed into a series  
 22 of WL voltages for computing L2 distance with all the stored features in parallel. The  
 23 predicted label is given by sensing MLs for minimum current.



**Fig. S8: Conductance distribution of the stored 64-dimension feature vectors in different few-shot learning tasks.** The features of 5-shot task are average results across different shots.



**Fig. S9: To achieve the same accuracy with the proposed L2 CAM, the word width of TCAM needs to be extended.**

1  
2  
3  
4



**Fig. S10: Increase L2 CAM dimensions by using p-stable locality sensitive hash algorithm ( $E^2$ LSH) for higher inference accuracy.**

1 The LSH algorithm is to solve the problem of approximate nearest neighbor search.  
 2 The original LSH method embeds the original feature-vector space into the Hamming  
 3 space and converts the distance measure of original space to the Hamming distance,  
 4 which performs bitwise XOR operation in two equal-length binary sequences and  
 5 accumulates the results of “1” [3]. It has been applied in TCAM, but is difficult to  
 6 directly extend to the multi-bit CAM.

7 To address this and apply the LSH method to our proposed multi-bit-storage L2  
 8 CAM, we employ the p-stable LSH method, which can directly perform the locality-  
 9 sensitive hash operation in Euclidean space and is also called  $E^2$ LSH [4]. The hashing  
 10 function  $h_{a,b}(v)$  of p-stable LSH is defined in equation (1).

$$h_{a,b}(v) = \left\lfloor \frac{a \cdot v + b}{r} \right\rfloor \quad (1)$$

12 The parameter  $a$  is a random vector that has the same dimension as the feature  
 13 vector  $v$ . Each element in  $a$  is randomly and independently generated from a p-stable  
 14 distribution ( $p=2$  is Gaussian distribution). The parameter  $b$  is a random number in the  
 15 range of  $[0, r]$ , of which  $r$  is an empirical hyperparameter. Besides, the final result needs  
 16 a round-down operation. One hashing function acting on the original feature vector can  
 17 generate one value for the new mapping vector. Therefore, by establishing more  
 18 hashing functions, we can map the 64-dimension feature vector to a larger width and  
 19 increase the inference accuracy of MANN. as shown in **Fig. S10**.

20  
 21



**Fig. S11: The impacts of ADC quantization precision on the inference accuracy.** 4-bit is enough for the quantization precision of ML sum current, with an accuracy loss less than 0.3% for both of 5-way and 20-way tasks.

1  
2  
3

1    **Supplementary References**

2    1. Xiang, Y. C., Huang, P., Yang, H. Z., Wang, K. L., Han, R. Z., Shen, W. S., Kang,  
3    J. F., et al. (2019). Storage reliability of multi-bit flash oriented to deep neural  
4    network. In 2019 IEEE International Electron Devices Meeting (IEDM) (pp. 38-2).  
5    IEEE.

6    2. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. (2016).  
7    Meta-learning with memory-augmented neural networks. In International  
8    conference on machine learning (pp. 1842-1850). PMLR.

9    3. Andoni, A., and Indyk, P. (2008). Near-optimal hashing algorithms for approximate  
10   nearest neighbor in high dimensions. Communications of the ACM, 51(1), 117-122.

11   4. Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. (2004). Locality-sensitive  
12   hashing scheme based on p-stable distributions. In Proceedings of the twentieth  
13   annual symposium on Computational geometry (pp. 253-262).