TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, X 201X 1 Training Passive Photonic Reservoirs with Integrated Optical Readout Matthias Freiberger, Student Member, IEEE, Andrew Katumba, Student Member, IEEE, Peter Bienstman, Member, IEEE, and Joni Dambre, Member, IEEE Abstract—As Moore’s law comes to an end, neuromorphic PD approaches to computing are on the rise. One of these, passive ADC photonic reservoir computing, is a strong candidate for comput- reservoir ... ing at high bitrates (>10 Gbps) and with low energy consumption. Currently though, both benefits are limited by the necessity ADC to perform training and readout operations in the electrical domain. Thus, efforts are currently underway in the photonic ADC community to design an integrated optical readout, which allows ADC to perform all operations in the optical domain. In addition to the technological challenge of designing such a readout, new ADC algorithms have to be designed in order to train it. Foremost, ... suitable algorithms need to be able to deal with the fact that the actual on-chip reservoir states are not directly observable. ADC In this work, we investigate several options for such a training algorithm and propose a solution in which the complex states of readout the reservoir can be observed by appropriately setting the readout Fig. 1: Swirl reservoir and readout system of existing RC weights, while iterating over a predefined input sequence. We perform numerical simulations in order to compare our method photonic chip prototypes, with the nodes of the reservoir being with an ideal baseline requiring full observability as well as with collected in the readout section, where optical output signals an established black-box optimization approach (CMA-ES). are converted to electrical signals and then processed to a final Index Terms—Cognitive Computing, Reservoir Computing, output. PD: photodiode, ADC: AD converter, MP: Micropro- Photonic Computing, Neuromorphic Computing, Nonlinearity cessor. The blue and orange parts represent respectively the Inversion, Integrated Optical Readout, Limited Observability. optical and electronic signals and components. I. INTRODUCTION and header recognition tasks relevant for telecom applications AS Moore’s law appears to come to an end [1], novel at data rates higher than 10 Gbps. computing approaches, that go beyond traditional com- Nevertheless, the technology is still afflicted by a number puting paradigms of silicon semiconductors are on the rise. of limitations, which prevent its application to problems on a Among these, Reservoir Computing [2]–[4], a machine learn- larger scale. One of these limitations lies in the fact that the ing technique proposed to perform brain-inspired computing, training, as well as the mixing of signals to solve a meaningful has been applied in unconventional computing recently in task, has so far only happened in the electrical domain. Indeed, attempts to overcome the Von Neumann bottleneck. While Vandoorne et al. [14] transferred the signal at each node from promising results have been shown on various hardware plat- the optical to the electrical domain using a photodetector, and forms [5]–[7], Photonic Reservoir Computing [8]–[13] makes then sent it through an AD converter. Finally, the required lin- an especially strong case as a future-proof technology platform ear combination of signals and readout weights was performed due to its potential to perform high-bandwidth tasks with little in the electrical domain using a microprocessor. See Figure 1 energy consumption. Especially, integrated passive photonic for a detailed illustration of the process. reservoir computing [14]–[16] is a promising approach since In order to truly reap the benefits of optical computing it exploits common CMOS fabrication technology and thus though, signals need to be processed at very high data rates ensures an easy route to mass market production. Integrated in an energy-efficient way. Considering ecological as well passive photonic reservoirs have been shown to solve parity bit as economical factors, minimizing power consumption is of paramount importance for future computing technologies. M. Freiberger and J. Dambre are with the Ghent University - imec From that perspective, the approach pursued in [14] for reading IDLab, Department of Electronics and Information Systems, Technologiepark- Zwijnaarde 15, B-9052 Ghent, Belgium. A. Katumba and P. Bienstman are out integrated photonic reservoirs is inefficient, since there is with the Ghent University - imec Photonics Research Group, Department a significant energy and latency cost associated to it. Hence, of Information Technology, Technologiepark-Zwijnaarde 15, B-9052 Ghent, it is desirable to perform the summing of signals in the Belgium. Email: Matthias.Freiberger@UGent.be optical domain instead of in the electrical domain. Using Manuscript received 13.10.2017; revised 29.06.2018 such an integrated optical readout, only a single photodetec- TNNLS-2017-P-8539.R1 ©c 2018 IEEE arXiv:1810.03377v1 [cs.NE] 8 Oct 2018 MP TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, X 201X 2 on real-valued signals, whereas here we need to use complex- OM bias valued weights on complex-valued signals. A second possible solution could be to train the weights OM reservoir ... based on simulations of the behaviour of a virtual reservoir, using photonic circuit simulation software, which will obvi- OM ously have full observability of all the nodes. However, the PD OM fabrication tolerances of these devices are such that the prop- agation phase of two nominally identical waveguides could OM be completely different. This prohibits the successful transfer OM of weights trained using the idealized simulated reservoir ... to actual hardware. In summary: finding a suitable training algorithm is not straight-forward. OM In this paper, we evaluate several approaches to train in- tegrated photonic reservoirs without full state observability. readout We simulate the behavior of passive photonic reservoirs with Fig. 2: Illustration of a swirl reservoir connected to a fully integrated optical readout and train the simulated readouts of optical readout. Each optical output signal is modulated by these reservoirs to perform 3 bit header recognition. We eval- an Optical Modulator (OM) implementing the weights. The uate the investigated approaches by comparing their achieved optical outputs are then sent to a combiner structure where all bit error rates over a wide range of input signal bit rates. signals are summed and finally converted to an electric output In Section II, we give an overview on our methodology and signal using a photo diode. the simulation setup which is applied throughout the paper. Thereafter, in Section III, we introduce a baseline approach inspired by classic reservoir training methods, against which tor, which receives the weighted sum of all optical signals, all other approaches will be evaluated. We apply CMA-ES, a is required. A straightforward low-power optical weighting well known general-purpose black-box optimization approach, element can take the form of a reverse-biased pn-junction. A in Section IV to train simulated reservoirs with integrated better solution would be to use non-volatile optical weighting readout and evaluate a possibility of a combination of our elements, such as the ones that are currently being developed baseline with CMA-ES in Section V. Finally, in Section VI by several groups [17]–[19]. The necessary summing of the we propose a technique to estimate the complex-valued states individual weighted reservoir state signals can be performed of the reservoir using only the final photodiode by running a by a combiner tree, i.e. by summing signals pairwise using predetermined input sequence several times, while appropri- an 2x1 optical combiner structure and repeating this step on ately setting the readout weights. The estimated complex states the resulting intermediate signals until a single optical output can then be used for classic training algorithms on a digital signal is obtained. Figure 2 further illustrates the concept of a computer and the resulting weights can be programmed on the fully optical integrated readout. actual readout. However, by employing such an integrated optical readout, we lose direct observability of the states of the photonic reservoir. Observing all states is mandatory though, in order to II. METHODOLOGY use classical linear readout training algorithms such as ridge We evaluate the suitability of our proposed training ap- regression [20] and other least-squares approaches. At first proaches by comparing the achieved bit error rates of sim- glance, a possible solution could be to add a separate high- ulated reservoirs over a wide range of input signal bitrates. speed photodetector to each reservoir node, which is only In more detail, we train the readout weights of simulated used during training to observe the states. The weights would passive photonic reservoirs with integrated optical readout then be calculated in the electrical domain, while the trained to perform the 3-bit header recognition task. Our simulation reservoir could still be operated entirely in the optical domain. setup builds upon the setup in [16]: Katumba et al. inject an This is unfortunately challenging due to a number of reasons. intensity-encoded bit signal into a simulated optical circuit First, high-speed photodetectors tend to be costly in terms of corresponding to an integrated 4x4 passive photonic reservoir chip footprint. Therefore, such an architecture would not scale using the swirl [14] architecture as as illustrated in Figure well when increasing the number of nodes in the photonic 1. Further, they convert the complex-valued optical signals at reservoirs to numbers common in classic echo state networks each of the 16 nodes of the reservoir, from the optical into the [3]. Second, since passive photonic reservoirs make use of electrical domain using a previously introduced photodetector coherent light for added richness, the tunable readout weights model that takes into account the responsitivity, bandwidth in the optical domain have to be complex-valued as well. and background noise of an integrated optical photodetector. However, unless we go to even more complicated coherent Finally, they arrange the 16 sampled electrical signals into detectors, these photodiodes can only measure the intensities a reservoir state matrix and detect 1 bit delayed XOR bit of the states and not their phases, so we cannot calculate the patterns in the bit sequence which has previously been fed correct complex-valued weights. Note that this is different into the simulated photonic reservoir. In this work, contrary to from the approach in [14], which used real-valued weights Katumba et al., we do not convert all the reservoir’s node combiner structure TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, X 201X 3 signals into the electrical domain prior to training. Instead 〈Id〉 = 0.1 nA, T = 300 K and RL = 1 MΩ. Finally, to model we assume an integrated optical readout operating directly the limited bandwidth B of the integrated optical detector, on the optical signals of the reservoir capable of weighting a fourth-order Butterworth low-pass filter is applied to the and summing signals which are complex-valued by nature. resulting output signal. By tuning the weights of this integrated optical readout, which We fix the delay time between any two connected nodes in is simulated on the functional level, we are able to perform our simulated reservoir to 62.5 ps and assume the waveguide meaningful computations by composing an optical signal from loss to be a rather pessimistic 3 dB/cm. We inject the input the signals occurring at the individual reservoir nodes. This signal into the reservoir through nodes 5, 6, 9 and 10, where optical signal is then converted from the optical into the node indices are ordered row by row and from left to right. electrical domain using the detector model used in [16]. This input node configuration offers a good trade-off between In more detail, we use Caphe [21] to simulate an optical performance and wiring effort [16]. circuit corresponding to an integrated 4x4 passive photonic As mentioned before, we use intensity modulation to encode reservoir using the swirl [14] architecture. We subsample our bit patterns into the optical signals sent to the simulated the intensity-modulated input signal 24 times per bit period, reservoir. In more detail, we distribute a total power ptotal = smoothen it using a single-pole low-pass filter, and simulate 0.1 Watt over the 4 input nodes, which leaves the maximal the response of our optical integrated swirl circuit using Caphe. amplitude power per node input signal at 0.1W 4 = 0.025W. We obtain a sampled complex output signal of each reservoir Moreover, we have found that training our reservoirs with node as result, denoting amplitude and phase of the optical a bias term is beneficial to classifier performance. Thus we signal at that node at a certain instant in time. We arrange the implement an additional optical bias line which does not sampled complex optical signals at each reservoir node as a lead into the reservoir, but directly into the integrated optical complex state-node matrix X ∈ C and simulate the integrated readout (see Figure 2). This line carries a constant input power optical readout by computing an inner product between this of 0.02W. Every training algorithm in this paper treats the bias matrix and a complex weight vector w, which represents like an additional reservoir state signal, with the exception that the complex optical weights. The resulting complex-valued the bias signal is not regularized during training. Assuming an signal is fed into a photodetector model to obtain the electrical equal occurrence probability of symbols 1 and 0 in the input output signal of the integrated photonic reservoir. This output signal, the average signal power to drive the reservoir amounts power signal is then downsampled at a predetermined optimal to 0.1W · 0.5 + 0.02W = 0.07W. The power consumption for sampling point and thresholded in order to obtain a clean actual implementations of an integrated optical readout with binary output bit sequence. We place a threshold T in the possibly active weighting elements needs of course also to be middle of the signal range of our output signal y[n] as taken into account, but is beyond the scope of this work. T = P5(y[n]) + (P95(y[n])− P5(y[n]))/2, (1) Since we seek to find a training algorithm that works well over a wide range of dynamics, we assess the performance where P5(y[n]) and P95(y[n]) are the 5th and 95th percentile of our classifiers by training integrated photonic reservoirs of y[n] respectively. We simulate the reservoir’s optical readout excited by input signals over a wide range of bitrates. We as sweep the bit rate of the input signal in 1 Gbps steps between y = σ(Xw), (2) 1 and 31 Gbps. where X ∈ CN×F is the matrix of complex reservoir states, As a machine learning task to assess the performance of our containing N samples of the complex signal occurring at F classifiers, we use the header recognition task. We expect the reservoir nodes of an integrated passive photonic reservoir. reservoir to present 1 at the output whenever a certain sought w ∈ CF×1 is a vector holding the complex weights of the header bit sequence occurs in the input signal and 0 otherwise. integrated optical readout. σ(a) : CN → RN is the mapping More precisely, given an input signal u[n] and a predefined from the optical to the electrical domain that is realized by header bit pattern h[n] we define our ideal desired signal the photodetector. In this paper, the photodetector model of dideal[n] to be [16] is used for all simulations and experiments. This model { ∑ if M−1 1 Jh[(M − 1)−m] = u[n−m]K = M computes the electric current of a sampled complex signal a d m=0 ideal[n] = as 0 else i(a) = R|a|2, (3) (5) where the notation above is Iversons bracket notation, defined where R is the responsitivity of the photodetector. Thereafter as { a zero-mean Gaussian noise vector n with a variance σ2 n is 1 if P true added to i(a). The variance σ2 n is computed as JP K = (6) 0 else σ2 n = 2qB(〈I〉+ 〈Id〉) + 4kBTB/RL (4) ∑ and M is the length of h[n] in bits. We start out setting where q is the elementary particle charge, B is the bandwidth M = 3 and h[n] = δ[n] + δ[(n − 2)], where δ denotes of the photodetector, 〈I〉 N = 1 N n i[n] is the photocurrent, the dirac delta function, thus seeking to detect the header 〈Id〉 is the dark current, kB is the Boltzmann constant, T is bit pattern ”101”. We use this pattern throughout this work the temperature and RL is the load impedance of the photode- as we discuss our findings. Thereafter, to demonstrate that tector. In our simulations we set R = 0.5 A W , B = 25 GHz, our introduced approaches exhibit consistent performance on TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, X 201X 4 the task, we train classifiers to detect all possible 3 bit function. Consequently, if we train our readout weights in a headers. While it is more common in recent literature [13], classical way, using ridge regression [20], the desired product [14] to compute errors in a winner-takes-all fashion jointly vector d of states and readout weights will be transformed by for all trained classifiers and patterns, we analyze the detection the detector output function. In more detail if we train w as performance of each header bit pattern individually. We argue H 2 −1 H that individual classifiers can easier be analyzed employing w = (X X + α I) X d, (10) the latter approach, especially when investigating reservoir where α ∈ R is the regularization strength and I ∈ RF×F is performance over a large range of bitrates. the identity matrix. Assuming w is ideal, We train our reservoir readouts using a modified version of the ideal desired signal σ(Xw) = σ(d). (11) d[n] = dideal[n] · ptotal, (7) Therefore, at the model output, we obtain where again ptotal = 0.1 W, the maximal attainable power of y = σ(Xw) = σ(d). (12) the input signal. Since we would like our model to output d rather than We generate 10010 random bits as training data as well as σ(d), we need to find an approximate inversion of the readout 10010 random bits of test data. Since we remove the reservoir nonlinearity σ̂−1 to invert σ such that signals generated by the first 10 input bits for both training and test states in order to minimize the impact of transients, σ(σ̂−1(d)) ≈ d. (13) this leaves us with 10000 bits of training data as well as 10000 bits of test data. To account for manufacturing variations, in We therefore train w as general, we simulate each reservoir 10 times with identical w = (XHX + α2I)−1XH σ̂−1(d), (14) train and test input, but different random phase configurations of the waveguides between nodes, as well as of the waveguides We can approximate the detector model in [16] (for details, feeding the input signals to the nodes. Subsequently we train see Section II) as a function in closed form, if we neglect its the output weights of the simulated readout for each instance. bandlimiting lowpass filter for simplicity : To compare the performance of our trained classifiers, we σ(a) = R|a|2 + n. (15) use the bit error rate (BER). The BER is defined as N Again, R denotes the responsitivity of the photodetector and 1 ∑ ebit = JyT [n] =6 dideal[n]K, (8) n is a noise vector. One can see that σ(a) cannot be inverted N n=1 exactly, primarily since we take the absolute value of a, and where y [n] is the subsampled, thresholded output signal of also due to the added, unknown noise vector. Nevertheless we T the reservoir and dideal[n] is again the ideal desired signal. can approximate the inverse of th√e detector function above as With 10000 bits of test data, the minimal detectable bit −1 a error rate with a confidence level of ≈ 90% is 10−3 [22]. σ̂ (a) = . (16) R Before computing the bit error rate on the test data, for all trained classifiers we determine the optimal sampling point By doing so, we minimize the sum of squared errors minimizing the bit error rate on the train data, which we ∑[ √ ] N 2 (n) − d[n] then use to downsample the test prediction and compute the (x w) (17) bit error rate as described above. We have found that this R n=0 additional step leads to more robust results for all classifiers where we denote the row vector with index n of X as x(n). investigated. While this approach obviously can not be used on actual devices due to the fact that the optical signals on the chip are III. BASELINE: COMPLEX-VALUED RIDGE REGRESSION not observable, its similarity to classic, real-valued training Consider again the model of an integrated optical readout approaches makes it a suitable candidate to be used as a as introduced in Section II, defined as baseline to assess and compare novel training approaches for (9) actual integrated reservoirs. Whenever we refer to this baseline y = σ(Xw). we call it plainly complex-valued ridge regression. Whenever X ∈ CN×F is the matrix of complex reservoir states, contain- we train classifiers using complex-valued ridge regression, we ing N samples of the complex signal occurring at F reservoir use 5-fold cross validation to find a suitable regularization nodes of an integrated passive photonic reservoir. w ∈ CF×1 is parameter α for an optimal training result. a vector holding the complex weights of the integrated optical We assess the performance of this baseline by training readout. σ(a) : CN → RN is the mapping realized by the integrated photonic reservoirs for different input signal bitrates photodetector of the readout. on the 3-bit header recognition task. Figure 3 shows the Contrary to classic reservoir readouts, the readout weights achieved bit error rates. of this model do not lie after the reservoir’s nonlinear detector As one can see, our proposed baseline works well for function, but before it. This implies that any result of the dot bitrates between 2 and 20 Gbps with minor increases of the product Xw of the model will be passed through this nonlinear error for bitrates at 14 Gbps and 17 Gbps. TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, X 201X 5 Error Rate vs Bit Rate Error Rate vs Bit Rate 100 100 10−1 10−1 10−2 10−2 10−3 10−3 Baseline CMA-ES 10−4 10−4 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Bit Rate[Gbps] Bit Rate[Gbps] Fig. 3: Bit error rates as a function of input signal bitrate for Fig. 4: Bit error rates as a function of input signal bitrate for our baseline approach, complex-valued ridge regression. Clas- the black-box approach, CMA-ES. Baseline approach is shown sifiers have been trained to perform 3 bit header recognition for comparison. Classifiers have been trained to perform 3 bit (pattern 101) on a 4x4 passive photonic swirl. All results are header recognition (pattern 101) on a 4x4 passive photonic averaged over 10 different reservoirs, the minimal detectable swirl. All results are averaged over 10 different reservoirs, the error rate is 10−3. minimal detectable error rate is 10−3. IV. BLACK-BOX OPTIMIZATION from the reservoir’s subsequent output and feeding that error measure back to the CMA-ES algorithm. The algorithm tra- Since the states of our integrated photonic reservoir are not verses the loss function space suggesting new candidates to observable, a straight-forward approach is to train the read- minimize the loss function. We encode the real and imaginary outs using a black-box optimization approach. Among these, parts of the complex weight vector w ∈ C into a real-valued Convergence Matrix Adaption - Evolution Strategy (CMA-ES) vector ( ) [23] appears to be a suitable candidate, since it usually deals ′ Re(w) well with non-convex search spaces, which are likely to occur w = Im (18) (w) for our problem since we perform optimization in the complex domain. Evolution strategies solve optimization problems in an prior to handing it to the CMA-ES algorithm. The inverse iterative way using a repeated two-step process based on of the transformation is applied to weight vectors suggested by the variation and selection of possible solutions. In the variation algorithm before setting them to the readout. We initialize the ′ step, new candidate solutions are generated by adding noise to algorithm with a zero vector w0 = 0 and perform sweeps −5 the currently known best solution. This noise is drawn from a over the initial variance in steps of one decade between 10 2 normal distribution called the mutation distribution. Thereafter, and 10 . As recommended in [24], we set the population size newly generated candidate solutions are evaluated using a to 4 + b3 log(F )c. Note that we might achieve better results predefined cost function. Candidate solutions which perform by cross-validating initial variance and population size, from better than the previously best known solution are selected which we refrain due to the typical very long training times and used to update the parameters of the mutation distribution. of CMA-ES. As an objective function, we minimize the sum Using the updated best known solution as well as the updated of squared errors mutation distribution, the steps described above are repeated. ∑N 2 The overall process can be repeated until a satisfactory solution eSSE = (y[n]− d[n]) . (19) to the given optimization problem has been found. The exact n=1 rules how the mutation distribution is updated vary inbetween We assess the performance of CMA-ES by training inte- evolution strategies. CMA-ES adapts the mean value and grated photonic reservoirs with different input signal bitrates covariance matrix of the used mutation distribution such that on the 3-bit header recognition task. Figure 4 shows the the probability to draw updates which lead to previously resulting error rates (results are averaged over 10 different selected solutions increases. The underlying rationale is that reservoirs). conducting updates which lead to good solutions in the past As one can see, CMA-ES performs only slightly worse than might result in even better solutions when performed again. the baseline, achieving the minimal detectable bit error rate of For details see [23]. 10−3 for bitrates between 4 and 17 Gbps. Therefore, it is in Integrated photonic reservoirs can be trained using CMA-ES principle capable of training integrated photonic reservoirs. by transferring a candidate solution suggested by the algorithm However, because CMA-ES typically involves high training to the reservoir’s readout weights, presenting the training bit time and requires many iterations of the input data, we run an sequence to the reservoir, computing a chosen loss function additional experiment investigating its convergence behavior. Error Rate[.] Error Rate[.] TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, X 201X 6 Maximum Phase Perturbation Error Rate vs Evaluations 0 0.1π 0.2π 0.3π > 0.3π 100 < 0.001 0.506 0.643 0.673 ≈ 0.70 TABLE I: Bit error rate for the 3 bit header recognition 10−1 task (pattern 101) for increasing phase perturbations in the reservoir’s waveguides (readout trained in simulation with 10−2 complex-valued ridge regression). 10−3 2) We create simulation models of 10 possible physical instances of the previous reservoir that reflect possible −4 waveguide phase variations in a physical reservoir due to 10 0 200 400 600 800 1000 manufacturing. This is done by adding random variations Number of evaluations[.] ηi and ηc to the phase configurations of input and con- nection waveguides of the nominal reservoir. The phase Fig. 5: Bit error rate as a function of iterations over the training variations are drawn from a uniform distribution U(0, b), data for CMA-ES. Classifiers have been trained to perform 3 where b represents the maximal phase perturbation in bit header recognition (pattern 101) on a 4x4 passive photonic each waveguide. swirl. All results are averaged over 10 different reservoirs, the 3) We reapply the previously trained readout weight vector minimal detectable error rate is 10−3. to each of these physical reservoir models and record the resulting bit error rate. This procedure is repeated for for all perturbations We simulate a passive photonic reservoir and train it with b ∈ {0.1π, 0.2π, 0.3π, 0.4π, 0.5π, 0.6π, 0.7π, 0.8π, 0.9π, π}. CMA-ES where we record the error rate at each iteration. We In addition, for each value of b, the resulting bit error rates are again average our results over 10 simulated reservoirs driven averaged across 10 different instances of the nominal reservoir by a bit rate of 10 Gbps to obtain the graph seen in Figure 5. (each with different random phases of all connections). The results of this experiment show that the full input Table I summarises the results. The bit error rate already training sequence needs to be presented about 500 times before increases by two orders of magnitude for amounts of ran- CMA-ES reaches satisfactory results. Since a short training dom phase noise that are currently well below fabrication process on the actual hardware is mandatory for our devices tolerances. This renders a pretraining-retraining approach chal- in order to obtain mass-market maturity, a way to drastically lenging. In the next section, we therefore pursue a radically reduce the number of necessary iterations over the input is different approach, in which we estimate the reservoir’s states necessary. Since stand-alone CMA-ES training converges too through the available photodetector in order to train a weight slowly, a promising alternative is the pretraining of models in vector from the actual states on a digital computer. This weight simulation and the refining them using CMA-ES on the actual vector can then be transferred back to the hardware where no devices to speed up the training process. We investigate the significant increase in error is to be expected. feasibility of such a pretraining-retraining approach in the next section. VI. NONLINEARITY INVERSION V. PRETRAINING IN SIMULATION Consider again the model of an integrated optical readout as introduced in Section II, defined as As mentioned in Section I, training reservoirs in simulation allows full observability. However the resulting weights are y = σ(Xw), (20) not directly transferable to hardware, due to the fabrication tolerances of integrated photonic reservoirs. However, if the the closed form approximation of the photodetector function changes in reservoir output signals due to process variations 2 are small enough, a pretraining approach could be useful if σ(a) = R|a| + n, (21) weights trained in simulation still perform considerably better than random weights. In such a case, one could use them to as well as its approximate inversion initialise, e.g., the training with CMA-ES, leading to much √ a faster training convergence. In this section, we quantify how σ̂−1(a) = , (22) phase variability of in our reservoirs affects the quality of R weights trained in simulation. Our approach is as follows: introduced in Section III. 1) We train the weights of a nominal simulated reservoir σ̂−1 can be used to estimate the reservoir state matrix |X| with integrated readout using complex-valued ridge re- when it cannot be observed directly. gression. We choose the bit rate of the input signal to Indeed, taking a closer look at Equation 20, one can see be 5 Gbps, based on Figure 3. that it is possible to observe the powers of the state matrix X Error Rate[.] TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, X 201X 7 Pk exp(jφk)+Pl exp(jφl). The absolute value of their relative OM bias phase difference φkl = φk − φl can be computed using the phase estimation equation ((see Appendix A fo)r derivation) OM reservoir ... 2 2 2 | P φ | = arccos kl − (Pk + Pl ) kl . (27) OM 2PkPl PD OM Due to the fact that arccos(x) is injective in the sense that for any given input x there are two possible solutions on OM the interval [−π, π], using Equation 27 we can only find the absolute value |φkl|, while the sign of (φkl) remains unknown. OM ... To resolve this issue, one performs this estimation of φkl twice with different values Pkl and P ′kl: While Pkl is being OM computed as shown before, a phase difference of π 2 is added to state k such that readout ′ | π P = P 2 Fig. 6: Illustration of modulus (light intensity) observation kl k exp(jφk) exp(j ) + Pl exp(jφl)| 2 (28) procedure. The weight highlighted in red is set to 1, all =|Pk exp(jφ′k) + Pl exp(jφl)|2. remaining weights are set to 0, the observable output is the By comparing the estimates of |φ square of the modulus of the corresponding reservoir state. ( kl| and ′ ) 2 | ′ | Pkl − (P 2 k + P 2 l ) φkl = arccos , (29) 2PkPl through an appropriate selection ofthe input weight vector w. If we choose we are able to infer {the sign of φkl. 1 ′   |φkl| if |φkl| ∈ [0, π ] 0 φkl = 2 (30) w =  .. (23) −|φkl| else . 0 (see Appendix A for proof). For illustration, see Figure 7. Since as a weight vector and present an input signal u[n] to the input of the integrated photonic reservoir, we obtain |Pk exp(jφk) + Pl exp(jφl)| = |Pk + Pl exp(jφkl)| (31) y = σ(x1) = R|x1|2 + n, (24) and thus in consequence ∀q 6=k at the output of the integrated readout, where x1 is the first ∑ |P exp(jφ ) + P column of the state matrix . Since we can observe | |2 k k ∑ q exp(jφq)| = X R x1 + (32) n, we can estimate the modulus |x1| of ∀q=6 k |Pk + Pq exp(jφkq)|, x1 = |x1| exp(j arg(x1)). (25) we can pick a certain state x(t,k) as reference state and If we neglect n, assuming we know R, we can apply σ̂−1 determine the relative phase φkq between the value of this to y and estimate |x1| as √ state x(t,k) and the value of every other state x(t,q). After having estimated each reservoir state this way, we | n xˆ1| = σ̂−1(R|x |21 + n) = |x1|2 + . (26) apply complex-valued ridge regression to find the optimal R weights for our reservoir. Since the above operation is an approximate inversion of In summary, after presenting the same input 3F − 2 times, the nonlinearity of the photodetector, we call our method with F being the number of output channels, we are able to nonlinearity inversion. measure the full complex time evolution of each of the F Repeating this procedure for every channel of the read- output channels, even though we only have a single detector. out/column of the state matrix, we are able to estimate the This information can then be used to calculate the required moduli of state values in the state matrix |X|. See Figure 6 weights in a single pass in software on a digital computer, for illustration. after which the computed weights are transferred back to Nevertheless, in order to tune the weights of the integrated the readout. For our simulated 16 node reservoirs, taking optical readout, we need information about the arguments (the into account the bias channel, which can be treated like any phases) of |X| as well. Absolute phase information about the other readout channel, this requires to present the input data states is lost as soon as it passes the photodetector. However, 3F − 2 = 3 × (16 + 1) − 2 = 49 times to the reservoir. only the relative phases between states are of interest since As we have seen in Section IV, 3F − 2 measurements is they influence the output sum of the readout. much less than what is typically necessary using CMA-ES. Consider two given complex state values x(t,k) and x(t,l) In addition, it is also a deterministic number, in contrast to at a certain instant in time t, with moduli Pk and Pl as a black-box optimisation technique, for which it is hard to well as the modulus Pkl of their sum Pkl exp(jφkl) = determine beforehand how many iterations will be needed. combiner structure TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, X 201X 8 OM Error Rate vs Delay bias 100 OM reservoir ... 10−1 OM PD OM 10−2 OM − Baseline 10 3 OM CMA-ES ... NL-Inv 10−4 OM 0 5 10 15 20 25 30 Bit Rate[Gbps] readout Fig. 7: Illustration of phase estimation procedure. The weights Fig. 8: Bit error rates as a function of input signal bitrate for highlighted in red are set to 1, the remaining weights are our proposed approach, nonlinearity inversion. Baseline and set to 0. This results in the power of the summed states at CMA-ES approaches are shown for comparison. Classifiers the output, which allows us to calculate the argument (phase have been trained to perform 3 bit header recognition (pattern angle) between states of the highlighted channels via the 101) on a 4x4 passive photonic swirl. All results are averaged relationship of the sum power and the powers of the individual over 10 different reservoirs, the minimal detectable error rate states obtained in the previous step. is 10−3. In our experiments, we simulate the repeated measurements on the other hand, the performance varies more in between for the nonlinearity inversion procedure by setting the cor- bit patterns, where it performs very well for some patterns responding rows of the weight vector in the readout model such as ’000’ and ’100 ’and slightly less well for others such according to our estimation procedure. For every setting of the as ’011’. Nevertheless, CMA-ES attains the minimal error weight vector we apply the complete model of our readout as rate for all bitpatterns in a range between 6 and 16 Gbps. described in Section II. We collect the corresponding output Interestingly, it manages to minimize the error rate for the signal from the detector and replace any samples of the output ’100’ pattern between 20 and 22 Gbps where our baseline signal smaller than 0 (which might occur due to noise or performs two orders of magnitude worse. This is consistent ringing of the bandlimiting filter of the photodetector) with 0. with the performance of our nonlinearity inversion approach Thereafter X is estimated using Equations 22, 27 and 30. We for that pattern, which also manages to minimize the error again train integrated photonic reservoirs with input bit rates rate in the same region contrary to the baseline. The overall between nodes on the 3-bit header recognition task to asses performance across bitrates is slightly smaller than for the two the performance of our proposed method. Figure 8 shows the former approaches, still, nonlinearity inversion minimizes the bit error rate as function of the input signal bit rate. error rate for the given task on all possible header bit patterns The nonlinearity inversion approach performs only slightly on a range between 6 and 14 Gbps, and is thus almost on par worse than the the CMA-ES approach and the complex- with CMA-ES here. In a nutshell, since nonlinearity inversion valued ridge regression baseline. It is remarkable though, that shows only slightly worse performance than CMA-ES, while for some bitrates, for instance at 14 Gbps, the nonlinearity requiring significantly less (3F − 2 = 3× (16 + 1)− 2 = 49 inversion approach outperforms the baseline. As the non- times) iterations of the input data, it appears to be the most linearity inversion approach operates on an estimate of the suitable among the training approaches for integrated photonic states used by the baseline, one would expect it to perform reservoirs that have been investigated in this paper. at best just as well as the baseline. A possible explanation for this phenomenon is that the noise introduced by the detector model in the estimation step acts as an additional VII. CONCLUSION regularizer for training. To provide a comprehensive picture on In this paper, we have assessed the suitability of different the investigated training approaches, we have performed the training methods for passive photonic reservoirs with inte- last experiment measuring the performance when recognizing grated optical readout. The most important boundary condition all possible bit patterns in a 3 bit header. Figure 9 shows for this type of hardware is the fact that the internal reservoir the results for all approaches discussed in this work. The states are not directly observable in the electronic domain. In results for all headers are largely consistent with the results we addition, the variability between individual devices makes it already observed for bit pattern ’101’: Our baseline approach necessary to perform the training on the actual hardware. For performs consistently well over a wide range of bitrates for this reason, our analysis has focused on the total training time all bitpatterns, approximately between 2 and 18 Gbps, with while staying below the maximally allowed bit-error rate of minor increases in error at 14 and 17 Gbps. For CMA-ES 10−3. Error Rate[.] combiner structure TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, X 201X 9 recorded output signals allow us to estimate the amplitude and Baseline 100 phase of the reservoir’s states within 3F − 2 iterations over the training data. We have shown that this method performs 10−1 as well as a classic training approach, which requires full observability, although for a more narrow bitrange. While the 10−2 CMA-ES black box algorithm performs slightly better than our own method with respect to task performance, our method 10−3 requires significantly fewer iterations over the input data. We conclude that nonlinearity inversion is a suitable candidate for 10−4 training integrated passive photonic reservoirs. 0 5 10 15 20 25 30 While all our approaches are roughly comparable with Bit Rate[Gbps] respect to task performance, we find that the baseline approach still outperforms the remaining approaches. On has to take CMA-ES 100 into account here though, that the baseline approach requires full observability and thus can not be implemented without 10−1 considerable hardware effort which affects the scalability of our systems. In contrast, CMA-ES and our nonlinearity 10−2 inversion approach exhibit slightly worse task performance, but are still applicable for a large range of bit rates. It is 10−3 remarkable that our approach exhibits better results than the baseline at certain bit rates, especially since it operates on 10−4 an estimate of the states used by the baseline. A possible 0 5 10 15 20 25 30 explanation for this phenomenon is that the noise introduced Bit Rate[Gbps] by the detector model in the estimation step acts as an additional regularizer for training. In future work, we will Nonlinearity Inversion 100 further investigate the reasons behind this better performance 000 in order to come up with a more accurate but still efficient way −1 001 10 of training the weights based on the estimated states obtained 010 from nonlinearity inversion. Moreover we will investigate the −2 011 10 performance on actual hardware implementations of integrated 100 optical readouts as soon as they are available. While we have 10−3 101 assumed in this work that amplitude and phase of the readout 110 are largely independently tunable, we are aware that this will 10−4 111 not be perfectly realizable for all hardware implementations, 0 5 10 15 20 25 30 especially considering nonvolatile tunable weights. We will Bit Rate[Gbps] address this problem in future work which could be met for instance by refining weights trained by nonlinearity inversion Fig. 9: Bit error rates as a function of input signal bitrate using reinforcement learning. for all possible 3 bit headers as performed by our baseline, CMA-ES and nonlinearity inversion approaches. Classifiers APPENDIX A have been trained to perform 3 bit header recognition on a DERIVATION OF THE PHASE ESTIMATION PROCESS 4x4 passive photonic swirl. All results are averaged over 10 Two given optical signals of our reservoir states x(t,k) and different reservoirs, the minimal detectable error rate is 10−3. x(t,l) can be represented as x(t,k) = Pk exp(jφk) (33) In the process, we found that CMA-ES delivers acceptable performance, but has to be ruled out as a practical training and method due to its long training times. Due to the high x(t,l) = Pl exp(jφl) (34) fabrication tolerances of integrated photonic reservoirs, the respectively. Given Pk, Pl , as well as the squared sum of both use of pretraining-retraining as a way to reduce these training signals times does also not seem very promising. As an alternative, P 2 kl = |Pk exp(jφk) + Pl exp(jφ ))|2l (35) we have proposed a novel method for training such reservoirs which we call nonlinearity inversion. Our method essentially we want to gain information on the state phases φk and φl. resolves the issue of limited observability of the states of Applying Eulers identity on the previous equation, separating an integrated passive photonic reservoir by estimating the real and imaginary parts and computing the absolute value reservoir’s states through a single photodetector at its output. gives us In more detail, we iterate over the training data several times P 2 kl = [Pk cos(φk)+Pl cos(φl)] 2 +[Pk sin(φk)+Pl sin(φ 2 l)] . while setting the weights according to a certain pattern. The (36) Error Rate[.] Error Rate[.] Error Rate[.] TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, X 201X 10 After expanding the squared brackets, we get If φkl has been mapped correctly in Equation 47, then 0 ≤ 2 2 2 φkl ≤ π. In consequence, (0− π ) ≤ φ′ ≤ (π− π ), and thus Pkl =Pk cos(φk) + 2PkPl cos(φl) cos(φ 2 kl 2 k)+ will be mapped on the interval [0, π2 ] by the arccos function. P 2 l cos(φ )2l + P 2 k sin(φk)2+ (37) In all remaining cases, φkl has been mapped wrong and the 2P P sin(φ ) sin(φ ) + P 2 k l l k l sin(φl) 2. sign of the result of Equation 47 needs to be reversed. This leaves us with a simp{le rule to determine the sign of φkl from By setting φ′ : sin(φ 2 l) = 1− cos(φ 2 kl l) (38) |φ | if |φ′kl kl| ∈ [0, π2 ] and applying the trigonometric identities φst = (50) −|φkl| else. 1 1 cos(φl) cos(φk) = cos(φl − φk) + cos(φl + φk) (39) 2 2 ACKNOWLEDGMENT and This research was funded by the EU Horizon 2020 1 1 PHRESCO Grant (Grant No. 688579) and the BELSPO IAP sin(φl) sin(φk) = cos(φl − φk)− cos(φl + φk), (40) 2 2 P7-35 program Photonics@be. we obtain P 2 = P 2 REFERENCES kl k + P 2 l + 2PkPl cos(φl − φk). (41) [1] W. Haensch, E. Nowak, R. Dennard, P. Solomon, A. Bryant, O. Doku- As expected, we can see in the last equation that the output of maci, A. Kumar, X. Wang, J. Johnson, and M. Fischetti, “Silicon cmos the system only depends on the difference of the input signal devices beyond scaling,” IBM Journal of Research and Development, phases, but not the absolute phases themselves. Thus if we vol. 50, no. 4/5, p. 339, 2006. [2] W. Maas, T. Natschlaeger, and H. Markram, “Real-time computing consider without stable states: a new framework for neural computation based |P + P exp(j(φ − φ ))|2 (42) on perturbations,” Neural Computation, vol. 14, no. 11, pp. 2531–2560, k l l k 2002. we get [3] H. Jaeger and H. Haas, “Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication,” Science, vol. P 2 + 2P P cos(φ − φ ) + P 2 cos(φ − φ )2+ 304, no. 5667, 2004. k k l l k l l k (43) [4] D. Verstraeten, B. Schrauwen, and D. S. M. D’Haene, “An experimental P 2 sin(φl − φ 2 k) unification of reservoir computing methods,” Neural Networks, vol. 20, l no. 3, pp. 391–403, 2007. which, after substitution with Equation 38 again yields [5] J. Bürger and C. Teuscher, “Variation-tolerant computing with mem- ristive reservoirs,” in Nanoscale Architectures (NANOARCH), 2013 P 2 + P 2 + 2P P cos(φ − φ ) = P 2 k l k l l k kl (44) IEEE/ACM International Symposium on, 2013. [6] E. Demis, R. Aguilera, H. Sillin, K. Scharnhorst, E. Sandouk, M. Aono, and thus A. Stieg, and J. Gimzewski, “Atomic switch networks-nanoarchitectonic design of a complex system for natural computing,” Nanotechnology, | vol. 26, no. 20, p. 204003, 2015. Pk exp(jφk) + Pl exp(jφl)|2 = |Pk + Pl exp(j(φ 2 l − φk))| . [7] M. Hermans, M. Soriano, J. Dambre, P. Bienstman, and I. Fischer, (45) “Photonic delay systems as machine learning implementations.” Journal Therefore we can just set the phase φ as a reference point of Machine Learning Research, vol. 16, pp. 2081–2097, 2015. k for our system and restructure Equation 41 such that [8] G. V. der Sande, D. Brunner, and M. Soriano, “Advances in photonic reservoir computing,” Nanophotonics, vol. 6, no. 3, pp. 561–576, 2017. P 2 − (P 2 + P 2) [9] L. Larger, M. Soriano, D. Brunner, L. Appeltant, J. Gutiérrez, L. Pes- cos(φ − φ ) = cos(φ − φ kl k l quera, C. Mirasso, and I. Fischer, “Photonic information processing be- k l k l) = . (46) 2PkPl yond turing: an optoelectronic implementation of reservoir computing,” Optics express, vol. 20, no. 3, pp. 3241–3249, 2012. Note that, since φkl = φl − φk and φlk = φk − φl mirror [10] A. Smerieri, F. Duport, Y. Paquot, M. Haelterman, B. Schrauwen, and each other along the real axis, their cosines are identical. S. Massar, “Towards fully analog hardware reservoir computing for Consequently, when applying an inverse cosine operation to speech recognition,” in AIP Conference Proceedings, 2012. [11] M. Fiers, T. V. Vaerenbergh, F. Wyffels, D. Verstraeten, B. Schrauwen, Equation 46 we can not determine the sign of φkl but its J. Dambre, and P. Bienstman, “Nanophotonic reservoir computing with magnitude via ( ) photonic crystal cavities to generate periodic patterns,” IEEE transac- tions on neural networks and learning systems, vol. 25, no. 2, pp. 344– | − | P 2 − (P 2 + P 2) 355, 2014. φk φl = |φkl| = arccos kl k l . (47) 2P P [12] K. Vandoorne, J. Dambre, D. Verstraeten, B. Schrauwen, and P. Bi- k l enstman, “Parallel reservoir computing using optical amplifiers,” IEEE We refer to Equation 47 as the phase estimation equation. transactions on neural networks, vol. 22, no. 9, pp. 1469–1481, 2011. [13] J. Qin, Q. Zhao, H. Yin, Y. Jin, and C. Liu, “Numerical simulation To find the actual sign of φkl, additional information needs and experiment on optical packet header recognition utilizing reservoir to be found. Assume we know not only P , but also P ′kl kl computing based on optoelectronic feedback,” IEEE Photonics Journal, defined as vol. 9, no. 1, pp. 1–11, 2017. [14] K. Vandoorne, P. Mechet, T. V. Vaerenbergh, M. Fiers, G. Morthier, P ′ π =|P exp(jφ ) exp(j ) + P exp(jφ )|2 D. Verstraeten, B. Schrauwen, J. Dambre, and P. Bienstman, “Exper- kl k l l l 2 (48) imental demonstration of reservoir computing on a silicon photonics =|P exp(jφ′ ) + P exp(jφ )|2. chip,” Nature Communications, vol. 5, no. 3541, 2014. k k l l [15] A. Katumba, P. Bienstman, and J. Dambre, “Photonic reservoir com- Then, puting approaches to nanoscale computation,” in Proceedings of the ′ π Second Annual International Conference on Nanoscale Computing and φkl = φl − φk − . (49) Communication, 2015. 2 TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, X 201X 11 [16] A. Katumba, M. Freiberger, P. Bienstman, and J. Dambre, “A multiple- Peter Bienstman was born in Ghent, Belgium, in input strategy to efficient integrated photonic reservoir computing,” 1974. He received a degree in electrical engineering Cognitive Computation, vol. pp1-8, 2017. from Ghent University, Belgium, in 1997 and a [17] S. Abel, T. Stferle, C. Marchiori, C. Rossel, M. Rossell, R. Erni, Ph.D. from the same university in 2001, at the D. Caimi, M. Sousa, A. Chelnokov, B. Offrein, and J. Fompeyrine, Department of Information Technology (INTEC), “A strong electro-optically active lead-free ferroelectric integrated on where he is currently a full professor. His research silicon,” Nature Communications, vol. 4, p. 1671, 2013. interests include several applications of nanopho- [18] C. Rı́os, M. Stegmaier, P. Hosseini, D. Wang, T. Scherer, C. Wright, tonics (biosensors, photonic information processing, H. Bhaskaran, and W. Pernice, “Integrated all-photonic non-volatile ...) as well as nanophotonics modelling. He has multi-level memory,” Nature Photonics, vol. 9, no. 11, pp. 725–732, published over 110 papers and holds several patents. 2015. He has been awarded a ERC starting grant for [19] B. V. Bilzen, P. Homm, L. Dillemans, C. Su, M. Menghini, M. Sousa, the Naresco-project: Novel paradigms for massively parallel nanophotonic C. Marchiori, L. Zhang, J. Seo, and J. Locquet, “Production of vo 2 information processing. thin films through post-deposition annealing of v 2 o 3 and vo x films,” Thin Solid Films, vol. 591, pp. 143–148, 2015. [20] A. Hoerl and R. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970. [21] M. Fiers, T. V. Vaerenbergh, K. Caluwaerts, D. V. Ginste, B. Schrauwen, J. Dambre, and P. Bienstman, “Time-domain and frequency-domain modeling of nonlinear optical components at the circuit-level using a node-based approach,” J. Opt. Soc. Am. B, vol. 29, no. 5, pp. 896–900, May 2012. [Online]. Available: http://josab.osa.org/abstract.cfm?URI=josab-29-5-896 [22] M. Jeruchim, “Techniques for estimating the bit error rate in the simulation of digital communication systems,” IEEE Journal on Selected Areas in Communications, vol. 2, no. 1, pp. 153–170, 1984. [23] N. Hansen and A. Ostermeier, “Completely derandomized self- adaptation in evolution strategies,” Evolutionary computation, vol. 9, no. 2, pp. 159–195, 2001. [24] N. Hansen, “The cma evolution strategy: A tutorial,” arXiv preprint arXiv:1604.00772, 2016. Matthias Freiberger was born in Graz, Austria, in Joni Dambre was born in Ghent, Belgium, in 1973. 1983. He received a M.Sc. degree in information She received the M.Sc. degree in electronics engi- and computer engineering from Graz University of neering and the Ph.D. degree in computer science Technology, Graz, Austria, in 2016. He is currently engineering from Ghent University, Ghent, in 1996 pursuing a Ph.D degree at the UGent-imec IDLab, and 2003, respectively. She is currently a professor Department of Electronics and Information Systems with the UGent-imec IDLab, and the Department at Ghent University. His current research focuses on of Electronics and Information Systems at Ghent training algorithms for neuromorphic devices. His University. She performs research on machine learn- research interests include scaling up neuromorphic ing and neural networks, with applications in sensor systems, deep learning, and recurrent neural net- processing and robotics, as well as innovative un- works on chip. conventional and neuromorphic computing hardware that exploit the computational power of physical dynamical systems. Andrew Katumba was born in Masaka, Uganda in 1985. He received an M.Sc. degree in Optics and Photonics from Karlsrule Institute of Technology, Germany in 2013. He is pursuing a Ph.D. degree in Photonics Engineering at the Photonics Research Group, Gent University-imec, Belgium. His current research focuses on photonic neuromorphic archi- tectures for high speed optical telecommunications systems. He is a student member of IEEE Photonics Society and the International Society for Optics and Photonics.