FPGA implementation of a MIMO DFE in 40 GB/S DQPSK optical links

EMERETLIS, Andreas, KELEFOURAS, Vasileios <http://orcid.org/0000-0001-9591-913X>, THEODORIDIS, George, NANOU, Maki, GEORGOULAKIS, Kwnstantinos and GLENTIS, Othon

Available from Sheffield Hallam University Research Archive (SHURA) at:
http://shura.shu.ac.uk/18372/

This document is the author deposited version. You are advised to consult the publisher's version if you wish to cite from it.

Published version


Copyright and re-use policy

See http://shura.shu.ac.uk/information.html
FPGA IMPLEMENTATION OF A MIMO DFE IN 40 GB/S DQPSK OPTICAL LINKS

A.Emeretlis, V.Kefelouras, G.Theodoridis

University of Patras
Dept. of Electrical and Computer Eng.
Patras, Greece

M.Nanou, C.Politi, K.Georgoulakis, G.Glentis

University of Peloponnese
Dept. of Informatics and Telecom.
Tripoli, Greece

ABSTRACT

In this paper, an FPGA implementation of a Multi Input Multi Output (MIMO) Decision Feedback equalizer (DFE) is proposed, for the electronic compensation of the impairments in 40Gb/s Intensity Modulated Direct Detection (IM/DD) optical communication links employing NRZ DQPSK signaling. The proposed equalizer is used for the electronic compensation the residual Chromatic Dispersion (CD) along the installed optically compensated optical paths. The required processing rate is achieved by applying intensive pipelining and parallelism in the original architecture of the equalizer. At the given processing rate, a 8-input 2-output DFE involving three taps feedforward filtering and two taps backward filtering is implemented on a single, cutting edge technology, Xilinx FPGA device.

Index Terms— DQPSK Optical Transmission, DFE equalization, FPGA implementation

1. INTRODUCTION

High capacity optical transmission links suffer from linear impairments like Chromatic Dispersion (CD). Advances in digital signal processing techniques, assist the development of high capacity optical transmission systems where all, or part, of the accumulated dispersion is compensated electronically [1, 2]. IM/DD systems have been extensively used for commercial optical communication links. Although recently there is a trend for use of coherent optical systems enabling even higher transmission rates, Direct Detection (DD) still offers low cost and simple-to-design receiver modules.

CD manifests itself as Intersymbol Interference (ISI) in optical communications. Optical dispersion compensation modules usually comprise Dispersion Compensating Fiber (DCF) parts placed along the optical path. Those bulky modules can compensate the amount of CD accumulated along the previous part of the optical path in a rigid way. Compensation is static with respect to the amount of the CD they mitigate thus implying that CD is known in advance. Electronic techniques can be placed at either end of the path. Especially equalizers are utilized to compensate all or, any residual CD. Here optical signals should be converted into electrical ones. In DD receivers the conversion is performed by a photo-detector which acts as a square law device. As a result, linear effects of the optical channel have a non-linear impact. Equalization can be performed by maximum likelihood sequence estimation (MLSE) method [5], or by the use of standard DFE or Volterra DFE counterparts [4, 6, 7] noting however, that the use of Volterra equalizers still becomes cumbersome when the ISI extends more than a few symbols.

In this paper, we investigate the performance of the MIMO DFE [9, 10] in the context of DD optical communication links, where Non-Return to Zero Differential Quadrature Phase Shift Keying (NRZ-DQPSK) is deployed [1,3]. NRZ-DQPSK transmission is chosen because it operates at half the bit rate required for 40 Gb/s and exhibits upgrade compatibility with existing transmission systems. Moreover, efficient pipelined and parallel architectures aiming at achieving the throughput requirement of the optical communication link are presented. At the given processing rate, a 8-input 2-output NRZ-DQPSK DFE involving three taps feedforward filtering and two taps backward filtering can fit in a single, cutting edge technology, Xilinx FPGA device. The designs efficiently exploit the specific FPGA features, and achieve the throughput of 40 Gb/s despite the high computational complexity caused by the advanced modulation format. To the best of our knowledge, it is the first time that DFEs for NRZ-DQPSK optical systems are implemented on FPGAs, since previous approaches have been introduced for Non-Return to Zero On/Off Keying (NRZ-OOK) systems [8].

2. DQPSK EQUALIZATION

Optical transport networks may comprise a number of 40 Gb/s DQPSK channels transmitted simultaneously over an installed fibre link like the one shown in Fig. 1(a) [1]. Optical fibre links considered in this paper consist of numerous identical fibre spans where part of the accumulated dispersion in the single mode fibre (SMF) is compensated by means...
of DCF and losses by optical amplifiers. Any transmitter (Tx)/receiver (Rx) configuration can be employed in such a transmission system, here 40 Gb/s DQPSK is deployed. DQPSK is a four level phase modulation format, signaling at 20 Gsymbol/s that overcomes the spectral efficiency limitation imposed by the binary modulation formats and the high frequency DAC converters required. Consequently, the tolerance for impairments, like CD, increases, [3]. The DQPSK Tx unit, is most commonly implemented by using a Super Mach Zehnder (MZ) Structure, which comprises two parallel MZ modulators operated as phase modulators (Fig. 1(b)) each of them modulates half of the optical signal and a phase shift of $\pi/2$ is applied in one of the branches of this superstructure. The most common DQPSK receiver uses two separate Mach Zehnder Delay Interferometers MZDI (Fig. 1(b)) and every arm of the MZDI has a phase shift of $\pi/4$ (in-phase (I) component, ) and $−\pi/4$ (quadrature (Q) component). Each of the MZDI outputs is detected by an photodiode. After electrical low pass filtering, the analog waveforms are fed to the equalizer described below.

Let $I_1(n) \in \{0, 1\}$ and $I_2(n) \in \{0, 1\}$ to represent the input binary sequences to the DQPSK encoded transmitter, and $T_s$ being the symbol period. Let $y_1^1(t)$, $y_1^2(t)$, $y_2^Q(t)$ and $y_2^H(t)$ be the electrical signals that are produced at the constructive (lower branch) and at the destructive (upper branch) of the I and the Q part of the receiver. The analog electrical waveforms are sampled at a fractional rate equal to $T_s/2$, as in this case the performance of the equalizers becomes less sensitive to the sampling phase of the receiver. A $T_s/2$, $8 \times 2$ MIMO DFE aiming to recover the transmitted information is employed, [9, 10]. It is described by

$$u_\ell(n) = \sum_{i=1}^{2} \sum_{\kappa=1}^{4} \sum_{m=1}^{M_f} f^{\kappa,\ell}_{i,m} y_{i,\kappa}(n - m + 1) + \sum_{i=1}^{2} \sum_{\kappa=1}^{4} b^{\kappa,\ell}_{i,m} I_1(n - m), \quad \ell = 1, 2 \quad (1)$$

where $u_1(n)$ and $u_2(n)$ denote the output signals at the I and the Q part of the DFE, $y_{1,1}(n) = y_1^1(n)$, $y_{1,2}(n) = y_1^2(n)$, $y_{1,3}(n) = y_1^Q(n)$, $y_{1,4}(n) = y_1^H(n)$, and $y_{2,1}(n) = y_2^1(n + T_s/2)$, $y_{2,2}(n) = y_2^2(n + T_s/2)$, $y_{2,3}(n) = y_2^Q(n + T_s/2)$, $y_{2,4}(n) = y_2^H(n + T_s/2)$, whereas $I_1(n) = Q[u_1(n)]$ and $I_1(n) = Q[u_2(n)]$, with $Q[.]$ denoting the detection device. $f^{\kappa,\ell}_{i,m}$ and $b^{\kappa,\ell}_{i,m}$ are the coefficients of the feed forward and the feedback part of the DFE, and $M_f$ and $M_b$ is the associated memory. (1) is compactly written as $u_\ell(n) = c^{T}_M X_a(n) \cdot T \cdot x_a(n) \cdot \ell = 1, 2$ where $c_M^T$ and $c_{M}^T$ are the vectors that carry the coefficients of the equalizer, and $x_a(n)$ is the corresponding data vector, with $T$ denoting the vector transpose, where $M = 8M_f + 2M_b$. The designated DFE is hereafter referred to as the DFE[$M_f, M_b$] DQPSK equalizer. When the input signal and the output signal statistics is not known in advance, the estimation of the DFE coefficients is carried out using a set of training data, employing standard batch or adaptive learning algorithms [11].

The efficiency of the suggested electronic equalization solution is evaluated by means of a metric, referred as required Optical Signal to Noise Ratio (OSNR) to reach a specific bit error rate (BER) value [1]. The system is modeled by the VPI TransmissionMaker simulator co-simulating equalizers implemented in Matlab. The transmitter operates at 193.1THz with 0 dBm output power. On the receiver side, an optical filter of 40GHz bandwidth is utilized. Fig. 2(a) illustrates the required OSNR for a different set of $[M_f, M_b]$, with respect to the residual chromatic dispersion, DFE[5,3] enhances the CD tolerance to 1100ps/nm, for $BER = 10^{-3}$, which is indicated as the FEC limit, i.e., the maximum BER required to achieve $BER < 10^{-9}$ using forward error correction (FEC). The performance of the electronic equalization in a typical optical link comprising 8 identical spans, each of length equal to $L = 100Km$ is illustrated in Fig. 2(b). Here, the CD is primarily compensated by means of a DCF, and two separate amplifiers of a 5dB noise figure are utilized to compensate SMF and DCF losses (1(a)). Equalizers are used to compensate residual dispersion. Evidently, the use of electric equalization enhance the CD tolerance of the optical link, i.e., despite the residual CD resulting from the incomplete optical dispersion compensation, the link can achieve $BER < 10^{-3}$.

### 3. HARDWARE DESIGN

The block diagram of the DFE equalizer in the case of the NRZ-DQPSK transmission is depicted in Fig. 3, [9, 10]. It has a typical butterfly structure at the forward as well as at the backward part. The received in-phase and quadrature signals are jointly processed to produce estimates of the transmitted in-phase and quadrature data. The Forward Filtering (FF) part comprises four, $4 \times 1$, Multi Input Single Output (MISO) linear Feed-Forward Filters (FFF), while the Backward Filtering (BF) part consists of four SISO Feed-Back Filters (FBF). The input signals, compactly represented by $4 \times 1$ vectors.
The chosen parallelism factor $P$ reached by means of extensive pipelining and parallelism. For the current FPGA technology, the required target process-
estimate of the current FPGA technology, the required target process-
equal to $M$, implemented using 16 individual FIR filters, each one of order 
geometric and the destructive signals. Hence, FF part can be 
consists of two, $T_s/2$ fractionally spaced sampled of the con-
structive and the destructive signals. Hence, FF part can be 
implemented using 16 individual FIR filters, each one of order 
equal to $M$, involving feed-forward interconnections only. 
The FF part requires $16M_f$ processing units.

Given the restrictions on the maximum processing speed of the current FPGA technology, the required target processing speed of 40 Gb/s or equivalently 20 Gsymbols/s can be reached by means of extensive pipelining and parallelism. The chosen parallelism factor $P$, i.e., the amount of symbols processed in parallel, depends on the frequency of the im-
plemented circuitry $F$, for a given throughput requirement $T$ given by $T = F \times P$.

Pipelining and parallelism is straightforward in the FF section as it consists of FIR filters interconnected in a feed forward way. Thus, a parallel architecture can easily be pro-
ced by unrolling the FF part of (1), while extensive pipelining 
can be achieved by inserting appropriate registers in the data path. However, the feedback loop and the presence of the quantizers (decision devices) within it, prevents the direct application of the above techniques in the FB part of the equalizer. Based upon and extending the results of [12, 13] for the case of SISO DFE, a pipelined and parallel architecture for the 
BF of (1) is proposed. The multiplexer-based approach [12] 
exploits the fact that, due the very nature of the feedback sig-
als ($\hat{I}_1(n)$, $\hat{I}_2(n) \in \{0, 1\}$ in our case) the possible outputs 
of the FBF can easily be pre-computed and the proper one can 
be selected using a multiplexer.

Let $u_j(n) = y_{F\ell}(n) + y_{B\ell}(n)$, $\ell = 1, 2$, where $y_{F\ell}(n)$ 
and $y_{B\ell}(n)$ denote the contribution of the FF and FB parts 
respectively, to the output of the DFE, and consider as an 
example the case when $M_b = 1$. In this particular case we get 

\[
\begin{array}{cccc}
I_1(n-1) & I_2(n-1) & y_{B1}(n) & y_{B2}(n) \\
0 & 0 & 0 & 0 \\
0 & 1 & b_{21} & b_{21}^* \\
1 & 0 & b_{11} & b_{11}^* \\
1 & 1 & b_{11} + b_{21} & b_{11} + b_{21}^* \\
\end{array}
\]

These values are added to the corresponding outputs $y_{F1}(n)$ 
and $y_{F2}(n)$ of the FF part, and the proper output is selected 
by means of two 4-to-1 multiplexers driven by $\hat{I}_1(n-1)$ and 
$\hat{I}_2(n-1)$.

In the general case, when the BF part is of order $M_b$ there 
are $2^{2M_b}$ possible values for each of the signals $y_{B1}(n)$ and 
$y_{B2}(n)$. Thus a $2^{2M_b}$-to-1 multiplexer is required. Based on 
the multiplexer approach, the structure for the computation of 
the output $\hat{I}_1(n)$ is depicted in Fig. 4, while a similar one is 
used for the computation of $\hat{I}_2(n)$. The values $B_j^i$ correspond 
to pre-computed outputs of the FBFs, where $i = 1, 2$ is the index of the output, $j = 1, 2, \ldots, N$ and $N = 2^{2M_b}$. This 
structure has the advantage that the feedback loop contains 
only a 1-bit $2^{2M_b}$-to-1 multiplexer, while the units performing 
arithmetic operations are located in the FF part. Hence, they 
can be easily pipelined.

The parallel architecture is derived by applying the looka-
head transformation [12]. This technique pipelines the multi-
plexer loop by adding extra registers inside it at the expense of 
lookahead stages, Fig. 5(a). In the case of DQPSK, each 
lookahead stage consists of $2^{2M_b} 4$-to-$1$ multiplexers, because 
the output of each FB section depends on the output of two 
feedback filters. Specifically, each lookahead block (LB), the 
first of which is shown in Fig. 5(b) contains two $4$-to-$1$ multi-
plexers. The purpose of the lookahead stages is to compensate 
for the extra delays in the loop. It is achieved by using as se-
lect signals in the multiplexers of the lookahead blocks the 
values of the possible outputs in previous time instances.

In order to produce a fully pipelined parallel loop, the 
pipelining of the multiplexer loop is necessary before the 
application of unfolding. In detail, if the amount of delays of 
the initial multiplexer loop equals to the applied parallelism, 
then these delays are equally distributed between the parallel

---

**Fig. 2.** a) Required OSNR for DFE with various $[M_f, M_b]$. b) BER versus residual CD

**Fig. 3.** DFE structure in NRZ-DQPSK transmission
loops. This results in a structure having one delay between the output of a multiplexer and the input of another one. In our case, the amount of lookahead stages is chosen to be P-1, where P is the parallelism factor.

When the level of the applied parallelism is greater than the order of the feedback filter, a low complexity architecture based on incremental block processing can be employed [13]. In detail, for a parallel $\times L$ architecture of a $M_b$-th order filter, the first $M_b$ outputs are produced in parallel while the $L-M_b$ outputs are calculated directly from the previous ones, eliminating the lookahead stages and resulting in area reduction. This approach was followed in this work.

4. IMPLEMENTATION DETAILS

For the implementation of the proposed architecture, two state-of-the-art Xilinx FPGA devices were selected from two different platforms. The first device (XC7VX690T-2) belongs to the Virtex-7 family built on 28 nm process while the second one (XCKU100-FLVF1924-2-i) belongs to the Kintex UltraScale (US) family, which is built on a high-performance 20 nm 3D-ICs process.

These platforms are rich in dedicated components suitable for DSP applications, which are the DSP48E1 and DSP48E2 slices in Virtex-7 and Kintex US family, respectively. They are embedded full-custom units that offer high-performance and flexibility and they are strongly recommended to be used when high-speed designs are needed. Hence, the computational units (multipliers, adders) of the architecture were mapped on them.

The DSP slices mainly consist of a 2’s complement multiplier followed by a 48-bit accumulator, which can be used separately, along with internal pipeline registers. To operate at full speed, two internal pipeline registers were used for the addition and three for the multiply-add operation. Also, the DSP slices, which were used to implement the parallel FIR filters, were interconnected each other through their cascading I/O ports. In this way, they are bound to be mapped next to each other and use the high-speed dedicated DSP routing resources. Finally, for the implementation of the additions of Fig. 4, the 48-bit adder of the DSP slice was used in SIMD mode to perform 5 additions resulting in an important reduction of the utilized DSP slices. The latter happens as the word-length of the data-path of the implemented DFE is 8 bits as explained below.

5. EXPERIMENTAL RESULTS AND DISCUSSION

Based on the computational complexity and the available DSP resources of the target FPGA devices, the equalizer that was implemented and achieved the target throughput of 40 Gb/s was the DFE[3,2]. A proper MATLAB model in fixed-point arithmetic was developed and an extensive study, analysis, and comparisons with the reference floating-point model were performed to determine the word-length of the architecture without affecting its quality in terms of BER. The outcome was that 6, 8, and 8 bits are demanded for the input signals, coefficients, and data-path operations, respectively. Then, based on the general architecture (Section 4), the specific design of the DFE[3,2] was described through
a parametric VHDL model to easily derive different parallel and pipelined instances. The correct functionality was verified through extensive Post Place & Route simulations and comparisons with the reference MATLAB model. For the implementation of the design, the Vivado Design Suite (v2014.4) was used. As it is the first time, to the best of our knowledge, that such systems are implemented on a FPGA, any comparison with other similar designs cannot be performed. The implementation results are shown in following Table 1, in which the second column denotes the the applied parallelism by which the throughput of 40 Gb/s was met.

Based on the above results, it is derived that the target throughput of 40 Gb/s is achieved applying high parallelism. This is caused by the limitation in the frequency, which is degraded due to the routing delay in large and complex FPGA designs such that of our case. However, the Kintex US technology deals better with this restriction and the corresponding implementation requires smaller parallelism than the implementation of the Virtex-7 technology.

Studying the area of the designs, it is evident that there is high utilization of the dedicated DSP blocks, whereas the utilization of the other resources is very low. Despite the large amount of logic due to the parallelism, it is efficiently handled by the Vivado Suite, which is highly compatible with the employed FPGA devices. This way, the designs occupy only a small amount of the fabric resources.

Even though the utilization of the DSP slices is high, these resources are efficiently exploited by the proposed architecture, achieving high-performance implementations. In the case of Virtex-7, the achievable frequency is 400 MHz in spite of the utilization of 78% of the available DSP resources. Consequently, the exploitation of the dedicated DSP routing resources, which are discussed in Section 4, facilitates the interconnection of the whole design. On the other hand, the operating frequency of the Kintex US implementation is 445 MHz resulting to lower total area. This fact proves the superiority of next-generation FPGA devices in facing computationally intensive DSP applications.

As a matter of fact, when the parallelism of the Kintex US implementation was set to 50 (equal to the parallelism of the Virtex-7 implementation), the achieved frequency was also 445 MHz. Hence, when the same parallelism is applied in both designs, the Kintex US implementation was able to achieve higher throughput, equal to 44.5 Gb/s.

6. CONCLUSION

In this paper, the performance of DFEs in the context of NRZ-DQPSK was studied and suitable architectures for FPGA implementations were proposed. Based on the experimental results and the available hardware resources, it is shown that modern FPGA devices are a suitable technology in order to face computationally intensive and extremely high data processing rate demanding electronic equalization methods in the context of optical communication systems.

REFERENCES