A Fast CRC Implementation on FPGA Using a Pipelined Architecture for the Polynomial Division Fabrice MONTEIRO, Abbas DANDACHE, Amine M’SIR,Bernard LEPLEY LICM, University of Metz, SUPELEC, Rue Edouard Belin, 57078 Metz Cedex phone: +33(0)3875473 11, fax: +33(0)387547301, email: fabrice. [email protected] org ABSTRACT The CRC error detection is a very common function on telecommunication applications. The evolution towards increasing data rates requires more and more sofisticated implementations. In this paper, we present a method to implement the CRC function based on a pipeline structure for the polynomial division.It improves very effectively the speed performance, allowing data rates from 1 Gbits/s to 4 Gbits/s on FPGA implementions, according to the parallelisation level (8 to 32 bits).

1 INTRODUCTION The CRC (Cyclic Redundancy Checking) codes are used in a lot of telecommunication applications. They are used in the internal layers of protocols such as Ethernet, X25, FDDI and ATM (AAL5). However, on modem networks, the need for increasing data rates (over 1 Gbit/s) is setting the constraints on performance very high. Indeed, the speed improvement (higher clock rates) due to the technological evolution is unable to fit the demand.Consequently, new architectures must be devised. Targetting the applications to an FPGA device is an issue for this paper, as it allows low-cost designs. The simple and evident serial implementation is a classical hardware implementation of the CRC algorithm.

Unfortunatly, on an FPGA implementation with maximal clock frequency of 250 MHz, maximal data rate is limited to 250 Mbits/s is the best case. Higher data rates can only be obtained through parallelisation. Some parallel architectures have been proposed in the past to address the need for high data throughput [ 1][2].The main problem is usually to limit the rapidly increasing area overhead while improving the speed performance. In this paper, we present a parallel approach for the polynomial division based on a pipeline structure. The parallelisation can be led to any level and is only lim- ited by the area constraint set on the design. The data throughput is almost directly linked to the parallelisation level, as the maximal clock rate is not very sensitive to it.

2 PRINCIPLE The polynomial division is the fundamental operation of the CRC applications.The serial implementation of the division is shown in figure 1 for the case where the polynomial divisor is G ( X ) = Go + G1. X1 + Gz. X2 + G3. X3 = 1 + X + X 3 . As indicated previously, the data throughput of this serial implementation is quite low. Very high data rates can only be achieved with high clock frequencies, which in turn can only be obtained using rather expensive technological solutions.

Parallelisation of data processing is the main solution to improve the speed performance of a circuit (or system) if the clock rate must remain low.Pipelining may be used as an effective parallelisation method when a repeatitive process must be applied on large volumes of ‘data. Previous works have addressed the parallelisation problem in large demanding computational applications, particularly in arithmetic (eg. [3][4]) and error control coding circuits (eg. [11[21[61). In the serial architecture (figure I), a new data bit is inject on each clock cycle. The previous cumulated remainder is simultaneously multiplied by X and divided by G(z) (where G(z) is the polynomial divisor).

On P Figure I : Serial polynomial division for G ( X ) = 1 -tX + X 3 -7803-7057-0/01/$10. 00 02001 IEEE. 1231 successive clock cycle , P bits are injected and P successive multiplication and divisions are performed. The next formula (related to the example of figure 1) describes the operation performed on one clock cycle. 0 T = [ o o 1 !]=[n Gz 0 1 o 1 1 Go GI 0 i ] 0 3 RESULTS This architecture have been implemented on FPGA devices of the FLEXlOKE ALTERA family. These devices have their maximal clock frequency limited to 250 MHz. The architecture was tested on the generating polynomials of table 1.

The results in table 2 were obtained on FPGA devices of the FLEXlOKE ALTERA family.The architecture tested in these examples implements a fully operational CRC checker. The synchronisation signals to write and read data respectively on input and ouput are fully implemented. The synthesis was done using Synplify 5. 3 and MaxPlus11 10. 0. The architecture was tested for 3 different levels of paralelism on 6 differents standard divisor polynomials.

It can be noticed that G17(z) is used on ethernet, FDDI and AALS-ATM, while G14(z) is the standard polynomial for the X2. 5 protocol. The clock rates must be compared to the highest frequency (250 MHz) that can be performed on FLEXlOKE devices.The “IC” indication means “logical cells” and is an indication of the area consumption. The results must be compared to those obtained in [SI. A data rate of 160 Mbits/s was obtained on an ALTERA FLEXIOK device (max. clock rate of 125 MHz), on a 32-bit parallel CRC runtime-configurable implementation of the decoder, based on the use of parallel combi- A pipeline structure can be devised by the implementation of P successive multiplications and divisions.

However, to keep the clock rate high, the P operations should not be done in a single combinatorial block. Thus, the stages of the P-multiplingldivising block must be separated by registers.This is the basic idea of the pipeline structure. Each of the P parallel bits of an input must be injected in their respective pipeline stage. consequently, they must be injected on different clock cycles. This may be done if the bits are delayed in a shift-register structure and (cf. the shift register path between [ d i n o ,.

.. , [douto, ... ,doutp-l] in the figure 2, with P = 8 in this example and G ( X ) = 1 + X + X 3 . The operation performed when passing from the stage k + l to the stage k of the pipeline (k>O) is described in the next formula, where G ( X ) = 1 + X + X 3 as it is in figure 2.

ith Ri,J= 0 wheni + j > p - 1. The P bits of an input are processed in P clock cycles. At each clock cycle, the result of the processing of P bits is available at the output of the pipeline structure. This result (the remainder of the P bits divided by G(z) must be cumulated in the [ROO, ROZ] ROI, register using a recurrent approach, similar to the scheme of the serial architecture of figure 1. The cumulated remainder at time t must be multiplied by X p and then divided by G(x). Then, the new partial remainder coming out of the pipeline structure can be cumulated. This process is describet in the next formula.

Ro,o,ROJ,R0,Sltfl = [Ro,o,RO,l,R0,zIt * M +[Ri,o, Ri,i, Rl,z]t * T f [Do,P-l, 0,Olt natorial block for the polynomial division as presented in [ 11. The gain obtained on the 32-bit parallel architecture is within 16 and 30 times, that is, 8 to 1. 5 times using the same technology (cf. table 2). For any combination of the design parametres, the latency is alway equal to P clock cycles where P denotes the parallelisation level. It can be noticed that for given a maximal polynomial divisor degree, the area consumption (number of logic cells ) is almost proportional to the parallelisation level of the architecture.Furthermore, the results show that a large increase of the parallelisation level can be done with a reasonable decrease of maximal clock frequency.

The critical path is due to the M matrix. The complexity of this matrix depends on the choosen polynomial (number and position of the non-zero terms in the polynomial). It also depends on the parallelisation 1232 level, but not linearly. Actually, a higher parallelisation level can lead to a less complex matrix.

1 INTRODUCTION The CRC (Cyclic Redundancy Checking) codes are used in a lot of telecommunication applications. They are used in the internal layers of protocols such as Ethernet, X25, FDDI and ATM (AAL5). However, on modem networks, the need for increasing data rates (over 1 Gbit/s) is setting the constraints on performance very high. Indeed, the speed improvement (higher clock rates) due to the technological evolution is unable to fit the demand.Consequently, new architectures must be devised. Targetting the applications to an FPGA device is an issue for this paper, as it allows low-cost designs. The simple and evident serial implementation is a classical hardware implementation of the CRC algorithm.

Unfortunatly, on an FPGA implementation with maximal clock frequency of 250 MHz, maximal data rate is limited to 250 Mbits/s is the best case. Higher data rates can only be obtained through parallelisation. Some parallel architectures have been proposed in the past to address the need for high data throughput [ 1][2].The main problem is usually to limit the rapidly increasing area overhead while improving the speed performance. In this paper, we present a parallel approach for the polynomial division based on a pipeline structure. The parallelisation can be led to any level and is only lim- ited by the area constraint set on the design. The data throughput is almost directly linked to the parallelisation level, as the maximal clock rate is not very sensitive to it.

2 PRINCIPLE The polynomial division is the fundamental operation of the CRC applications.The serial implementation of the division is shown in figure 1 for the case where the polynomial divisor is G ( X ) = Go + G1. X1 + Gz. X2 + G3. X3 = 1 + X + X 3 . As indicated previously, the data throughput of this serial implementation is quite low. Very high data rates can only be achieved with high clock frequencies, which in turn can only be obtained using rather expensive technological solutions.

Parallelisation of data processing is the main solution to improve the speed performance of a circuit (or system) if the clock rate must remain low.Pipelining may be used as an effective parallelisation method when a repeatitive process must be applied on large volumes of ‘data. Previous works have addressed the parallelisation problem in large demanding computational applications, particularly in arithmetic (eg. [3][4]) and error control coding circuits (eg. [11[21[61). In the serial architecture (figure I), a new data bit is inject on each clock cycle. The previous cumulated remainder is simultaneously multiplied by X and divided by G(z) (where G(z) is the polynomial divisor).

On P Figure I : Serial polynomial division for G ( X ) = 1 -tX + X 3 -7803-7057-0/01/$10. 00 02001 IEEE. 1231 successive clock cycle , P bits are injected and P successive multiplication and divisions are performed. The next formula (related to the example of figure 1) describes the operation performed on one clock cycle. 0 T = [ o o 1 !]=[n Gz 0 1 o 1 1 Go GI 0 i ] 0 3 RESULTS This architecture have been implemented on FPGA devices of the FLEXlOKE ALTERA family. These devices have their maximal clock frequency limited to 250 MHz. The architecture was tested on the generating polynomials of table 1.

The results in table 2 were obtained on FPGA devices of the FLEXlOKE ALTERA family.The architecture tested in these examples implements a fully operational CRC checker. The synchronisation signals to write and read data respectively on input and ouput are fully implemented. The synthesis was done using Synplify 5. 3 and MaxPlus11 10. 0. The architecture was tested for 3 different levels of paralelism on 6 differents standard divisor polynomials.

It can be noticed that G17(z) is used on ethernet, FDDI and AALS-ATM, while G14(z) is the standard polynomial for the X2. 5 protocol. The clock rates must be compared to the highest frequency (250 MHz) that can be performed on FLEXlOKE devices.The “IC” indication means “logical cells” and is an indication of the area consumption. The results must be compared to those obtained in [SI. A data rate of 160 Mbits/s was obtained on an ALTERA FLEXIOK device (max. clock rate of 125 MHz), on a 32-bit parallel CRC runtime-configurable implementation of the decoder, based on the use of parallel combi- A pipeline structure can be devised by the implementation of P successive multiplications and divisions.

However, to keep the clock rate high, the P operations should not be done in a single combinatorial block. Thus, the stages of the P-multiplingldivising block must be separated by registers.This is the basic idea of the pipeline structure. Each of the P parallel bits of an input must be injected in their respective pipeline stage. consequently, they must be injected on different clock cycles. This may be done if the bits are delayed in a shift-register structure and (cf. the shift register path between [ d i n o ,.

.. , [douto, ... ,doutp-l] in the figure 2, with P = 8 in this example and G ( X ) = 1 + X + X 3 . The operation performed when passing from the stage k + l to the stage k of the pipeline (k>O) is described in the next formula, where G ( X ) = 1 + X + X 3 as it is in figure 2.

ith Ri,J= 0 wheni + j > p - 1. The P bits of an input are processed in P clock cycles. At each clock cycle, the result of the processing of P bits is available at the output of the pipeline structure. This result (the remainder of the P bits divided by G(z) must be cumulated in the [ROO, ROZ] ROI, register using a recurrent approach, similar to the scheme of the serial architecture of figure 1. The cumulated remainder at time t must be multiplied by X p and then divided by G(x). Then, the new partial remainder coming out of the pipeline structure can be cumulated. This process is describet in the next formula.

Ro,o,ROJ,R0,Sltfl = [Ro,o,RO,l,R0,zIt * M +[Ri,o, Ri,i, Rl,z]t * T f [Do,P-l, 0,Olt natorial block for the polynomial division as presented in [ 11. The gain obtained on the 32-bit parallel architecture is within 16 and 30 times, that is, 8 to 1. 5 times using the same technology (cf. table 2). For any combination of the design parametres, the latency is alway equal to P clock cycles where P denotes the parallelisation level. It can be noticed that for given a maximal polynomial divisor degree, the area consumption (number of logic cells ) is almost proportional to the parallelisation level of the architecture.Furthermore, the results show that a large increase of the parallelisation level can be done with a reasonable decrease of maximal clock frequency.

The critical path is due to the M matrix. The complexity of this matrix depends on the choosen polynomial (number and position of the non-zero terms in the polynomial). It also depends on the parallelisation 1232 level, but not linearly. Actually, a higher parallelisation level can lead to a less complex matrix.