Fast and efficient implementation of floating point complex matrix decomposition

Floating point has a larger dynamic range of data, requiring only one data type advantage in many algorithms. This article describes how to implement floating point complex matrix decomposition using Vivado HLS. Using HLS, various matrix decomposition algorithms can be implemented quickly and efficiently, which greatly improves production efficiency and reduces the difficulty of the developer's algorithm FPGA implementation.
Void matrix_dcmp
(
Cf_t in_u[(R_DIM+Y_DIM)/DIV_NUM][DIV_NUM],
Cf_t pd_err_in,
Float lamda,
Float lamda_sqrt,
Float diag[R_DIM],
Cf_t r[R_DIM][X_DIM],
Cf_t p[R_DIM]
)
{
Coef_cal(lamda_sqrt, lamda, diag[i], pre_in_u, pd_err_in, &s_o,&s_conj_o,&la
Mda_sqrtxs_o,&c_o,&lamda_sqrtxc_o,&diag_out,&p_o,&pd_err);
Cal_core(u_tmp, r_tmp, s_n, i, j, k, c_o, lamda_sqrtxc_o, lamda_sqrtxs_o,
S_conj_o, &in_u_w2, &r[i][r_addr]);
}
Void coef_calc
(
Float lamda_sqrt,
Float lamda,
Float r_diag,
Cf_t u_diag,
Cf_t pd_err_in,
Cf_t *s,
Cf_t *s_conj,
Cf_t *lamda_sqrtxs,
Float *c,
Float *lamda_sqrtxc,
Float *diag,
Cf_t *p_o,
Cf_t *pd_err
)
Void calc_core
(
Cf_t in_u,
Cf_t r,
Int s_n,
Int i,
Unsigned char j,
Unsigned char k,
Float c_o,
Float lamda_sqrtxc_o,
Cf_t lamda_sqrtxs_o,
Cf_t s_conj_o,
Cf_t* u_ret,
Cf_t* r_ret
)
The RTL code generated by Vivado-HLS retains the original c code hierarchy by default. When building the c code hierarchy, you can use the module division from top to bottom and bottom to top. Write the basic floating-point operations such as add, subtract, multiply, divide, square root, etc. into the lowest-level sub-function, and add a pipeline to it, even hit the register for better timing performance. The following example:
Template T reg(T x) {
#pragma HLS inline self off
#pragma HLS interface ap_none register port=return
Return x;
}
Cf_t mult( cf_t in1, cf_t in2 ) {
#pragma HLS PIPELINE
Cf_t out;
Float in1_re_in2_re, in1_im_in2_im, in1_re_in2_im, in1_im_in2_re;
In1_re_in2_re = hfmult(in1.re,in2.re);
In1_im_in2_im = hfmult(in1.im,in2.im);
In1_re_in2_im = hfmult(in1.re,in2.im);
In1_im_in2_re = hfmult(in1.im,in2.re);
Out.re = (in1_re_in2_re - in1_im_in2_im);
Out.im = (in1_re_in2_im + in1_im_in2_re);
Return reg(out);
}
Float hfmult
(
Float in1,
Float in2
)
{
#pragma HLS PIPELINE
Float out;
Out = in1 * in2;
Return reg(out);
}
In addition, in order to improve the parallelism of the operation, it is necessary to divide the in_u and r arrays in the matrix_dcmp (the HLS integrated into the BRAM or the distributed RAM of the FPGA) into the direcTIve, so that the data can travel to the parallel processing unit.
#pragma HLS ARRAY_PARTITION variable=in_u complete dim=2
#pragma HLS ARRAY_PARTITION variable=r complete dim=2
3, Vivado-HLS matrix decomposition timing optimization
In order to make the in_u integrated RAM timing better, you can add the resource directive to the in_u integrated RAM to control it to 3 stages, so that the generated RAM input and output will hit a register.
#pragma HLS RESOURCE variable=in_u core=RAM3S
Similarly, we can also set the sufficient latency directive to DSP48, so that there is enough clock beat to give the DSP48 internal beat register.
If the above-mentioned single-precision floating-point multiplication hfmult latency is set to 3, so that there is only 2 levels of latency allocated to each DSP48, then the integrated A_reg or P_reg inside the DSP48 will not be hit, so that the timing performance is Greatly dropped.
Derive_core fmul_der -base FMul_maxdsp -latency 3 -fixed
Set_directive_resource -core fmul_der hfmult out
or
In order to achieve better timing performance, the above-mentioned single-precision floating-point multiplication hfmult latency is set to at least 4, so that there is only 3 levels of latency allocated to each DSP48, then the A_reg or P_reg inside the DSP48 will be combined. A beat.
Derive_core fmul_der -base FMul_maxdsp -latency 4 -fixed
Set_directive_resource -core fmul_der hfmult out
4, Vivado-HLS matrix decomposition design results The size of the matrix in this design is 128x128 single precision floating point, complex number.
The combination of serial and parallel implementation is adopted.

SFX Power Supply
Sfx Power Supply,Sfx 250W Power Supply,Sfx Pc 150W Power Supply,Psu 150W 200W Power Supply
Boluo Xurong Electronics Co., Ltd. , https://www.greenleaf-pc.com