Danh mục tài liệu

Hardware Acceleration of EDA Algorithms- P9

Số trang: 20      Loại file: pdf      Dung lượng: 236.48 KB      Lượt xem: 18      Lượt tải: 0    
Xem trước 2 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Hardware Acceleration of EDA Algorithms- P9: Single-threaded software applications have ceased to see significant gains in performanceon a general-purpose CPU, even with further scaling in very large scaleintegration (VLSI) technology. This is a significant problem for electronic designautomation (EDA) applications, since the design complexity of VLSI integratedcircuits (ICs) is continuously growing. In this research monograph, we evaluatecustom ICs, field-programmable gate arrays (FPGAs), and graphics processors asplatforms for accelerating EDA algorithms, instead of the general-purpose singlethreadedCPU....
Nội dung trích xuất từ tài liệu:
Hardware Acceleration of EDA Algorithms- P99.4 Our Approach 143 Now, by definitionCD(k) = (CD(i) · D(i, k) + CD(j) · D(j, k)) and CD(i) = (CD(a) · D(a, i) + CD(b) ·D(b, i))From the first property discussed for CD, CD(a) = FD(a s-a-0, a) = 1010, and bydefinition CD(b) = 0000. By substitution and similarly computing CD(i) and CD(j),we compute CD(k) = 0010. The implementation of the computation of detectabilities and cumulativedetectabilities in FSIM∗ and GFTABLE is different, since in GFTABLE, all compu-tations for computing detectabilities and cumulative detectabilities are done on theGPU, with every kernel executed on the GPU launched with T threads. Thus a singlekernel in GFTABLE computes T times more data, compared to the correspondingcomputation in FSIM∗. In FSIM∗, the backtracing is performed in a topologicalmanner from the output of the FFR to its inputs and is not scheduled for gates drivingzero critical lines in the packet. We found that this pruning reduces the number ofgate evaluations by 42% in FSIM∗ (based on tests run on four benchmark circuits).In GFTABLE, however, T times more patterns are evaluated at once, and as a result,no reduction in the number of scheduled gate evaluations were observed for thesame four benchmarks. Hence, in GFTABLE, we perform a brute-force backtracingon all gates in an FFR. As an example, the pseudocode of the kernel which evaluates the cumulativedetectability at output k of a 2-input gate with inputs i and j is provided in Algo-rithm 11. The arguments to the kernel are the pointer to global memory, CD, wherecumulative detectabilities are stored; pointer to global memory, D, where detectabil-ities to the immediate dominator are stored; the gate_id of the gate being evaluated(k) and its two inputs (i and j). Let the thread’s (unique) threadID be tx . The datain CD and D, indexed at a location (tx + i × T) and (tx + j × T), and the resultcomputed as perCD(k) = (CD(i) · D(i, k) + CD(j) · D(j, k))is stored in CD indexed at location (tx + k × T). Our implementation has a similarkernel for 2-, 3-, and 4-input gates in our library.Algorithm 11 Pseudocode of the Kernel to Compute CD of the Output k of 2-InputGate with Inputs i and j CPT_kernel_2(int ∗ CD,int ∗ D,inti,intj,intk){ tx = my_thread_id CD[tx + k ∗ T] = CD[tx + i ∗ T] · D[tx + i ∗ T] + CD[tx + j ∗ T] · D[tx + j ∗ T] }9.4.2.4 Fault Simulation of SR(s) (Lines 15, 16)In the next step, the FSIM∗ algorithm checks that CD(s) = (00...0) (line 15), beforeit schedules the simulation of SR(s) until its immediate dominator t and the compu-tation of D(s, t). In other words, if CD(s) = (00...0), it implies that for the currentvector, the frontier of all faults upstream from s has died before reaching the stem144 9 Fault Table Generation Using Graphics Processorss, and thus no fault can be detected at s. In that case, the fault simulation of SR(s)would be pointless. In the case of GFTABLE, the effective packet size is 32 × T. T is usually setto more than 1,000 (in our experiments it is ≥10K), in order to take advantage ofthe parallelism available on the GPU and to amortize the overhead of launching akernel and accessing global memory. The probability of finding CD(s) = (00...0) inGFTABLE is therefore very low (∼0.001). Further, this check would require thelogical OR of T 32-bit integers on the GPU, which is an expensive computation.As a result, we bypass the test of line 15 in GFTABLE and always schedule thecomputation of SR(s) (line 16). In simulating SR(s), explicit fault simulation is performed in the forward lev-elized order from stem s to its immediate dominator t. The input at stem s duringsimulation of SR(s) is CD(s) XORed with fault-free value at s. This is equivalent toinjecting the faults which are upstream from s and observable at s. After the faultsimulation of SR(s), the detectability D(s, t) is computed by XORing the simulationoutput at t with the true value simulation at t. During the forward levelized simu-lation, the immediate fanout of a gate g is scheduled only if the result of the logicevaluation at g is different from its fault-free value. This check is conducted forevery gate in all paths from stem s to its immediate dominator t. On the GPU, thisstep involves XORing the current gate’s T 32-bit outputs with the previously storedfault-free T 32-bit outputs. It would then require the computation of a logical reduc-tion OR of the T 32-bit results of the XOR into one 32-bit result. This is becauseline 17 is computed on the CPU, which requires a 32-bit operand. In GFTABLE,the reduction OR operation is a modified version of the highly optimized tree-basedparallel reduction algorithm on the GPU, described in [2]. The approach in [2] effec-tively avoids bank conflicts and divergent warps, minimizes global memory accesslatencies, and employs loop unrolling to gain further speedup. Our modified reduc-tion algorithm has a key difference compared to [2]. The approach in [2] computesa SUM instead of a logical OR. The approach described in [2] is a breadth-firstapproach. In our case, employing a breadth-first approach is expensive, since weneed to detect if any of the T × 32 bits is not equal to 0. Therefore, as soon as wefind a single non-zero entry we can finish our computation. Note that performing thistest sequentially would be extremely slow in the worst case. We therefore equallydivide the array of T 32-bit words into smaller groups of size Q words and computethe logical OR of all numbers within a group using our modified parallel reductionapproach. As a result, our approach is a hybrid of a breadth-first and a depth-firstapproach. If the reduction result for any group is not (00...0), we return from theparallel reduction kernel and schedule the fanout of the current ...