Hardware Acceleration of EDA Algorithms- P9: Single-threaded software applications have ceased to see significant gains in performanceon a general-purpose CPU, even with further scaling in very large scaleintegration (VLSI) technology. This is a significant problem for electronic designautomation (EDA) applications, since the design complexity of VLSI integratedcircuits (ICs) is continuously growing. In this research monograph, we evaluatecustom ICs, field-programmable gate arrays (FPGAs), and graphics processors asplatforms for accelerating EDA algorithms, instead of the general-purpose singlethreadedCPU....
Nội dung trích xuất từ tài liệu:
Hardware Acceleration of EDA Algorithms- P99.4 Our Approach 143 Now, by definitionCD(k) = (CD(i) · D(i, k) + CD(j) · D(j, k)) and CD(i) = (CD(a) · D(a, i) + CD(b) ·D(b, i))From the first property discussed for CD, CD(a) = FD(a s-a-0, a) = 1010, and bydefinition CD(b) = 0000. By substitution and similarly computing CD(i) and CD(j),we compute CD(k) = 0010. The implementation of the computation of detectabilities and cumulativedetectabilities in FSIM∗ and GFTABLE is different, since in GFTABLE, all compu-tations for computing detectabilities and cumulative detectabilities are done on theGPU, with every kernel executed on the GPU launched with T threads. Thus a singlekernel in GFTABLE computes T times more data, compared to the correspondingcomputation in FSIM∗. In FSIM∗, the backtracing is performed in a topologicalmanner from the output of the FFR to its inputs and is not scheduled for gates drivingzero critical lines in the packet. We found that this pruning reduces the number ofgate evaluations by 42% in FSIM∗ (based on tests run on four benchmark circuits).In GFTABLE, however, T times more patterns are evaluated at once, and as a result,no reduction in the number of scheduled gate evaluations were observed for thesame four benchmarks. Hence, in GFTABLE, we perform a brute-force backtracingon all gates in an FFR. As an example, the pseudocode of the kernel which evaluates the cumulativedetectability at output k of a 2-input gate with inputs i and j is provided in Algo-rithm 11. The arguments to the kernel are the pointer to global memory, CD, wherecumulative detectabilities are stored; pointer to global memory, D, where detectabil-ities to the immediate dominator are stored; the gate_id of the gate being evaluated(k) and its two inputs (i and j). Let the thread’s (unique) threadID be tx . The datain CD and D, indexed at a location (tx + i × T) and (tx + j × T), and the resultcomputed as perCD(k) = (CD(i) · D(i, k) + CD(j) · D(j, k))is stored in CD indexed at location (tx + k × T). Our implementation has a similarkernel for 2-, 3-, and 4-input gates in our library.Algorithm 11 Pseudocode of the Kernel to Compute CD of the Output k of 2-InputGate with Inputs i and j CPT_kernel_2(int ∗ CD,int ∗ D,inti,intj,intk){ tx = my_thread_id CD[tx + k ∗ T] = CD[tx + i ∗ T] · D[tx + i ∗ T] + CD[tx + j ∗ T] · D[tx + j ∗ T] }9.4.2.4 Fault Simulation of SR(s) (Lines 15, 16)In the next step, the FSIM∗ algorithm checks that CD(s) = (00...0) (line 15), beforeit schedules the simulation of SR(s) until its immediate dominator t and the compu-tation of D(s, t). In other words, if CD(s) = (00...0), it implies that for the currentvector, the frontier of all faults upstream from s has died before reaching the stem144 9 Fault Table Generation Using Graphics Processorss, and thus no fault can be detected at s. In that case, the fault simulation of SR(s)would be pointless. In the case of GFTABLE, the effective packet size is 32 × T. T is usually setto more than 1,000 (in our experiments it is ≥10K), in order to take advantage ofthe parallelism available on the GPU and to amortize the overhead of launching akernel and accessing global memory. The probability of finding CD(s) = (00...0) inGFTABLE is therefore very low (∼0.001). Further, this check would require thelogical OR of T 32-bit integers on the GPU, which is an expensive computation.As a result, we bypass the test of line 15 in GFTABLE and always schedule thecomputation of SR(s) (line 16). In simulating SR(s), explicit fault simulation is performed in the forward lev-elized order from stem s to its immediate dominator t. The input at stem s duringsimulation of SR(s) is CD(s) XORed with fault-free value at s. This is equivalent toinjecting the faults which are upstream from s and observable at s. After the faultsimulation of SR(s), the detectability D(s, t) is computed by XORing the simulationoutput at t with the true value simulation at t. During the forward levelized simu-lation, the immediate fanout of a gate g is scheduled only if the result of the logicevaluation at g is different from its fault-free value. This check is conducted forevery gate in all paths from stem s to its immediate dominator t. On the GPU, thisstep involves XORing the current gate’s T 32-bit outputs with the previously storedfault-free T 32-bit outputs. It would then require the computation of a logical reduc-tion OR of the T 32-bit results of the XOR into one 32-bit result. This is becauseline 17 is computed on the CPU, which requires a 32-bit operand. In GFTABLE,the reduction OR operation is a modified version of the highly optimized tree-basedparallel reduction algorithm on the GPU, described in [2]. The approach in [2] effec-tively avoids bank conflicts and divergent warps, minimizes global memory accesslatencies, and employs loop unrolling to gain further speedup. Our modified reduc-tion algorithm has a key difference compared to [2]. The approach in [2] computesa SUM instead of a logical OR. The approach described in [2] is a breadth-firstapproach. In our case, employing a breadth-first approach is expensive, since weneed to detect if any of the T × 32 bits is not equal to 0. Therefore, as soon as wefind a single non-zero entry we can finish our computation. Note that performing thistest sequentially would be extremely slow in the worst case. We therefore equallydivide the array of T 32-bit words into smaller groups of size Q words and computethe logical OR of all numbers within a group using our modified parallel reductionapproach. As a result, our approach is a hybrid of a breadth-first and a depth-firstapproach. If the reduction result for any group is not (00...0), we return from theparallel reduction kernel and schedule the fanout of the current ...
Hardware Acceleration of EDA Algorithms- P9
Số trang: 20
Loại file: pdf
Dung lượng: 236.48 KB
Lượt xem: 18
Lượt tải: 0
Xem trước 2 trang đầu tiên của tài liệu này:
Thông tin tài liệu:
Tìm kiếm theo từ khóa liên quan:
phần cứng CPU Vi điều khiển PIC thiết bị máy chủ giao tiếp ngoại vi Định thời biểu CPU hệ điều hành đa chươngTài liệu có liên quan:
-
Giáo trình Vi điều khiển PIC: Phần 1
119 trang 131 0 0 -
Giáo trình Vi điều khiển PIC: Lý thuyết - Thực hành (Phần 2)
168 trang 104 0 0 -
Tìm hiểu các thông số cơ bản của CPU
11 trang 91 0 0 -
Giáo trình hoàn chỉnh vi điều khiển PIC 14
8 trang 58 0 0 -
Bài tập lớn lý thuyết điều khiển tự động
16 trang 48 0 0 -
Agile Processes in Software Engineering and Extreme Programming- P10
19 trang 43 0 0 -
GIÁO TRÌNH ĐIỀU KHIỂN SỐ_CHƯƠNG 7
0 trang 41 0 0 -
GIÁO TRÌNH ĐIỀU KHIỂN SỐ_CHƯƠNG 5
0 trang 41 0 0 -
GIÁO TRÌNH ĐIỀU KHIỂN SỐ_CHƯƠNG 1_2
0 trang 38 0 0 -
Giáo trình Vi điều khiển PIC: Lý thuyết - Thực hành (Phần 1)
201 trang 37 1 0