ADVANCED FAULT-TOLERANT ON-CHIP INTERCONNECTS


Future System-on-Chip (SoC) will contain hundreds of components made of processor cores, DSPs, memory, accelerators, and I/O all intergared into a single die area of just a few square millimeters. Such complex system/SoC will be interconnected via a novel on-chip interconnect closer to a sophisticated network than to current bus-based solutions.This network must provide high throughput and low latency while keeping area and power consumption low. Our research effort is about solving several design challenges to enable such new paradigm in massively parallel many-core systems. In particular, we are investigating fault-tolerance, 3D-TSV integration, photonic communication, low-power mapping techniques, low-latency adaptive routing.

Patents
  1. [特 許第6846027 号] (2021.03.03) Abderazek Ben Abdallah, ''Defect tolerance router for network on-chip'', 特願 2016-100732号 (2016.05.19)
  2. [特 許第6747660号]  (登録日2020.11.08), Abderazek Ben Abdallah), ''Optical network-on-chip system using non-block photo-switches each including control unit, and optical network-on-chip setup method '', 特願2015-196698号 (2015.10.02)
  3. [特 許第6284177号]   (登録日2018.2.09), Abderazek Ben Abdallah), ''Error resilience router, IC using the same, and error resilience router control method'', 特願2013-262523号 (2013.12.19)
  4. 特願 2017-218953(特 許第7239099号)Abderazek Ben Abdallah, Khanh N. Dang, Masayuki Hisada, "TSV Error Tolerant Router Device for 3D Network On Chip," 特願 2017-218953 (2023.03.14)
  5. Abderazek Ben Abdallah, Khanh N. Dang, Masayuki Hisada, ‘‘Distance-aware Extended Parity Product Coding for multiple faults detection for on-chip links [三 次元ICリ ンクにおける多重故障検出のための距離に基づく 拡張パリティ積符号], 特 願 2020-171553
  6. Abderazek Ben Abdallah, Khanh N. Dang, "Multiple error detection circuit detecting multiple errors in multiple links and error correction circuit having multiple error detection circuit'', 特願2020-094220. 
  • Khanh N. Dang, Akram Ben Ahmed, Abderazek Ben Abdallah, Xuan-Tu Tran, ‘‘HotCluster: A thermal-aware defect recovery method for Through-Silicon-Vias Towards Reliable 3-D ICs systems’’, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems March 2021. DOI: 10.1109/TCAD.2021.3069370
    Through silicon via (TSV) is considered as the near-future solution to realize low-power and high-performance 3D-integrated circuits (3D-ICs) and 3D-Network-on-Chips (3D-NoCs). However, the lifetime reliability issue of TSV due to its fault sensitivity and the high operating temperature of 3D-ICs, which also accelerates the fault rate, is one of the most critical challenges. Meanwhile, most current works focus on detecting and correcting TSV defects after manufacturing without considering high-temperature nodes’ impact on lifetime reliability. Besides, the recovery for defective clusters is also challenging because of costly redundancies. In this work, we present HotCluster : a hotspot-aware self-correction platform for clustering defects in 3D-NoCs to help understand and tackle this problem. Wefirst give a method to predict normalized fault rates and place redundant TSV groups according to each region’s fault rate. In our particular medium fault rate (normalized to the coolest area), HotCluster reduces about 60% of the redundancies in comparison to the uniformly distributed redundancies while having a higher ratio of router working in a normal state. Furthermore, HotCluster integrates both online (weight based) and offline (max-flow min-cut offline method) mapping algorithms to help the system correct the faulty TSV clusters. The experimental results show that both the max-flow min-cut offline method and weight-based online mode with a redundancy of 0.25 exhibits less than 1% of routers disabled under 50% defect rates.

  • Khanh N. Dang, Akram Ben Ahmed, Abderazek Ben Abdallah, X. Tran, ‘‘A thermal-aware on-line fault tolerance method for TSV lifetime reliability in 3D-NoC systems’’, IEEE Access, Volume 8, pp 166642-166657, 2020.
    Through-silicon-via (TSV) based 3D Integrated Circuits (3D-IC) are one of the most advanced architectures by providing low power consumption, shorter wire length and smaller footprint. However,3D-ICs confront lifetime reliability due to high operating temperature and interconnect reliability, especiallythe Through-Silicon-Via (TSV), which can significantly affect the accuracy of the applications. In this paper,we present an online method that supports the detection and correction of lifetime TSV failures, named IaSiG. By reusing the conventional recovery method and analyzing the output syndromes, IaSiG can determine and correct the defective TSVs. Results show that within a group, R redundant TSVs can fully localize and correct R defects and support the detection of R+1 defects. Moreover, by using G groups, it can localize up to G×R and detect up to G × (R + 1) defects. An implementation of IaSiG for 32-bit data in eight groups and two redundancies has a worst-case execution time (WCET) of 5,152 cycles while supporting at most 16 defective TSVs (50% localization). By integrating IaSiG onto a 3D Network-on-Chip, we also perform a grid-search based empirical method to insert suitable numbers of redundancies into TSV groups. The empirical method takes the operating temperature as the factor of accelerated fault due to the fact that temperature is one of the major issues of 3D-ICs. The results show that the proposed method can reduce the number of redundancies from the uniform method while still maintaining the required Mean Time to Failure

  • Khanh N. Dang, Akram Ben Ahmed, Michael Meyer, Abderazek Ben Abdallah, and Xuan-Tu Tran, ‘’A non-blocking non-degrading multiple defect link test method for 3D-Networks-on-Chip,’’ IEEE Access, Vol8, pp. 59571 – 59589,  2020. DOI: 10.1109/ACCESS.2020.2982836
    As one of the most promising technologies to realize 3D Integrated Circuits (3D-ICs), Through-Silicon-Via (TSV) acts as the inter-layer link inside 3D Networks-on-Chip. However, the reliability issues due to the low yield rates and the sensitivity to thermal hotspots and stress issues are preventing TSV-based 3D-ICs from being widely and efficiently used. To ensure the correctness of TSV connections at run-time, detecting multiple (clustering) defects is an important feature. While Error Correction Codes are limited by a certain number of detectable faults, using Built-In-Self-Test (BIST) prevents the system from operating normally during the test time. This paper first presents a Parity Product Code (PPC) with the ability to correct one fault and detect, at least, two faults. Second, we present extended PPC (EPPC) to detect multiple defects within the links of Networks-on-Chip by using two or more additional matrices. Furthermore, we present the distance-aware version of EPPC to detect multiple defects by using only one extra matrix. The results show that the distance-aware EPPC can detect 100% of clustering defects and multiple random defects within two and three cycles, respectively. The performance evaluation for Networkon-Chip testing also shows no degradation while providing an extremely short response time (2-3 cycles).


  • K. N. Dang, A. B. Ahmed, A. Ben Abdallah and X. Tran, "TSV-OCT: A Scalable Online Multiple-TSV Defects Localization for Real-Time 3-D-IC Systems," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 3, pp. 672-685, 3/2020. doi: 10.1109/TVLSI.2019.2948878.
    In order to detect and localize through-silicon-via (TSV) failures in both manufacturing and operating phases, most of the existing methods use a dedicated testing mechanism with long response time and prerequisite interruptions for online testing. This article presents an error correction code (ECC)-based method named “TSV on-communication test” (TSV-OCT) to detect and localize faults without halting the operation of TSV-based 3-D-IC systems. We first propose a statistical detector, a method to detect open and short defects in TSVs that work in parallel with data transactions. Second, we propose an isolation-and-check algorithm to enhance the localization ability of the method. Moreover, the Monte Carlo simulations show that the proposed statistical detector increases ×2 the number of detected faults when compared to conventional ECC-based techniques. With the help of isolation and check, TSV-OCT localizes the number of defects up to ×4 and ×5 higher. In addition, the response time is kept below 65000 cycles, which could be easily integrated into real-time applications. On the other hand, an implementation of TSV-OCT on a 3-D Network-on-Chip (NoC) router shows no performance degradation for testing while having a reasonable area overhead.



  • K. N. Dang, A. B. Ahmed, Y. Okuyama and A. Ben Abdallah, "Scalable Design Methodology and Online Algorithm for TSV-Cluster Defects Recovery in Highly Reliable 3D-NoC Systems," in IEEE Transactions on Emerging Topics in Computing, vol. 8, no. 3, pp. 577-590, 1 July-Sept. 2020, doi: 10.1109/TETC.2017.2762407.
    3D-Network-on-Chips exploit the benefits of Network-on-Chips and 3D-Integrated Circuits allowing them to be considered as one of the most advanced and auspicious communication methodologies. On the other hand, the reliability of 3D-NoCs, due to the vulnerability of Through Silicon Vias, remains a major problem. Most of the existing techniques rely on correcting the TSV defects by using redundancies or employing routing algorithms. Nevertheless, they are not suitable for TSV-cluster defects as they can either lead to costly area and power consumption overheads, or they may result in non-minimal routing paths; thus, posing serious threats to the system reliability and overall performance. In this work, we present a scalable and low-overhead TSV usage and design method for 3D-NoC systems where the TSVs of a router can be utilized by its neighbors to deal with the cluster open defects. An adaptive online algorithm is also introduced to assist the proposed system to immediately work around the newly detected defects without using redundancies. The experimental results show the proposal ensure less than 2 percent of the routers being disabled, even with 50 percent of the TSV clusters defects. The performance evaluations also demonstrate unchanged performances for real applications under 5 percent of cluster defects.

  • Khanh N. Dang, Akram Ben Ahmed, Xuan-Tu Tran, Yuichi Okuyama, Abderazek Ben Abdallah, ”A Comprehensive Reliability Assessment of Fault-Resilient Network-on-Chip Using Analytical Model”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, Issue: 11, pp. 3099 – 3112, vol. 2017.  DOI: 10.1109/TVLSI.2017.2736004
    The component's failure in network-on-chips (NoCs) has been a critical factor on the system's reliability. In order to alleviate the impact of faults, fault tolerance has been investigated in the recent years to enhance NoC's robustness. Due to the vast selection of fault-tolerance mechanisms and critical design constraints, selecting and configuring an appropriate mechanism to satisfy the fault-tolerance requirements constitute new challenges for designers. Consequently, reliability assessment has become prominent for the early stages of manufacturing process to solve these problems. This paper approaches the fault-tolerance analysis by providing an analytical model to approximate the lifetime reliability and compares it with a system-level simulation. Based on the proposed approach, we measure the fault-tolerance efficiency using a new parameter, named reliability acceleration factor. The goal of this paper is to provide an efficient and accurate reliability assessment to help designers easily understand and evaluate the advantages and drawbacks of their potential fault-tolerance methods.


  • Achraf  Ben Ahmed, Tsutomu Yoshinaga, Abderazek Ben Abdallah, “Scalable Photonic Networks-on-Chip Architecture Based on a Novel Wavelength-Shifting Mechanism”, IEEE Transactions on  Emerging Topics in Computing, 2017. DOI: 10.1109/TETC.2017.2737016
    Since Photonic Networks-on-Chip (PNoCs) were proposed, there was an unanimity about the benefits that photonic links could bring to the on-chip interconnection. However, a debate always takes place regarding the suitable architecture and routing scheme to be used. This debate concerns the use of fully photonic PNoC or an Electro-assisted one. Both schemes have their pros and cons, but the main drawback in both architectures is their scalability. We propose in this paper an alternative to these two conventional PNoC architectures. Our proposed system is based on a novel Wavelength-Shifting mechanism, which combines the benefits of the previously mentioned schemes while limiting their drawbacks. The proposed system was validated by an analytical model, in addition to a set of simulations using synthetic and realistic traffic patterns. Evaluation results show that compared to the electro-assisted architectures, we could enhance the latency, power, and bandwidth by an order of magnitude, reaching a performance similar to the fully photonic architecture. In addition, the number of used photonic devices still much lower than the one used in conventional fully photonic architectures by an average of 60 percent. Furthermore, the new wavelength-shifting mechanism is highly scalable, and it is not affected anymore by the communication's distance, nor the traffic pattern, which make it a promising solution to replace existing conventional architectures.

Visualization of Si-Photonic 3D-NoC Routing on 16x16x5/(4+1) and 8x8x5/(4+1) Chips


  • kram Ben Ahmed, Abderazek Ben Abdallah, ”Adaptive Fault-Tolerant Architecture and Routing Algorithm for Reliable Many-Core 3D-NoC Systems”, Journal of Parallel and Distributed Computing, Volumes 93–94, July 2016, Pages 30-43, ISSN 0743-7315, doi:10.1016/j.jpdc.2016.03.014
    During the last few decades, Three-dimensional Network-on-Chips (3D-NoCs) have been showing their advantages against 2D-NoC architectures. This is thanks to the reduced average interconnect length and lower interconnect-power consumption inherited from Three-dimensional Integrated Circuits (3D-ICs). On the other hand, questions about their reliability is starting to arise. This issue is mainly caused by their complex nature where a single faulty transistor may cause intolerable performance degradation or even the entire system collapse. To ensure their correct functionality, 3D-NoC systems must be fault-tolerant to any short-term malfunction or permanent physical damage to ensure message delivery on time while minimizing the performance degradation as much as possible.In this paper, we present a fault-tolerant 3D-NoC architecture, called 3D-Fault-Tolerant-OASIS (3D-FTO).11This project is partially supported by Competitive research funding, Ref. P1-5, Fukushima, Japan. With the aid of a light-weight routing algorithm, 3D-FTO manages to avoid the system failure at the presence of a large number of transient, intermittent, and permanent faults. Moreover, the proposed architecture is leveraging on reconfigurable components to handle the fault occurrence in links, input-buffers, and crossbar, where the faults are more often to happen. The proposed 3D-FTO system is able to work around different kinds of faults ensuring graceful performance degradation while minimizing the additional hardware complexity and remaining power-efficient. Adaptive fault-tolerant 3D-Network-on-Chip system architecture.RAB mechanism for deadlock recovery and fault-tolerance in input-buffers.Traffic-Prediction-Unit technique for congestion relief.Bypass-Link-on-Demand to tackle fault-occurrence in the Crossbar.Fault-tolerance and graceful performance degradation obtained at high fault-rates.

  • Akram Ben Ahmed, A. Ben Abdallah,”Graceful Deadlock-Free Fault-Tolerant Routing Algorithm for 3D Network-on-Chip Architectures”, Journal of Parallel and Distributed Computing, 74/4 (2014), pp. 2229-2240.
    Three-Dimensional Networks-on-Chip (3D-NoC) has been presented as an auspicious solution merging the high parallelism of Network-on-Chip (NoC) interconnect paradigm with the high-performance and lower interconnect-power of 3-dimensional integration circuits. However, 3D-NoC systems are exposed to a variety of manufacturing and design factors making them vulnerable to different faults that cause corrupted message transfer or even catastrophic system failures. Therefore, a 3D-NoC system should be fault-tolerant to transient malfunctions or permanent physical damages. In this work, we present an efficient fault-tolerant routing algorithm, called Hybrid-Look-Ahead-Fault-Tolerant (HLAFT), which takes advantage of both local and look-ahead routing to boost the performance of 3D-NoC systems while ensuring fault-tolerance. A deadlock-recovery technique associated with HLAFT, named Random-Access-Buffer (RAB), is also presented. RAB takes advantage of look-ahead routing to detect and remove deadlock with no considerably additional hardware complexity. We implemented the proposed algorithm and deadlock-recovery technique on a real 3D-NoC architecture (3D-OASIS-NoC1) and prototyped it on FPGA. Evaluation results show that the proposed algorithm performs better than XYZ, even when considering high fault-rates (i.e., 20%), and outperforms our previously designed Look-Ahead-Fault-Tolerant routing (LAFT) demonstrated in latency/flit reduction that can reach 12.5% and a throughput enhancement reaching 11.8% in addition to 7.2% dynamic-power saving thanks to the Power-management module integrated with HLAFT.