VLSI Projects

VLSI Projects

ECE Projects, EEE Projects
Description
VLSI Projects: Very-large-scale-integration (VLSI) is the process of creating an integrated circuit (IC) by combining thousands of transistors into a single chip. We offer VLSI projects that can be applied in real-time solutions by optimization of processors thereby increasing the efficiency of many systems.
Download Project List

Quality Factor

  • 100% Assured Results
  • Best Project Explanation
  • Tons of Reference
  • Cost optimized
  • Controlpanel Access


1VLSI Architecture for Novel Hopping Discrete Fourier Transform Computation
The hopping discrete Fourier transform (HDFT) is a new method applied for time-frequency spectral analysis of time-varying signals. In the implementation of HDFT algorithms, the updating vector transform (UVT) plays a key role, and therefore a novel recursive DFT-based UVT formula is introduced in the proposed design for a HDFT algorithm and its architecture. The perceived advantages can be summarized as: 1) the proposed algorithm reduces the number of multiplications, additions, and coefficients by 42.3%, 33.3%, and 50%, respectively, compared with Park’s method under the settings of an M-sample complex input sequence (M = 256), and an N-point recursive DFT computation scheme (N = 64) for time hop L (L = 4); 2) by adopting the hardware-sharing scheme and the register-shifting concept, the proposed design only takes nine multipliers and 12 adders for realization. The proposed hardware accelerator can be implemented using a field-programmable gate array, which can operate at 48.33 MHz clock rate. The resource utilization of combinational logic lookup tables (LUTs) and digital signal processing (DSP) blocks reduced by 11.7% and 42.5% compared with Juanget al.’s work. For very-large-scale integration realizations, the proposed design would be more powerful than other existing algorithms in future applications focusing on DSP, filtering, and communications.

2VLSI design of low-cost and high-precision fixed-point reconfigurable FFT processors
Fast Fourier transform (FFT) plays an important role in digital signal processing systems. In this study, the authors explore the very large-scale integration (VLSI) design of high-precision fixed-point reconfigurable FFT processor. To achieve high accuracy under the limited wordlength, this study analyses the quantisation noise in FFT computation and proposes the mixed use of multiple scaling approaches to compensate the noise. In addition, a statistics-based optimisation scheme is proposed to configure the scaling operations of the cascaded arithmetic blocks at each stage for yielding the most optimised accuracy for a given FFT length. On the basis of this approach, they further present a VLSI implementation of area-efficient and high-precision FFT processor, which can perform power-of-two FFT from 32 to 8192 points. By using the SMIC 0.13 μm process, the area of the proposed FFT processor is 27 mm2 with a maximum operating frequency of 400 MHz. When the FFT processor is configured to perform 8192-point FFT at 40 MHz, the signal-to-quantisation-noise ratio is up to 53.28 dB and the power consumption measured by post-layout simulation is 35.7 mW.

3Spiking Neural Classifier with Lumped Dendritic Nonlinearity and Binary Synapses: A Current Mode VLSI Implementation and Analysis
We present a neuromorphic current mode implementation of a spiking neural classifier with lumped square law dendritic nonlinearity. It has been shown previously in software simulations that such a system with binary synapses can be trained with structural plasticity algorithms to achieve comparable classification accuracy with fewer synaptic resources than conventional algorithms. We show that even in real analog systems with manufacturing imperfections (CV of 23.5% and 14.4% for dendritic branch gains and leaks respectively), this network is able to produce comparable results with fewer synaptic resources. The chip fabricated in m complementary metal oxide semiconductor has eight dendrites per cell and uses two opposing cells per class to cancel common-mode inputs. The chip can operate down to a V and dissipates 19 nW of static power per neuronal cell and 125 pJ/spike. For two-class classification problems of high-dimensional rate encoded binary patterns, the hardware achieves comparable performance as software implementation of the same with only about a 0.5% reduction in accuracy. On two UCI data sets, the IC integrated circuit has classification accuracy comparable to standard machine learners like support vector machines and extreme learning machines while using two to five times binary synapses. We also show that the system can operate on mean rate encoded spike patterns, as well as short bursts of spikes. To the best of our knowledge, this is the first attempt in hardware to perform classification exploiting dendritic properties and binary synapses.

4Accurate performance evaluation of VLSI designs with selected CMOS process parameters
As process monitors have become vital components in modern very-large-scale integration (VSLI) designs, performance targets often determine the physical implementation of such monitors. However, as various process and environmental parameters collectively affect circuit behaviour, the design of process monitors can be difficult. In addition, process parameters from device-level models may not provide sufficient resolution in circuit-level performance. Therefore, the authors propose an intelligent novel flow for selecting dominant process parameters for evaluating performance targets such as timing and leakage. The proposed flow is applied to ISCAS'85, ISCAS'89, and IWLS'05 benchmark circuits and selects the dominant parameters in 32 and 45 nm complementary metal-oxide-semiconductor (CMOS) technologies. Through this flow, the authors identify the supply voltage, temperature, gate-oxide thickness, and effective gate length as the four dominant factors for timing and leakage. Experimental results show that the suggested process parameters achieve high evaluation accuracy (<;3% errors in timing and <;1% errors in leakage on average) in the benchmark circuits. Therefore, the proposed flow can select dominant parameters for performance targets, and the four determined factors can be used to accurately evaluate timing and leakage in 32 and 45 nm CMOS technologies.

5A VLSI On-Chip Analog High-Order Low-Pass Filter Performance Evaluation Strategy
This paper presents a strategy for evaluation of an analog high-order low-pass filter. In the proposed strategy, the evaluation procedure is divided into three modes, first, to estimate the passband characteristic, second, to estimate the stopband characteristic, and finally, to evaluate the performance of the analog filter by the analysis of the relationship between the attenuations at the passband and stopband. The proposed strategy is investigated to quantify the tradeoffs between the evaluation accuracy and hardware cost, estimate the probability of having a specification-passed device, and estimate the evaluation time. The evaluated result is related to a “ratio” rather than a “specific value.” This is an interesting and advantageous feature of the proposed strategy. Hardware reusability makes the proposed strategy practicable and possible. The experimental verification of the proposed strategy demonstrates the functionality, effectiveness, and feasibility of the proposed strategy.

6A New Paradigm in High-Speed and High-Efficiency Silicon Photodiodes for Communication—Part II: Device and VLSI Integration Challenges for Low-Dimensional Structures
The ability to monolithically integrate high-speed photodetectors (PDs) with silicon (Si) can contribute to drastic reduction in cost. Such PDs are envisioned to be integral parts of high-speed optical interconnects in the future intrachip, chip-to-chip, board-to-board, rack-to-rack, and intra-/interdata center links. Si-based PDs are of special interest since they present the potential for monolithic integration with CMOS and BiCMOS very-large-scale integration and ultralarge-scale integration electronics. In the second part of this review, we present the efforts pursued by the researchers in engineering and integrating Si, SiGe alloys, and Ge PDs to CMOS and BiCMOS electronics and compare the performance of recently demonstrated CMOS-compatible ultrafast surface-illuminated Si PD with absorption-enhancing low-dimensional structures. We discuss the advantages and challenges of device design with micro-/nanostructures, and finally, we conclude with the future directions that low-dimensional structures can offer to potentially cause a paradigm shift in high-performance PD design for various applications such as extended-reach links, single-photon detection, light detection and ranging, and high-performance computing.

7VLSI Design of an ML-Based Power-Efficient Motion Estimation Controller for Intelligent Mobile Systems
In this paper, a machine learning (ML)-based power-efficient motion estimation (ME) controller algorithm and VLSI architecture incorporating coding bandwidth and rate-distortion (R-D) cost using convex optimization are proposed to effectuate a smart and bandwidth-efficient ME design for intelligent mobile systems. To be smart and adapt to time-altering coding bandwidth using intelligent power-management techniques in modern application processor systems, we first propose an ML-based bandwidth-on-demand ME controller algorithm based on the convex optimization method to resolve the lack of an awareness of coding bandwidth in prior ME designs. Then, a hardware-friendly and power-efficient VLSI architecture is developed to implement an intelligent, high-performance, and low-power ME controller design that can be combined with prior ME designs to satisfy the bandwidth-efficient ME design target under bandwidth constraints. The final implementation results show that the proposed smart ME controller architecture using our proposed bandwidth control scheme costs 0.816K gate counts, consumes 0.873 mW of power at a working frequency of 1.1 GHz with Taiwan Semiconductor Manufacture Company (TSMC) 90-nm CMOS technology, and achieves an average bandwidth reduction of 56.08% compared with previous non-bandwidth-on-demand ME designs for high-definition (HD) videos.

8VLSI Design of SVM-Based Seizure Detection System with On-Chip Learning Capability
Portable automatic seizure detection system is very convenient for epilepsy patients to carry. In order to make the system on-chip trainable with high efficiency and attain high detection accuracy, this paper presents a very large scale integration (VLSI) design based on the nonlinear support vector machine (SVM). The proposed design mainly consists of a feature extraction (FE) module and an SVM module. The FE module performs the three-level Daubechies discrete wavelet transform to fit the physiological bands of the electroencephalogram (EEG) signal and extracts the time-frequency domain features reflecting the nonstationary signal properties. The SVM module integrates the modified sequential minimal optimization algorithm with the table-driven-based Gaussian kernel to enable efficient on-chip learning. The presented design is verified on an Altera Cyclone II field-programmable gate array and tested using the two publicly available EEG datasets. Experiment results show that the designed VLSI system improves the detection accuracy and training efficiency.

9VLSI Designs for Joint Channel Estimation and Data Detection in Large SIMO Wireless Systems
Channel estimation errors have a critical impact on the reliability of wireless communication systems. While virtually all existing wireless receivers separate channel estimation from data detection, it is well known that joint channel estimation and data detection (JED) significantly outperforms conventional methods at the cost of high computational complexity. In this paper, we propose a novel JED algorithm and corresponding VLSI designs for large single-input multiple-output (SIMO) wireless systems that use constant-modulus constellations. The proposed algorithm is referred to as PRojection Onto conveX hull (PrOX) and relies on biconvex relaxation (BCR), which enables us to efficiently compute an approximate solution of the maximum-likelihood JED problem. Since BCR solves a biconvex problem via alternating optimization, we provide a theoretical convergence analysis for PrOX. We design a scalable, high-throughput VLSI architecture that uses a linear array of processing elements to minimize hardware complexity. We develop corresponding field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) designs, and we demonstrate that PrOX significantly outperforms the only other existing JED design in terms of throughput, hardware-efficiency, and energy-efficiency.

10VLSI Design and Implementation of Reconfigurable 46-Mode Combined-Radix-Based FFT Hardware Architecture for 3GPP-LTE Applications
This paper presents a reconfigurable fast Fourier transform (FFT) hardware architecture, supporting 46 different FFT sizes defined in 3GPP-LTE applications. Our proposed design concept is mainly based on combined radix-5, radix-32, and radix24 single-path delay feedback FFT design approaches. In addition, in order to elaborate our hardware design, we also develop three design techniques, such as reconfigurable processing kernel with seven types (RPK-ST), efficient FIFO management scheme, and single-table approximation method. In an ASIC implementation with TSMC 40-nm CMOS technology, our 46-mode reconfigurable FFT chip only occupies a core area of 0.36 mm2, dissipates 48.46 mW, and operates up to clock frequency of 500 MHz. As compared with the other state-of-the-art works, our work delivers high-quality design results in the aspects of area- and energy-related performance indexes, providing a constructive FFT design prototyping for 3GPP-LTE systems.

11A Compact VLSI System for Bio-Inspired Visual Motion Estimation
This paper proposes a bio-inspired visual motion estimation algorithm based on motion energy, along with its compact very-large-scale integration (VLSI) architecture using low-cost embedded systems. The algorithm mimics motion perception functions of retina, V1, and MT neurons in a primate visual system. It involves operations of ternary edge extraction, spatiotemporal filtering, motion energy extraction, and velocity integration. Moreover, we propose the concept of confidence map to indicate the reliability of estimation results on each probing location. Our algorithm involves only additions and multiplications during runtime, which is suitable for lowcost hardware implementation. The proposed VLSI architecture employs multiple (frame, pixel, and operation) levels of pipeline and massively parallel processing arrays to boost the system performance. The array unit circuits are optimized to minimize hardware resource consumption. We have prototyped the proposed architecture on a low-cost field-programmable gate array platform (Zynq 7020) running at 53-MHz clock frequency. It achieved 30-frame/s real-time performance for velocity estimation on 160 × 120 probing locations. A comprehensive evaluation experiment showed that the estimated velocity by our prototype has relatively small errors (average endpoint error <; 0.5 pixel and angular error <; 10°) for most motion cases.

12A Variable-Clock-Cycle-Path VLSI Design of Binary Arithmetic Decoder for H.265/HEVC
The next-generation 8K ultra-high-definition video format involves an extremely high bit rate, which imposes a high throughput requirement on the entropy decoder component of a video decoder. Context adaptive binary arithmetic coding (CABAC) is the entropy coding tool in the latest video coding standards including H.265/High Efficiency Video Coding and H.264/Advanced Video Coding. Due to critical data dependencies at the algorithm level, a CABAC decoder is difficult to be accelerated by simply leveraging parallelism and pipelining. This letter presents a new very-large-scale integration arithmetic decoder, which is the most critical bottleneck in CABAC decoding. Our design features a variable-clock-cycle-path architecture that exploits the differences in critical path delay and in probability of occurrence between various types of binary symbols (bins). The proposed design also incorporates a novel data-forwarding technique (rLPS forwarding) and a fast path-selection technique (coarse bin type decision), and is enhanced with the capability of processing additional bypass bins. As a result, its maximum throughput achieves 1010 Mbins/s in 90-nm CMOS, when decoding 0.96 bin per clock cycle at a maximum clock rate of 1053 MHz, which outperforms previous works by 19.1%.

13VLSI Architecture Exploration of Guided Image Filtering for 1080P@60Hz Video Processing
Guided image filtering (GIF) is a promising edge-preserving filtering technique that has been applied in a variety of applications. Nevertheless, an efficient very-large-scale integration (VLSI) architecture design of GIF is still very challenging for the real-time processing of full-high definition videos. Previously proposed architectures are somewhat inefficient in terms of either on-chip memory usage or off-chip memory bandwidth. This paper aims to improve the balance between on-chip memory usage and off-chip memory bandwidth through architecture exploration. Three critical architectural tradeoffs in the VLSI design of GIF are explored, and two efficient VLSI architectures, namely sequential line-based and parallel line-based architectures, are proposed. Experimental results demonstrate that the proposed VLSI design only consumes 34.1-K logic gates, 25.4-KB on-chip memories, and 373-MB/s off-chip memory bandwidth while achieving a real-time video processing of 1080P@60Hz at the maximum clock frequency of 297-MHz. Moreover, the proposed VLSI circuits are fully pipelined and synchronized to the pixel clock of output video, so can be seamlessly integrated into diverse real-time video processing systems.

14A Neuromorphic VLSI Circuit for Spike-Based Random Sampling
This paper presents a novel, neuromorphic circuit that produces a continuous stream of analog random samples. The circuit encodes these samples by the temporal difference between the onset times of two subsequent voltage jumps, which mimic action potentials of biological neurons. By combining elegantly concepts from renewal theory and analog very large scale integrated technology, the circuit is principally able to sample from arbitrary distributions of positive, real random variables. Moreover, these distributions can be defined online by the circuit-user in terms of an input current time-series, without the need to reconfigure the circuit. We show results from this circuit fabricated in a CMOS 0.35-μm technology process. Random sampling is demonstrated for the uniform, exponential, and-by means of circuit simulation-also for a more complex bimodal distribution.

15Thermal Management of Batteries Using Supercapacitor Hybrid Architecture with Idle Period Insertion Strategy
Thermal analysis and management of batteries have been an important research issue for battery-operated systems, such as electric vehicles and mobile devices. Nowadays, battery packs are designed considering heat dissipation, and external cooling devices, such as a cooling fan, are also widely used to enforce the reliability and extend the lifetime of a battery. However, this type of approaches cannot achieve an immediate temperature drop to avoid a thermal emergency situation. Approaches based on removing the heat from the heat sources via idle period insertion (similar to what is done for silicon devices) would allow faster thermal response; however, it is not obvious how to implement these schemes in the context of batteries. In this paper, we propose the use of a simple parallel battery-supercapacitor hybrid architecture with a dual-mode discharge strategy that can provide immediate temperature management, in which the supercapacitor is used as an energy buffer during the idle periods of the battery. Simulation results show that the proposed method can reduce the battery temperature during charge and discharge while exploiting the advantage of the parallel connection.

16Data Reuse Buffer Synthesis Using the Polyhedral Model
Current high-level synthesis (HLS) tools for the automatic design of computing hardware perform excellently for the synthesis of computation kernels, but they often do not optimize memory bandwidth. As accessing memory is a bottleneck in many algorithms, the performance of the generated circuit could benefit substantially from memory access optimization. In this paper, we present a method and a tool to automate the optimization of memory accesses to array data in HLS by introducing local memory tailored perfectly to store only the data that are used repeatedly. Our method detects data reuse in the source code of the algorithm to be implemented in hardware, selects and parameterizes data reuse buffers, and generates a register transfer level design of the data buffers and a matching loop controller that coordinates reuse buffers and datapath operations. Throughout this paper, the polyhedral representation is used extensively as it proves to be well suited for calculations on loop nests and data accesses. As a consequence, this paper is limited to affine programs which can be represented in this model. Experiments show that our method outperforms state-of-the-art academic and commercial HLS tools.

17High-Performance Architecture Using Fast Dynamic Reconfigurable Accelerators
System accelerators (ACCs) improve performance and break power and utilization walls. They can be implemented by fixed-function hard macros or reconfigurable logic such as field-programmable gate arrays (FPGAs). For systems running various applications, dynamic reconfigurable ACCs offer a very attractive feature; however, the reconfiguration time is an unavoidable overhead. This paper proposes high-performance architecture with fast dynamic reconfigurable FPGA ACCs (F-RACCs) based on a novel bitstream reprogramming method, which is feasible by using emerging technologies. The architecture includes CPU cores, caches, memories, ACCs, and network-on-chips. A portion of the computing tasks can be offloaded from CPUs to ACCs to improve performance. The ACCs can be reprogrammed rapidly to accommodate various functions required by wide spectrum of applications. The performance is evaluated by platform for ACC-rich architectural design and exploration, a gem5-based cycle-accurate full-system simulation platform. The 11 benchmark applications from different domains are evaluated. Comparing with systems using conventional FPGA ACCs partially configured using fastest configuration speed; this architecture improves system performance on all applications and achieves maximum 1.31× and 2.82× speedup using 1 and 12 ACC instances, respectively. It achieves maximum speedup of 94.93× with one and 565.12× with 12 F-RACCs over CPU software path with no ACCs.

18Dynamic Reconfiguration of Thermoelectric Generators for Vehicle Radiators Energy Harvesting Under Location-Dependent Temperature
Conventional internal combustion engine vehicles generally have less than a 30% of fuel efficiency, and the most wasted energy is dissipated in the form of heat energy. The excessive heat dissipation is a primary reason of poor fuel efficiency, but reclamation of the heat energy has not been a main focus of vehicle design. Thanks to thermoelectric generators (TEGs), wasted heat energy can be directly converted to electric energy. All the heat exchangers, including vehicle radiators, gradually cool down the coolant or gas from the inlet to outlet. TEG modules are commonly mounted throughout the heat exchanger to fulfill the required power density and voltage. Each TEG module has a different hot-side temperature by the mounting location (distance from the inlet) and thus different maximum power point (MPP) voltage and current. Nevertheless, TEG modules are commonly connected in series and parallel, where both the ends are connected to a single power converter. As a result, the whole TEG module array exhibits a significant efficiency degradation even if the power converter has the MPP tracking capability. Although material and device researchers have been putting a lot of effort in enhancing TEG efficiency, such system-level issue has not been deeply investigated. This paper proposes a cross-layer, system-level solution to enhance TEG array efficiency introducing online reconfiguration of TEG modules. The proposed method is useful for any sort of TEG arrays to reclaim wasted heat energy, because heat exchangers generally have different inlet and outlet temperature values.

19Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA
As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g., loop unrolling, tiling, and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This paper overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g., memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing end-to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.

20On the Analysis and the Mitigation of Power Supply Noise and Power Distribution Network Impedance Variation for Scan-Based Delay Testing Techniques
In this paper, we analyze the impact of the power supply noise and the power distribution network (PDN) impedance variation on the timing margin in both modes for ICs with multiple clock domains. We investigate the so-called intermodulation products (IMPs). We show that IMPs are mainly induced by the dependent nature of the transistors. We also provide experimental results showing that scan-based delay testing can be optimistic with respect to the mission mode for maximum achievable nominal frequency prediction, even at lower clock frequencies. We also show that IMPs can induce timing margin fluctuations that can be larger than that of the ones induced by the voltage droop in the test mode. Using an improved HSpice simulation model of a PDN validated by experimental results, we also quantify the timing margin variation due to power noise in the test mode as a function of the clock frequency, including the so-called clock stretching phenomenon. Finally, we propose a robust test signal scheme for multiple clock domain chips. The simulation results reveal that this scheme is less sensitive to PDN impedance variation than that of the most popular existing test schemes, and that it provides timing margins closer to those obtained in the mission mode.

21Low Overhead Warning Flip-Flop Based on Charge Sharing for Timing Slack Monitoring
Timing error predictors have a strong potential to reduce the worst case timing margins by monitoring timing slack of a design. However, these timing error predictors incur substantial amount of silicon area and power which limit the overall benefits in the system level. This paper presents a low overhead warning flip-flop (FF), which predicts setup time violations. It consists of a delay buffer and a warning generator along with a conventional master-slave FF. Low overhead FF can be designed by exploiting the concept of charge sharing to implement the warning generator. As the warning generator requires only seven transistors to predict the timing violation, the proposed warning FF occupies 30% less area and consumes 27% less power compared to the state-of-the-art timing error predictors. A test chip is fabricated using the proposed FF in a 130-nm CMOS technology to verify the functionality of the proposed warning FF in dynamic voltage and frequency scaling applications. Measurement results from the test chip show that a performance improvement of 44% can be achieved at a supply voltage of 0.9 V by employing the proposed technique compared to the worst case design. For a typical chip, the power consumption can be reduced by 36% compared to the worst case design.

22A Changing-Reference Parasitic-Matching Sensing Circuit for 3-D Vertical RRAM
The 3-D vertical array architecture is considered to be a promising technology for emerging nonvolatile memories. But in 3-D vertical emerging nonvolatile memories, planar parasitic elements, vertical parasitic elements, and the sneak currents of the half-selected memory cells result in the delay of the read operation and read errors. This paper refers to the case of the 3-D vertical resistive switching random access memory (RRAM) technology. A read scheme, the memory core design, and the read path are proposed or analyzed. The factors that affect the read operation are concluded. A changing-reference parasitic-matching sensing circuit is, therefore, proposed. In the proposed circuit, the reference side and the read side share similar sneak currents and read paths. Simulated in a 40-nm CMOS process, the sensing time of 128-Mb 3-D vertical RRAM is 8.54 ns compared to the conventional 34.26 ns. Monte Carlo simulations show a 130.21-ns worst sensing time compared to the conventional 401.95 ns. The read errors are reduced by 100% and 95.31% under the regular and the worst RRAM resistance, respectively.

23On the Analysis and the Mitigation of Power Supply Noise and Power Distribution Network Impedance Variation for Scan-Based Delay Testing Techniques
In this paper, we analyze the impact of the power supply noise and the power distribution network (PDN) impedance variation on the timing margin in both modes for ICs with multiple clock domains. We investigate the so-called intermodulation products (IMPs). We show that IMPs are mainly induced by the dependent nature of the transistors. We also provide experimental results showing that scan-based delay testing can be optimistic with respect to the mission mode for maximum achievable nominal frequency prediction, even at lower clock frequencies. We also show that IMPs can induce timing margin fluctuations that can be larger than that of the ones induced by the voltage droop in the test mode. Using an improved HSpice simulation model of a PDN validated by experimental results, we also quantify the timing margin variation due to power noise in the test mode as a function of the clock frequency, including the so-called clock stretching phenomenon. Finally, we propose a robust test signal scheme for multiple clock domain chips. The simulation results reveal that this scheme is less sensitive to PDN impedance variation than that of the most popular existing test schemes, and that it provides timing margins closer to those obtained in the mission mode.

24Low Overhead Warning Flip-Flop Based on Charge Sharing for Timing Slack Monitoring
Timing error predictors have a strong potential to reduce the worst case timing margins by monitoring timing slack of a design. However, these timing error predictors incur substantial amount of silicon area and power which limit the overall benefits in the system level. This paper presents a low overhead warning flip-flop (FF), which predicts setup time violations. It consists of a delay buffer and a warning generator along with a conventional master-slave FF. Low overhead FF can be designed by exploiting the concept of charge sharing to implement the warning generator. As the warning generator requires only seven transistors to predict the timing violation, the proposed warning FF occupies 30% less area and consumes 27% less power compared to the state-of-the-art timing error predictors. A test chip is fabricated using the proposed FF in a 130-nm CMOS technology to verify the functionality of the proposed warning FF in dynamic voltage and frequency scaling applications. Measurement results from the test chip show that a performance improvement of 44% can be achieved at a supply voltage of 0.9 V by employing the proposed technique compared to the worst case design. For a typical chip, the power consumption can be reduced by 36% compared to the worst case design.

25A Changing-Reference Parasitic-Matching Sensing Circuit for 3-D Vertical RRAM
The 3-D vertical array architecture is considered to be a promising technology for emerging nonvolatile memories. But in 3-D vertical emerging nonvolatile memories, planar parasitic elements, vertical parasitic elements, and the sneak currents of the half-selected memory cells result in the delay of the read operation and read errors. This paper refers to the case of the 3-D vertical resistive switching random access memory (RRAM) technology. A read scheme, the memory core design, and the read path are proposed or analyzed. The factors that affect the read operation are concluded. A changing-reference parasitic-matching sensing circuit is, therefore, proposed. In the proposed circuit, the reference side and the read side share similar sneak currents and read paths. Simulated in a 40-nm CMOS process, the sensing time of 128-Mb 3-D vertical RRAM is 8.54 ns compared to the conventional 34.26 ns. Monte Carlo simulations show a 130.21-ns worst sensing time compared to the conventional 401.95 ns. The read errors are reduced by 100% and 95.31% under the regular and the worst RRAM resistance, respectively.

26Toward Energy-Efficient Stochastic Circuits Using Parallel Sobol Sequences
Stochastic computing (SC) often requires long stochastic sequences and, thus, a long latency to achieve accurate computation. The long latency leads to an inferior performance and low energy efficiency compared with most conventional binary designs. In this paper, a type of low-discrepancy sequences, the Sobol sequence, is considered for use in SC. Compared to the use of pseudorandom sequences generated by linear feedback shift registers (LFSRs), the use of Sobol sequences improves the accuracy of stochastic computation with a reduced sequence length. The inherent feature in Sobol sequence generators enables the parallel implementation of random number generators with an improved performance and hardware efficiency. In particular, the underlying theory is formulated and circuit design is proposed for an arbitrary level of parallelization in a power of 2. In addition, different strategies are implemented for parallelizing combinational and sequential stochastic circuits. The hardware efficiency of the parallel stochastic circuits is measured by energy per operation (EPO), throughput per area (TPA), and runtime. At a similar accuracy, the 8× parallel stochastic circuits using Sobol sequences consume approximately 1% of the EPO of the conventional LFSR-based nonparallelized circuits. Meanwhile, an average of 70 (up to 89) times improvements in TPA and less than 1% runtime are achieved. A sorting network is implemented for a median filter (MF) as an application. For a similar image processing quality, a higher energy efficiency is obtained for an 8× parallelized stochastic MF compared with its binary counterpart.

27Design and Analysis of Energy-Efficient and Reliable 3-D ReRAM Cross-Point Array System
In this paper, we study the energy, performance, and reliability of 3-D horizontal 1-selector-1-resistor (1S1R) cross-point resistive random access memory (ReRAM) systems. We present access schemes which activate multiple subarrays with multiple layers in a subarray to achieve high energy efficiency through activating fewer subarray and good reliability through innovative data organization. We propose two low-cost access schemes [namely, multilayer access scheme (MAS)-I and MAS-II] which enable multilayer programming but differ in the number of activated layers (NL) and hence differ in energy efficiency. To improve reliability, we propose to distribute data across subarrays as well as along the layers of a subarray such that the error characteristics of all accessed data lines are the same. At the system level, we use Bose-Chaudhuri-Hocquenghem (BCH) codes with different strengths so that all competing systems have the same reliability. We show that for a 1-GB 3-D horizontal 1S1R ReRAM system with an I/O width of 64 bits, the NB = 16, NL = 4 system based on MAS-I that utilizes BCH t = 6 code consumes the lowest energy with 33% lower energy consumption compared to the baseline system where only one layer is activated at a time.

28Secure Double Rate Registers as an RTL Countermeasure against Power Analysis Attacks
Power analysis attacks (PAAs), a class of side-channel attacks based on power consumption measurements, are a major concern in the protection of secret data stored in cryptographic devices. In this paper, we introduce the secure double rate registers (SDRRs) as a register-transfer level (RTL) countermeasure to increase the security of cryptographic devices against PAAs. We exploit the SDRR in a conventional advanced encryption standard (AES)-128 architecture, improving the immunity of the cryptographic hardware to the state-of-the-art PAAs. In the AES-128 exploiting SDRR, the combinational path evaluates random data throughout the entire clock cycle, and the interleaved processing of random and real data ensures the protection of both combinational and sequential logics. Our technique does not require the duplication of the combinational path to process the random data, thus limiting area overhead, unlike previous RTL countermeasures. The proposed approach is validated by means of PAAs based on real measurements on a field-programmable gate array implementation and on a 65-nm CMOS prototype chip. The protected implementation shows a strongly reduced correlation coefficient for the correct key, and more than three orders of magnitude increase in the measurements to disclosure with respect to the unprotected AES-128.

29A Flexible and Energy-Efficient Convolutional Neural Network Acceleration with Dedicated ISA and Accelerator
State-of-the-art convolutional neural networks (CNNs) usually have a large number of layers and filter weights which bring huge computation and communication overheads. A general purpose instruction set architecture (ISA) is flexible but has low code density and high power consumption. The existing CNN-specific accelerators are much more efficient but usually are inflexible or require a complex controller to handle the computation and data transfer of different CNNs. In this brief, we propose a new CNN-specific ISA which embeds the parallel computation and data reuse parameters in the instructions. An instruction generator deploys the instruction parameters according to the feature of CNNs and hardware’s computation and storage resources. In addition, a reconfigurable accelerator with 225 multipliers and 24 adder trees is realized to obtain efficient parallel computation and data transfer. Compared with x86 processors, our design has 392 times better energy efficiency and 16 times higher code density. Compared with other state-of-the-art accelerators, our solution has a higher flexibility to support all popular CNNs and a higher energy efficiency.

30Low-Complexity VLSI Design of Large Integer Multipliers for Fully Homomorphic Encryption
Large integer multiplication has been widely used in fully homomorphic encryption (FHE). Implementing feasible large integer multiplication hardware is thus critical for accelerating the FHE evaluation process. In this paper, a novel and efficient operand reduction scheme is proposed to reduce the area requirement of radix-r butterfly units. We also extend the single-port, merged-bank memory structure to the design of number theoretic transform (NTT) and inverse NTT (INTT) for further area minimization. In addition, an efficient memory addressing scheme is developed to support both NTT/INTT and resolving carries computations. Experimental results reveal that significant area reductions can be achieved for the targeted 786,432- and 1,179,648-bit NTT-based multipliers designed using the proposed schemes in comparison with the related works. Moreover, the two multiplications can be accomplished in 0.196 and 2.21 ms, respectively, based on 90-nm CMOS technology. The low-complexity feature of the proposed large integer multiplier designs is thus obtained without sacrificing the time performance.

31Search for Editor-In-Chief of the IEEE Transactions on Very Large Scale INtegration (VLSI) Systems
In this paper, we study the energy, performance, and reliability of 3-D horizontal 1-selector-1-resistor (1S1R) cross-point resistive random access memory (ReRAM) systems. We present access schemes which activate multiple subarrays with multiple layers in a subarray to achieve high energy efficiency through activating fewer subarray and good reliability through innovative data organization. We propose two low-cost access schemes [namely, multilayer access scheme (MAS)-I and MAS-II] which enable multilayer programming but differ in the number of activated layers (NL) and hence differ in energy efficiency. To improve reliability, we propose to distribute data across subarrays as well as along the layers of a subarray such that the error characteristics of all accessed data lines are the same. At the system level, we use Bose-Chaudhuri-Hocquenghem (BCH) codes with different strengths so that all competing systems have the same reliability. We show that for a 1-GB 3-D horizontal 1S1R ReRAM system with an I/O width of 64 bits, the NB = 16, NL = 4 system based on MAS-I that utilizes BCH t = 6 code consumes the lowest energy with 33% lower energy consumption compared to the baseline system where only one layer is activated at a time.

32Reducing Rollback Cost in VLSI Circuits to Improve Fault Tolerance
In nanometer technologies, circuits are more and more sensitive to various kinds of perturbations. Alpha particles and atmospheric neutrons induce single-event upsets, affecting memory cells, latches, and flip-flops. They also induce single-event transients, initiated in the combinational logic and captured by the latches and flip-flops associated with the outputs of this logic. In the past, the major efforts were related on memories. However, as the whole situation is getting worse, solutions that protect the entire design are mandatory. Solutions for detecting the error in logic functions already exist, but there are only few solutions allowing the correction, leading to a lot of hardware overhead in nonprocessor design. In this paper, we present a novel technique that includes several hardware architectures and an algorithm for their implementations, which reduces the cost of rollback in any kinds of circuit.

33Thermal management of batteries using a hybrid supercapacitor architecture
Thermal analysis and management of batteries have been an important research issue for battery-operated systems such as electric vehicles and mobile devices. Nowadays, battery packs are designed considering heat dissipation, and external cooling devices such as a cooling fan are also widely used to enforce the reliability and extend the lifetime of a battery. This type of approaches that target the enhancement of the cooling efficiency via the reduction of the thermal resistance cannot achieve an immediate temperature drop to avoid a thermal emergency situation. Approaches based on removing the heat from the heat sources via idle period insertion (similar to what is done for silicon devices) would allow faster thermal response; however it is not obvious how to implement these schemes in the context of batteries. In this paper, we propose the use of a simple parallel battery-supercapacitor hybrid architecture with a dual-mode discharging strategy that can provide immediate temperature management, in which the supercapacitor is used as an energy buffer during the idle periods of the battery. Simulation results shows that the proposed method can keep the battery temperature within the safe range without external cooling devices while exploiting the advantage of the battery-supercapacitor parallel connection.

34 A 900 MHz, 3.5mW 8 bit pipelined sub ranging ADC combining flash ADC and TDC
The time-based ADC (TB-ADC) is gaining attention as a key technology to solve challenges concerning CMOS scaling; however, it cannot achieve high speeds and a high resolution at the same time. We proposed a TB-ADC architecture combining a flash ADC and TDC to achieve both high speeds and a high resolution. The flash ADC and the ADC using a VTC and TDC are pipelined to enhance the conversion speed. The charge steering amplifier is used for the low-power residue transfer. Moreover, the Vernier TDC using the dynamic delayer enables low-power operation. An 8-bit ADC test chip fabricated using 65-nm CMOS technology demonstrated a high sampling frequency of 900 MHz and a low-power consumption of 3.5 mW. The FOM was 32 fJ/conv.-step.

35 A high speed 2 bit/cycle SAR ADC with time domain quantization
This brief presents a 2-bit/cycle successive approximation register (SAR) analog-to-digital converter (ADC) with time-domain quantization, which only needs one capacitive digital-to-analog converter (DAC) array. A duplicated dynamic comparator is adopted to generate the time references. To quantize the time value, a dynamic latch-based high precision time-domain comparator is proposed. Moreover, a redundancy technique is utilized to overcome the effect of nonideal factors, such as incomplete DAC settling, reference scale mismatch, and offset of comparators. A design example of 9-bit 700 MS/s SAR ADC in 65-nm CMOS technology is presented. Simulation results show that with a differential 600-mVp-p input, the spurious free dynamic range at Nyquist input is above 65 dB. The simulated effective number of bit is up to 8.3 bits at 10-MHz input with the presence of noise and mismatches calibration.

36 A Linearity-Improved 8-bit 320-MS/s SAR ADC with Metastability Immunity Technique
This paper presents an 8-bit 320-MS/s successive approximation register (SAR) analog-to-digital converter (ADC) with the linearity-improved technique. A linearity-improved sampling switch with parasitic capacitance compensation, which makes the parasitic capacitance of sampling switch to be almost constant with varied input signal, is proposed. It also improves the matching of the differential sampling switches. Moreover, a metastability immunity technique is provided to suppress the uncertain decision behavior of dynamic comparator at high conversion rate. In addition, a bypass SAR logic that parallels comparator and SAR logic operations is exhibited to reduce the delay of SAR feedback loop. To demonstrate the proposed techniques, a design example of SAR ADC is fabricated in a 55-nm CMOS technology, consuming 1.2 mW at a 1-V power supply. It achieves a signal-to-noise-and-distortion ratio >43.5 dB and spurious free dynamic range >54 dB at 320 MS/s. The ADC core occupies an active area of only 0.02 mm 2, and the corresponding figure of merit is 30 fJ/conversion step with Nyquist rate.

37 A low power forward and reverse body bias generator in CMOS 40nm
This brief presents a low-power forward and reverse body bias (FRBB) generator with body bias (BB) switches to dynamically set BB voltage. The reverse BB (RBB) P-well generator uses pulse frequency modulation (PFM)-based switching capacitor power converter to achieve low power consumption in a wide load current range of 1-30 μ A. The FBB adopts class-AB output stage and an over-current indicator to detect large body diode forward conduction leakage current. BB switch is introduced to dynamically configure the FRBB generated voltage. The proposed FRBB with BB switch is fabricated in a CMOS 40-nm triple well process, occupying an area of around 0.1 mm2 . The 300k digital gates are used as the load for test. The measured active current for FBB and RBB are 13 and 5 μA, respectively.

38 A low power high speed comparator for precise applications
A low-power comparator is presented. pMOS transistors are used at the input of the preamplifier of the comparator as well as the latch stage. Both stages are controlled by a special local clock generator. At the evaluation phase, the latch is activated with a delay to achieve enough preamplification gain and avoid excess power consumption. Meanwhile, small cross-coupled transistors increase the preamplifier gain and decrease the input common mode of the latch to strongly turn on the pMOS transistors (at the latch input) and reduce the delay. Unlike the conventional comparator, the proposed structure let us set the optimum delay for preamplification and avoid excess power consumption. The speed and the power benefits of the comparator were verified using solid analytical derivations, process-VDD-temperature corners, and Monte Carlo simulations along with silicon measurements in 0.18 μm. The tests confirm that the proposed circuit reduces the power consumption by 50% and provides 30% better comparison speed at the same offset and almost the same noise budgets. Moreover, the comparator provides a rail-to-rail input Vcm range in fclk = 500 MHz.

39A variable size FFT hardware accelerator based on matrix transposition
Fast Fourier transform (FFT) is the kernel and the most time-consuming algorithm in the domain of digital signal processing, and the FFT sizes of different applications are very different. Therefore, this paper proposes a variable-size FFT hardware accelerator, which fully supports the IEEE-754 single-precision floating-point standard and the FFT calculation with a wide size range from 2 to 2²⁰ points. First, a parallel Cooley-Tukey FFT algorithm based on matrix transposition (MT) is proposed, which can efficiently divide a large size FFT into several small size FFTs that can be executed in parallel. Second, guided by this algorithm, the FFT hardware accelerator is designed, and several FFT performance optimization techniques such as hybrid twiddle factor generation, multibank data memory, block MT, and token-based task scheduling are proposed. Third, its VLSI implementation is detailed, showing that it can work at 1 GHz with the area of 2.4 mm² and the power consumption of 91.3 mW at 25 °C, 0.9 V. Finally, several experiments are carried out to evaluate the proposal's performance in terms of FFT execution time, resource utilization, and power consumption. Comparative experiments show that our FFT hardware accelerator achieves at most 18.89x$ speedups in comparison to two software-only solutions and two hardware-dedicated solutions.

40Accurate performance evaluation of VLSI design with selected CMOS process parameter
As process monitors have become vital components in modern very-large-scale integration (VSLI) designs, performance targets often determine the physical implementation of such monitors. However, as various process and environmental parameters collectively affect circuit behaviour, the design of process monitors can be difficult. In addition, process parameters from device-level models may not provide sufficient resolution in circuit-level performance. Therefore, the authors propose an intelligent novel flow for selecting dominant process parameters for evaluating performance targets such as timing and leakage. The proposed flow is applied to ISCAS'85, ISCAS'89, and IWLS'05 benchmark circuits and selects the dominant parameters in 32 and 45 nm complementary metal-oxide-semiconductor (CMOS) technologies. Through this flow, the authors identify the supply voltage, temperature, gate-oxide thickness, and effective gate length as the four dominant factors for timing and leakage. Experimental results show that the suggested process parameters achieve high evaluation accuracy (<;3% errors in timing and <;1% errors in leakage on average) in the benchmark circuits. Therefore, the proposed flow can select dominant parameters for performance targets, and the four determined factors can be used to accurately evaluate timing and leakage in 32 and 45 nm CMOS technologies.

41An area efficient low-voltage 6-T SRAM cell using stacked silicon nanowires
An area efficient low-voltage 6-T SRAM cell using stacked silicon nanowires is proposed. Among emerging CMOS devices, nanowire (NW) / gate-all-around (GAA) silicon MOSFETs have shown advantages for scaling features as the semiconductor technology continues to progress. While preserving the intrinsic GAA advantages, this paper provides a design methodology for the optimal and feasible manufacturability with different doping concentrations to achieve high density design and assesses the performance via three-dimensional TCAD simulation. However, due to limited atoms in the extremely scaled channel, a heavy doping with in-situ doping process is needed. In addition, using vertical stacked gate-all-around MOSFETs to achieve high density in the same layout area with the proposed multi-threshold doping scheme is beneficial for system on chip (SoC) application. Circuit performance projection of the 6-T SRAM is provided based on balanced read and write performances.

42Approximate hybrid high radix encoding for energy efficient inexact multipliers
Approximate computing forms a design alternative that exploits the intrinsic error resilience of various applications and produces energy-efficient circuits with small accuracy loss. In this paper, we propose an approximate hybrid high radix encoding for generating the partial products in signed multiplications that encodes the most significant bits with the accurate radix-4 encoding and the least significant bits with an approximate higher radix encoding. The approximations are performed by rounding the high radix values to their nearest power of two. The proposed technique can be configured to achieve the desired energy-accuracy tradeoffs. Compared with the accurate radix-4 multiplier, the proposed multipliers deliver up to 56% energy and 55% area savings, when operating at the same frequency, while the imposed error is bounded by a Gaussian distribution with near-zero average. Moreover, the proposed multipliers are compared with state-of-the-art inexact multipliers, outperforming them by up to 40% in energy consumption, for similar error values. Finally, we demonstrate the scalability of our technique.

43Area optimized fully flexible BCH decoder for multiple GF dimensions
Recently, there are increasing demands for fully flexible Bose-Chaudhuri-Hocquenghem (BCH) decoders, which can support different dimensions of Galois fields (GF) operations. As the previous BCH decoders are mainly targeting the fixed GF operations, the conventional techniques are no longer suitable for multiple GF dimensions. For the area-optimized flexible BCH decoders, in this paper, we present several optimization schemes for reducing hardware costs of multi-dimensional GF operations. In the proposed optimizations, we first reformulate the matrix operations in syndrome calculation and Chien search for sharing more common sub-expressions between GF operations having different dimensions. The cellbased multi-m GF multiplier is newly introduced for the area-efficient flexible key-equation solver. As case studies, we design several prototype flexible BCH decoders for digital video broadcasting systems and NAND flash memory controllers managing different page sizes. The implementation results show that the proposed fully-flexible BCH decoder architecture remarkably enhances the area-efficiency compared with the conventional solutions.

44Average 7T1R Nonvolatile SRAM with R/W Margin Enhanced for Low-Power Application
A new average 7T1R nonvolatile SRAM for low-power application is presented in this brief, which improves the read and write margin (RM/WM), as well as the restore energy, simply by using the source switch transistor. Simulation results demonstrate that the RM and WM will be improved by ~23% and ~73%, respectively, and the energy consumption will be decreased by ~63% for low-resistance state restoration, compared with the prior art initialization-and-overwrite-7T1R at nMOS typical corner and pMOS typical corner in Taiwan Semiconductor Manufacturing Company's 65-nm technology. In addition, with the column-shared structure, the area penalty is cheerfully acceptable.

45Enhancing the delay performance of junction less silicon nanotube based 6T SRAM
This work investigates the delay performance of junctionless silicon nanotube (JLSiNT) field-effect transistor (FET) based 6T SRAM cell. The study demonstrates that the delay performance of symmetric drain/source DS-JLSiNT FET (inner gate covers drain, channel, and source regions) based 6T SRAM gets improved when the inner gate of nanotube covers only either drain and channel regions (D-JLSiNT FET) or source and channel regions (S-JLSiNT FET) because of improved I on / C gg . The improvement in read (write) access time is ~22% (17%) and ~9% (20%) when DS-JLSiNT FET is replaced by D-JLSiNT FET and S-JLSiNT FET, respectively, in DS-JLSiNT FET based 6T SRAM. Furthermore, due to partial covering of inner gate, the gate electrostatic integrity is reduced which decreases the ratio of on-current to off-current ( I on / I off ) resulting in degraded static noise margin (SNM). However, the deterioration in write SNM, hold SNM, and read SNM are almost minimal (~0.3, 0.9, and 2%, respectively) for S-JLSiNT FET based SRAM as compared to DS-JLSiNT FET based SRAM. However, the deterioration in SNMs is aggravated for D-JLSiNT FET based SRAM as compared to DS-JLSiNT FET based SRAM. Thus, S-JLSiNT FET is the best configuration for designing of JLSiNT FET based 6T SRAM cell.

46Leakage Power Attack-Resilient Symmetrical 8T SRAM Cell
Power analysis attacks have become a serious threat to security systems by enabling secret data extraction using side-channel leakage information. Embedded memories, often implemented with 6T SRAM cells, serve as a key component in many of these systems. However, conventional SRAM cells are prone to side-channel leakage power attacks. To provide resiliency to these types of attacks, we propose a symmetric 8T SRAM cell which incorporates two more transistors than the conventional 6T cell to significantly reduce the correlation between the stored data and the leakage currents. To demonstrate the improved security of the suggested memory array, both cells were implemented in a 65-nm CMOS technology. Simulation results, including Monte Carlo analysis and signal-to-noise ratio comparison, illustrate the resiliency of the 8T cell to leakage power attacks.

47Low power and fast full adder by exploring new XOR and XNOR gates
In this paper, novel circuits for XOR/XNOR and simultaneous XOR–XNOR functions are proposed. The proposed circuits are highly optimized in terms of the power consumption and delay, which are due to low output capacitance and low short-circuit power dissipation. We also propose six new hybrid 1-bit full-adder (FA) circuits based on the novel full-swing XOR–XNOR or XOR/XNOR gates. Each of the proposed circuits has its own merits in terms of speed, power consumption, power-delay product (PDP), driving ability, and so on. To investigate the performance of the proposed designs, extensive HSPICE and Cadence Virtuoso simulations are performed. The simulation results, based on the 65-nm CMOS process technology model, indicate that the proposed designs have superior speed and power against other FA designs. A new transistor sizing method is presented to optimize the PDP of the circuits. In the proposed method, the numerical computation particle swarm optimization algorithm is used to achieve the desired value for optimum PDP with fewer iterations. The proposed circuits are investigated in terms of variations of the supply and threshold voltages, output capacitance, input noise immunity, and the size of transistors.

48NMOS only Schmitt trigger circuit for NBTI resilient CMOS circuits
A novel N-type MOS (NMOS) only Schmitt trigger with voltage booster (NST-VB) circuit is presented. The proposed NST-VB circuit uses NMOS transistors in both pull-up and pull-down networks to reduce the effect of negative bias temperature instability (NBTI) on the circuit. The proposed circuit is less affected by both inter-die and intra-die process variations in consequence of NMOS only structure. Owing to NBTI, the increase in delay for the proposed NST-VB circuit is only 0.47% as compared with 7.2% for conventional Schmitt trigger after the stress time of three years. For the viability of the proposed circuit figure of merit (FOM) is used as a performance metric and it is found that the proposed circuit has 15.35× and 3.53× improved FOM as compared with the conventional Schmitt trigger and NMOS inverter, respectively.

49On-Chip Adaptive Body Bias for Reducing the Impact of NBTI on 6T SRAM Cells
Negative bias temperature instability (NBTI) is a major reliability issue with the scaled devices at elevated temperature. The effect of NBTI increases with the time, and it increases the threshold voltage of pMOS. In this paper, an on-chip adaptive body bias (O-ABB) circuit to compensate the degradation due to NBTI aging is presented. The O-ABB is used to compensate the parameter variations and improves the SRAM circuit yield regarding read current, hold SNM, read SNM, write margin, and word line write margin (WLWM). The O-ABB consists of standby leakage current (Iddq) sensor circuit, decision circuit, and body bias control circuit. Circuit level simulation for SRAM cell is performed for pre-and post-stress of ten years NBTI aging. The proposed O-ABB reduces the effect of NBTI on the stability of SRAM cell. The simulation results show the hold SNM, read SNM, and WLWM decreases by 10.55%, 8.55%, and 3.25%, respectively, in the absence of O-ABB, whereas hold SNM, read SNM, and WLWM decreases by only 0.47%, 1.15%, and 0.62%, respectively, if O-ABB is used to compensate the degradation.

50Performance analysis of junctionless DG-MOSFET-based 6T-SRAM with gate-stack configuration
In this work, the investigation of high- K gate-stack-based junctionless (JL) double-gate (DG) metal-oxide-semiconductor field-effect transistor (MOSFET) is carried out to study the high- K gate dielectric effect on six-transistor (6T)-static random-access memory (SRAM) built with JLDG-MOSFET. It is observed that the utilisation of the high- K gate dielectric in JLDG-MOSFETs improves the static noise margin (SNM) that is the stability of the cell as well as access time (AT) which reflects the delay performance of the SRAM cell. Furthermore, scaling down of L g degrades the stability. Moreover, it is also described that the enhancement in hold SNM (ΔHSNM = HSNM (K=40) -HSNM (K=3.9) ), read SNM (ΔRSNM = RSNM (K=40) -RSNM (K=3.9) ), and write SNM (ΔWSNM = WSNM (K=40) -WSNM (K=3.9) ) is restricted at lower L g without much enhancement in the improvement of read AT and write AT.

51Read Static Noise Margin Decrease of 65-nm 6-T SRAM Cell Induced by Total Ionizing Dose
Read static noise margin (SNM) decrease of 65-nm 6-T cell induced by total ionizing dose (TID) was observed in this paper. The static random access memory (SRAM) cell test structure allowing precise measurement of read SNM was specifically designed and irradiated by gamma ray. Experimental results show that read SNM of 65-nm 6-T cell is sensitive to TID irradiation. The largest decrease of read SNM is 48 mV after 1000 krad(Si) irradiation, which is 36% of the value before TID irradiation. Being dependent on the measurement of radiation responses of cell transistors and simulation results, we conclude that the read SNM decrease is due to a threshold voltage shift induced by TID. Because narrow width transistors are employed in SRAM cells, threshold voltage of cell transistors will be shifted by charges trapped in shallow trench isolation, known as “Radiation-Induced Narrow Channel Effect.”

52Study on a low-complexity ECG compression scheme with two tier sensors
Toward reducing the bit overhead in telemonitoring systems, a novel ECG signal compression scheme at low complexity is proposed. The ECG signal compression for a single sensor or multiple single-tier sensors is most widely studied in the literature. Different from the existing work, a scheme based on two-tier sensors is considered where the lower tier sensors are designed to be at the lower complexity for reducing the cost. The compression for both the intersensor transmission and the outward transmission from the whole system is optimized in terms of quantization bits size. The stability of the proposed scheme is analyzed. Furthermore, the joint optimization of the transmission period and the quantization bits per transmission is investigated. Experimental results show that the proposed scheme method outperforms the conventional ones with respect to ECG reconstruction accuracy at a given bit.

53Ultra low power high stability 8T SRAM for application in object tracking system
In this paper, an ultra-low power (ULP) 8T static random access memory (SRAM) is proposed. The proposed SRAM shows better results as compared with conventional SRAMs in terms of leakage power, write static noise margin, write-ability, read margin, and I ON /I OFF . It is observed that the leakage power is reduced to 82× (times) and 75× as compared with the conventional 6T SRAM and read decoupled (RD)-8T SRAM, respectively, at 300 mV VDD. In addition, write static noise margin (WSNM), write trip point (WTP), read dynamic noise margin, and I ON /I OFF ratio are also improved by 7.1%, 43%, 7.4%, and 74× than conventional 6T SRAM, respectively, at 0.3 V VDD. Moreover, the WSNM, WTP, and I ON /I OFF values are improved by 6.67%, 7.14%, and 68× as compared with RD-8T SRAM, respectively, at 0.3 V VDD. Furthermore, a fast, reliable, less memory usage object tracking algorithm and implementation of its memory block using ULP 8T SRAM are proposed. A quadtree-based approach is employed to diminish the bounding box and to reduce the computations for fast and low power object tracking. This, in turn, minimizes the complexity of the algorithm and reduces the memory requirement for tracking. The proposed object detection and tracking method are based on macroblock resizing, which demonstrates an accuracy rate of 96.5%. In addition, the average total power consumption for object detection and tracking which includes writing, read and hold power is 1.63× and 1.45× lesser than C6T and RD8T SRAM at 0.3 V VDD.




Topic Highlights



IEEE VLSI Projects

Important to realize, advancements in technology made implementation of embedded in home applicances. It involves designing integrated circuits (ICs) by combining thousands of transistors logically into a single chip by different logic circuits. To point out, real time projects done here. ElysiumPro assists you in developing Final Year IEEE Projects.

Moreover, analyse various topics to get a clear idea. Through your deep research over a topic. Download latest Projects titles with abstract available from our site. After that you get a good idea of selecting the project topic of your choice from the VLSI project’s  list. And also hope that you have enough confidence to take up any topic.