# Systematic Design of an Approximate Adder: the Optimized Lower-part Constant-OR Adder

Ayad Dalloo<sup>†</sup>, *Student Member, IEEE,* Ardalan Najafi<sup>†</sup>, *Student Member, IEEE,* and Alberto Garcia-Ortiz<sup>†</sup>, *Member, IEEE* 

*Abstract*—Exploiting the trade-off between accuracy and hardware cost has a tremendous potential to improve the efficiency of integrated systems. Using this concept, numerous approximate adders have been proposed in the last ten years. Although conceptually different, all previous architectures have been obtained with an ad-hoc and non-systematic methodology. Instead, this work generalizes and systematically optimizes an architectural template for approximate adders. The outcome, called Optimized Lower-part Constant-OR Adder, outperforms previous approaches in terms of accuracy and hardware cost. For example, an 8-bit approximate adder implemented with our new approach improves the mean squared error by 58.5%, while simultaneously reducing the cost by 7.2% with respect to the previously reported best architecture.

Index Terms—Approximate Computing, Stochastic Computing, Adder Architecture, Error-Cost Trade-off.

## I. INTRODUCTION

**S** TOCHASTIC computing has begun to emerge in response to the languishing benefits of technology scaling. Rather than hiding variations under expensive guard-bands, designers have begun to relax traditional correctness constraints and deliberately expose hardware variability to higher levels of the computing stack [1]. Approximate computing, a promising technique to reduce power, area and delay in VLSI design, approximates a system by redesigning its logic circuit [2]. It exploits the gap between the level of accuracy required by the applications and that provided by the computing system, for achieving diverse optimizations.

The researchers in the field of approximate computing have paid special attention to adders, one of the key components of arithmetic circuits. In fact, a surprisingly large number of approximate adders [3]–[10] have been proposed in the literature: segmented adders where an *n*-bit adder is divided into *k*-bit sub-adders [3]–[5]; carry select adders in which multiple sub-modules are used [6], [7], approximate full adders where the full adder is approximated [9], [10] and speculative adders which are built upon the observation that the critical path is rarely activated in traditional adders [11]–[13]. The current situation is such, that even a fair comparison of approximate adders is a challenging endeavor [14], [15]. Although all the architectures are conceptually different, they share a common characteristic: they have been obtained with an ad-hoc and

This work is funded by the German Research Foundation (DFG) project GA 763/4-1.

non-systematic methodology. A remarkable exception is the Generic Accuracy Configurable Adder (GeAr) that uses the idea of template [8] but is not optimal.

Among all the purely combinatorial approximate adders, the *Lower-part OR Adder* (LOA) [9] shows the best error versus hardware-cost trade-off [14], [15]. As can be seen in Fig 1, LOA [9] divides an *n*-bit adder into two sub-adders. While the higher significant sub-adder consists of an  $(n_h-1)$ -bit exact adder, the lower part sub-adder is simply constructed by  $n_l$  OR gates (bits 0 to  $n_l-1$ ). To generate the carry-in signal for the accurate adder, an extra AND gate is used which combines the adder inputs of bit position  $n_l$  i.e.,  $a_{n_l}$  and  $b_{n_l}$ . The key advantage of LOA with respect to other architectures as Equal Segmentation Adder (ESA) [4], Error Tolerant Adder (ETAII) [5] or Almost Correct Adder (ACA) [3] is that the approximation is restricted to the least significant bits and therefore, the magnitude of the errors is limited.

The goal of this brief is to improve LOA systematically. First, we generalize the LOA architecture in the form of an architectural template; then, studying all the possible choices to implement that template, we obtain an *optimal* architecture for the presented template focusing on Mean Squared Error (MSE). We call it *Optimized Lower-part Constant-OR Adder* (OLOCA). Since LOA is the superior adder among the existing approximate adders, our optimized architecture outperforms all the existing approximate adders when considering the tradeoff between hardware-cost and accuracy. The experimental evidence reported in this brief corroborate this fact.

Following the aforementioned goals, the paper is organized as follows: Section II describes the structures of the architectural template and of OLOCA. Afterwards, in Section III, we quantify the advantages of OLOCA using experimental results; furthermore, we validate the mathematical formulas developed in Section II. Finally, Section IV concludes the paper.



Fig. 1. Hardware architecture of the Lower-part OR Adder (LOA)

<sup>&</sup>lt;sup>†</sup>All the authors contributed equally to this work.

A. Najafi and A. Garcia-Ortiz are with the Institute of Electrodynamics and Microelectronics, University of Bremen, 28359 Bremen, Germany. E-mail: {ardalan, agarcia}@item.uni-bremen.de

### II. ARCHITECTURE

To obtain systematically an optimal<sup>1</sup> approximate adder, we progress in three steps. First, we describe the error metrics and hardware-cost quantifying the quality of the architecture; second, we generalize the architecture of LOA into a more abstract template; third, we optimize the template, regarding MSE, to produce OLOCA.

# A. Metrics

Different metrics need to be considered to evaluate the quality of approximate adders; they quantify the trade-off between error and hardware-cost.

The error is defined as the difference between approximate and accurate output results of the adder, i.e.,

$$\varepsilon = \tilde{S} - S,\tag{1}$$

where  $\tilde{S}$  is the approximate (erroneous) output of the adder and S is the accurate result. The magnitude of the error can be quantified with several metrics; among them, the most common ones are the *Average Error* ( $\mu$ ), the *Standard Deviation* (STD or  $\sigma$ ), the *Mean Squared Error* (MSE), and the *Mean Absolute Error* (MAE). They can be calculated as:

$$\mu = E[\varepsilon] , \qquad (2)$$

$$\sigma = \sqrt{E[(\varepsilon - \mu)^2]} , \qquad (3)$$

$$MSE = E[\varepsilon^2] = \mu^2 + \sigma^2 , \qquad (4)$$

$$MAE = E[|\varepsilon|] , \qquad (5)$$

where E is the expectation operator. It should be mentioned, that it is also common to employ the normalized version of the previous metrics dividing them by the range of the adder, i.e.  $2^n$ .

In order to evaluate the hardware efficiency of the architectures, the area and delay of the designs need to be considered. In the rest of this paper, *A* and *D* denote the hardware area and delay, respectively. In the mathematical analysis, we use the unit-gate model [16] where simple monotonic 2-input gates (AND,OR,NAND,etc.) have a cost of one in area and delay, and simple non-monotonic 2-input gate (XOR,XNOR) have a cost of two in area and delay. Obviously, in the experimental results, the actual area and delay of the circuit are considered.



Fig. 2. The hardware structure of the general template

<sup>1</sup>Throughout this manuscript, "optimal" refers to "optimal for the given template".

Table I ERROR METRICS AND UNIT GATE CHARACTERISTICS OF THE POSSIBILITIES FOR 2-TO-1 BLOCKS

|        | $\hat{\mu}$ | $\hat{\sigma}^2$ | $\hat{MSE}$ | Â | D |
|--------|-------------|------------------|-------------|---|---|
| AND    | -3/4        | $^{3/16}$        | $^{3/4}$    | 1 | 1 |
| OR     | -1/4        | $^{3/16}$        | $^{1/4}$    | 1 | 1 |
| Buffer | -1/2        | $^{1/4}$         | $^{1/2}$    | 0 | 0 |
| Cte-0  | -1          | 1/2              | $^{3/2}$    | 0 | 0 |
| Cte-1  | 0           | 1/2              | 1/2         | 0 | 0 |

Table II ERROR METRICS AND UNIT GATE CHARACTERISTICS OF THE POSSIBILITIES FOR THE 2-TO-2 BLOCKS

|            | $\hat{\mu}$ | $\hat{\sigma}^2$ | $\hat{MSE}$ | Â | Û    |
|------------|-------------|------------------|-------------|---|------|
| Half-Adder | 0           | 0                | 0           | 3 | 2(1) |
| OR_AND     | 1/4         | $^{3/16}$        | $^{1/4}$    | 2 | 1(1) |
| Cte-1_AND  | 1/2         | $^{1/4}$         | 1/2         | 1 | 0(1) |
| Buffer_AND | 0           | 1/2              | $^{1/2}$    | 1 | 0(1) |

### B. General Template Architecture Based on LOA

As discussed in the previous section, considering the error versus hardware-cost trade-off, experimental results show that LOA is the best architecture among all the existing approximate adders [14], [15]. Studying LOA's architecture carefully, it can be generalized as Fig. 2: the lower significant sub-adder can be divided into  $n_l$  2-to-1 logic blocks (bits 0 to  $n_l$ -1), and a single 2-to-2 logic block. This later block receives the inputs of the adder in bit position  $n_l$  to generate the input carry for the exact part using an AND gate, and its sum signal can be generated inexactly. Finally, the higher significant sub-adder is an exact adder. Clearly, the architecture of LOA can be described by taking the proposed general template, putting OR gates in each bit of the lower significant sub-adder, and replacing the first bit of the higher significant sub-adder with approximate circuitry of OR\_AND.

In principle, any Boolean function with the right size provides a choice for the blocks. Note that even a constant function equal to one (Cte-1) or zero (Cte-0) is a valid selection. For concreteness, the relevant choices for 2-to-1 and 2-to-2 blocks are tabulated in Tab. I and Tab. II, respectively. Although we have studied all the possibilities, the blocks with higher error values for the same cost have been eliminated from Tab. I and Tab. II. In order to have an *optimal* architecture for the template, the best combination of blocks from each table should be chosen. For uniform distributed data, each bit is uncorrelated and the error metrics of the template (T) can be calculated as a function of the error characteristics of each block. Since the total error,  $\varepsilon_T$ , is the summation of the errors of each block,  $\hat{\varepsilon}_i$ , with the corresponding weight, i.e.,  $\varepsilon_T = \sum_{i=0}^{n_l} \hat{\varepsilon}_i 2^i$ , we obtain:

$$\mu_T = \sum_{i=0}^{n_l} \hat{\mu}_i 2^i \tag{6}$$

$$\sigma_T^2 = \sum_{i=0}^{n_l} \hat{\sigma}_i^2 2^{2i} , \qquad (7)$$

$$MSE_{T} = \sum_{i=0}^{n_{l}} \hat{\sigma}_{i}^{2} 2^{2i} + \left(\sum_{i=0}^{n_{l}} \hat{\mu}_{i} 2^{i}\right)^{2}, \qquad (8)$$

where  $\hat{\mu}_i$  and  $\hat{\sigma}_i^2$  are the average error and the variance of error associated with the instantiated block in bit position *i*. The corresponding values are given in Tab. I for bits 0 to  $n_l - 1$ and in Tab. II for the bit  $n_l$ , under the column names  $\hat{\mu}$  and  $\hat{\sigma}^2$ , respectively. For example, using this method, we obtained the error metrics for LOA shown in Tab. III which agree with the simulation results of [15]. The key question, now, is whether the particular choices made by LOA are optimal, and if not, which is the optimal alternative for the selected template. Next subsection addresses this topic.

### C. The Optimized Architecture

Depending on the error metrics which are chosen, different optimization results might be obtained. Illustratively, here, we choose the MSE as the error metric because of its relevance in data processing applications. In order to obtain the optimal architecture out of the general template, we need to evaluate all the possible combinations of 2-to-1 and 2-to-2 logic blocks of Tab. I and Tab. II. Let us proceed firstly intuitively and then more formally.

The errors in the upper bits have a higher weight than in the lower ones (see Eq. (8)). Thus, it is more profitable to expend resources in the 2-to-2 block than in the lower 2-to-1 blocks. The best 2-to-2 blocks are the OR AND and Halfadder. Replacing the Half-adder with an OR AND does not improve the delay and improves the area only marginally; the penalty is a large increase in the MSE. For this reason, the idea of LOA (to use the OR\_AND for the 2-to-2 block) is not efficient. Once we fix the 2-to-2 block to a Half-adder, we can observe that the average error introduced by the 2-to-2 block is zero or positive, while the 2-to-1 blocks introduce a zero or negative average error. Thus, it is only useful to use blocks with small  $\hat{\mu}$  (the Cte-1) or small  $\hat{\sigma}$  (the OR). Therefore, the optimal disposition of 2-to-1 blocks should be OR blocks followed by Cte-1 blocks in the lower bits where the errors are less relevant. Since the adder is constructed using Cte-1s and OR gates, we call it Lower-part Constant-OR Adder (LOCA). The structure of LOCA is depicted in Fig. 3 and its error metrics can be expressed as follows:

$$\mu_{LOCA} = 2^{n_{cte}-2} - 2^{n_l-2} , \qquad (9)$$

$$\sigma_{LOCA}^2 = 2^{2n_l - 4} + \frac{5}{3} 2^{2n_{cte} - 4} - \frac{1}{6} , \qquad (10)$$

$$MAE_{LOCA} = 2^{n_l - 2} - 2^{n_{cte} - 2} +$$
(11)

$$+\frac{1}{3}\left(\frac{3}{4}\right)^{n_{l}-n_{cte}}\left(2^{n_{cte}}-\frac{1}{2^{n_{cte}}}\right) , \qquad (12)$$

$$MSE_{LOCA} = \frac{1}{6}2^{2n_l - 2n_{or}} + 2^{2n_l - 3} - 2^{2n_l - n_{or} - 3} - \frac{1}{6}$$
(13)

To determine the optimal number of OR gates, we can minimize the Eq. (13) versus  $n_{or}$ , resulting the optimal value in  $n_{or} = \log_2\left(\frac{8}{3}\right)$ . The closest integer numbers,  $n_{or} = 1$ and  $n_{or} = 2$ , produce the same MSE and are optimal. We prefer  $n_{or} = 2$  that provides a better STD. We call this architecture *Optimized Lower-part Constant-OR Adder* (OLOCA). Although remarkable simple, it outperforms LOA regarding STD, MSE and MAE.





Fig. 3. The structure of LOCA;  $n_l = n_{cte} + n_{or}$ 

The error metrics, area and delay of LOA and OLOCA are tabulated in Tab. III. Those formulas provide a better understanding of the architectures and make the comparison easier. As can be seen in the table, the average error of OLOCA is slightly larger than that of LOA's, while its STD is much smaller. Hence, the MSE of OLOCA is almost 2.4 times smaller than the MSE of LOA, which represents a considerable improvement for practical circuits. Regarding the MAE, LOA has a 1.6 times larger error with respect to OLOCA. Although OLOCA does not improve the delay over LOA, its silicon area is clearly smaller (for  $n_l > 2$ ).

It is also possible to obtain the optimal architecture out of the general template more rigorously. Firstly, let us observe that Eq. (8) and Tab. II imply that an architecture where the 2-to-2 block is not a Half-Adder has necessarily a MSE of at least  $MSE_T \geq \hat{\sigma}_{n_l}^2 4^{n_l} \geq \frac{3}{16} 4^{n_l}$ , which is worse than the MSE of OLOCA (see Tab. III). Thus, the 2-to-2 block has to be a Half-Adder in the optimal architecture.

In order to demonstrate that the selection of 2-to-1 blocks of OLOCA is optimal in terms of MSE for the given template (Fig. 2), we can proceed by induction, using  $n_l$  as the induction variable. A simple computation of all the possibilities, using Eq. (8), shows that OLOCA is indeed optimal for  $n_l = 1$ ,  $n_l = 2$  and  $n_l = 3$ . Let us analyze an architecture with  $n_l = K$ , assuming the optimality of OLOCA for an architecture with  $n_l = K - 1$ . Observe that the total error,  $\varepsilon_T$ , can be decomposed into the independent contributions of the block in bit position 0,  $\hat{\varepsilon}_0$ , and the remaining blocks,  $\varepsilon_{MSBs}$ . Since  $\varepsilon_T = \varepsilon_{MSBs} + \hat{\varepsilon}_0$ , the  $MSE_T$  can be expressed as a function of the statistical characteristics of  $\hat{\varepsilon}_0$  and  $\varepsilon_{MSBs}$ ; more precisely:

$$MSE_{T} = MSE_{MSBs} + MSE_{0} + 2\hat{\mu}_{0}\mu_{MSBs}$$
, (14)

where  $\mu_{MSBs}$  and  $MSE_{MSBs}$  can be calculated using Eq. (6) and Eq. (8), respectively, iterating *i* from 1 to *K*.

Note that the optimization of the block 0 and the remaining K-1 blocks are not independent due to the term  $2\hat{\mu}_0 \mu_{MSBs}$ . However, if we prove that the block 0 is a Cte-1 in the

Table III FORMULAS OF ERROR METRICS, AREA AND DELAY

|            | LOA                                | OLOCA                                      |
|------------|------------------------------------|--------------------------------------------|
| μ          | $\frac{1}{4}$                      | $\frac{-3}{16}2^{n_l}$                     |
| $\sigma^2$ | $\frac{1}{4}4^n l - \frac{1}{16}$  | $\frac{53}{768}4^n l - \frac{1}{6}$        |
| MSE        | $\frac{1}{4}4^{n_l}$               | $\frac{5}{48}4^{n}l - \frac{1}{6}$         |
| MAE        | $\frac{3}{8}2^{n}l - \frac{3}{8}$  | $\frac{15}{64}2^n l - \frac{3}{4}2^{-n} l$ |
| А          | $(n_h - 1) \cdot A_{FA} + A_{AND}$ | $(n_h - 1) \cdot A_{FA} + A_{HA}$          |
|            | $+ (n_l + 1) A_{OR}$               | $+ (n_l - n_{cte}) A_{OR}$                 |
| D          | $(n_h - 1).t_c + T_{AND}$          | $(n_h - 1).t_c + T_{AND}$                  |



Fig. 4. Comparison of 16-bit LOA and OLOCA synthesized in a 65nm tech.: Simulation and formulas results. (a)MAE vs. ADP, (b)MSE vs. ADP.

optimal architecture, then it follows that  $\hat{\mu}_0 = 0$  and the term  $2\hat{\mu}_0\mu_{MSBs}$  disappears. In this case, the optimization of  $MSE_{MSBs}$ , consisting of K-1 blocks, yields an OLOCA architecture by the induction hypothesis.

Let us show that the block in bit position 0 has to be Cte-1 (for  $K \geq 3$ ) in order to have the optimal architecture regarding MSE. Firstly, note that the alternative of choosing Cte-1 for the blocks 1 to K-1, which produces  $MSE_{MSBs} = \frac{1}{6}4^K - \frac{2}{3}$  and  $\mu_{MSBs} = 0$ , is sub-optimal. This is due to the fact that the resulting  $MSE_T$ , which is greater than or equal to  $MSE_{MSBs}$ , is worst than that of OLOCA. As a result, at least one of the blocks should not be a Cte-1 in the optimal architecture. For any of those cases,  $\mu_{MSBs} = \sum_{i=1}^{K-1} \hat{\mu}_i 2^i \leq \frac{-1}{2}$ , because each term in the addition is strictly negative. A simple calculation of the MSE for each of five possibilities for the block number 0, using Eq. (14), shows that a Cte-1 is the optimal selection when  $\mu_{MSBs} \leq -\frac{1}{2}$ . This observation concludes the proof.

### **III. EXPERIMENTAL RESULTS**

To assess the circuit characteristics and evaluate the presented architectures in the previous section, we have generated VHDL description of the adders. Different configurations of these adders are synthesized in a commercial low-power 65 nm library, for 16-bit and 8-bit operands. Using back-annotated simulations, dynamic power dissipation of the adders are evaluated after synthesis for the freq.=1GHz. Ripple Carry Adders (RCA) are used as the sub-adders of all the approximate adders. All the adders have been simulated for  $10^7$  uniformly distributed random input patterns. In this section, each adder's name is followed by one number. For ESA and ETAII, this number is the size of the equal segments. Regarding LOA, and OLOCA the number is the size of the lower significant sub-adder; i.e.  $n_l$ .

In order to check the accuracy of the formulas, as well as comparing the adder architectures, the error versus cost of the adders for different values of  $n_l$ s are depicted in Fig. 4. MAE and MSE versus Area-Delay Product (ADP) of the 16bit adders are shown in two graphs. LOCA has been simulated for different number of constants in each  $n_l$  case. As can be seen in the graphs, replacing OR gates with Cte-1s decreases the MAE and MSE values and at the same time the ADP; the trend continues until the point where 2 OR gates remain. After that point, the error values start increasing while the cost of the adder decreases. As a result, the optimal architecture, considering the error-cost trade-off, is obtained keeping 2 OR gates and place Cte-1s for the rest of 2-to-1 blocks. This verifies the discussion in the previous section that the optimal architecture has 2 OR gates. Replacing all the 2-to-1 blocks with Cte-1s considerably increase the error values. Although, replacing all the 2-to-1 blocks with Cte-1s results in an architecture which is still better that LOA, it is not the optimal architecture, as shown in the figure. It can also be seen that the OLOCA and LOA's formulas (see Tab. III) perfectly predict the behavior of the adders for all the  $n_l$ s. Moreover, for all  $n_l$ s, OLOCA outperforms LOA, both from cost and error points of view; for the same values of errors, OLOCA improves the cost almost 25% and for the same values of cost, the error values of OLOCA are almost half of the LOA's. As an example, a 16-bit OLOCA-8 improves the cost by 13.6%, MAE by 37.4% and MSE by 58% in comparison with LOA-8.

In order to evaluate OLOCA with another bit-width, we have studied 8-bit adders as well; the results are tabulated in Tab. IV. The table shows the accuracy of the presented formulas versus the simulation results, as well as the superiority of OLOCA over LOA for all the  $n_l$ s. As an example, OLOCA-4 improves MAE by 36.9%, MSE by 58.5% and cost by 7.2% in comparison with LOA-4.

To show the superiority of OLOCA over all the existing approximate adders, besides LOA, we consider ESA, ETAII and GeAr. Among the existing combinational approximate adders, the above-mentioned architectures have proved to have the best performance [14], [15] after LOA. Different configurations of the adders have been simulated and the results are depicted in Fig. 5. Fig. 5(a) depicts the MAE of the adders versus ADP. Similarly, MSE versus Power-Delay Product (PDP) of the adder architectures are illustrated in Fig. 5(b). Although ESA is hardware-efficient, it is the least accurate adder architecture. As an example, for the almost same value of ADP, OLOCA-8 improves the error value by 97% in comparison with ESA-4. OLOCA-8 improves error, ADP and PDP by 53%, 54.9% and 42.6% compared with ETAII-4, respectively. The improvements for the MSE are even

 Table IV

 Simulation and formulas results for 8-bit adders synthesized in a commercial 65nm technology

|                         |       | $n_l=2$ |         | $n_l=3$ |         | $n_l=4$ |         | $n_l=5$ |         | n1=6    |         |
|-------------------------|-------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
|                         |       | Sim.    | Formula |
| MAE                     | LOA   | 1.38    | 1.38    | 2.88    | 2.88    | 5.87    | 5.88    | 11.87   | 11.88   | 23.86   | 23.88   |
|                         | OLOCA | 0.75    | 0.75    | 1.78    | 1.78    | 3.70    | 3.70    | 7.48    | 7.48    | 14.96   | 14.99   |
| MSE                     | LOA   | 4.00    | 4.00    | 16.00   | 16.00   | 63.93   | 64.00   | 255.90  | 256.00  | 1023.39 | 1024.00 |
|                         | OLOCA | 1.50    | 1.50    | 6.53    | 6.50    | 26.50   | 26.50   | 106.57  | 106.50  | 424.95  | 426.50  |
| STD                     | LOA   | 1.99    | 1.98    | 3.99    | 3.99    | 7.99    | 8.00    | 16.00   | 16.00   | 31.99   | 32.00   |
|                         | OLOCA | 0.97    | 0.97    | 2.06    | 2.06    | 4.18    | 4.18    | 8.40    | 8.40    | 16.79   | 16.81   |
| ADP [zm <sup>2</sup> s] | LOA   | 26.82   | 26.82   | 19.19   | 19.50   | 13.19   | 13.32   | 7.95    | 8.28    | 3.94    | 4.38    |
|                         | OLOCA | 27.00   | 27.00   | 18.89   | 19.20   | 12.24   | 12.36   | 6.74    | 7.02    | 3.05    | 3.18    |



Fig. 5. Comparison of 16-bit approximate adders synthesized in a commercial 65nm tech. with various configurations: (a)MAE vs. ADP, (b)MSE vs. PDP.

larger.

## IV. CONCLUSION

In this paper, an optimal approximate adder, through generalizing an architectural template for approximate adders, has been proposed. The proposed adder "Optimized Lower-part Constant-OR Adder (OLOCA)" shows considerable improvement in both error and hardware-cost metrics in comparison with the previously reported best architectures. The superiority of OLOCA over the existing approximate adders has been proved presenting mathematical analysis and further using experimental results. As an instance, a 16-bit approximate adder implemented with the OLOCA approach improves the mean squared error by 58% while reducing the area-delay product by 13.8% at the same time, in comparison with an approximate adder implemented with the LOA approach.

### REFERENCES

- [1] J. Sartori and R. Kumar, "Stochastic computing," *Found. Trends Electron. Des. Autom.*, vol. 5, no. 3, pp. 153–210, Mar. 2011.
- [2] S. Mittal, "A survey of techniques for approximate computing," ACM Comput. Surv., vol. 48, no. 4, pp. 62:1–62:33, Mar. 2016.
- [3] A. B. Kahng and S. Kang, "Accuracy-configurable adder for approximate arithmetic designs," in *Proceedings of the 49th Annual Design Automation Conference*, ser. DAC '12. New York, NY, USA: ACM, 2012, pp. 820–825.
- [4] D. Mohapatra, V. K. Chippa, A. Raghunathan, and K. Roy, "Design of voltage-scalable meta-functions for approximate computing," in 2011 Design, Automation Test in Europe, March 2011, pp. 1–6.
- [5] N. Zhu, W. L. Goh, and K. S. Yeo, "An enhanced low-power high-speed adder for error-tolerant application," in *Proceedings of the 2009 12th International Symposium on Integrated Circuits*, Dec 2009, pp. 69–72.

- [6] K. Du, P. Varman, and K. Mohanram, "High performance reliable variable latency carry select addition," in 2012 Design, Automation Test in Europe Conference Exhibition (DATE), March 2012, pp. 1257–1262.
- [7] I. C. Lin, Y. M. Yang, and C. C. Lin, "High-performance low-power carry speculative addition with variable latency," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 23, no. 9, pp. 1591– 1603, Sept 2015.
- [8] M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel, "A low latency generic accuracy configurable adder," in 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), June 2015, pp. 1–6.
- [9] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, "Bio-inspired imprecise computational blocks for efficient VLSI implementation of soft-computing applications," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 57, no. 4, pp. 850–862, April 2010.
- [10] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, "Low-power digital signal processing using approximate adders," *IEEE Transactions* on Computer-Aided Design of Integrated Circuits and Systems, vol. 32, no. 1, pp. 124–137, Jan 2013.
- [11] A. K. Verma, P. Brisk, and P. Ienne, "Variable latency speculative addition: A new paradigm for arithmetic circuit design," in *Proceedings* of the Conference on Design, Automation and Test in Europe, ser. DATE '08, 2008, pp. 1250–1255.
- [12] S.-L. Lu, "Speeding up processing with approximation circuits," *Computer*, vol. 37, no. 3, pp. 67–73, Mar 2004.
- [13] D. Esposito, D. D. Caro, and A. G. M. Strollo, "Variable latency speculative parallel prefix adders for unsigned and signed operands," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 63, no. 8, pp. 1200–1209, Aug 2016.
- [14] H. Jiang, J. Han, and F. Lombardi, "A comparative review and evaluation of approximate adders," in *Proceedings of the 25th Edition on Great Lakes Symposium on VLSI*, ser. GLSVLSI '15. New York, NY, USA: ACM, 2015, pp. 343–348.
- [15] A. Najafi, M. Weißbrich, G. P. Vayá, and A. Garcia-Ortiz, "A fair comparison of adders in stochastic regime," in 2017 27th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS), Sept 2017, pp. 1–6.
- [16] R. Zimmermann, "Binary adder architectures for cell-based VLSI and their synthesis," Ph.D. dissertation, Swiss Federal Institute of Technology (ETH) Zurich, Hartung-Gorre Verlag, 1998.