Fast and Precise Cache Performance Estimation for Out-Of-Order Execution

Douma, R.J.; Altmeyer, S.; Pimentel, A.D.

DOI
10.5555/2755753.2757075

Publication date
2015

Document Version
Final published version

Published in

License
Article 25fa Dutch Copyright Act (https://www.openaccess.nl/en/in-the-netherlands/you-share-we-take-care)

Link to publication

Citation for published version (APA):
https://doi.org/10.5555/2755753.2757075

General rights
It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulations
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 426, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)
Fast and Precise Cache Performance Estimation for Out-Of-Order Execution

Roeland J. Douma, Sebastian Altmeyer, Andy D. Pimentel
Computer Systems Architecture Group, University of Amsterdam, Netherlands
\{r.j.douma, altmeyer, a.d.pimentel\}@uva.nl

Abstract—Design space exploration (DSE) is a key ingredient of system-level design, enabling designers to quickly prune the set of possible designs and determine, e.g., the number of the processing cores, the mapping of application tasks to cores, and the core configuration such as the cache organization. High-level performance estimation is a principle component of any system-level DSE: it has to be fast and sufficiently precise. Modern out-of-order architectures with caches pose a significant problem to this performance estimation process, as no simple one-to-one mapping of the number of cache misses and resulting cycle time exists.

We present a high-level cache performance-estimation framework for out-of-order processors. Evaluation shows that our prediction method is on average 15 times faster than cycle-accurate simulation, while our estimates only show an average error of below 3.5%, reduce the pessimism of a naive high-level performance estimation by around 66%, and still maintain a high fidelity. Our approach thus enables quick yet accurate performance estimation and extends the applicability of system-level DSE to out-of-order processors with caches.

I. INTRODUCTION

As the complexity of modern embedded systems increases, the development of these systems becomes more and more challenging. Design space exploration (DSE) methods are thus used to cope with the increasing complexity, to speed-up the development process and thus, to reduce the time-to-market. Automated exploration tools quickly prune the set of possible architectures and select the set of (pareto) optimal candidate architectures according to multiple objectives, such as performance, costs, size and power consumption. The system specification automatically selected by the exploration tools range from high-level information such as the number and the type of the processing cores or the mapping of application tasks to cores to more low-level specifications such as the cache hierarchy configuration. As the pruning of the set of candidate architectures is already required at an early development stage, a high-level performance estimation of the target application is necessary. Such an estimation has to fulfill two opposing requirements: it has to be fast (to evaluate a sufficient set of candidates) and precise (to select the correct candidates).

Caches are nowadays an integral component of most embedded systems where performance is critical. On the downside, caches require a significant amount of area on the chip and can increase the system’s energy consumption [1]. Choosing the right cache size and right cache parameters is thus a paramount task of any DSE, and the aforementioned high-level performance estimation must account for caches.

In case of processors with in-order execution, it is sufficient to compute the number of hits and misses as a simple one-to-one mapping from misses to cycles exist. To this end, an abundance of cache miss estimation techniques, such as stack-distance histograms and cache-miss equations, exist and fast and accurate performance estimation is available.

In contrast, modern out-of-order architectures with caches pose a significant problem to the performance estimation, as no simple one-to-one mapping of the number of cache misses and cycles exist: the actual address trace can change depending on the selected cache configuration and pipelining of outstanding memory requests change the average memory latency. A naive estimation of the number of cycles to execute an application purely based on the number of hits and misses (as valid for in-order processors) thus provides unreliable results. Hybrid approaches using source-code instrumentation and relatively fast source-level simulation exist, but typically are limited to in-order execution. Consequently, the currently available performance estimation techniques are either precise (based on slow cycle-accurate simulation) or fast (based on hit/miss ratios), but not both at the same time.

We present a cache performance estimation framework for out-of-order processors. The framework is split in two phases: a setup phase and a DSE-phase. In the setup-phase, we perform cycle-accurate simulations for two cache configurations, extract stack-distance histograms and the memory-overlap of the target application. This information serves as input for the DSE-phase which is then used to quickly estimate the performance of the target application for the candidate cache configurations. This two-step approach thus enables quick yet accurate performance estimation to be used for system-level DSE.

Our framework improves upon related work in the following aspects: The framework is:

- fast during the actual DSE using a two-phase approach,
- precise as it explicitly models the effect of out-of-order execution on the memory performance, and
- agnostic of the target architecture as it only relies on the existence of a cycle-accurate simulator, but does not require an in-depth analysis of the hardware.

The paper is structured as follows: Section II reviews the related work and Section III introduces the required terminology and notation. In Section IV, we present the performance-estimation framework and evaluate the proposed techniques in Section V. Section VI concludes the paper.

II. RELATED WORK

The estimation of cache performance has been an intensive subject of research in the past decades. We distinguish two types of approaches: static analysis and simulation based approaches.

The two most prominent examples of the former type are the cache-miss equations by Ghosh et al. [2], and the stack-distance computation [3]. Cache-miss equations provide a high-level representation of the cache behaviour. They are used in compiler frameworks to estimate the effect of optimizations. Estimation techniques based on the stack distance [3] provide
more accuracy, but suffer from the memory overhead. Several methods have been proposed to enable an efficient computation [4]–[7] or approximation [8] of the stack distance. Stack distances and cache miss equations, however, only provide the number of misses and hits for different cache configurations. While this is sufficient for simple in-order architectures that stall in case of a cache miss, the performance of out-of-order architectures is not sufficiently described by this metric alone.

Simulation techniques provide an alternative to the static analyses, but are either restricted to in-order processors [9]–[11] or to the simulation of only one cache configuration at a time, thus resulting in an unacceptable execution time [12].

A different set of performance estimation techniques is source-level simulation based on instrumentation [13]–[15]. The source code is instrumented with memory access annotations and timing information, enabling the performance estimation via source-code simulations instead of using a cycle-accurate low-level simulator. These approaches thus still require explicit cache simulation and do typically not take the effect of out-of-order execution into account. Only recently, Plyaskin et al. [16] presented a source-level simulation approach that accounts for out-of-order execution behavior. However, this approach still requires explicit cache simulation.

To the best of our knowledge, no fast and accurate high-level performance estimation for out-of-order processors with caches exist. DSE techniques thus either do not support these systems or have to restrict precise performance estimation to a small subset of the design space [17].

III. BACKGROUND AND NOTATION

In this section, we introduce the necessary notation and concepts needed in the remainder of the paper. We assume a target architecture with disjoint instruction and data caches that are both connected to main memory via a shared bus.

A. Caches and Address Traces

We concentrate in the following on set-associative caches with LRU replacement: i.e., caches that are partitioned into $s$ cache sets, where each memory block of size $l$ (i.e., cache line size is $l$) maps to exactly one of the cache sets. Each cache set in turn may contain up to $k$ different memory blocks at once, where $k$ is referred to as the associativity of the cache. The cache size is thus given by $l \cdot s \cdot k$.

We will later on use the concept of a perfect cache, which denotes a fully associative cache that is large enough to store the complete data of a task/program. A perfect cache thus only exhibits cold misses, but no conflict or capacity misses [5]. Note that direct-mapped and fully associative caches are special cases of set-associative caches with $k = 1$ or $s = 1$, respectively.

A cache configuration or cache setup $\zeta$ is a tuple $(k, s)$ consisting of the associativity and the number of cache sets. For the sake of simplicity, we assume a fixed line size $l$ throughout the paper. The set of all cache setups is denoted by $C$ with the perfect cache $pc$ as special instance of a cache setup, i.e.: $C = \{(k, s) \mid (k, s) \in \mathbb{N}\} \cup \{pc\}$. Architectural restrictions or a pre-selection of candidate cache setups allows us to reduce the complete set of cache setups $C$ and to focus only on a subset $\mathcal{C} \subset C$.

A cache address trace $T$ of size $n$ is an ordered sequence $[m_1, \ldots, m_n]$ of memory blocks $m_i \in \mathcal{M}$, where $\mathcal{M}$ is the set of all memory blocks. The set of all traces is given by $\mathcal{T}$. The stack distance $sd_s$ (s denotes the number of sets) [3] is the number of distinct memory elements mapping to cache set $s$ accessed in between an access to a memory element that also maps to $s'$ and the previous access to the same memory element, with $\infty$ denoting that there is no prior access:

$$
\forall \zeta \in \mathcal{C}, \forall l, \forall i<j<l, \forall s \in \mathbb{N}, \exists s' \in \mathbb{N}, \exists m_1 \neq m_j, \exists m_l \neq m_j, \forall 0 \leq m < l, \forall m_{l+k-1} \neq m, \exists m_{l+k} = m_l
$$

$$
\text{sd}_s : [m_1, [m_1, \ldots, m_n]] = \begin{cases} 
& |\{m_i \mid i < j < l \land \text{cs}(m_i) = \text{cs}(m_j)\}| \\
& \infty \\
& \text{if } m_i = m_j \\
& \land \forall s < j < l : m_i \neq m_j
\end{cases}
$$

Table I summarizes the definitions used throughout the paper.

### B. Stack Distance Histogram Computation

To derive the number of cache misses, we will later use stack distance histograms [4]–[7] within our performance estimation framework. These stack distance histograms only depend on the number of cache sets and the cache line size (which we assume to be fixed). It is therefore sufficient to compute one stack distance histogram per distinct number of cache sets, irrespective of the associativity of the cache configuration. The computation of several stack distance histograms for different numbers of cache sets can be done in parallel.

Consider for example the following address trace $A, B, B', A', B'', A, B, A', B', B, A''$ where we assume that all addresses map to distinct cache lines. The first 6 accesses are thus cold misses in any cache setup and have a stack distance of $\infty$.

In a fully-associative cache, three accesses have a stack distance of 5 ($A, B, B'$), and two accesses have a stack distance of 4 ($A', A''$). Only the last access to $B$ has a stack distance of 2. In case of the set-associative cache, we assume that all references to $A, A', A''$ map to the same cache set, but to distinct cache lines. The same holds for $B, B', B''$. Moreover, we assume that $A, B$ all map to distinct cache sets. Thus, to determine e.g. the stack distance for $A$, only accesses to $A, A'$, and $A''$ need to be taken into account. The number of cold misses is unchanged, five accesses have a stack distance of 2 ($A, B, A', B', A''$). The last access to $B$ has a stack distance of 1. This example shows that the stack distance can significantly change if the cache setup changes.

### IV. CACHE PERFORMANCE ESTIMATION

System-level DSE requires fast, yet precise estimation of the execution time of the target application on the candidate architecture. Exhaustive evaluation of all cache configurations
using simulation is thus not an option; the simulation runs would consume more time than budgeted for the complete DSE.

We propose a method that relies on a minimal number of cache simulations needed to predict the performance of all candidate cache setups from the set $\mathcal{C}$, and a limited number of trace analyses. We divide the set of candidate setups into $q$ congruent sets $\mathcal{C}_s$, where $q$ is the number of distinct numbers of sets within $\mathcal{C}$:

$$q = |\{s| (s,s) \in \mathcal{C}\}|$$  

(2)

$\mathcal{C}_s$ contains all candidate cache setups with $s$ cache sets:

$$\mathcal{C}_s = \{\zeta| \zeta = (s,s) \in \mathcal{C}\}$$  

(3)

Instead of performing $|\mathcal{C}|$ simulations, our approach relies on the results of only 2 simulations in total (one to extract the address trace and one to estimate the memory overlap), a performance metric explained in detail in Section IV-B) and $q$ trace analyses, which can be performed in parallel.

The first simulation assumes a perfect cache. This simulation provides us with the number of cycles, $\text{cyc}_{pc}$, and the address trace $T$. The second simulation is performed for a reference setup $\zeta' \in \mathcal{C}$, as explained later on.

The $q$ trace analyses are needed to compute a histogram of the stack distances for a given $s$: $H_s : \mathbb{N}_\infty \to \mathbb{N}$

$$H_s(x) = |\{i| \text{sd}_x(m_i,T) = x\}|$$  

(4)

Using the histograms, the number of cold misses for cache setup $\zeta = (k,s)$ is given by:

$$\text{miss}_{\zeta}^{\text{cold}} = H_s(\infty)$$  

(5)

The number of conflict or capacity misses by:

$$\text{miss}_{\zeta}^{\text{warm}} = \sum_{k' \geq k \land k' \neq \infty} H_s(k')$$  

(6)

And the total number of misses is given by:

$$\text{miss}_{\zeta} = \text{miss}_{\zeta}^{\text{cold}} + \text{miss}_{\zeta}^{\text{warm}}$$  

(7)

A. In-Order Execution

In an in-order processor, a cache miss means that the processor stalls until this cache line is retrieved from main memory. This means that the number of cycles required for a given cache setup can be approximated as follows:

$$\text{cyc}_{\zeta} = \text{cyc}_{pc} + \text{miss}_{\zeta}^{\text{warm}}, \gamma$$  

(8)

Where $\gamma$ is the nominal memory latency of the system.

For the sake of simplicity, we assume unrestricted and immediate access to all potentially shared resources such as a bus and only concentrate on the execution cycles of the target application in isolation. Even though this assumption may be considered a strong restriction, it allows us to focus on the problem at hand and is common in the related work.

B. Out-Of-Order Execution

Out-of-order processors do not stall in case of a cache miss, but execute other instructions while waiting for data. To fully exploit the advantages of out-of-order execution many processors also allow multiple outstanding memory requests to be handled simultaneously. Consequently, the effective memory latency $\hat{\gamma}$ is often significantly smaller than the actual time to transport data from main memory to the cache; and the number of cache misses alone is not sufficient to estimate the performance anymore.

1) Illustrated Example: We illustrate the effect of out-of-order execution on the memory latency with an example depicted in Figure 1. We assume two cache configurations: Configuration 1 is a direct mapped cache with 2 sets; Configuration 2 has an associativity of 2 and also 2 sets. The second cache is thus twice as large as the first one.

We consider the address given in Section III-B and we assume again that all letters map to pairwise distinct cache sets, but all the accentted letters map to different cache lines (but the same cache set). The cache is initially cold, i.e., empty, which means that the first $6$ accesses in this trace are all misses for all cache configurations. The trace with corresponding hits, misses and cycle counts is shown in Figure 1.

The memory latency is 100 cycles and two outstanding memory requests can be overlaid with a combined memory latency of 150 cycles (instead of 200 for an in-order processor). Note that this is an oversimplification purely for the purpose of the example. For the sake of simplicity, we also assume that if two memory requests overlap they have to finish before another request can start.

Although Figure 1 shows the entire address trace, for this example we will only focus on the final part of the trace, since the first 6 accesses are all cold misses. The processor with the direct mapped cache has 6 cache misses and spends a total of 450 cycles waiting on memory. The average memory latency is thus 75 cycles. The processor with the 2-way set associative cache has only 5 cache misses but waits for a total of 400 cycles for data. The average memory latency is thus 80 cycles.

The example illustrates the imprecision of Eq. (8) when applied to out-of-order processors. The nominal latency $\hat{\gamma}$ is 100 cycles while we observe values of 75 and 80 cycles instead. The naive estimation of the execution cycles is thus not valid for DSE. Instead of the simple one-to-one mapping of the number of misses and the execution time, the average memory latency depends on the actual cache configuration.

2) Effective Memory Latency: As Eq. (8) is restricted to in-order processors, we propose an adapted version for out-of-order execution. We multiply the number of warm misses $\text{miss}_{\zeta}^{\text{warm}}$ of cache configuration $\zeta$ with the effective memory latency $\hat{\gamma}$ instead of the nominal latency $\gamma$:

$$\text{cyc}_{\zeta} = \text{cyc}_{pc} + \text{miss}_{\zeta}^{\text{warm}}, \hat{\gamma}$$  

(9)

Unfortunately, the effective memory latency is not constant, but depends on the cache configuration as we have seen in the example.

In the following, we show how we can compute an approximation of execution cycles $\text{cyc}_{\zeta'}$ for cache configuration $\zeta'$ based on the memory overlap, a metric which we compute using a reference cache configuration $\zeta'$. We first compute the miss-rate of the warm misses for the reference configuration $\zeta'$:

$$\text{miss}_{\zeta'}^{\text{rate}_{\text{warm}}} = \frac{\text{miss}_{\zeta'}^{\text{warm}}}{n}$$  

(10)

where $n$ is the size of the address trace, i.e., the number of memory accesses, and the average number of cycles per
Our framework to estimate the cache performance consists of two phases, the setup and the DSE phase, as shown in Fig. 2. Within the setup phase, we perform the two simulations of the target application. One simulation assuming a perfect cache, and one assuming the minimal (i.e. reference) cache configuration $\zeta_{\min}$. These runs provide us with the number of execution cycles for these two cache setups ($\text{cyc}_{\text{pc}}, \text{cyc}_{\zeta_{\min}}$) and the address trace. The address trace is then input to our stack histogram computation, where we derive a stack histogram for each set $C_s$ (see Eq. (3)).

The last step within the setup phase consists of calculating the miss statistics for the reference cache setup $\zeta_{\min}$, i.e., the miss-rate $\text{miss}_{\zeta_{\min}}$ and the number of cycles of $\text{cyc}_{\zeta_{\min}}$ for the perfect cache $\zeta_{\zeta_{\text{pc}}}$, and the memory overlap $\zeta_{\zeta_{\text{mem}}}$ for the minimal configuration. This information is sufficient to estimate the number of cycles $\text{cyc}_{\zeta}$ for any cache configuration we are interested in within negligible time overhead using Eq. (14).

### Example Revisited

3) Example Revisited: We now revisit the example depicted in Fig. 1 to illustrate the presented concepts. The figure shows that a run of the simulator with a perfect cache takes 450 cycles and the run of the simulator with a direct mapped cache takes 900 cycles (i.e., $\text{cyc}_{\zeta_{\min}}$). We see using Eq. (5) and (6) that the execution with configuration 1, direct mapped, results in 6 cold misses and 6 warm misses. Thus, the warm miss-rate as given by Eq. (10) is $\frac{6}{12} = \frac{1}{2}$. Using the miss-rate, the number of cycles obtained from simulation (i.e., $\text{cyc}_{\zeta_{\min}}$) and Eq. (11), we calculate the fraction of cycles associated with the warm misses: $900 \cdot \frac{1}{2} = 450$. Furthermore, we calculate the error using Eq. (12): $(450 + 6 \cdot 100) - 900 = 150$ and the memory overlap as described in Eq. (13): $\frac{150}{450} = \frac{1}{3}$. By combining the memory overlap and the number of cycles of the perfect cache, we calculate $\text{cyc}_{\zeta}$ for cache configuration 1 using Eq. (14) as follows $\text{cyc}_{\zeta} \approx \frac{450 + 6 \cdot 100}{1 + \frac{1}{3}} = 900$ which provides an exact estimate (which was to be expected as we have used configuration 1 as reference configuration).

Using the same stack distance histogram, we also calculate that setup 2 results in 6 cold misses, 5 warm misses and 1 hit. Using Eq. (10), we get the following miss-rate $\frac{miss}{miss}_{\zeta_{\min}} = \frac{5}{12}$ and using Eq. (14), we estimate the number of cycles as follows: $\text{cyc}_{\zeta} \approx \frac{450 + 5 \cdot 100}{1 + \frac{1}{3}} \approx 834$, which is only off by 16 cycles (error of less than 1.9%). Using the naive approach from Eq. (8), we get the following estimate $\text{cyc}_{\zeta} = 450 + 5 \cdot 100 = 950$ which is 100 cycles off (error rate of 11.7%).

### Framework

V. Evaluation

In this section, we evaluate the effectiveness of our performance estimation framework. We focus in particular on (i) the accuracy of our predictions and the improvement with respect to the naive estimation from Eq. (8) (both in terms of predicted misses and predicted cycles), (ii) the fidelity in the predicted order of candidate configurations, and (iii) the speed-up compared to the use of cycle-accurate simulation.

The design space, i.e., the set of candidate cache configurations is as follows. We assume a line size of 32 bytes. The size of the cache ranges from 1 to 32 kB with a range of the
We calculate the error using the cache set-up. We therefore first evaluate the number of misses of-order execution. All simulations have been performed on an a 32-bit ARMv7-A with a 5-stage pipeline which supports out-to-evaluate the precision of our estimates. The simulator models the cycle counts for all benchmarks and all cache configurations (for the perfect and minimal cache configuration) and provides simulation within the framework (i.e. to obtain the cycle counts [19] instruction set simulator (ISS) serves as the cycle-accurate results. This means that results for the missing benchmarks lie in between those presented in the graphs. The simulator models a 32-bit ARMv7-A with a 5-stage pipeline which supports out-of-order execution. All simulations have been performed on an Intel QuadCore 17-2600.

A. Misses

Out-of-order execution may influence the actual address trace, such that the order of accesses depends on the actual cache set-up. We therefore first evaluate the number of misses predicted by our method (which is based on the address trace of the perfect cache) compared to the actual number of misses derived by the simulation (assuming the specific cache setup).

We calculate the error using $error = 1 - \frac{miss_{pred}}{miss_{act}}$, where $miss_{pred}$ is the number of misses predicted by the trace from the perfect cache. The average error is shown in Figure 3. The figure shows that the average error in the number of misses of our estimation framework is below 5%. It also shows that the I-cache is more strongly influenced by the out-of-order behaviour than the D-cache. We note that in an in-order processor, the order of the memory operations would not have changed and we would have an error of 0%.

B. Validity

In Figure 4, we have visualized the average error in our predicted number of cycles compared to the actual number of cycles obtained using simulation. For readability we have omitted the error bars from Figure 4, which would have shown an average standard deviation of 2%.

We calculate the error using $error = 1 - \frac{cyc_{pred}}{cyc_{act}}$, where $cyc_{pred}$ is the number of cycles we predict. We have determined the memory overlap using the smallest cache configuration ($c_{min}$) from our set of candidate configurations. This is a 1 kB cache with associativity 1, i.e., direct mapped.

The figure shows that for most benchmarks our predictions are close to the obtained values using simulation: From the 66 scenarios (of which only 34 are shown in the graphs due to the limited space), only 4 exhibit an average error of more than 5% and the majority (61) has an average error of below 3%. This indicates that our framework yields good results in predicting the number of required cycles.

The line graphs in Figure 4 show the improvement of our predictions compared to the naive approach from Eq. (8) on the secondary y-axis. From these results we can see that our approach is on average a factor 3.95 better for the prediction of the D-cache and a factor 6.95 for the I-cache.

The bars in Figure 4 show a large error of our predictions for two benchmarks (matmult and nsichneu). We have observed for these two benchmarks significant performance differences for the minimal cache setup ($c_{min}$) compared to other cache setups (not shown). This results in a stark under- or over-approximation of the memory overlap, which in turn results in a larger error of our predictions. However, our framework still improves over the naive approach.

Note that we are agnostic to the target architecture and did not perform an in-depth analysis of the hardware. We only require the availability of a cycle-accurate ISS to predict the number of required cycles.

C. Fidelity

As we have seen, our estimates are very close to the actual number of misses and execution cycles, but not exact. However, for early DSE it is often equally important to preserve the fidelity than to have precise predictions [20]. To show the fidelity of the results we use two common metrics: Spearman’s $\rho$ [21] and Kendall’s $\tau$ [22]. In short, Spearman’s $\rho$ compares the rank of elements between two datasets. Where a value of 1 means the ranks match and a value of -1 means the ranks are reversed. Kendall’s $\tau$ compares the relation of one item in the set to another. Again a value of 1 means the ordering matches and a value of -1 means a reverse ordering.

For the I-cache, the average value of Spearman’s $\tau$ for the whole benchmark suite is 0.994 with a standard deviation of 0.004. The average of Kendall’s $\rho$ is 0.983 with a standard deviation of 0.017. For the D-cache, the average of Spearman’s $\rho$ is 0.956 with a standard deviation of 0.038. The average of Kendall’s $\tau$ is 0.924 with a standard deviation of 0.062.

Our method is an approximation and is subject to an error margin, as can be seen from the figures. For cache setups that
perform within 1% of each other, our method can therefore not provide a proper judgement. This, however, is acceptable for early DSE as other design-criteria (such as hardware costs or energy consumption) are likely to outweigh such a minor difference. To take this into account, we also investigated the effect of allowing a 1% error in the number of calculated cycles in the calculation of the fidelity metrics. This means we allow shuffling of the ordering if the results are within a 1% range of each other. For the I-cache prediction, this improves Spearman’s ρ to an average of 0.999 with a standard deviation of 0.003 and Kendall’s τ to an average of 0.990 with a standard deviation of 0.012. The D-cache experiments show similar behaviour with Spearman’s ρ improved to an average of 0.991 with a standard deviation of 0.015. Kendall’s τ is improved to an average of 0.954 with a standard deviation of 0.042. These results indicate that there are several cache configurations with very similar performance.

D. Speed-up

Finally, we evaluate the execution time of our approach compared to exhaustive simulation using gem5. We therefore compare the total execution time of the simulation of the 36 cache setups with the execution time of our framework, including both the setup-phase (executed once in total) and DSE-phase (executed once for each cache setup). In Figure 5, the speed-up of our approach is shown per benchmark. We have measured speed-ups between 6x and 23x with an average of 15x. This shows that for all benchmarks, we are significantly faster than a complete simulation of all cache configurations.

Figure 6 visualizes the point at which the overhead of the setup-phase amortizes and our approach is faster than simulation. The dotted blue line shows the normalized execution time when exploring cache configurations using simulation. The line with the upward pointing arrows represents the normalized time our framework requires for the benchmark fir when predicting D-cache, i.e., the minimal speed-up. The line with the downward pointing arrows represents the benchmark lms when predicting I-cache, i.e., the maximum speedup. The break-even points show the number of cache configurations where the prediction using our framework and using the simulations require the same execution time. The break-even points are at 6 cache configurations in case of fir and D-cache exploration, and at 2 cache configurations in case of lms and I-cache exploration. The average break-even point lies at 2.37, which is to be expected as our framework requires two runs of the simulator and limited additional overhead to compute the metrics.

VI. CONCLUSION

In this paper, we have presented a framework to estimate the performance of systems with out-of-order processors using cache. Our framework consists of two phases: a setup and a DSE phase. The setup-phase performs two cycle-accurate simulations and a trace analysis to extract relevant information metrics, after which the DSE-phase then derives accurate performance estimations of the target application for architectures with different cache setups within negligible execution time. Evaluation has shown that the framework is precise both in terms of estimated number of cache misses (average error of less than 4%) and estimated execution cycles (average error of less than 3.5%). Our framework improves upon a naive estimation based purely on the number of misses by a factor of 3.95 in case of I-caches and by a factor of 6.95 in case of D-caches. The fidelity in the ordering of candidate cache setups is near optimal with an error of around 5%. We furthermore have observed an average speed-up of 15x compared to cycle-accurate simulation, which represents the only precise alternative. Our framework thus enables quick yet accurate performance estimation for out-of-order processors with caches to be used for system-level DSE.

ACKNOWLEDGEMENTS

This work was partially funded by the NWO SES Project Nr. 647.000.002 and the ICT COST Action IC1202 TACLe.

REFERENCES