HERCULES will develop an integrated framework to achieve a predictable performance on top of low- power multi-core heterogeneous platforms. To that aim, a set of design decisions have been made already at this stage to maximize the impact and exploitation potential of the project outcomes. We hereafter outline these decisions.
Commercial-off-the-shelf Heterogeneous Platforms
Real-time application developers have been always struggling to promote the design, production and adoption of new predictable hardware platforms that are easier to program, schedule and analyse, especially from a timing perspective. Unfortunately, real-time systems have been mostly based on embedded architectures which have not (or at least not exclusively) been thought to address the predictability and analysability requirements of time-critical applications. The few platforms designed for being fully timing analysable became quickly obsoleted by later process technologies.
With the rising cost of manufacturing technologies, these motivations are even more constraining. It is not convenient to spend the considerable amount of money required to design domain-specific computing platforms, unless using cheaper but older, and therefore less performing, manufacturing technologies. This is essentially the reason why this project will not devise new hardware solutions to achieve the predictability targets addressed. Instead, it will use embedded COTS platforms with state-of-the-art technologies, performance and power consumption.
Parallel programming model
The tremendous potential of heterogenous platform in terms of performance and power efficiency comes at the cost of an increased programming complexity, which hinders programmability of these platforms. Nowadays, it is widely accepted that heterogeneous integration is key to attack technology and utilization walls at all computing scales.
For the acceleration of general-purpose, computation intensive programs, a widely adopted approach to simplify programming is using compiler directives on top of standard languages from the uniprocessor domain, such as C, to express parallelism over heterogeneous computing resources. Unfortunately, all the parallel programming approaches which address heterogeneous systems, exhibit certain limitations when compared to the objectives of the HERCULES project. For example, the constructs for accelerator exploitation are tailored to the characteristics of today’s GPUs, emphasizing data-level parallelism and copy-based host-to-device communication.
HERCULES will look at an offload model capable of exploiting different types of accelerators (not only GPUs but also, DSP clusters, FPGA modules and many-core accelerators)., HERCULES will not propose new programming models to deal with the above described issues, but it will build upon existing and widely adopted interfaces to maximize the impact of the framework.
Towards a data-centric scheduling
Energy consumption of modern computing architectures is ever more driven by data movement from memory to computing cores rather than by computing data within cores. Current embedded systems have strong constraints on the power consumption related to the application requirements: it is therefore of paramount importance to limit the
data movements between the cores and the different memory levels to fit the strict power cap required.
Observing the hardware configuration of modern low-power multi-core processors for the embedded market, it can be noted that they feature very limited on-chip cache sizes. The reason is that memory replication and coherency mechanisms lead to a repeated bouncing of data between various memory levels, causing a significant power consumption. While such mechanisms may help improving average case performances of general purpose systems, they are detrimental to the power efficiency and predictability of embedded applications, relying on power-hungry hardware mechanisms that cannot be modified via software.
Project HERCULES will implement state-of-the-art memory-centric co-scheduling mechanisms and execution models that jointly consider memory accesses and data processing to minimize the movement of data within the platform.However, the main reason to implement memory-centric scheduling algorithms in HERCULES in not power, but predictability. Real-time applications require bounding (and guaranteeing) the worst-case execution time (WCET) of time-critical activities. Average case performance is less important. To this end, it is necessary to profile and upper bound the memory and communication delays experienced by each task. When the access to shared resources is not properly orchestrated, a significant pessimism is introduced in the timing analysis due the possible occurrence of repeated memory conflicts and unbounded bus contention. Figure below shows a typical distribution of the execution times of a task. While most instances may have relatively small execution times, the WCET may be considerably higher than average-case performance.
Interestingly, there is a peculiar convergence between predictability, energy efficiency and performance.Properly orchestrating memory accesses allows reducing cache misses. This reduces (the variability of) memory-related timing delays, but at the same time it improves the energy efficiency, avoiding data to be repeatedly moved around the chip. HERCULES intend to collect the peculiar convergence for satisfying these increasing demands of future real-time applications by enforcing a smart orchestration of data movements within and between SoC components.