San Jose State University
San Jose, CA 95192
Parallel processing is becoming more common, with the emergence of new processing units how efficient is our current hardware given this new programming paradigm. Most computers now have multi-core processors and are capable of running in parallel to some extent. We examine how performance measures among the current processing units available today and their versatility onto other architectures. We do this by measuring the task scheduling and portability efficiency of these processing units. The processing units we going to evaluate are central processing units, graphical, and accelerated processing units.
Traditionally computers have had a single core central processing unit (CPU) and ran in sequential programming. In recent years we have tapped the potential of this hardware architecture and have then since evolved into a many-core, or multi-core architecture system. Commercial market CPUs have been around since the mid 1900s at a time when parallel processing was not the conventional programming paradigm. With the advancement of processor technology and in an attempt to make advancements in the speedup of programing algorithms a new paradigm has evolved and brought about parallel programing. To utilize this new paradigm some companies, like AMD/NVIDIA, have even engineered new custom types of processing units that are parallel focused in order to efficiently utilize parallel programming. The current processing units available are multi-core CPUs, graphical processing units (GPUs), and accelerated processing units (APUs).
Multi-core CPUs are components with two or more independent CPUs designed to transfer data by performing basic arithmetical, logical, input/output operations, and move data using registers. Before GPUs there was a predecessor known as the physics processing unit (PPUs), which were first released by the company Ageia. PPUs were initially designed for accelerating particle systems to measure transformations and collisions in physics and other scientific experiments. GPUs, also called visual processing units (VPUs), are designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. This happens to be a similar function to their ancestor the PPU. The GPU architecture was designed to perform computationally intensive transform and lighting calculations. It performs this outside the CPU to reduce latency by using many small cores that allow it to run in parallel by distributing the computations. This hardware feature makes GPUs very popular in multi-core processing and has spawned off general purpose graphical processing units (GPGPU). The company AMD wanted to take advantage of this architecture and created a hybrid processing unit they call an accelerated processing unit, also known as an advanced processing unit (APU). The APU is used as the computers main processing unit designed to accelerate the computations outside of the CPU but still share memory. These new processing units are on the rise and are developing a large following, but how well do they all compare against one another.
In this section we take a look at the similarities and differences among the current available processing units. The multi-core CPU hardware design varies in it can be a single core repeatedly placed on a die, or it can have many different cores were each one is optimized for specific tasks.
Fi gure 1.2
The above figures illustrate a dual-core (1.1) and quad-core (1.2) multi-core CPUs architecture .
GPU hardware uses a similar design however, its smaller cores are optimized by using more transistors and arithmetic logic units allowing it to run more threads of computation on the same size chip. This requires a smaller cache memory per core, as compared to that of a CPU, since there are more transistors dedicated to do computations and data processing rather than data caching or flow control. Due to this parallel architecture some GPGPUs are used in place of CPUs for intensive computations. However due to the architecture of the GPU many control and serial instructions still have to be performed by a CPU.
T he above Figure shows a GPU hardware architecture. 
This brings us to the APU which utilizes a GPGPU, or similar specialized processing unit, and a CPU on the same die in order to reduce overhead of the data transfer rates between the two units and reducing power consumption. This new architecture takes advantage of the benefits of both control and serial data processing of a CPU, and the parallel data processing and display computations of a GPU in order to improve performance and maintain flexibility with programs that are not designed to run in parallel and those that are capable.
The figure above shows 4 conceptual APU architectures.
In order to measure the performance of each processing unit we must first understand the limits of each processing unit in order to test them accordingly. Since GPUs use a smaller cache memory we must make sure the data falls in the capacity of the GPU memory. When parallel programming on a classic CPU Von Neumann architecture we must remember that this design has a bottleneck effect on data transfer and will affect throughput. We must also remember to consider the overhead produced by the bottleneck of data transfers between the CPU and GPU in CPU-GPU collaborative environments. As for APUs they share main memory but still have some overhead due to the differences in which they each interpret data during transfer.
In order to compare the performance of these processing units we must use a computing model that suits them all well. In order to do this we must consider the APU as a separate model due to the nature of it's architecture and it requiring a specific algorithm implementation. This is due to the optimization of scheduling computing tasks and communication tasks to the different hardware. Where the GPU or CPU is used for computing tasks and the CPU is used for the data control of input and output of the computing tasks. Separately the CPU and GPU have a large number of research results that indicate that the GPU in solving compute-intensive problems has great advantage compared with the CPU. This forces most CPU-GPU collaborative environments to rely on one another. The CPU being master control distributing tasks and executing some tasks while the GPU can only help in the execution of tasks. Further research has proved to show that depending on the intensity of the calculations the lower intense calculations should be done by the CPU and the higher intense calculations should be mapped to the GPU. Even with the overhead of communication between data transfers the collaborative environment out performs the CPU-only and GPU-only architectures. Since APUs are so new and hardware specific to AMD processors there is very little research, if any, done on this type of processor.
T he figure above shows parallel architectures with one CPU and one GPU.
In this section we look at the impact of programming paradigms on parallel processing units. The oldest programing paradigm used has been the imperative programing paradigm (PP). Most computer architectures are modeled after the Von Neumann model and since it's conception the imperative PP has been used directly and indirectly. In today's world we have a number of different paradigms from Object-Oriented, quantum, and others. Many of which have much to thank their predecessors in which they were modeled after. With the advancements of different paradigms we have also seen the birth of high level programming languages that are currently trying to take advantage of parallel processing, such languages would include OpenCL or CUDA. High level languages help reduce critical errors from being made at a programming level and are easier for programmers to debug and read. The inheritance in the structure of these PP still have elements of the imperative PP, and other abstraction models. All programming languages have to be interpreted into assembly language for the hardware to understand what computations it is being told to execute. This detail makes it difficult to completely eradicate previous PP and slows down the development of new paradigms and languages focused on parallel processing without being entirely architecture specific. However, there are many languages out now that are incorporating multi-threading and other elements of parallel programming in order to take advantage of the advances in technology. This however all falls onto the lap of the programmer to explicitly tell the processing units how to distribute the data making debugging more difficult and prone to human error.
3.4Portability with OmniDB
Portability of parallel processing on CPU, GPU, CPU-GPU, and APU architectures ranging from desktop to mobile devices has been architecture specific up to now. This implementation suggests the use of a kernel-adapter based design known as OmniDB in order to implement across the different architectures. While this implementation solves the issue of cross-platform implementation it does run into some unique problems. For instance, in the CPU-only and GPU-only architectures we have parallel programing elements (PPEs) that each contain their own memory which may or may not overlap with other PPEs. Where in the CPU-GPU environments they each have their own memory, and in APU architecture the two PPEs share main memory. To relieve these architectural differences and make the paradigm completely cross-platform we use the proposed kernel-adapter based approach in order to first verify the resources of the given architecture by an architecture-aware query and an adapter. The adapter allows the kernel to optimize data distribution according to the available resources. In the end it is an open problem to define the boundary between the kernel and the adapters, a subject that is open to further research.
Naturally two heads are better than one, and this was the thinking behind the development of multi-core and many-core systems. The peak performance and efficiency of CPUs has been reached in a common phrase known as the heat-and-power wall. No longer can a single CPU out perform another single CPU of the same caliber. With the help of a GPU a CPU can reduce its energy consumption and heat. However a GPU is slow with basic arithmetic computations when compared to a CPU, and thus still making the CPU best suited for sequential and low intensive computations. Noted is the fact that flow control/input/output operations are still controlled by the CPU, but requiring a larger cache to do so, making the CPU still a valuable asset to computer architecture. While a CPU can perform parallel programming through task parallelism it requires threads to be explicitly defined. A GPU on the other hand can perform parallel processing by data parallelism in which threads are managed and scheduled by the hardware. Overall the CPU-GPU collaborative environment is currently the most efficient when it comes to parallel processing with the APU having the additional benefit of a much smaller data transfer overhead via shared main memory.
In conclusion new processing units are always being developed along with programing languages and paradigms to take advantage of the available technology. We are currently in a time where our technology is cheap and capable of producing great performance but our paradigms are not designed to tap the true potential of what we have created. While we try to hold on to the sequential programs that we have grown accustomed to and try our best to adapt them to a more parallel architecture in order to make leaps instead of steps into the next level of computer science. So many advancements still await us in the next couple of centuries from mobile portability of parallel processing to the untapped potential of GPUs and APUs. Who knows we may even find ourselves developing yet another processing unit capable of doing everything our current ones can and much more. In the future we will look at multi-core APU performance.
Accelerated Processing Unit
Wang, Lei. Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment. International Conference on Computer Science and Information Technology 2008.
Zuhri Zuhud, Daeng A. From Programming Sequential Machine to Parallel Smart Mobile Devices.: Bringing Back the Imperative Paradigm to Today's Perspective. 8th International Conference on Information Technology 2013. http://ieeexplore.ieee.org.libaccess.sjlibrary.org/stamp/stamp.jsp?tp=&arnumber=6637580&tag=1