ACE Tutorials Week-5
Q.1
Let α be the percentage of program code that can be executed simultaneously by n processors in a computer system. Assume that the remaining code must be executed sequentially by a single processor. Each processor has an execution rate of x MIPS.
-
Derive an expression for the effective MIPS rate when using the system for exclusive execution of this program, in terms of n, α, and x.
-
If n = 16 and x = 4 MIPS, determine the value of α that will yield a system performance of 40 MIPS.
a. MIPS rate = [nα + (1 – α)] x = (nα – α + 1)x
b. 40 = (16* α- α+1)*4 40 = 60 α +4 α = 0.6
Q.2
A multiprocessor with eight processors has 20 attached tape drives. There are a large number of jobs submitted to the system that each require a maximum of four tape drives to complete execution. Assume that each job starts running with only three tape drives for a long period before requiring the fourth tape drive for a short period toward the end of its operation. Also assume an endless supply of such jobs.
-
Assume the scheduler in the OS will not start a job unless there are four tape drives available. When a job is started, four drives are assigned immediately and are not released until the job finishes. What is the maximum number of jobs that can be in progress at once? What are the maximum and minimum number of tape drives that may be left idle as a result of this policy?
-
Suggest an alternative policy to improve tape drive utilization and at the same time avoid system deadlock. What is the maximum number of jobs that can be in progress at once? What are the bounds on the number of idling tape drives?
-
If this conservative policy is used, at most 20/4 = 5 processes can be active simultaneously. Because one of the drives allocated to each process can be idle most of the time, at most 5 drives will be idle at a time. In the best case, none of the drives will be idle.
-
To improve drive utilization, each process can be initially allocated with three tape drives, with the fourth drive allocated on demand. With this policy, at most ⎣20 / 3⎦ = 6 processes can be active simultaneously. The minimum number of idle drives is 0 and the maximum number is 2.
Q.3
Consider the following two versions of a program to add two vectors:
-
The program on the left executes on a uniprocessor. Suppose each line of code L2, L4, and L6 takes one processor clock cycle to execute. For simplicity, ignore the time required for the other lines of code. Initially all arrays are already loaded in main memory and the short program fragment is in the instruction cache. How many clock cycles are required to execute this program?
-
The program on the right is written to execute on a multiprocessor with M processors. We partition the looping operations into M sections with L = N/M elements per section. DOALL declares that all M sections are executed in parallel. The result of this program is to produce M partial sums. Assume that k clock cycles are needed for each interprocessor communication operation via the shared memory and that therefore the addition of each partial sum requires k cycles. An l-level binary adder tree can merge all the partial sums, where l = log2M. How many cycles are needed to produce the final sum?
-
Suppose N = 220 elements in the array and M = 256. What is the speedup achieved by using the multiprocessor? Assume k = 200. What percentage is this of the theoretical speedup of a factor of 256?
17.16
a. The I loop requires N cycles, as does the J loop. With the L4 statement, the total is 2N + 1.
b. The sectioned I loop can be done in L cycles. The sectioned J loop produces M partial sums in L cycles. Total = 2L + l(k + 1).
c. Sequential execution of the original program takes 2N = 221 cycles.
Parallel execution requires, L=n/m = 220/28 =212 l = log2M = 8 Total = 2L + l(k + 1) = 213 + 8*(200+1) = 8000+1608 = 9800 cycles. This is a speedup factor of approximately 214 (221/9800). Therefore, an efficiency of 214/256 = 83.6% is achieved.
Q.4
Give several reasons for the choice by designers to move to a multicore organization rather than increase parallelism within a single processor.
In the case of pipelining, simple 3-stage pipelines were replaced by pipelines with 5 stages, and then many more stages, with some implementations having over a dozen stages. There is a practical limit to how far this trend can be taken, because with more stages, there is the need for more logic, more interconnections, and more control signals. With superscalar organization, performance increases can be achieved by increasing the number of parallel pipelines. Again, there are diminishing returns as the number of pipelines increases. More logic is required to manage hazards and to stage instruction resources.
Eventually, a single thread of execution reaches the point where hazards and resource dependencies prevent the full use of the multiple pipelines available. This same point of diminishing returns is reached with SMT, as the complexity of managing multiple threads over a set of pipelines limits the number of threads and number of pipelines that can be effectively utilized.
Q.5
What are the differences between CPU and GPU architecture?
19.2 In the CPU, as discussed in Chapter 18, the control logic and cache memory make up the majority of the CPU’s real estate; the GPU chip area is mostly devoted to processing logic. The GPU can perform at a significantly higher FLOPS rate than the CPU.
Share with your friends: |