This is the machine language which is corresponding to c statement C=A+B

Download 22.43 Kb.

Date	17.04.2021
Size	22.43 Kb.
	#56353

ACA

Load R1, A

Load R2,B

Add R3, R1, R2

Store R3, C

This is the machine language which is corresponding to C statement C=A+B

Following is the representation of the 32 bit fixed instruction format for the arithmetic operation.

Opcode	Register - Source 1 (rs)	Register – Source 2 (rt)	Register- Destination (rd)	Shift	Function
6 bits	5 bits	5 bits	5 bits	5 bits	6 bits

Following is a description of each field

Opcode : Machine code is represented by the opcode. Several related instructions can have same opcode. It is 6 bits long. It selects the specific operation.
Registers (rs, rt, rd): These are the numeric representation of the source and destination registers. Each of these filed is 5 bits long. rs and rt are first and second source registers and rd is a destination register
Shift : This field is used only with shift and rotate instructions, this is the count/amount by which source operand will be rotated/ shifted. This field is 5 bits long.
Function: The function field is 6-bits long, it is an extension of the opcode and they together determine the operation. It contains necessary control codes to differentiate the different instructions. For example, if opcode access the ALU, function selects which ALU function to use.

Load store instruction format

Load store uses a I-type instruction format, I instructions are used when the instruction must operate on an immediate value and a register value. I instructions are converted into machine code words in the following format

Opcode	Register- Source (rs)	Register- Target (rt)	Immediate (IMM)
6-bits	5-bits	5-bits	16-bits

Opcode: Ths instruction has a 6 bit opcode. In I-instruction, every mnemonics have one-to-one corresponding opcodes. This is because there is no function parameter to differentiate instructions with an identical opcode.

Registers (rs, rt): These are the source and target register operands, 5 bits each. Rs specifies the only register operand and rt specifies the register which will receive result of the computations.

Immediate field: This is 16-bit long which represents immediate up to 2¹⁶different values. This is large enough to handle the offset in a typical load (lw) and store (sw), also depending upon instruction, may be expressed in two’s complement.

2. Machine code for accumulator, accumulator minimizes the internal state of the machine and short instructions.

Load A

ADD B

Store C

Opcode	Index Addressing	Index
7-bits	1-bit	24-bits

Opcode: 7 bits long, capable of performing 128 bit operarions.

Index addressing field: this field is 1-bit, which is set only when the index addressing mode is considered. Index addressing mode specifies the effective address of the data.

Index: It is 24-bits long, which is equal to address of the operands in the memory.

3. let m be the total number of memory access

Percentage of memory access for data are,

=0.29

So, 29% memory accesses are for data.

b. It is given in question, that 2/3 data transfer are loads,

so read=2/3,

Let y be the total number of memory accesses

Then, Percentage of memory access for data that are reads is given by,

=0.90

So 90% memory accesses are reads.

Paper Review

The existing Muesli’s Data parallel skeleton library works on the distributed memory machines, these algorithmic skeleton sheaths the classic parallel programming, users apply easily in their application. Augmenting the skeleton library allow the user to use the same application on various parallel machines varying from the multiprocessor distributed memory to many-core shared memory machine. The new skeleton, Muenster Skeleton Library is a C++ template library, which uses the Open Multi-Processing API Open MP in concoction with MPI to work efficiently on multi-core architectures, also concealing the parallelism from users. Algorithmic skeletons offer a predefined parallel computation and communication pattern, reducing the low-level programming problems. The enhanced version includes the skeletons fold, map, scan, zip, and their variants on three distributed data structures to keep parallelism transparent to the users. Computer mainly needs ISO/IEC C++ 1998 compliant compiler along with MPI 1.2 compliant runtime system installed to execute a parallel program written with Muesli. What's more, compiler must support OpenMP 2.5 to carry multi-core processors. The OpenMP runtime environment generates multiple threads, dispense the skeleton’s workloads, and assign the threads to the available cores, which will be executing in parallel. Without any necessary changes to the program implemented on Muesli, the user can execute the application on a single-core cluster, also compile with a compiler not supporting OpenMP. To isolate from the direct use of OpenMP library routines, a special layer, OpenMP Abstraction Layer (OAL) was set forth, acts as a wrapper function and makes use of pre-processor macro _OPENMP. Foldcolumn skeleton helped in reducing all elements of each column of a matrix to a single element, each process ended up with an array containing m elements. Private clause (lift-of-variables) added to the work sharing directive to ensure each thread gets its private copy of variables. The standards of this new algorithmic skeleton was conducted on a multi-node, multicore cluster computer at the University of Munster. The cluster consisted ten compute nodes, each with two quadcore twin AMD Opteron 2352 CPUs at 2.1 GHz, total of 160 cores available. Variation was observed on speedups with increasing number of processes and threads. The skeleton-based implementation of Bellman-Ford algorithm performed better than the LM OSEM-based image recognition, but miserable from large, non-parallelizable code segments. In FFT implementation, reasonable speedups were found, mainly using mapIndexInPlace and permutePartition skeletons. The best among all standards was on gaussian elimination, achieved the linear speed up using rotateColumns, rotateRows and mapIndexInPlace skeletons. The results depict that, skeletons reasonably scale on multi-node, multi-core computer architecture. Nevertheless, scalability depends on problem size, amount of code applicable to parallelization.

Philipp Ciechanowicz, Herbert Kuchen, “Enhancing Muesli’s Data Parallel Skeletons for Multi-core Computer Architectures”, IEEE 12^th International Conference on High Performance Computing and Communications (HPCC), September 2010

Download 22.43 Kb.

Share with your friends: