Process 0 a00 a01 a10 a11 Process 1

Download 426.29 Kb.

Page	2/4
Date	09.06.2018
Size	426.29 Kb.
	#53780

1 2 3 4

Process 3
Fox’s Algorithm Psuedocode

Process 0

a₀₀ a₀₁

a₁₀ a₁₁

Process 1

a₀₂ a₀₃

a₁₂ a₁₃

Process 2

a₂₀ a₂₁

a₃₀ a₃₁

Process 3

a₂₂ a₂₃

a₃₂ a₃₃

Fox’s algorithm is a one that distributes the matrix using a checkerboard scheme like the above. In order to simplify the discussion, let’s assume that the matrices have order n, and the number of processes, p, equals n². Then a checkerboard mapping assigns a_ij, b_ij, and c_ij to process (i,j). In a process grid like the above, the process (i,j) is the same as process p= i * n + j, or, loosely, process (i,j) using row major form in the process grid.

Fox’s algorithm for matrix multiplication proceeds in n stages: one stage for each term a_ikb_kj in the dot product
C_ij = a_i0b_0j + a_i1b_1i + … + a_i,n-1b_n-1,j.
During the initial stage, each process multiplies the diagonal entry of A in its process row by its element of B:
Stage 0 on process(i,j): c_ij=a_iib_ij.
During the next stage, each process multiplies the element immediately to the right of the diagonal of A by the element of B directly beneath its own element of B:
Stage 1 on process(i,j): c_ij = c_ij + a_i,i+1b_i+1,j.
In general, during the kth stage, each process multiplies the element k columns to the right of the diagonal of A by the element k rows below its own element of B:
Stage k on process(i,j): c_ij = c_ij + a_i,i+kb_i+k,j.
Of, course, we can’t just add k to a row or column subscript and expect to always get a valid row or column number. For example, if i = j = n –1, then any positive value added to i or j will result in an out-of-range subscript. One possible solution is to use subscripts module n. That is, rather than use i + k for a row or column subscript, use i + k mode n. Perhaps we should say that the incomplete algorithm is correct. We still haven’t said how we arrange that each process gets the appropriate values a_i,k and b_k,i. Thus, we need to broadcast a_ik across the ith row before each multiplication. Finally, observe that during the initial stage, each process uses its own element, b_ij, of B. During subsequent stages, process(i,j) will use b_ik. Thus, after each multiplication is completed, the element of B should be shifted up one row, and elements in the top row should be sent to the bottom row.
Example-1: illustrates the stages in Fox’s algorithm for multiplying two 2  2 matrices distributes across four processes.

Stages	First Step	Second Step	Third Step
Stage 0	a₀₀ a₀₁	c₀₀ = a₀₀b₀₀ c₀₁ = a₀₀b₀₁	b₀₀ b₀₁
Stage 0	a₁₀ a₁₁	c₁₀ = a₁₁b₁₀ c₁₁ = a₁₁b₁₁	b₁₀ b₁₁
Stage 1	a₀₀ a₀₁	c₀₀ = c₀₀ + a₀₁b₁₀ c₀₁ = c₀₁ + a₀₁b₁₁	b₁₀ b₁₁
Stage 1	a₁₀ a₁₁	c₁₀ = c₁₀ + a₁₀b₀₀ c₁₁ = c₁₁ + a₁₀b₀₁	b₀₀ b₀₁

First step: each blue element in the ith row of A will be multiplied by all of the corresponding ith row of B in the third step.

Second step: assigns the result of the multiplications to c_ij.

Third Step: after each stage, the elements of B should be shifted up one row, and elements in the top row should be sent to the bottom row.

Example-2: illustrates the stages in Fox’s algorithm for multiplying two 3  3 matrices distributed across nine processes.

Stages	First Step			Second Step			Third Step
Stage 0	a₀₀	a₀₁	a₀₂	c₀₀+=a₀₀b₀₀	c₀₁+=a₀₀b₀₁	c₀₂+=a₀₀b₀₂	b₀₀	b₀₁	b₀₂
	a₁₀	a₁₁	a₁₂	c₁₀+=a₁₁b₁₀	c₁₁+=a₁₁b₁₁	c₁₂+=a₁₁b₁₂	b₁₀	b₁₁	b₁₂
	a₂₀	a₂₁	a₂₂	c₂₀+=a₂₂b₂₀	c₂₀+=a₂₂b₂₁	c₂₂+=a₂₂b₂₂	b₂₀	b₂₁	b₂₂
Stage 1	a₀₀	a₀₁	a₀₂	c₀₀+=a₀₁b₁₀	c₀₁+=a₀₁b₁₁	c₀₂+=a₀₁b₁₂	b₁₀	b₁₁	b₁₂
	a₁₀	a₁₁	a₁₂	c₁₀+=a₁₂b₂₀	c₁₁+=a₁₂b₂₁	c₁₂+=a₁₂b₂₂	b₂₀	b₂₁	b₂₂
	a₂₀	a₂₁	a₂₂	c₂₀+=a₂₀b₀₀	c₂₁+=a₂₀b₀₁	c₂₂+=a₂₀b₀₂	b₀₀	b₀₁	b₀₂
Stage 2	a₀₀	a₀₁	a₀₂	c₀₀+=a₀₂b₂₀	c₀₁+=a₀₂b₂₁	c₀₂+=a₀₂b₂₂	b₂₀	b₂₁	b₂₂
	a₁₀	a₁₁	a₁₂	c₁₀+=a₁₀b₀₀	c₁₁+=a₁₀b₀₁	c₁₂+=a₁₀b₀₂	b₀₀	b₀₁	b₀₂
	a₂₀	a₂₁	a₂₂	c₂₀+=a₂₁b₁₀	c₂₁+=a₂₁b₁₁	c₂₂+=a₂₁b₁₂	b₁₀	b₁₁	b₁₂

Example-3: illustrates the stages in Fox’s algorithm for multiplying two 4  4 matrices distributed across four processes.
For large n it is that we can unlikely access to n² processors. So a natural solution would seem to be to store sub-matrices rather than matrix elements on each process. We use a square grid of processes, where the number of process rows or process columns, sqrt(p), evenly divides n. With this assumption, each process is assigned a square
n/sqrt(p)  n/sqrt(p)
sub-matrix of each of the three matrices. In our example p = 4, and n = 4, so the sub-matrices for matrix of A will be as following:
a₀₀ a₀₁a₀₂a₀₃

A₀₀ = A₀₁ =

a₁₀ a₁₁a₁₂a₁₃

a₂₀ a₂₁a₂₂a₂₃

A₁₀ = A₁₁ =

a₃₀ a₃₁a₃₂a₃₃

If we make similar definition of B_ij and C_ij, assign A_ij, B_ij, and C_ij to process (i,j), and we define q = sqrt(p), then our algorithm will compute
C_ij = A_iiB_ij + A_i,i+1B_i+1,j + … + A_i,q-1B_q-1,j + A_i0B_0j + … +A_i,i-1B_i-1,j.
The stages for computing C_ij will be as following:

Q	First Step	Second Step	Third Step
q = 0	A₀₀ A₀₁	C₀₀ = A₀₀B₀₀ C₀₁ = A₀₀B₀₁	B₀₀ B₀₁
q = 0	A₁₀ A₁₁	C₁₀ = A₁₁B₁₀ C₁₁ = A₁₁B₁₁	B₁₀ B₁₁
q = 1	A₀₀ A₀₁	C₀₀ = C₀₀ + A₀₁B₁₀ C₀₁ = C₀₁ + A₀₁B₁₁	B₁₀ B₁₁
q = 1	A₁₀ A₁₁	C₁₀ = C₁₀ + A₁₀B₀₀ C₁₁ = C₁₁ + A₁₀B₀₁	B₀₀ B₀₁

Fox’s Algorithm Psuedocode

/* my process row = i , my process column = j */

q = sqrt(p);

dest = ((i – 1) mod q , j);

source = ((i + 1) mod q, j);

for (stage = 0 ; stage < q ; stage++)

{

k_bar = (i + stage) mod q;

Broadcast A[i,k_bar] across process row i;
C[i,j] = C[i,j] + A[i,k_bar]*B[k_bar,j];
Send B[(k_bar+1) mod q, j] to dest;

Receive B[(k_bar+1) mod q, j] from source;

}

Process(0,0)

i = j = 0
Assign A_00,B₀₀,C₀₀ to this process(0,0)
q = 2
dest = (1,0)
source = (1,0)
stage = 0
k_bar = 0
Broadcast A₀₀ across process row i = 0
C₀₀ = C₀₀ + A₀₀ * B₀₀
Send B₀₀ to dest = (1,0)
Receive B₁₀ from source = (1,0)
stage = 1
k_bar = 1
Broadcast A₀₁ across process row i = 0
C₀₀ = C₀₀ + A₀₁B₁₀
Send B₀₀ to dest = (1,0)
Receive B₀₀ from source = (1,0)

Process(0,1)

i = 0, j = 1
Assign A_01,B₀₁,C₀₁ to this process(0,1)
q = 2
dest= (1,1)
source= (0,1)
stage = 0
k_bar = 0
Broadcast A₀₀ across process row i = 0
C₀₁ = C₀₁ + A₀₀ * B₀₁
Send B₁₁ to dest = (1,1)
Receive B₁₁ from source = (0,1)
stage = 1
k_bar = 1
Broadcast A₀₁ across process row i = 0
C₀₁ = C₀₁ + A₀₁ * B₁₁
Send B₀₁ to dest = (1,1)
Receive B₀₁ from source = (0,1)

Process(1,0)

i = 1, j = 0
Assign A_10,B₁₀,C₁₀ to this process(1,0)
q = 2
dest = (0,0)
source = (0,0)
stage = 0
k_bar = 1
Broadcast A₁₁ across process row i = 1
C₀₀ = C₀₀ + A₁₁ * B₁₀
Send B₀₀ to dest = (0,0)
Receive B₀₀ from source = (0,0)
stage = 1
k_bar = 0
Broadcast A₁₀ across process row i = 0
C₀₀ = C₀₀ + A₁₀B₀₀
Send B₁₀ to dest = (0,0)
Receive B₁₀ from source = (0,0)

Process(1,1)

i = 1, j = 1
Assign A_11,B₁₁,C₁₁ to this process(1,1)
q = 2
dest = (0,1)
source = (0,1)
stage = 0
k_bar = 1
Broadcast A₁₁ across process row i = 1
C₁₁ = C₁₁ + A₁₁ * B₁₁
Send B₀₁ to dest = (0,1)
Receive B₀₁ from source = (0,1)
stage = 1
k_bar = 0
Broadcast A₁₀ across process row i = 1
C₁₁ = C₁₁ + A₁₀B₀₁
Send B₁₁ to dest = (0,1)
Receive B₁₁ from source = (0,1)