Process 0
a00 a01
a10 a11
Process 1
a02 a03
a12 a13
| Process 2
a20 a21
a30 a31
| Process 3
a22 a23
a32 a33
|
Fox’s algorithm is a one that distributes the matrix using a checkerboard scheme like the above. In order to simplify the discussion, let’s assume that the matrices have order n, and the number of processes, p, equals n2. Then a checkerboard mapping assigns aij, bij, and cij to process (i,j). In a process grid like the above, the process (i,j) is the same as process p= i * n + j, or, loosely, process (i,j) using row major form in the process grid.
Fox’s algorithm for matrix multiplication proceeds in n stages: one stage for each term aikbkj in the dot product
Cij = ai0b0j + ai1b1i + … + ai,n-1bn-1,j.
During the initial stage, each process multiplies the diagonal entry of A in its process row by its element of B:
Stage 0 on process(i,j): cij=aiibij.
During the next stage, each process multiplies the element immediately to the right of the diagonal of A by the element of B directly beneath its own element of B:
Stage 1 on process(i,j): cij = cij + ai,i+1bi+1,j.
In general, during the kth stage, each process multiplies the element k columns to the right of the diagonal of A by the element k rows below its own element of B:
Stage k on process(i,j): cij = cij + ai,i+kbi+k,j.
Of, course, we can’t just add k to a row or column subscript and expect to always get a valid row or column number. For example, if i = j = n –1, then any positive value added to i or j will result in an out-of-range subscript. One possible solution is to use subscripts module n. That is, rather than use i + k for a row or column subscript, use i + k mode n. Perhaps we should say that the incomplete algorithm is correct. We still haven’t said how we arrange that each process gets the appropriate values ai,k and bk,i. Thus, we need to broadcast aik across the ith row before each multiplication. Finally, observe that during the initial stage, each process uses its own element, bij, of B. During subsequent stages, process(i,j) will use bik. Thus, after each multiplication is completed, the element of B should be shifted up one row, and elements in the top row should be sent to the bottom row.
Example-1: illustrates the stages in Fox’s algorithm for multiplying two 2 2 matrices distributes across four processes.
Stages
|
First Step
|
Second Step
|
Third Step
|
Stage 0
|
a00 a01
|
c00 = a00b00 c01 = a00b01
|
b00 b01
|
a10 a11
|
c10 = a11b10 c11 = a11b11
|
b10 b11
|
Stage 1
|
a00 a01
|
c00 = c00 + a01b10 c01 = c01 + a01b11
|
b10 b11
|
a10 a11
|
c10 = c10 + a10b00 c11 = c11 + a10b01
|
b00 b01
|
First step: each blue element in the ith row of A will be multiplied by all of the corresponding ith row of B in the third step.
Second step: assigns the result of the multiplications to cij.
Third Step: after each stage, the elements of B should be shifted up one row, and elements in the top row should be sent to the bottom row.
Example-2: illustrates the stages in Fox’s algorithm for multiplying two 3 3 matrices distributed across nine processes.
Stages
|
First Step
|
Second Step
|
Third Step
|
Stage 0
|
a00
|
a01
|
a02
|
c00+=a00b00
|
c01+=a00b01
|
c02+=a00b02
|
b00
|
b01
|
b02
|
a10
|
a11
|
a12
|
c10+=a11b10
|
c11+=a11b11
|
c12+=a11b12
|
b10
|
b11
|
b12
|
a20
|
a21
|
a22
|
c20+=a22b20
|
c20+=a22b21
|
c22+=a22b22
|
b20
|
b21
|
b22
|
Stage 1
|
a00
|
a01
|
a02
|
c00+=a01b10
|
c01+=a01b11
|
c02+=a01b12
|
b10
|
b11
|
b12
|
a10
|
a11
|
a12
|
c10+=a12b20
|
c11+=a12b21
|
c12+=a12b22
|
b20
|
b21
|
b22
|
a20
|
a21
|
a22
|
c20+=a20b00
|
c21+=a20b01
|
c22+=a20b02
|
b00
|
b01
|
b02
|
Stage 2
|
a00
|
a01
|
a02
|
c00+=a02b20
|
c01+=a02b21
|
c02+=a02b22
|
b20
|
b21
|
b22
|
a10
|
a11
|
a12
|
c10+=a10b00
|
c11+=a10b01
|
c12+=a10b02
|
b00
|
b01
|
b02
|
a20
|
a21
|
a22
|
c20+=a21b10
|
c21+=a21b11
|
c22+=a21b12
|
b10
|
b11
|
b12
|
Example-3: illustrates the stages in Fox’s algorithm for multiplying two 4 4 matrices distributed across four processes.
For large n it is that we can unlikely access to n2 processors. So a natural solution would seem to be to store sub-matrices rather than matrix elements on each process. We use a square grid of processes, where the number of process rows or process columns, sqrt(p), evenly divides n. With this assumption, each process is assigned a square
n/sqrt(p) n/sqrt(p)
sub-matrix of each of the three matrices. In our example p = 4, and n = 4, so the sub-matrices for matrix of A will be as following:
a00 a01 a02 a03
A00 = A01 =
a10 a11 a12 a13
a20 a21 a22 a23
A10 = A11 =
a30 a31 a32 a33
If we make similar definition of Bij and Cij, assign Aij, Bij, and Cij to process (i,j), and we define q = sqrt(p), then our algorithm will compute
Cij = AiiBij + Ai,i+1Bi+1,j + … + Ai,q-1Bq-1,j + Ai0B0j + … +Ai,i-1Bi-1,j.
The stages for computing Cij will be as following:
Q
|
First Step
|
Second Step
|
Third Step
|
q = 0
|
A00 A01
|
C00 = A00B00 C01 = A00B01
|
B00 B01
|
A10 A11
|
C10 = A11B10 C11 = A11B11
|
B10 B11
|
q = 1
|
A00 A01
|
C00 = C00 + A01B10 C01 = C01 + A01B11
|
B10 B11
|
A10 A11
|
C10 = C10 + A10B00 C11 = C11 + A10B01
|
B00 B01
|
Fox’s Algorithm Psuedocode
/* my process row = i , my process column = j */
q = sqrt(p);
dest = ((i – 1) mod q , j);
source = ((i + 1) mod q, j);
for (stage = 0 ; stage < q ; stage++)
{
k_bar = (i + stage) mod q;
Broadcast A[i,k_bar] across process row i;
C[i,j] = C[i,j] + A[i,k_bar]*B[k_bar,j];
Send B[(k_bar+1) mod q, j] to dest;
Receive B[(k_bar+1) mod q, j] from source;
}
Process(0,0)
i = j = 0
Assign A00,B00,C00 to this process(0,0)
q = 2
dest = (1,0)
source = (1,0)
stage = 0
k_bar = 0
Broadcast A00 across process row i = 0
C00 = C00 + A00 * B00
Send B00 to dest = (1,0)
Receive B10 from source = (1,0)
stage = 1
k_bar = 1
Broadcast A01 across process row i = 0
C00 = C00 + A01B10
Send B00 to dest = (1,0)
Receive B00 from source = (1,0)
|
Process(0,1)
i = 0, j = 1
Assign A01,B01,C01 to this process(0,1)
q = 2
dest= (1,1)
source= (0,1)
stage = 0
k_bar = 0
Broadcast A00 across process row i = 0
C01 = C01 + A00 * B01
Send B11 to dest = (1,1)
Receive B11 from source = (0,1)
stage = 1
k_bar = 1
Broadcast A01 across process row i = 0
C01 = C01 + A01 * B11
Send B01 to dest = (1,1)
Receive B01 from source = (0,1)
|
Process(1,0)
i = 1, j = 0
Assign A10,B10,C10 to this process(1,0)
q = 2
dest = (0,0)
source = (0,0)
stage = 0
k_bar = 1
Broadcast A11 across process row i = 1
C00 = C00 + A11 * B10
Send B00 to dest = (0,0)
Receive B00 from source = (0,0)
stage = 1
k_bar = 0
Broadcast A10 across process row i = 0
C00 = C00 + A10B00
Send B10 to dest = (0,0)
Receive B10 from source = (0,0)
|
Process(1,1)
i = 1, j = 1
Assign A11,B11,C11 to this process(1,1)
q = 2
dest = (0,1)
source = (0,1)
stage = 0
k_bar = 1
Broadcast A11 across process row i = 1
C11 = C11 + A11 * B11
Send B01 to dest = (0,1)
Receive B01 from source = (0,1)
stage = 1
k_bar = 0
Broadcast A10 across process row i = 1
C11 = C11 + A10B01
Send B11 to dest = (0,1)
Receive B11 from source = (0,1)
|
Share with your friends: |