We have implemented the parallel versions of the MI and EP algorithms to solve the IASVP by using ScaLAPACK routines and some other routines directly implemented by the authors by using PBLAS and BLACS. We assume we have a Distributed Memory Machine consisting of Prc processors, each one with its own local memory, and connected through an interconnection network featured by the parameters , the word-sending time, and , the latency of the network. No assumption about the physical topology of the interconnection network is done.
The ScaLAPACK library partitions matrices by blocks and uses a 2-D block cyclic data distribution, mapping the blocks on a logical mesh of processors. We work with Prc = Pr x Pc processors in a logical mesh, with Pr processors by row and Pc processors by column and assume that matrices and some vectors of MI and LP are partitioned in blocks and cyclically distributed among the processors of the mesh, following the ScaLAPACK technique in order to obtain a good load balance.
We summarize the data distributions in the parallel machine, for MI and LP:
Global mxn matrices A0, A1, ..., Al, P(k) are locally stored in blocks of size m/Pr x n/Pc (note that only n first columns of P(k) are needed);
Global nxn matrix Q(k) is locally stored in blocks of size n/Pr x n/Pc;
Global l vector c(k) is stored in a replicated way with size l;
Global n vectors S(c(k)) y S* are stored in a replicated way with size n each.
For MI:
Global nxl matrix J(k) are locally stored in blocks of size n/Pr x l/Prc;
Global n vector b(k) is locally stored in blocks of size n/Pr.
For LP:
Global lxl matrix T is locally stored in blocks of size l/Pr x l/Pc;
Global l vector d(k) is locally stored in blocks of size l/Pr.
4.1 Parallel MI algorithm
Each processor execute the next parallel algorithm of MI with the portion of data it contains.
Parallel MI algorithm for IASVP
1. A(0) = A0 + c1(0)A1 + ... + cl(0)Al
(* using daxpy of BLAS *)
2. [P(0),S(0),Q(0)] = svd(A(0))
(* using pdgesvd of ScaLAPACK *)
3. error = || S(0)- S*||2
(* using daxpy, dnrm of BLAS *)
4. For k = 0,1,..., While error > tol
4.1 Compute J(k)
(* using ddot of BLAS, pdgemm of PBLAS
dgsum2d,dgesd2d,dgerv2d of BLACS *)
4.2 Compute b(k)
(* using ddot of BLAS, pdgemm of PBLAS
dgsum2d,dgesd2d,dgerv2d of BLACS *)
4.3 Compute c(k+1) solving J(k)c(k+1) = b(k)
(* using pdgetrf,pdgetrs of ScaLAPACK *)
4.4 Broadcast c(k+1) to all processors
(* using dgebs2d, dgebr2d of BLACS *)
4.5 A(k+1) = A0 + c1(k+1)A1 + ... + cl(k+1)Al
(* using daxpy of BLAS *)
4.6 [P(k+1),S(k+1),Q(k+1)] = svd(A(k+1))
(* using pdgesvd of ScaLAPACK *)
4.7 error = || S(k+1)- S*||2
(* using daxpy, dnrm of BLAS *)
The theoretical computational cost of this parallel algorithm can be approximated as TPAR=TA+TC, where TA represents the arithmetic time, which can be expressed as
TA = O() + k O() Flops
and TC represents the communication time which, can be expressed as
TC = O(17m+) + k O(m2),
In these expressions has been assumed m=n=l.
Share with your friends: |