`Matrix Multiplication1. Inner product method. Matrix multiplication is usually written: do i=1,n do j=1,n do k=1,n C(i,j)=C(i,j)+A(i,k)*B(k,j) end do end do end do The most direct translation of this program into vector form is: do i=1,n do j=1,nC(i,j)=DOT_PRODUCT(A(i,1:n),B(1:n,j))end do end do1 of 20The time on a pipelined machine is:n 2 ( ( s ­ 1 ) log n + s + 2 ( n ­ 1 ) )The time on an array machine or multiprocessor if P &gt; n is:( t + log n + t * )n 22 of 202. Middle-product method (n-parallelism) This is obtained by interchanging the headers in the original matrix multiplication loop. do j=1,n do k=1,n do i=1,n C(i,j)=C(i,j)+A(i,k)*B(k,j) end do end do end do The direct translation of this loop into vector form is: do j=1,n do k=1,n C(1:n,j)=C(1:n,j)+A(1:n,k)*B(k,j) end do end do3 of 20Alternatively, the headers could have been exchanged in a different order to obtain the loop: do j=1,n do k=1,n C(i,1:n)=C(i,1:n)+A(i,k)*B(k,1:n) end do end do The time on a pipelined machine is:2n 2 ( s + ( n ­ 1 ) )The time in an array machine is:( t + + t * )n 24 of 203. Outer-product method (n2-parallelism) Another interchange of the loop headers produce: do k=1,n do i=1,n do j=1,n C(i,j)=C(i,j)+A(i,k)*B(k,j) end do end do end do To obtain n2 parallelism, the inner two loops should take the form of a matrix operation: do k=1,n C(1:n,1:n)=C(1:n,1:n)+A(1:n,k) B(k,1:n) end do Where the operator  represents the outer product of two vectors. Given two vectors a and b, their outer product is a matrix Z such that Zi,j=ai × bj . Notice that the previous loop is5 of 20NOT a valid Fortran or Fortran 90 loop because  is not a valid Fortran character. The outer product matrix in the loop above has the following form:A 1k B k1 A 1k Bk2 A 1k B k3 ... A 2k B k1 A 2k Bk2 A 2k B k3 ... A 3k B k1 A 3k Bk2 A 3k B k3 ... ... ... ... ...2 1 The two directions of replicationThis matrix is the element-by-element product of the following two matrices:B A A ...Ak k k × B k k...kBwhich are formed by replicating Ak=A(1:n,k) and Bk=B(k,1:n) along the appropriate dimensions. This6 of 20replication can be achieved using the Fortran 90 SPREAD function discussed above:spread(A(1:n,k),dim=2,ncopies=n)*spread(B(k,1:n),dim=1,ncopies= n))The resulting loop is therefore: do k=1,n C=C+SPREAD(A(1:n,k),2,n)*SPREAD(B(k,1:N,1,n) end do In an array machine with P&gt;n2, the time would be:( 2 t copy log n + t * + t+ )nwhere tcopy is the time to copy a vector. The time to spread to n copies is logarithmic as discussed in class.7 of 204. n3 parallelism The product of two n×n matrices, C=matmul(A,B), can be computed by adding n matrices of rank (n,n): ... 3 B1 ... ... A 11 13 ... B 23 ... B 2 2 3 B 1 A 21 A 1 B 23 ... B3 1 3 A 1 12 A 1 B 33 22 A 22 B B 2 1 11 3 2 B A 1 B 22 A2 A2 B3 1 3 A1 A 1 B 32 11 21 A 22 B B 3 1 21 12 A B 21 A A2 B3 3 C= 2 A 1 B 31 A2 ... 3 A2 ...+ + ... + ...8 of 20These n matrices of rank (n,n) can be computed by multiplying (element-by-element) two three-dimensional arrays of rank(n,n,n). The two three-dimensional arrays are formed by replicating A and B along different dimensions as shown next: A A AB B B B.. .This replication can, again, be achieved, with SPREAD....9 of 20Thus, give the following three directions of replication: 3 2 1we can start by computing a n3 temporary array T as follows:T(:,:,:)=SPREAD(A,DIM=3,NCOPIES=n)*SPREAD(B,DIM=1,NCOPIES=n)Then, the result is just C=SUM(T,DIM=2) In an array machine with P&gt;=n3 processing unit, the time to compute C would be:( 2 t copy log n + t * + t + log n )10 of 208 Multiplication by DiagonalsAn n×n matrix A is banded if Aij=0 for i-j1, j-i2:A 11 ... ... A 12 A 22 ... ... ... A 23 ... ... A 1,  ... ... ... ... ... A n, n ­  + 1 1 0 2 A 2,  + 1 0 0 2 ... ... 0 ... ... A n ­  + 1, n 2 ... ... ... ... ... ... ... A n ­ 1, n A n, n 0 0A , 1 1 0 A  + 1, 2 ... 1 0 0 ... 0 0 0For a small band, for example, 1=2=3, the algorithm discussed before for matrix-vector multiplication is not efficient.11 of 20An alternative is to do the product by diagonals:A 0 A1 ... A 0 0 0 2 A­1 ... ... ... ... 0 0 ... ... ... A­ ... ... ... 1 0 ... ... ... 0 0 ... ... 0 0 0 ... ... ... ... 0 ... ... ... ×V ... ... ... ... ... ... ... ... ...After separating the diagonals into separate matrices, we get:A0 0 0 0 0 0 0 0 0 ... 0 0 0 0 ... 0 0 0 0 ... 0 0 0 0 ... 0 A1 0 0 V+ 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 ... 0 0 0 0 ... 0 0 0 0 A 2 00 0 0 V+...+ 00 0 0 00 0 0 00 0 0 0 ... V+ 0 0 0 0 0 0 0 0 A­1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 V 00 ... 0 0 0 V+...+ A 0 0 ... 0 0 2 0 0 0 ... 0 0 ... 00 012 of 20which can be written as follows:A0V  A1V  ...  A V + + + 2 2 2 + A­1 Vn ­ 1  ...  A­  Vn ­  1 1+ +where Vj=(Vj,...,Vn) and Vn-j=(V1,...,Vn-j). Also,  means add the shorter vector to the first component of the longer one, and  means add the shorter vector to the last component of the longer one. In Fortran 90 (except for the greek letters and the subscripts):+ (/ ... (/ A1(1:n-1)*V(1+1:n), (0., j=1,1) /) + (/ 0., A-1(1:n-1)*V(1:n-1) /) + ... (/ (0., j=1,2), A2(1:n-2)*V(1:n-2) /) A0(1:n)*V(1:n) + A1(1:n-1)*V(2:n),0. /)13 of 209 Consistent AlgorithmsA vector algorithm for solving a problem of size n is consistent with the best serial algorithms for the same problem if the redundancy is bounded as n  . Vector reduction is consistent.n -- + -- + ... + 1 - On 2 4 n­1 R n = ------ = ----------------------------- = -----------  1 n­1 O1 n­1 nBut parallel prefix is not (see Section 8.9). The time for the simple form of parallel prefix presented in class earlier given P processing units and vector length n=2k is:k­1t parallel =i=0n­2 -----------PiThe problem with algorithms that are non consistant is that for n (size of input data) large enough, the array algorithm is14 of 20slower than the scalar version. Assuming constant number of devices (pipeline stages or processing units). For example, assuming 8 processors tparallel has the following values:n16 32 64 512 1024 ~106tserial15 31 63 511 1023 ~106tparallel7 17 41 513 1153 ~2.5×106Notice that the array algorithm becomes slower than the scalar version when n=512.15 of 20A typical situation with inconsistent algorithms is depicted in the following graph: tparallel tserial timen overhead makes parallel algorithm slower parallel algorithm is better in this region assymptotic complexity16 of 2010 A consistent version of parallel prefixA consistent version of parallel prefix can be obtained by blocking the original algorithm. This can be done for both a pipelined unit or an array machine. We will show the array machine version. There are several steps in the algorithm 1. First the array B(1:n) should be reshaped into an matrix as follows: C=RESHAPE(B,(/n -Pn -- × P P, P /),0.0)The reshape(source,shape[,pad][,order]) function takes the elements of source, in normal Fortran order, and returns them (as many as will fit) as an array whose shape is specified by the one-dimensional integer array shape. If there is space remaining, then pad must be specifie and is used to fill out the rest. See the manual for a description of order. Thus, if17 of 20B = (/1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,11. /) then, the result of RESHAPE above would be:1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 0.2. Then, we assign a column of C to each processing unit and compute the partial sums of each column separately as follows: do i=2,n -PC(i,1:P) = C(i-1,1:P) + C(i,1:P) end do At the end of this loop we will havekC ( k, q ) = C ( i, q )i=118 of 203. The third step is to compute the true value of the last element of each column: do j=2,P C(n -P,j)= C(n -P,j-1)+C(n -P,j)end do At the end of this step, the last element of each column will have the sum of all elements in its own column and the columns preceding it. 4. Finally, the elements in each processor are adjusted: do i=1,n -P-1n -PC(i,2:P) = C( end do,1:P-1) + C(i,2:P)19 of 20The total number of operation in the algorithm is: n ­ 1 P + ( P ­ 1 ) +  n ­ 1 ( P ­ 1 )  -  - p PThe algorithms is consistent because: n ­ 1 P + ( P ­ 1 ) +  n ­ 1 ( P ­ 1 )  -  - p P ----------------------------------------------------------------------------------------------------  2 n20 of 20`

20 pages

#### Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate

830591