Next: Moving to Out-of-core Data Up: Detailed Examples Previous: Generic Irregular Computation Contents

Parallelization of Simple Loops

The first step is to distribute data among processors. As a simpliest way we chose BLOCK data distribution. It means that each computing node will only own part of data and index array. To do that, we should add the following lines to our source code:

#define CD(j,k) (((j)-1)/(k)+1)                                                                                      
  MPI_Init( &argc, &argv );
  LIP_Setup( NULL, 0 ); /* all processes go to LIP */
  LIP_Rank( &rank );
  LIP_Size( &size );
  l_l = CD( l, size ); /* number of `x' elements on single node */
  n_l = CD( n, size ); /* number of `y' & `perm' elements on single node */
  l_start = l_l * rank;
  n_start = n_l * rank;
  generate_data(x, perm, l, l_start, l_start+l_l, n_start, n_start+n_l);                                                                                    
  LIP_Datamap_create( LIP_DATAMAP_BLOCK, l, &l_l, LIP_INDEXTYPE_INT, 1, 0,
    &datamap );

We use ceiling division CD(j,k) to make sure there is enough memory allocated on all nodes in case the data length is not equally divisible by the number of nodes. Each node generates only its own data (beginning from l_start) and indices (from n_start). After that a new Datamap object is created with two important parameters: l - total length of dataset and l_l - size of local portion after partitioning. Datamap is the object that contains the description of data distribution. What we have to do next is to create the communication schedule, transform the index array and use the communication layer to get all the needed data into its place. We write the following lines:

  LIP_Schedule_create( l_l, &schedule );
  LIP_Localize( datamap, perm, LIP_INDEXTYPE_INT, perm_l, LIP_INDEXTYPE_INT,
    n_l, 0, &schedule, LIP_INDEXINFO_NULL );

In the first line we create the communication schedule and inform it that the local data array has the size l_l. Then we use the LIP_Localize function. The behavior of this function is shown in the Fig. in Chapter . It reads information from the Datamap object and transforms the array perm[] into perm_l[]. The indices that pointed to non-local data elements now point to their places in the ghost-area, which is located in memory Section following the local array x[]. All the knowledge needed to bring these non-local data to their places is kept in the modified schedule object.
After localizing indices we are ready to perform computation of the loops. The first loop reads data from array x[] in an irregular way so before it can be computed, all the needed data must be gathered from other nodes. We write:

  LIP_Schedule_commit( &schedule );
  LIP_Gather( x_l, x_l + l_l, MPI_DOUBLE, schedule ); 
  for (i = 0 ; i < n_l ; i++)
    y[i] = -x_l[ perm_l[i] ];

The schedule has to be committed to ensure synchronization with other nodes. We gather data and put imported ones to the ghost-area.
The second loop is more complicated: we modify array x[] by adding to it some values. So after calculating the loop we have to scatter data to their owners, where their original values have to be modified, i.e. the summation has to be performed. Before we can start calculating this loop, we have to clear the ghost-area - otherwise wrong values might be added what would give errornous results.

  for (i = 0 ; i < n_l ; i++)
    x_l[ i + l_l ] = 0.0;
  for (i = 0 ; i < n_l ; i++)
     x_l[perm_l[i] ] += y[i];
  LIP_Scatter( x_l + l_l, x_l, MPI_DOUBLE, schedule, LIP_OP_SUM );
  LIP_Schedule_free( &schedule );

After scattering data and updating the local values, we can compute the total sum of arrays. Each node has to obtain the local sum and the final result is given by the MPI_Reduce.

  for (sum_y_l = 0.0, i = 0 ; i < n_l ; i++)
    sum_y_l += y[i];

  MPI_Reduce( &sum_y_l, &sum_y, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );

  for (sum_x_l = 0.0, i = 0 ; i < l_l ; i++)
    sum_x_l += x_l[i];

  MPI_Reduce( &sum_x_l, &sum_x, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD );

Next: Moving to Out-of-core Data Up: Detailed Examples Previous: Generic Irregular Computation Contents
Created by Katarzyna Zając