Parallel external memory

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine.^[1] It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

Model

Definition

The PEM model^[1] is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of $P$ processors and a two-level memory hierarchy. This memory hierarchy consists of a large external memory (main memory) of size $N$ and $P$ small internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size $M$ which is partitioned in blocks of size $B$ . The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size $B$ .

I/O complexity

The complexity measure of the PEM model is the I/O complexity,^[1] which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if $P$ processors load parallelly a data block of size $B$ form the main memory into their caches, it is considered as an I/O complexity of $O(1)$ not $O(P)$ . A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.

Read/write conflicts

In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts^[1] occur. Like in the PRAM model, three different variations of this problem are considered:

Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.

The following two algorithms^[1] solve the CREW and EREW problem if $P\leq B$ processors write to the same block simultaneously. A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of $P$ parallel block transfers. A second approach needs $O(\log(P))$ parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round $P$ processors combine their blocks into $P/2$ blocks. Then $P/2$ processors combine the $P/2$ blocks into $P/4$ . This procedure is continued until all the data is combined in one block.

Comparison to other models


Model	Multi-core	Cache-aware
Random-access machine (RAM)	No	No
Parallel random-access machine (PRAM)	Yes	No
External memory (EM)	No	Yes
Parallel external memory (PEM)	Yes	Yes

Examples

Multiway partitioning

Let $M=\{m_{1},...,m_{d-1}\}$ be a vector of d-1 pivots sorted in increasing order. Let A be an unordered set of N elements. A d-way partition^[1] of A is a set $\Pi =\{A_{1},...,A_{d}\}$ , where $\cup _{i=1}^{d}A_{i}=A$ and $A_{i}\cap A_{j}=\emptyset$ for $1\leq i<j\leq d$ . $A_{i}$ is called the i-th bucket. The number of elements in $A_{i}$ is greater than $m_{i-1}$ and smaller than $m_{i}^{2}$ . In the following algorithm^[1] the input is partitioned into N/P-sized contiguous segments $S_{1},...,S_{P}$ in main memory. The processor i primarily works on the segment $S_{i}$ . The multiway partitioning algorithm (PEM_DIST_SORT^[1]) uses a PEM prefix sum algorithm^[1] to calculate the prefix sum with the optimal $O\left({\frac {N}{PB}}+\log P\right)$ I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.

// Compute parallelly a d-way partition on the data segments  $S_{i}$ 
for each processor i in parallel do
    Read the vector of pivots M into the cache.
    Partition  $S_{i}$  into d buckets and let vector  $M_{i}=\{j_{1}^{i},...,j_{d}^{i}\}$  be the number of items in each bucket.
end for

Run PEM prefix sum on the set of vectors  $\{M_{1},...,M_{P}\}$  simultaneously.

// Use the prefix sum vector to compute the final partition
for each processor i in parallel do
    Write elements  $S_{i}$  into memory locations offset appropriately by  $M_{i-1}$  and  $M_{i}$ .
end for

Using the prefix sums stored in  $M_{P}$  the last processor P calculates the vector B of bucket sizes and returns it.

If the vector of $d=O\left({\frac {M}{B}}\right)$ pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with $O\left({\frac {N}{PB}}+\left\lceil {\frac {d}{B}}\right\rceil >\log(P)+d\log(B)\right)$ I/O complexity. The content of the final buckets have to be located in contiguous memory.

Selection

The selection problem is about finding the k-th smallest item in an unordered list A of size N. The following code^[1] makes use of PRAMSORT which is a PRAM optimal sorting algorithm which runs in $O(\log N)$ , and SELECT, which is a cache optimal single-processor selection algorithm.

if  $N\leq P$  then 
     ${\texttt {PRAMSORT}}(A,P)$ 
    return  $A[k]$ 
end if 

//Find median of each  $S_{i}$ 
for each processor i in parallel do 
     $m_{i}={\texttt {SELECT}}(S_{i},{\frac {N}{2P}})$ 
end for 

// Sort medians
 ${\texttt {PRAMSORT}}(\lbrace m_{1},\dots ,m_{2}\rbrace ,P)$ 

// Partition around median of medians
 $t={\texttt {PEMPARTITION}}(A,m_{P/2},P)$ 

if  $k\leq t$  then 
    return  ${\texttt {PEMSELECT}}(A[1:t],P,k)$ 
else 
    return  ${\texttt {PEMSELECT}}(A[t+1:N],P,k-t)$ 
end if

Under the assumption that the input is stored in contiguous memory, PEMSELECT has an I/O complexity of:

O\left({\frac {N}{PB}}+\log(PB)\cdot \log({\frac {N}{P}})\right)

Distribution sort

Distribution sort partitions an input list A of size N into d disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.

If $P=1$ the task is delegated to a cache-optimal single-processor sorting algorithm.

Otherwise the following algorithm^[1] is used:

// Sample  ${\tfrac {4N}{\sqrt {d}}}$  elements from A
for each processor i in parallel do
    if  $M<|S_{i}|$  then
         $d=M/B$ 
        Load  $S_{i}$  in M-sized pages and sort pages individually
    else
         $d=|S_{i}|$ 
        Load and sort  $S_{i}$  as single page
    end if
    Pick every  ${\sqrt {d}}/4$ 'th element from each sorted memory page into contiguous vector  $R^{i}$  of samples
end for 

in parallel do
    Combine vectors  $R^{1}\dots R^{P}$  into a single contiguous vector  ${\mathcal {R}}$ 
    Make  ${\sqrt {d}}$  copies of  ${\mathcal {R}}$ :  ${\mathcal {R}}_{1}\dots {\mathcal {R}}_{\sqrt {d}}$ 
end do

// Find  ${\sqrt {d}}$  pivots  ${\mathcal {M}}[j]$ 
for  $j=1$  to  ${\sqrt {d}}$  in parallel do
     ${\mathcal {M}}[j]={\texttt {PEMSELECT}}({\mathcal {R}}_{i},{\tfrac {P}{\sqrt {d}}},{\tfrac {j\cdot 4N}{d}})$ 
end for

Pack pivots in contiguous array  ${\mathcal {M}}$ 

// Partition Aaround pivots into buckets  ${\mathcal {B}}$ 
 ${\mathcal {B}}={\texttt {PEMMULTIPARTITION}}(A[1:N],{\mathcal {M}},{\sqrt {d}},P)$ 

// Recursively sort buckets
for  $j=1$  to  ${\sqrt {d}}+1$  in parallel do
    recursively call  ${\texttt {PEMDISTSORT}}$  on bucket jof size  ${\mathcal {B}}[j]$ 
    using  $O\left(\left\lceil {\tfrac {{\mathcal {B}}[j]}{N/P}}\right\rceil \right)$  processors responsible for elements in bucket j
end for

The I/O complexity of PEMDISTSORT is:

O\left(\left\lceil {\frac {N}{PB}}\right\rceil \left(\log _{d}P+\log _{M/B}{\frac {N}{PB}}\right)+f(N,P,d)\cdot \log _{d}P\right)

where

f(N,P,d)=O\left(\log {\frac {PB}{\sqrt {d}}}\log {\frac {N}{P}}+\left\lceil {\frac {\sqrt {d}}{B}}\log P+{\sqrt {d}}\log B\right\rceil \right)

If the number of processors is chosen that $f(N,P,d)=O\left(\left\lceil {\tfrac {N}{PB}}\right\rceil \right)$ and $M<B^{O(1)}$ the I/O complexity is then:

$O\left({\frac {N}{PB}}\log _{M/B}{\frac {N}{B}}\right)$

Other PEM algorithms


PEM Algorithm	I/O complexity	Constraints
Mergesort^[1]	$O\left({\frac {N}{PB}}\log _{\frac {M}{B}}{\frac {N}{B}}\right)={\textrm {sort}}_{P}(N)$	$P\leq {\frac {N}{B^{2}}},M=B^{O(1)}$
List ranking^[2]	$O\left({\textrm {sort}}_{P}(N)\right)$	$P\leq {\frac {N/B^{2}}{\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}$
Euler tour^[2]	$O\left({\textrm {sort}}_{P}(N)\right)$	$P\leq {\frac {N}{B^{2}}},M=B^{O(1)}$
Expression tree evaluation^[2]	$O\left({\textrm {sort}}_{P}(N)\right)$	$P\leq {\frac {N}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}$
Finding a MST^[2]	$O\left({\textrm {sort}}_{P}(\|V\|)+{\textrm {sort}}_{P}(\|E\|)\log {\tfrac {\|V\|}{pB}}\right)$	$p\leq {\frac {\|V\|+\|E\|}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}$

Where ${\textrm {sort}}_{P}(N)$ is the time it takes to sort N items with P processors in the PEM model.

References

^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041.
^ ^a ^b ^c ^d Arge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms". 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. pp. 1–11. doi:10.1109/ipdps.2010.5470440. ISBN 9781424464425. S2CID 587572.