Parallel external memory

PEM Model

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine.[1] It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

Model

Definition

The PEM model[1] is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of P {\displaystyle P} processors and a two-level memory hierarchy. This memory hierarchy consists of a large external memory (main memory) of size N {\displaystyle N} and P {\displaystyle P} small internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size M {\displaystyle M} which is partitioned in blocks of size B {\displaystyle B} . The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size B {\displaystyle B} .

I/O complexity

The complexity measure of the PEM model is the I/O complexity,[1] which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if P {\displaystyle P} processors load parallelly a data block of size B {\displaystyle B} form the main memory into their caches, it is considered as an I/O complexity of O ( 1 ) {\displaystyle O(1)} not O ( P ) {\displaystyle O(P)} . A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.

Read/write conflicts

In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts[1] occur. Like in the PRAM model, three different variations of this problem are considered:

  • Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
  • Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
  • Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.

The following two algorithms[1] solve the CREW and EREW problem if P B {\displaystyle P\leq B} processors write to the same block simultaneously. A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of P {\displaystyle P} parallel block transfers. A second approach needs O ( log ( P ) ) {\displaystyle O(\log(P))} parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round P {\displaystyle P} processors combine their blocks into P / 2 {\displaystyle P/2} blocks. Then P / 2 {\displaystyle P/2} processors combine the P / 2 {\displaystyle P/2} blocks into P / 4 {\displaystyle P/4} . This procedure is continued until all the data is combined in one block.

Comparison to other models

Model Multi-core Cache-aware
Random-access machine (RAM) No No
Parallel random-access machine (PRAM) Yes No
External memory (EM) No Yes
Parallel external memory (PEM) Yes Yes

Examples

Multiway partitioning

Let M = { m 1 , . . . , m d 1 } {\displaystyle M=\{m_{1},...,m_{d-1}\}} be a vector of d-1 pivots sorted in increasing order. Let A be an unordered set of N elements. A d-way partition[1] of A is a set Π = { A 1 , . . . , A d } {\displaystyle \Pi =\{A_{1},...,A_{d}\}} , where i = 1 d A i = A {\displaystyle \cup _{i=1}^{d}A_{i}=A} and A i A j = {\displaystyle A_{i}\cap A_{j}=\emptyset } for 1 i < j d {\displaystyle 1\leq i<j\leq d} . A i {\displaystyle A_{i}} is called the i-th bucket. The number of elements in A i {\displaystyle A_{i}} is greater than m i 1 {\displaystyle m_{i-1}} and smaller than m i 2 {\displaystyle m_{i}^{2}} . In the following algorithm[1] the input is partitioned into N/P-sized contiguous segments S 1 , . . . , S P {\displaystyle S_{1},...,S_{P}} in main memory. The processor i primarily works on the segment S i {\displaystyle S_{i}} . The multiway partitioning algorithm (PEM_DIST_SORT[1]) uses a PEM prefix sum algorithm[1] to calculate the prefix sum with the optimal O ( N P B + log P ) {\displaystyle O\left({\frac {N}{PB}}+\log P\right)} I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.

// Compute parallelly a d-way partition on the data segments 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}
  

for each processor i in parallel do
    Read the vector of pivots M into the cache.
    Partition 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}
  
 into d buckets and let vector 
  
    
      
        
          M
          
            i
          
        
        =
        {
        
          j
          
            1
          
          
            i
          
        
        ,
        .
        .
        .
        ,
        
          j
          
            d
          
          
            i
          
        
        }
      
    
    {\displaystyle M_{i}=\{j_{1}^{i},...,j_{d}^{i}\}}
  
 be the number of items in each bucket.
end for

Run PEM prefix sum on the set of vectors 
  
    
      
        {
        
          M
          
            1
          
        
        ,
        .
        .
        .
        ,
        
          M
          
            P
          
        
        }
      
    
    {\displaystyle \{M_{1},...,M_{P}\}}
  
 simultaneously.

// Use the prefix sum vector to compute the final partition
for each processor i in parallel do
    Write elements 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}
  
 into memory locations offset appropriately by 
  
    
      
        
          M
          
            i
            
            1
          
        
      
    
    {\displaystyle M_{i-1}}
  
 and 
  
    
      
        
          M
          
            i
          
        
      
    
    {\displaystyle M_{i}}
  
.
end for

Using the prefix sums stored in 
  
    
      
        
          M
          
            P
          
        
      
    
    {\displaystyle M_{P}}
  
 the last processor P calculates the vector B of bucket sizes and returns it.

If the vector of d = O ( M B ) {\displaystyle d=O\left({\frac {M}{B}}\right)} pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with O ( N P B + d B > log ( P ) + d log ( B ) ) {\displaystyle O\left({\frac {N}{PB}}+\left\lceil {\frac {d}{B}}\right\rceil >\log(P)+d\log(B)\right)} I/O complexity. The content of the final buckets have to be located in contiguous memory.

Selection

The selection problem is about finding the k-th smallest item in an unordered list A of size N. The following code[1] makes use of PRAMSORT which is a PRAM optimal sorting algorithm which runs in O ( log N ) {\displaystyle O(\log N)} , and SELECT, which is a cache optimal single-processor selection algorithm.

if 
  
    
      
        N
        
        P
      
    
    {\displaystyle N\leq P}
  
 then 
    
  
    
      
        
          
            PRAMSORT
          
        
        (
        A
        ,
        P
        )
      
    
    {\displaystyle {\texttt {PRAMSORT}}(A,P)}
  

    return 
  
    
      
        A
        [
        k
        ]
      
    
    {\displaystyle A[k]}
  

end if 

//Find median of each 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}
  

for each processor i in parallel do 
    
  
    
      
        
          m
          
            i
          
        
        =
        
          
            SELECT
          
        
        (
        
          S
          
            i
          
        
        ,
        
          
            N
            
              2
              P
            
          
        
        )
      
    
    {\displaystyle m_{i}={\texttt {SELECT}}(S_{i},{\frac {N}{2P}})}
  

end for 

// Sort medians

  
    
      
        
          
            PRAMSORT
          
        
        (
        {
        
          m
          
            1
          
        
        ,
        
        ,
        
          m
          
            2
          
        
        }
        ,
        P
        )
      
    
    {\displaystyle {\texttt {PRAMSORT}}(\lbrace m_{1},\dots ,m_{2}\rbrace ,P)}
  


// Partition around median of medians

  
    
      
        t
        =
        
          
            PEMPARTITION
          
        
        (
        A
        ,
        
          m
          
            P
            
              /
            
            2
          
        
        ,
        P
        )
      
    
    {\displaystyle t={\texttt {PEMPARTITION}}(A,m_{P/2},P)}
  


if 
  
    
      
        k
        
        t
      
    
    {\displaystyle k\leq t}
  
 then 
    return 
  
    
      
        
          
            PEMSELECT
          
        
        (
        A
        [
        1
        :
        t
        ]
        ,
        P
        ,
        k
        )
      
    
    {\displaystyle {\texttt {PEMSELECT}}(A[1:t],P,k)}
  

else 
    return 
  
    
      
        
          
            PEMSELECT
          
        
        (
        A
        [
        t
        +
        1
        :
        N
        ]
        ,
        P
        ,
        k
        
        t
        )
      
    
    {\displaystyle {\texttt {PEMSELECT}}(A[t+1:N],P,k-t)}
  

end if

Under the assumption that the input is stored in contiguous memory, PEMSELECT has an I/O complexity of:

O ( N P B + log ( P B ) log ( N P ) ) {\displaystyle O\left({\frac {N}{PB}}+\log(PB)\cdot \log({\frac {N}{P}})\right)}

Distribution sort

Distribution sort partitions an input list A of size N into d disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.

If P = 1 {\displaystyle P=1} the task is delegated to a cache-optimal single-processor sorting algorithm.

Otherwise the following algorithm[1] is used:

// Sample 
  
    
      
        
          
            
              
                4
                N
              
              
                d
              
            
          
        
      
    
    {\displaystyle {\tfrac {4N}{\sqrt {d}}}}
  
 elements from A
for each processor i in parallel do
    if 
  
    
      
        M
        <
        
          |
        
        
          S
          
            i
          
        
        
          |
        
      
    
    {\displaystyle M<|S_{i}|}
  
 then
        
  
    
      
        d
        =
        M
        
          /
        
        B
      
    
    {\displaystyle d=M/B}
  

        Load 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}
  
 in M-sized pages and sort pages individually
    else
        
  
    
      
        d
        =
        
          |
        
        
          S
          
            i
          
        
        
          |
        
      
    
    {\displaystyle d=|S_{i}|}
  

        Load and sort 
  
    
      
        
          S
          
            i
          
        
      
    
    {\displaystyle S_{i}}
  
 as single page
    end if
    Pick every 
  
    
      
        
          
            d
          
        
        
          /
        
        4
      
    
    {\displaystyle {\sqrt {d}}/4}
  
'th element from each sorted memory page into contiguous vector 
  
    
      
        
          R
          
            i
          
        
      
    
    {\displaystyle R^{i}}
  
 of samples
end for 

in parallel do
    Combine vectors 
  
    
      
        
          R
          
            1
          
        
        
        
          R
          
            P
          
        
      
    
    {\displaystyle R^{1}\dots R^{P}}
  
 into a single contiguous vector 
  
    
      
        
          
            R
          
        
      
    
    {\displaystyle {\mathcal {R}}}
  

    Make 
  
    
      
        
          
            d
          
        
      
    
    {\displaystyle {\sqrt {d}}}
  
 copies of 
  
    
      
        
          
            R
          
        
      
    
    {\displaystyle {\mathcal {R}}}
  
: 
  
    
      
        
          
            
              R
            
          
          
            1
          
        
        
        
          
            
              R
            
          
          
            
              d
            
          
        
      
    
    {\displaystyle {\mathcal {R}}_{1}\dots {\mathcal {R}}_{\sqrt {d}}}
  

end do

// Find 
  
    
      
        
          
            d
          
        
      
    
    {\displaystyle {\sqrt {d}}}
  
 pivots 
  
    
      
        
          
            M
          
        
        [
        j
        ]
      
    
    {\displaystyle {\mathcal {M}}[j]}
  

for 
  
    
      
        j
        =
        1
      
    
    {\displaystyle j=1}
  
 to 
  
    
      
        
          
            d
          
        
      
    
    {\displaystyle {\sqrt {d}}}
  
 in parallel do
    
  
    
      
        
          
            M
          
        
        [
        j
        ]
        =
        
          
            PEMSELECT
          
        
        (
        
          
            
              R
            
          
          
            i
          
        
        ,
        
          
            
              P
              
                d
              
            
          
        
        ,
        
          
            
              
                j
                
                4
                N
              
              d
            
          
        
        )
      
    
    {\displaystyle {\mathcal {M}}[j]={\texttt {PEMSELECT}}({\mathcal {R}}_{i},{\tfrac {P}{\sqrt {d}}},{\tfrac {j\cdot 4N}{d}})}
  

end for

Pack pivots in contiguous array 
  
    
      
        
          
            M
          
        
      
    
    {\displaystyle {\mathcal {M}}}
  


// Partition Aaround pivots into buckets 
  
    
      
        
          
            B
          
        
      
    
    {\displaystyle {\mathcal {B}}}
  


  
    
      
        
          
            B
          
        
        =
        
          
            PEMMULTIPARTITION
          
        
        (
        A
        [
        1
        :
        N
        ]
        ,
        
          
            M
          
        
        ,
        
          
            d
          
        
        ,
        P
        )
      
    
    {\displaystyle {\mathcal {B}}={\texttt {PEMMULTIPARTITION}}(A[1:N],{\mathcal {M}},{\sqrt {d}},P)}
  


// Recursively sort buckets
for 
  
    
      
        j
        =
        1
      
    
    {\displaystyle j=1}
  
 to 
  
    
      
        
          
            d
          
        
        +
        1
      
    
    {\displaystyle {\sqrt {d}}+1}
  
 in parallel do
    recursively call 
  
    
      
        
          
            PEMDISTSORT
          
        
      
    
    {\displaystyle {\texttt {PEMDISTSORT}}}
  
 on bucket jof size 
  
    
      
        
          
            B
          
        
        [
        j
        ]
      
    
    {\displaystyle {\mathcal {B}}[j]}
  

    using 
  
    
      
        O
        
          (
          
            
            
              
                
                  
                    
                      
                        B
                      
                    
                    [
                    j
                    ]
                  
                  
                    N
                    
                      /
                    
                    P
                  
                
              
            
            
          
          )
        
      
    
    {\displaystyle O\left(\left\lceil {\tfrac {{\mathcal {B}}[j]}{N/P}}\right\rceil \right)}
  
 processors responsible for elements in bucket j
end for

The I/O complexity of PEMDISTSORT is:

O ( N P B ( log d P + log M / B N P B ) + f ( N , P , d ) log d P ) {\displaystyle O\left(\left\lceil {\frac {N}{PB}}\right\rceil \left(\log _{d}P+\log _{M/B}{\frac {N}{PB}}\right)+f(N,P,d)\cdot \log _{d}P\right)}

where

f ( N , P , d ) = O ( log P B d log N P + d B log P + d log B ) {\displaystyle f(N,P,d)=O\left(\log {\frac {PB}{\sqrt {d}}}\log {\frac {N}{P}}+\left\lceil {\frac {\sqrt {d}}{B}}\log P+{\sqrt {d}}\log B\right\rceil \right)}

If the number of processors is chosen that f ( N , P , d ) = O ( N P B ) {\displaystyle f(N,P,d)=O\left(\left\lceil {\tfrac {N}{PB}}\right\rceil \right)} and M < B O ( 1 ) {\displaystyle M<B^{O(1)}} the I/O complexity is then:

O ( N P B log M / B N B ) {\displaystyle O\left({\frac {N}{PB}}\log _{M/B}{\frac {N}{B}}\right)}

Other PEM algorithms

PEM Algorithm I/O complexity Constraints
Mergesort[1] O ( N P B log M B N B ) = sort P ( N ) {\displaystyle O\left({\frac {N}{PB}}\log _{\frac {M}{B}}{\frac {N}{B}}\right)={\textrm {sort}}_{P}(N)} P N B 2 , M = B O ( 1 ) {\displaystyle P\leq {\frac {N}{B^{2}}},M=B^{O(1)}}
List ranking[2] O ( sort P ( N ) ) {\displaystyle O\left({\textrm {sort}}_{P}(N)\right)} P N / B 2 log B log O ( 1 ) N , M = B O ( 1 ) {\displaystyle P\leq {\frac {N/B^{2}}{\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}}
Euler tour[2] O ( sort P ( N ) ) {\displaystyle O\left({\textrm {sort}}_{P}(N)\right)} P N B 2 , M = B O ( 1 ) {\displaystyle P\leq {\frac {N}{B^{2}}},M=B^{O(1)}}
Expression tree evaluation[2] O ( sort P ( N ) ) {\displaystyle O\left({\textrm {sort}}_{P}(N)\right)} P N B 2 log B log O ( 1 ) N , M = B O ( 1 ) {\displaystyle P\leq {\frac {N}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}}
Finding a MST[2] O ( sort P ( | V | ) + sort P ( | E | ) log | V | p B ) {\displaystyle O\left({\textrm {sort}}_{P}(|V|)+{\textrm {sort}}_{P}(|E|)\log {\tfrac {|V|}{pB}}\right)} p | V | + | E | B 2 log B log O ( 1 ) N , M = B O ( 1 ) {\displaystyle p\leq {\frac {|V|+|E|}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}}

Where sort P ( N ) {\displaystyle {\textrm {sort}}_{P}(N)} is the time it takes to sort N items with P processors in the PEM model.

See also

References

  1. ^ a b c d e f g h i j k l Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041.
  2. ^ a b c d Arge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms". 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. pp. 1–11. doi:10.1109/ipdps.2010.5470440. ISBN 9781424464425. S2CID 587572.
  • v
  • t
  • e
GeneralLevelsMultithreading
TheoryElements
CoordinationProgrammingHardwareAPIsProblems
  •  Category: Parallel computing