Parallel Sort Example

My Linux Environment

A Simple Fast Parallel Sort Algorithm

Here is a source code tarball and screenshot for some code that explores parallel sorting of data leveraging a 2d grid data structure. The below screenshot shows a count of less than 12000 logically parallel steps to sort 16 million random numbers. This is significantly faster than I was expecting so decided to put this on the web in case anyone else wanted to play with it. More algorithm info is below.

This program also attempts to animate the data movement of the sort for clarity. The source code partitions the 16 million integers in a matrix configuration into 256 smaller matrices of 64K each, visible in the pane on the left. Each sub matrix is connected to its immediate neighbor to allow data to swap accross the sub matrix boundaries which is required for a global sort of all data. The pane on the right animates the data movement of the swapping occuring within the sub matrix selected in the left pane.

Each of the sub matrices can be processed independently and are sized arbitrarily. On my current computer I only have 8 hardware threads. The software allocates each of the sub matrices to one of the available real hardware threads. This algorithm can easily scale to thousands of CPU hardware threads when they become available. Performance on current computers is primarily gated by thread count.

This code uses the SDL graphics library and I have released this under the BSD0 license for maximum permissive reuse.

gridsort.tar.gz

The Algorithm

Phase 1: Sort all columns in parallel in M steps where M is the number of elements in the column (the square root of N). This is done in M steps doing alternating odd/even swaps with immediate neighbor. At the end of M steps each colum will be locally sorted within itself.
Phase 2: Now we switch to row sorting with the occassional colum swap interlaced to allow data to move between rows. We continue in this mode until no more swaps are occuring. Note that for odd rows we reverse the swap direction to increase right to left. We do this so that all swaps can be with their immediate neighbor instead of having to wrap at the row end.
Phase 3: Optionally reverse the data in the odd rows in M steps, if you don't want to manage the reverse order rows at extraction time. This will add square root of N steps.

Performance

This performs near near Log(N) which is dramatically faster than the N steps I was expecting.
This shows there is a lot of untapped performance available in new high thread count architectures. The sooner we can get massive CPU thread counts the better.

Next Steps

This specific algorithm can get more improvement by adding dimensions to the interconnects.
Architecturally the big open question is how to manage the interconnects. As parallelism scales the interconnect will dominate performance.
I suggest experimenting with dynamic interconnects to 'learn' what the best interconnect topologies are as is now popularly done in AI machine learning.

Links

Google Parallel Computer Architecture