External merge sort in C++
External merge sort in C++
External sorting is a concept for a group of sorting algorithms capable of handling large data volumes. External sorting is needed if the information getting sorted does not fit into a computer device's primary memory and, instead, it must reside in the lighter external memory. Typically, external sorting developed a hybrid merging tactic. Amounts of information small enough to fit into primary memory are read, sorted, and written to a temporary file during the sorting process. The sorted sub-files are merged into a single bigger file during the merge phase.
Algorithm:
- Read the input file, and that at many other components of 'run size' is the view at a time.
- Next, read inside an array per each run.
- Order the execution using Merge Sort.
- Store the array sorted into a file. Let's say 'i' for file i.
- Use the approach discussed to merge k sorted arrays to fuse the sorted files
Example:
#include <iostream> #include <algorithm> #include <queue> #include <limits> using namespace std; struct MinHeapNode { int element; int k; }; struct comp { bool operator()(const MinHeapNode lhs, const MinHeapNode rhs) const { return lhs.element > rhs.element; } }; FILE* openFile(char* fileName, char* mode) { FILE* fp = fopen(fileName, mode); if (fp == NULL) { perror("Error detected while opening the file.\n"); exit(EXIT_FAILURE); } return fp; } void mergeFiles(char *output_file, int n, int k) { FILE* in[k]; for (int j = 0; j < k; j++) { char fileName[2]; // convert i to string snprintf(fileName, sizeof(fileName), "%d", j); in[i] = openFile(fileName, "r"); } FILE *out = openFile(output_file, "w"); MinHeapNode harr[k]; priority_queue<MinHeapNode, vector<MinHeapNode>, comp> pq; int i; for (i = 0; i < k; i++) { if (fscanf(in[i], "%d ", &harr[i].element) != 1) break; harr[i].i = i; pq.push(harr[i]); } int count = 0; while (count != i) { MinHeapNode root = pq.top(); pq.pop(); fprintf(out, "%d ", root.element); . if (fscanf(in[root.i], "%d ", &root.element) != 1 ) { root.element = numeric_limits<int>::max(); count++; } // Replace root with next element of input file pq.push(root); } // close input and output files for (int i = 0; i < k; i++) fclose(in[i]); fclose(out); } void createInitialRuns(char *input_file, int run_size, int num_ways) { // For big input file FILE *in = openFile(input_file, "r"); // output scratch files FILE* out[num_ways]; char fileName[2]; for (int i = 0; i < num_ways; i++) { // convert i to string snprintf(fileName, sizeof(fileName), "%d", i); // Open output files in write mode. out[i] = openFile(fileName, "w"); } int* arr = new int[run_size]; bool more_input = true; int next_output_file = 0; int i; while (more_input) { for (i = 0; i < run_size; i++) { if (fscanf(in, "%d ", &arr[i]) != 1) { more_input = false; break; } } sort(arr, arr + i); for (int j = 0; j < i; j++) fprintf(out[next_output_file], "%d ", arr[j]); next_output_file++; } // deallocate memory delete arr; // close input and output files for (int i = 0; i < num_ways; i++) fclose(out[i]); fclose(in); } // Program to demonstrate external sorting int main() { // No. of partitions of input file int num_ways = 10; // The size of each partition int run_size = 1000; char input_file[] = "input.txt"; char output_file[] = "output.txt"; FILE* in = openFile(input_file, "w"); srand(time(NULL)); // generate input for (int i = 0; i < num_ways * run_size; i++) fprintf(in, "%d ", rand()); fclose(in); createInitialRuns(input_file, run_size, num_ways); mergeFiles(output_file, run_size, num_ways); return 0; }
Complexity Analysis:
Time Complexity: O(n + run_size log run_size).
The time needed by the merge sort is O(nlogn). However, there are components in the most run size. Thus the time complexity is O(run size log run size), but then the time complexity is O(n) to merge the sorted arrays. The multiply-accumulate of the time also is O(n + run size log run size).
Auxiliary space:O(run_size).
The run size was its space required for array storage.
One such code will not work on compiler online, as it needs permissions to create files. So, if running a local machine, the sample input file "input.txt" is produced with different characters of 10000. It sorts the number and put the numbers sorted in an "output.txt" file. It also produces files with names 1, 2, .. Sorted passes to shop.