Burrows - Wheeler Data Transform Algorithm in Java
An algorithm for data transformation called BWT (Burrows – Wheeler Data Transform) rearranges data so that the resultant message can be compressed more effectively. It is, in terms of technology, a lexicographical reversible permutation (rotation) of a string's characters. This is the first of three steps that must be followed in order to develop the Burrows-Wheeler Data Compression method, which is the basis for the bzip2 Unix compression tool.
What is the main reason behind BWT and the Need for BWT?
The fields of biology and ecology are where BWT is most commonly used. In genomes, which are lengthy strings written in the alphabets A, C, T, and G, there are numerous repetitions but with few cycles. Creating an array whose rows are all cyclic shifts of the given input string in dictionary order is the main concept behind the BWT. The last column of the array, which typically has long sequences of identical characters, is the one that has to be returned.
The advantage of this is that the characters also have an order once they have been clustered together. This can help our string compress more for use with other methods, such as Huffman Coding and run-length encoding. The most amazing aspect of BWT is that there is very little data overhead, and it can be reversed.
Example 1:
Input:
text = "helloworld$"
Output:
The text after applying the BWT Algorithm is dlh$relwloo.
Explanation:
For the given string the BWT Algorithm follows the below Steps.
Step 1: Compute each and every possible cyclic rotation for the provided text.
helloworld$
$ h $helloworld
d e d$helloworl
l l ld$hellowor
Cyclic rotations ---------> rld$hellowo
r l orld$hellow
o w o world$hello
oworld$hell
loworld$hel
lloworld$he
elloworld$h
Step 2: The rotations must then be sorted lexicographically. Even before "d," the '$' sign is considered the first letter in lexicographically.
Should Perform Sorting Alphabetically
helloworld$ $helloworld
$helloworld d$helloworl
d$helloworl elloworld$h
ld$hellowor helloworld$
rld$hellowo ld$hellowor
orld$hellow -------------------> lloworld$he
world$hello loworld$hel
oworld$hell orld$hellow
loworld$hel oworld$hell
lloworld$he rld$hellowo
elloworld$h world$hello
Step 3: The final Column characters represent the output of the BWT Algorithm.
BWT (helloworld$) ----------> dlh$relwloo
Why is only the last Column taken into Consideration?
In comparison to the other columns, the final column exhibits superior symbol clustering. We can retrieve the remaining cyclic rotations in full even if we only have the BWT of our string. This feature is missing from the other columns and is crucial for calculating the inverse of BWT. The main reason for using the '$' symbol is that even if there are no EOF characters ('$') in our given text, we can still compute BWT. As one computes the inverse of BWT, the '$' symbol becomes relevant.
Approach: Using Rotation Class
Let us first create our input string, "helloworld$," and our output, a character array called bwt_array.
To save the index of each suffix, let's identify all the suffixes that come with the word "helloworld" and compute its suf_array.
0 helloworld$ 10 $
1 elloworld$ 9 d$
2 lloworld$ 1 elloworld$
3 loworld$ 0 helloworld$
4 oworld$ Sorting 8 ld$
5 world$ ----------> 2 lloworld$
6 orld$ alphabetically 3 loworld$
7 rld$ 6 orld$
8 ld$ 4 oworld$
9 d$ 7 rld$
10$ 5 world$
The final character of each rotation should now be added to our output array, bwt_array, as we continue iterating over the suf_array.
One can compute the final character of every input rotation, beginning at the point indicated by the current value in the suffix array, using the formula input[(suf_array[i] – 1 + n ) % n], where n is the number of elements in the suf_array.
bwt_array[0]
= input[(suf_array[0] - 1 + 11) % 11]
= input[9]
= d
Implementation:
FileName: BWTAlgorithm.java
import java.util.Arrays;
import java.io.*;
class RotationCycle implements Comparable<RotationCycle>
{
int index;
String suf;
public RotationCycle(int index, String suf)
{
this.index = index;
this.suf = suf;
}
@Override public int compareTo(RotationCycle o)
{
return this.suf.compareTo(o.suf);
}
}
public class BWTAlgorithm
{
public static int[] SuffixArrayComputation(String input)
{
int Textlength = input.length();
RotationCycle[] suffix = new RotationCycle[Textlength];
for (int i = 0; i < Textlength; i++)
{
suffix[i] = new RotationCycle(i, input.substring(i));
}
Arrays.sort(suffix);
int[] suffixArr = new int[Textlength];
for (int i = 0; i < Textlength; i++)
{
suffixArr[i] = suffix[i].index;
}
return suffixArr;
}
public static String findLastChar(String input, int[] sufArray)
{
int n = input.length();
StringBuilder bwtArr = new StringBuilder();
for (int i = 0; i < n; i++)
{
int j = sufArray[i] - 1;
if (j < 0)
{
j = j + n;
}
bwtArr.append(input.charAt(j));
}
return bwtArr.toString();
}
public static void main(String[] args)
{
String input = "helloworld$";
int[] sufArray = SuffixArrayComputation(input);
String bwtArr = findLastChar(input, sufArray);
System.out.println("The given Input text : " + input);
System.out.println("The text after applying Burrows - Wheeler Transform Algorithm : " + bwtArr);
}
}
Output:
The given Input text : helloworld$
The text after applying Burrows - Wheeler Transform Algorithm : dlh$relwloo
Complexity Analysis:
For the above BWT program, O(n^2 Logn) is the time complexity. This is because the preceding approach for creating the suffix array has an O(n^2 Logn) time complexity, which results from the O(n) time required for string comparisons in the O(nLogn) sorting algorithm.