An algorithm for data transformation called BWT (Burrows – Wheeler Data Transform) rearranges data so that the resultant message can be compressed more effectively. It is, in terms of technology, a lexicographical reversible permutation (rotation) of a string's characters. This is the first of three steps that must be followed in order to develop the Burrows-Wheeler Data Compression method, which is the basis for the bzip2 Unix compression tool.

What is the main reason behind BWT and the Need for BWT?

The fields of biology and ecology are where BWT is most commonly used. In genomes, which are lengthy strings written in the alphabets A, C, T, and G, there are numerous repetitions but with few cycles. Creating an array whose rows are all cyclic shifts of the given input string in dictionary order is the main concept behind the BWT. The last column of the array, which typically has long sequences of identical characters, is the one that has to be returned.

The advantage of this is that the characters also have an order once they have been clustered together. This can help our string compress more for use with other methods, such as Huffman Coding and run-length encoding. The most amazing aspect of BWT is that there is very little data overhead, and it can be reversed.

Example 1:

Input:

text = "helloworld$"

Output:

The text after applying the BWT Algorithm is dlh$relwloo.

Explanation:

For the given string the BWT Algorithm follows the below Steps.

Step 1: Compute each and every possible cyclic rotation for the provided text.

helloworld$

       $    h                                            $helloworld

    d           e                                        d$helloworl

l                     l                                     ld$hellowor

   Cyclic rotations     --------->       rld$hellowo

r                    l                                     orld$hellow

    o    w    o                                        world$hello

                                                            oworld$hell

                                                            loworld$hel

                                                            lloworld$he

                                                            elloworld$h

Step 2: The rotations must then be sorted lexicographically. Even before "d," the '$' sign is considered the first letter in lexicographically.

Should Perform Sorting Alphabetically

helloworld$                                     $helloworld      

$helloworld                                     d$helloworl                                               

d$helloworl                                     elloworld$h

ld$hellowor                                     helloworld$

rld$hellowo                                     ld$hellowor            

orld$hellow    ------------------->   lloworld$he            

world$hello                                     loworld$hel      

oworld$hell                                     orld$hellow

loworld$hel                                     oworld$hell

lloworld$he                                     rld$hellowo

elloworld$h                                     world$hello

Step 3: The final Column characters represent the output of the BWT Algorithm.

BWT (helloworld$) ---------->  dlh$relwloo

Why is only the last Column taken into Consideration?

In comparison to the other columns, the final column exhibits superior symbol clustering. We can retrieve the remaining cyclic rotations in full even if we only have the BWT of our string. This feature is missing from the other columns and is crucial for calculating the inverse of BWT. The main reason for using the '$' symbol is that even if there are no EOF characters ('$') in our given text, we can still compute BWT. As one computes the inverse of BWT, the '$' symbol becomes relevant.

Approach: Using Rotation Class

Let us first create our input string, "helloworld$," and our output, a character array called bwt_array.

To save the index of each suffix, let's identify all the suffixes that come with the word "helloworld" and compute its suf_array.

0 helloworld$                      10 $  

1 elloworld$                        9 d$

2 lloworld$                           1 elloworld$

3 loworld$                            0 helloworld$

4 oworld$      Sorting         8 ld$

5 world$      ---------->        2 lloworld$      

6 orld$     alphabetically   3 loworld$

7 rld$                                     6 orld$

8 ld$                                       4 oworld$

9 d$                                        7 rld$

10$                                         5 world$

The final character of each rotation should now be added to our output array, bwt_array, as we continue iterating over the suf_array.

One can compute the final character of every input rotation, beginning at the point indicated by the current value in the suffix array, using the formula input[(suf_array[i] – 1 + n ) % n], where n is the number of elements in the suf_array.

bwt_array[0]

  = input[(suf_array[0] - 1 + 11) % 11]

  = input[9]

  = d

Implementation:

FileName: BWTAlgorithm.java

import java.util.Arrays;

import java.io.*;

class RotationCycle implements Comparable<RotationCycle>

{

            int index;

            String suf;

            public RotationCycle(int index, String suf)

            {

                        this.index = index;

                        this.suf = suf;

            }

            @Override public int compareTo(RotationCycle o)

            {

                        return this.suf.compareTo(o.suf);

            }

}

public class BWTAlgorithm

{

            public static int[] SuffixArrayComputation(String input)

            {

                        int Textlength = input.length();

                        RotationCycle[] suffix = new RotationCycle[Textlength];

                        for (int i = 0; i < Textlength; i++)

                        {

                                    suffix[i] = new RotationCycle(i, input.substring(i));

                        }

                        Arrays.sort(suffix);

                        int[] suffixArr = new int[Textlength];

                        for (int i = 0; i < Textlength; i++)

                        {

                                    suffixArr[i] = suffix[i].index;

                        }

                        return suffixArr;

            }

            public static String findLastChar(String input, int[] sufArray)

            {

                        int n = input.length();

                        StringBuilder bwtArr = new StringBuilder();

                        for (int i = 0; i < n; i++)

                        {

                                    int j = sufArray[i] - 1;

                                    if (j < 0)

                                    {

                                                j = j + n;

                                    }

                                    bwtArr.append(input.charAt(j));

                        }

                        return bwtArr.toString();

            }

            public static void main(String[] args)

            {

                        String input = "helloworld$";

                        int[] sufArray = SuffixArrayComputation(input);

                        String bwtArr = findLastChar(input, sufArray);

                        System.out.println("The given Input text : " + input);

                        System.out.println("The text after applying Burrows - Wheeler Transform Algorithm : "            + bwtArr);

            }

}

Output:

The given Input text : helloworld$

The text after applying Burrows - Wheeler Transform Algorithm : dlh$relwloo

Complexity Analysis:

For the above BWT program, O(n^2 Logn) is the time complexity. This is because the preceding approach for creating the suffix array has an O(n^2 Logn) time complexity, which results from the O(n) time required for string comparisons in the O(nLogn) sorting algorithm.

← Prev Next →

Java Tutorial Index

Java Loops

Java Programs

Java Sorting

Java OOPs Concepts

Java Strings

Java Exceptions

Garbage Collection

Multithreading

Java IO

Serialization

Networking

AWT

Swing

Java Collections

Java Generics

Java Annotations

Java JDBC

Java Differences

How to

Java 8 Features

Java 9 Features

Java 12

Java 13

Java 14

Java 15

Java 16

Java 17

Java Math Methods

Java String Methods

Java Conversion

Java Keywords

Java Problems

Java Questions

Java Interview Questions

Misc

Burrows - Wheeler Data Transform Algorithm in Java

What is the main reason behind BWT and the Need for BWT?

Why is only the last Column taken into Consideration?

Approach: Using Rotation Class