Tokenizing a String in C++
In C++, tokenizing a string is the act of dividing a given text into discrete, smaller parts called tokens. Depending on your particular requirements, these tokens can be single characters, sentences, or phrases. String tokenization is a widely used and crucial text processing and parsing procedure that is used in a wide range of applications, such as text analysis, natural language processing, and parsing data from text files.
Tokenization's primary goal is to break up a lengthy text document into digestible chunks that can be handled separately. Tokens are usually divided by delimiters, which are strings or characters that show where one token stops and another one starts. Though you can use any character or string as a delimiter depending on your needs, common delimiters include spaces, commas, periods, and more.
Tokenizing a string in C++ can be done in a few different ways, like using the Standard Template Library (STL) functions like std::istringstream or doing a manual search for delimiter places with functions like find and substr. The method and delimiter selected will rely on the particular parsing requirements and task complexity. After a string has been correctly tokenized, your C++ application can process, analyze, or use the tokens for other purposes. Tokenization is an important stage in many text-based activities and is used in many different software applications for data extraction, text analysis, and parsing.
Example:
A C++ implementation of tokenizing the string:
#include <iostream> #include <string> #include <sstream> #include <vector> int main() { std::string input_string = "This is a sample string for tokenization"; std::vector<std::string> tokens; std::istringstream tokenStream(input_string); std::string token; while (std::getline(tokenStream, token, ' ')) { tokens.push_back(token); } for (const std::string &t : tokens) { std::cout << t << std::endl; } return 0; }
Output:
This is a sample string for tokenization
Explanation:
In the above example, the input_string is divided into tokens using a std::istringstream, with a space (' ') serving as the delimiter. You can access individual tokens by iterating through a std::vector<std::string> containing each token.
If you have different token separations, you can use any other character or string as a space delimiter.
Another approach is to manually extract tokens depending on a custom delimiter, use the std::string member functions find and substr. Here's an illustration of this method in action:
Example 2 code:
#include <iostream> #include <string> int main() { std::string input_string = "This,is,a,sample,string,for,tokenization,example 2"; std::string delimiter = ","; size_t start = 0, end = 0; while ((end = input_string.find(delimiter, start)) != std::string::npos) { std::string token = input_string.substr(start, end - start); std::cout << token << std::endl; start = end + delimiter.length(); } // Print the last token (if any) std::string token = input_string.substr(start); std::cout << token << std::endl; return 0; }
Output:
This is a sample string for Tokenization example 2
Explanation:
In this example, we employ a comma (',') as the separator. The substr function extracts the token between the current position (start) and the delimiter position (end), and the find function finds the delimiter in the string. Until every token is extracted, the procedure is repeated.
Time Complexity: O(n ) where n is the length of string.
Auxiliary Space: O(1).
By using std::sregex_token_iterator:
Example:
#include <iostream> #include <regex> #include <string> #include <vector> std::vector<std::string> tokenize(const std::string str,const std::regex re) { std::sregex_token_iterator it{ str.begin(), str.end(), re, -1 }; std::vector<std::string> tokenized{ it, {} }; // Additional check to remove empty strings tokenized.erase(std::remove_if(tokenized.begin(), tokenized.end(),[](std::string const& s) { return s.size() == 0; }), tokenized.end()); return tokenized; } // Driver Code int main() { const std::string str = "Break string a,spaces,and,commas"; const std::regex re(R"([\s|,]+)"); const std::vector<std::string> tokenized = tokenize(str, re); for (std::string token : tokenized) std::cout << token << std::endl; return 0; }
Output:
Break string a spaces and commas
Conclusion:
One essential step in text processing and parsing is tokenizing a string in C++. There are two main ways that are frequently employed: using a std::istringstream with the Standard Template Library (STL) or manually parsing the string using the find and substr functions. Tokenization tasks where tokens are separated by a single character (spaces, for example) are best served by the first approach because it is simple and effective. However, the manual parsing approach offers greater flexibility and control when working with unique delimiters or intricate tokenization needs. You can properly deconstruct a string into its component tokens for additional processing or analysis in C++ by using whichever approach best suits your needs and your particular use case.