Sentence to python vector
Conversion of a Sentence to Vector in Python
Before starting the tutorial, let’s just recap about the vector and the respective package that has to be imported in Python.
Python Vector:
Putting simply, we can say that vector is an array with single dimension in Python. In order to use vector, Python comes with a built- in module called NumPy. By importing the module we are allowed to create a vector in Python.
Creation of a Vector:
For creating a vector, the Python module provides numpy.array() method , which helps us to create both vertical and horizontal vectors.
Consider the program:
#Python 3
import numpy as np
list_1 = [100, 200, 300, 400, 500] # horizontal with one row
list_2 = [[10],[11], [12], [13], [14]] # vertical with one column
vector_1= np.array(list_1)
vector_2 = np.array(list_2)
print(“ Horizontal Vector : “)
print(vector_1)
print(“Vertical Vector : “)
print(vector_2)
Output:
Horizontal vector :
[100, 200, 300, 400, 500]
Vertical vector :
[[10],[11], [12], [13], [14]]
We know how to create a vector with the help of the built – in python module NumPy. Now, we are going to convert or store a sentence that can be any text line into vector using Python with its available methods.
We can store a sentence in a vector in two ways: like separating it into words or into characters. In the following example we will consider a string text line and convert it into list by splitting the words with split() method in Python.
#Python 3
import numpy as np
st="welcome to Java T Point"
np.array(list(st.split(" "))) # splitting the string and converting list of words to vector
Output:
array(['welcome', 'to', 'Java', 'T', 'Point'], dtype='<U7')
Explanation:
- First we converted the text line into list of words by usual split method called tokenization, and then converted to NumPy single dimensional array, which we call it as vector.
- Tokenization involves splitting sequence of strings or sentences into words, phrases and some keywords or symbols (tokens). And also observe the above output, we can see dtype = ’<U7’ which means datatype is Unicode with maximum length 7 (welcome).
- Data type object or dtype is interpretating fixed block of memory respective to an array. It’s value will be maximum length of all strings present in an array or a vector. Once it is fixed after storing some values , then we cannot insert another string greater than that length. Only, the characters up to the length will be inserted.
Let’s look at another simple program to convert list of sentences to vector.
Example 1:
import numpy as np
arr = ["hello everyone", "this is javatpoint", "you are welcome"]
my=np.asarray (arr, dtype = None, order = None)
np.asarray(arr, dtype=None, order=None)
Output:
array(['hello everyone', 'this is javatpoint', 'you are welcome'],
dtype='<U18')
Here, we converted a list of sentences to a vector and datatype is none.
We can just do the same thing using different method, as shown in following example.
Example 2:
#Python 3
import numpy as np
arr = [ "hello everyone", "this is javatpoint", "you are welcome" ]
np.array ([i for i in arr])
Output:
array(['hello everyone', 'this is javatpoint', 'you are welcome'],
dtype='<U18')
We got the same output as the above program as the loop iterates over the list and provides each value to i.
Overview:
- Vector is a NumPy single dimensional array and creation can be done by numpy.array () method.
- In order to convert a sentence or sequence of words to a vector, we need to tokenize the text and then generate the vector.
- The dtype in the vector represents maximum length of an individual string present in the sequence.