Bag Of Words Dataset. To that end Ive selected the Bag of Words dataset from the UCI repository renound for its neat and tidy datasets which is a popular first datset for this contains five distinct sub-sets of text data to use as sub-sets or to find sub-sets inside of though Im only using one of them - the infamous enron dataset since the other two are both far too large for either github or the machines I have. However real-world datasets are huge with millions of words.
This approach is a simple and flexible way of extracting features from documents. This number is typically larger than 100000. Building a Bag of Words involves 3 steps.
To learn more about bag of words click here Step5.
Then I used a Decision Tree to train by model on the bag of words input to make a prediction whether the sentence is important or not. Each document in this case a review is converted into a vector representation. In the BoW model a text such as a sentence or a document is represented as the bag multiset of its words disregarding grammar and even word order but keeping multiplicity. Training and Classification We.