Python Text Classification Using NLTK with Example Code (Beginner Guide)

Introduction

Text classification is one of the most important tasks in Natural Language Processing (NLP). Whether you're building a spam filter, sentiment analyzer, or chatbot, classification helps machines understand human language.

But beginners often struggle with how to actually implement it in Python. In this guide, you'll learn how to build a simple text classification model using NLTK step-by-step with clean code and real output.

Python text classification using NLTK example with sentiment analysis code and output

Concept Explanation

Text classification is the process of assigning predefined labels to text data.

Common use cases include:

  • Sentiment analysis (positive/negative)
  • Spam detection (spam/not spam)
  • Topic classification
  • Chatbots and recommendation systems

How it works:

  • Convert text into features (words)
  • Train a machine learning model
  • Predict category for new text

Program Code

pip install nltk
import nltk
nltk.download('punkt')
import nltk
from nltk.tokenize import word_tokenize
from nltk.classify import NaiveBayesClassifier

# Sample dataset
data = [
    ("I love this product", "positive"),
    ("This is amazing", "positive"),
    ("I feel great today", "positive"),
    ("I hate this", "negative"),
    ("This is terrible", "negative"),
    ("I feel bad", "negative")
]

# Feature extraction
def extract_features(sentence):
    words = word_tokenize(sentence.lower())
    return {word: True for word in words}

# Training data
training_data = [(extract_features(text), label) for (text, label) in data]

# Train model
classifier = NaiveBayesClassifier.train(training_data)

# Test sentence
test_sentence = "I love this"
features = extract_features(test_sentence)

# Prediction
result = classifier.classify(features)

print("Sentence:", test_sentence)
print("Predicted Sentiment:", result)

Sample Output

Sentence: I love this
Predicted Sentiment: positive

Step-by-Step Explanation

  • A small dataset is created with labeled sentences
  • Text is converted into features using tokenization
  • Naive Bayes classifier is trained
  • New sentence is classified based on learned patterns

Time Complexity

The time complexity depends on:

  • Training: O(n × m) where n = number of samples, m = words per sentence
  • Prediction: O(m)

Real-World Applications

  • Email spam filtering
  • Sentiment analysis on social media
  • Customer feedback analysis
  • News categorization

Common Mistakes

  • Using very small datasets
  • Not cleaning text data
  • Ignoring stop words
  • Not testing with new data

Best Practices

  • Use larger datasets for better accuracy
  • Apply preprocessing (stopword removal, stemming)
  • Use cross-validation
  • Try advanced models after basics

Frequently Asked Questions

1. What is NLTK used for?
NLTK is a Python library used for Natural Language Processing tasks like tokenization, classification, and sentiment analysis.

2. Which algorithm is used here?
Naive Bayes classifier is used for simple and fast text classification.

3. Can I use this for real projects?
Yes, but you should train it with larger datasets for better performance.

4. What is tokenization?
It is the process of splitting text into words or tokens.

🔗 Related Articles

Conclusion

Text classification using NLTK is one of the easiest ways to get started with NLP. With just a few lines of Python code, you can build a working sentiment analyzer.

Once you're comfortable with this, you can move to advanced models like Logistic Regression, SVM, or deep learning for better accuracy and real-world applications.

Comments