Features:
1. Advanced NLP Preprocessing:
- URL, email, and HTML tag removal
- Tokenization with POS tagging
- Lemmatization with WordNet POS mapping
- Stop word removal
- Feature extraction (text length, spam indicators, special characters, etc.)
2. Multiple ML Models:
- Random Forest
- SVM with linear kernel
- Naive Bayes
- Logistic Regression
- Gradient Boosting
- XGBoost
- Neural Network (MLP)
3. GUI Features:
- Modern interface with tabs
- Email input with sample emails
- Real-time analysis with feature extraction
- Model selection dropdown
- Performance metrics display
- Visualization capabilities
- Dataset loading functionality
4. Advanced Features:
- Spam indicator detection
- Feature importance visualization
- Multiple vectorization techniques
- Cross-validation ready
- Real-time predictions with probabilities
How to Use:
- Run the application:
bash
pip install nltk scikit-learn pandas numpy matplotlib seaborn xgboost tkinter
python spam_classifier.py
- Load a dataset (CSV format with 'text' and 'label' columns)
- Train models using the "Train Models" button
- Analyze emails by typing or loading samples
- View results including:
- Spam/Ham prediction
- Confidence probabilities
- Extracted features
- Spam indicators found
Dataset Format:
The classifier expects a CSV file with:
- text: Email content
- label: 1 for spam, 0 for ham
Customization:
You can easily:
- Add more spam indicators in the AdvancedTextPreprocessor class
- Modify feature extraction parameters
- Add new ML models to the ensemble
- Customize the GUI appearance
The application provides a complete end-to-end solution for email spam classification with both real-time analysis and batch training capabilities.
Comments