# Sentiment Classification

https://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html

Download the "Sentiment Polarity Dataset Version 2.0" from http://www.nltk.org/nltk_data/ and put in the defined folder. 
The dataset zip file is here: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/movie_reviews.zip

In [3]:
import sklearn
from sklearn.datasets import load_files
moviedir = r'./movie_reviews'

# Data loading and preparation
### Load the dataset and inspect its content

In [5]:
movie = load_files(moviedir, shuffle=True)
len(movie.data)

2000

In [6]:
movie.target_names

['neg', 'pos']

In [7]:
movie.data[0]

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \nyes , the film is 

In [8]:
movie.target[0]

0

### Split the data between train and test

In [11]:
import nltk
nltk.download('punkt')

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split

docs_train, docs_test, y_train, y_test = train_test_split(movie.data, movie.target, 
                                                          test_size = 0.20, random_state = 12)


[nltk_data] Downloading package punkt to /home/jmag/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Compute word dictionaries and word-doc frequencies matrix

In [12]:
movieVzer= CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features=3000)
docs_train_counts = movieVzer.fit_transform(docs_train)
docs_test_counts = movieVzer.transform(docs_test)
docs_train_counts.shape

(1600, 3000)

### TF-IDF weighting


In [14]:
movieTfmer = TfidfTransformer()
docs_train_tfidf = movieTfmer.fit_transform(docs_train_counts)
docs_test_tfidf = movieTfmer.transform(docs_test_counts)

# Model Training

 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
 
 -  Parameters relevant to class inbalance: class_weight.
 - Parameters relevant to regularization: penalty, C.
 - Paremeters relevant to stop criteria: tol, max_iter.


In [15]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(docs_train_tfidf, y_train)

The model parameters are given by the coef_ variable:

In [18]:
print(clf.coef_)

[[-0.98274511 -0.02516272 -0.07721108 ...  0.09443541  0.07750834
   0.02408974]]


## Evaluation

https://scikit-learn.org/stable/modules/model_evaluation.html

In [12]:
import numpy as np
from sklearn.metrics import classification_report
predict_train = clf.predict(docs_train_tfidf)
predicted_test = clf.predict(docs_test_tfidf)

### Training results

In [13]:
target_names = ['neg', 'pos']
print(classification_report(predict_train, y_train, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.90      0.91      0.90       785
         pos       0.91      0.90      0.91       815

    accuracy                           0.91      1600
   macro avg       0.91      0.91      0.91      1600
weighted avg       0.91      0.91      0.91      1600



### Test results

In [14]:
print(classification_report(predicted_test, y_test, target_names=target_names))

              precision    recall  f1-score   support

         neg       0.73      0.81      0.77       186
         pos       0.82      0.74      0.78       214

    accuracy                           0.78       400
   macro avg       0.78      0.78      0.77       400
weighted avg       0.78      0.78      0.78       400

