In the age of social media, Twitter has emerged as a powerful platform for expressing opinions, sharing news, and shaping public discourse. Leveraging the wealth of data generated by millions of tweets daily, we embark on a journey to develop a sentiment analysis system capable of gauging public sentiment towards various topics, brands, and events. In this article, we explore the methodologies, algorithms, and potential applications of machine learning in Twitter sentiment analysis, offering insights into the dynamic world of social media analytics.
- Data Collection and Preprocessing:Our journey begins with the acquisition of tweet data related to specific topics, keywords, or hashtags of interest. Through meticulous preprocessing steps including text normalization, tokenization, and sentiment labeling, we ensure the quality and relevance of the data for sentiment analysis tasks.
- Feature Extraction and Representation:Next, we extract meaningful features from the tweet text to represent various aspects of sentiment, such as word frequency, sentiment lexicons, and syntactic patterns. Feature extraction techniques such as bag-of-words, TF-IDF (Term Frequency-Inverse Document Frequency), and word embeddings are employed to capture the semantic meaning and context of the tweet content.
- Model Training and Evaluation:With our feature-engineered dataset prepared, we train machine learning models to classify tweets into sentiment categories such as positive, negative, or neutral. Supervised learning algorithms such as logistic regression, Naive Bayes, and support vector machines (SVM) are explored to predict sentiment labels accurately. Models are evaluated using metrics such as accuracy, precision, recall, and F1-score to assess their performance and generalization capabilities.
- Real-Time Analysis and Deployment:Once trained and evaluated, our sentiment analysis system is deployed into real-time monitoring and analysis of Twitter data streams. Whether deployed as a standalone application or integrated into social media analytics platforms, the system provides valuable insights into public opinion, brand perception, and emerging trends in real-time.
- Applications and Future Directions:By accurately analyzing sentiment on Twitter, our system opens up a wide range of applications across various domains. From marketing and brand management to political analysis and crisis response, the ability to monitor and interpret public sentiment in real-time has transformative potential. Future directions for research and development include exploring multimodal sentiment analysis, opinion dynamics modeling, and cross-lingual sentiment analysis to further enhance the capabilities and applicability of sentiment analysis systems.
Download the Data set from Kaggle Repository or from below Download.
The below code is .Python file.
# %% [markdown]
# # Twitter Sentiment Analysis
# %%
!pip install gensim --upgrade
!pip install keras --upgrade
!pip install pandas --upgrade
# %%
# DataFrame
import pandas as pd
# Matplot
import matplotlib.pyplot as plt
%matplotlib inline
# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import TfidfVectorizer
# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout, Embedding, Flatten, Conv1D, MaxPooling1D, LSTM
from keras import utils
from keras.callbacks import ReduceLROnPlateau, EarlyStopping
# nltk
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
# Word2vec
import gensim
# Utility
import re
import numpy as np
import os
from collections import Counter
import logging
import time
import pickle
import itertools
# Set log
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# %%
nltk.download('stopwords')
# %% [markdown]
# ### Settings
# %%
# DATASET
DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"
TRAIN_SIZE = 0.8
# TEXT CLENAING
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
# WORD2VEC
W2V_SIZE = 300
W2V_WINDOW = 7
W2V_EPOCH = 32
W2V_MIN_COUNT = 10
# KERAS
SEQUENCE_LENGTH = 300
EPOCHS = 8
BATCH_SIZE = 1024
# SENTIMENT
POSITIVE = "POSITIVE"
NEGATIVE = "NEGATIVE"
NEUTRAL = "NEUTRAL"
SENTIMENT_THRESHOLDS = (0.4, 0.7)
# EXPORT
KERAS_MODEL = "model.h5"
WORD2VEC_MODEL = "model.w2v"
TOKENIZER_MODEL = "tokenizer.pkl"
ENCODER_MODEL = "encoder.pkl"
# %% [markdown]
# ### Read Dataset
# %% [markdown]
# ### Dataset details
# * **target**: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
# * **ids**: The id of the tweet ( 2087)
# * **date**: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
# * **flag**: The query (lyx). If there is no query, then this value is NO_QUERY.
# * **user**: the user that tweeted (robotickilldozr)
# * **text**: the text of the tweet (Lyx is cool)
# %%
dataset_filename = os.listdir("../input")[0]
dataset_path = os.path.join("..","input",dataset_filename)
print("Open file:", dataset_path)
df = pd.read_csv(dataset_path, encoding =DATASET_ENCODING , names=DATASET_COLUMNS)
# %%
print("Dataset size:", len(df))
# %%
df.head(5)
# %% [markdown]
# ### Map target label to String
# * **0** -> **NEGATIVE**
# * **2** -> **NEUTRAL**
# * **4** -> **POSITIVE**
# %%
decode_map = {0: "NEGATIVE", 2: "NEUTRAL", 4: "POSITIVE"}
def decode_sentiment(label):
return decode_map[int(label)]
# %%
%%time
df.target = df.target.apply(lambda x: decode_sentiment(x))
# %%
target_cnt = Counter(df.target)
plt.figure(figsize=(16,8))
plt.bar(target_cnt.keys(), target_cnt.values())
plt.title("Dataset labels distribuition")
# %% [markdown]
# ### Pre-Process dataset
# %%
stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")
# %%
def preprocess(text, stem=False):
# Remove link,user and special characters
text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
tokens = []
for token in text.split():
if token not in stop_words:
if stem:
tokens.append(stemmer.stem(token))
else:
tokens.append(token)
return " ".join(tokens)
# %%
%%time
df.text = df.text.apply(lambda x: preprocess(x))
# %% [markdown]
# ### Split train and test
# %%
df_train, df_test = train_test_split(df, test_size=1-TRAIN_SIZE, random_state=42)
print("TRAIN size:", len(df_train))
print("TEST size:", len(df_test))
# %% [markdown]
# ### Word2Vec
# %%
%%time
documents = [_text.split() for _text in df_train.text]
# %%
w2v_model = gensim.models.word2vec.Word2Vec(size=W2V_SIZE,
window=W2V_WINDOW,
min_count=W2V_MIN_COUNT,
workers=8)
# %%
w2v_model.build_vocab(documents)
# %%
words = w2v_model.wv.vocab.keys()
vocab_size = len(words)
print("Vocab size", vocab_size)
# %%
%%time
w2v_model.train(documents, total_examples=len(documents), epochs=W2V_EPOCH)
# %%
w2v_model.most_similar("love")
# %% [markdown]
# ### Tokenize Text
# %%
%%time
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_train.text)
vocab_size = len(tokenizer.word_index) + 1
print("Total words", vocab_size)
# %%
%%time
x_train = pad_sequences(tokenizer.texts_to_sequences(df_train.text), maxlen=SEQUENCE_LENGTH)
x_test = pad_sequences(tokenizer.texts_to_sequences(df_test.text), maxlen=SEQUENCE_LENGTH)
# %% [markdown]
# ### Label Encoder
# %%
labels = df_train.target.unique().tolist()
labels.append(NEUTRAL)
labels
# %%
encoder = LabelEncoder()
encoder.fit(df_train.target.tolist())
y_train = encoder.transform(df_train.target.tolist())
y_test = encoder.transform(df_test.target.tolist())
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)
print("y_train",y_train.shape)
print("y_test",y_test.shape)
# %%
print("x_train", x_train.shape)
print("y_train", y_train.shape)
print()
print("x_test", x_test.shape)
print("y_test", y_test.shape)
# %%
y_train[:10]
# %% [markdown]
# ### Embedding layer
# %%
embedding_matrix = np.zeros((vocab_size, W2V_SIZE))
for word, i in tokenizer.word_index.items():
if word in w2v_model.wv:
embedding_matrix[i] = w2v_model.wv[word]
print(embedding_matrix.shape)
# %%
embedding_layer = Embedding(vocab_size, W2V_SIZE, weights=[embedding_matrix], input_length=SEQUENCE_LENGTH, trainable=False)
# %% [markdown]
# ### Build Model
# %%
model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary()
# %% [markdown]
# ### Compile model
# %%
model.compile(loss='binary_crossentropy',
optimizer="adam",
metrics=['accuracy'])
# %% [markdown]
# ### Callbacks
# %%
callbacks = [ ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0),
EarlyStopping(monitor='val_acc', min_delta=1e-4, patience=5)]
# %% [markdown]
# ### Train
# %%
%%time
history = model.fit(x_train, y_train,
batch_size=BATCH_SIZE,
epochs=EPOCHS,
validation_split=0.1,
verbose=1,
callbacks=callbacks)
# %% [markdown]
# ### Evaluate
# %%
%%time
score = model.evaluate(x_test, y_test, batch_size=BATCH_SIZE)
print()
print("ACCURACY:",score[1])
print("LOSS:",score[0])
# %%
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'b', label='Training acc')
plt.plot(epochs, val_acc, 'r', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'b', label='Training loss')
plt.plot(epochs, val_loss, 'r', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
# %% [markdown]
# ### Predict
# %%
def decode_sentiment(score, include_neutral=True):
if include_neutral:
label = NEUTRAL
if score <= SENTIMENT_THRESHOLDS[0]:
label = NEGATIVE
elif score >= SENTIMENT_THRESHOLDS[1]:
label = POSITIVE
return label
else:
return NEGATIVE if score < 0.5 else POSITIVE
# %%
def predict(text, include_neutral=True):
start_at = time.time()
# Tokenize text
x_test = pad_sequences(tokenizer.texts_to_sequences([text]), maxlen=SEQUENCE_LENGTH)
# Predict
score = model.predict([x_test])[0]
# Decode sentiment
label = decode_sentiment(score, include_neutral=include_neutral)
return {"label": label, "score": float(score),
"elapsed_time": time.time()-start_at}
# %%
predict("I love the music")
# %%
predict("I hate the rain")
# %%
predict("i don't know what i'm doing")
# %% [markdown]
# ### Confusion Matrix
# %%
%%time
y_pred_1d = []
y_test_1d = list(df_test.target)
scores = model.predict(x_test, verbose=1, batch_size=8000)
y_pred_1d = [decode_sentiment(score, include_neutral=False) for score in scores]
# %%
def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title, fontsize=30)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=90, fontsize=22)
plt.yticks(tick_marks, classes, fontsize=22)
fmt = '.2f'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label', fontsize=25)
plt.xlabel('Predicted label', fontsize=25)
# %%
%%time
cnf_matrix = confusion_matrix(y_test_1d, y_pred_1d)
plt.figure(figsize=(12,12))
plot_confusion_matrix(cnf_matrix, classes=df_train.target.unique(), title="Confusion matrix")
plt.show()
# %% [markdown]
# ### Classification Report
# %%
print(classification_report(y_test_1d, y_pred_1d))
# %% [markdown]
# ### Accuracy Score
# %%
accuracy_score(y_test_1d, y_pred_1d)
# %% [markdown]
# ### Save model
# %%
model.save(KERAS_MODEL)
w2v_model.save(WORD2VEC_MODEL)
pickle.dump(tokenizer, open(TOKENIZER_MODEL, "wb"), protocol=0)
pickle.dump(encoder, open(ENCODER_MODEL, "wb"), protocol=0)
# %%
Conclusion
In conclusion, the application of machine learning in Twitter sentiment analysis offers exciting opportunities to understand and interpret public opinion in real-time. By leveraging text data and advanced algorithms, we can develop systems that accurately classify sentiment, enabling organizations and individuals to gain valuable insights into public sentiment and respond effectively to emerging trends and events.
"Stay tuned for future data science projects that will turbocharge your learning journey and take your skills to the next level!"