Sentiment Analysis Model - Let's Code KnownSense

Last Updated on December 26, 2025 by KnownSense

Before starting the Sentiment Analysis Model, make sure to visit NLP Architectures – Text Classification. In that article, we explored the key steps of a Sentiment Analysis Model—from tokenization and embeddings to LSTM processing and loss optimization—providing a clear picture of how text classification and sentiment analysis work. Each stage plays a crucial role in transforming raw text into meaningful predictions, enabling models to effectively interpret and evaluate sentiment.
In this article, we will do some hands-on work and try to understand each step with real examples.

NLP-related Python libraries

Before building a text classification or sentiment analysis model, we need to install a few essential NLP libraries. These libraries help us load datasets, convert text into numerical form, and represent words as meaningful vectors.
We begin by installing datasets, tokenizers, and gensim

datasets is used to easily download and manage popular NLP datasets.
tokenizers helps convert raw text into tokens and numerical IDs that models can process.
gensim provides tools for creating word embeddings that capture semantic meaning.

With these libraries installed, we are ready to move on to loading data and preparing text for model training.

!pip3 install datasets tokenizers gensim

Importing libraries and specific classes/methods from those libraries

from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace
from gensim.models import KeyedVectors
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
import torch.optim as optim

Tokenizer

Loading the Dataset

This line loads the IMDb movie reviews dataset using the Hugging Face datasets library. Each entry contains text (the movie review) and a label (sentiment → 0 = negative, 1 = positive). We will use this dataset to train the sentiment classification model.

dataset = load_dataset("imdb", split="train")

Creating the Tokenizer

tokenizer = Tokenizer(WordLevel(unk_token="<unk>"))
tokenizer.pre_tokenizer = Whitespace()

Here, we create a word-level tokenizer. The tokenizer assigns each unique word a token and uses <unk> (unknown token) when it encounters a word it has never seen before. The pre-tokenizer splits text based on spaces. This approach allows the model to handle unseen words gracefully.

Defining the Trainer

The trainer is responsible for building the tokenizer’s vocabulary.

<unk> → represents unknown words
<pad> → used to pad sentences so that all inputs have the same length

Padding is required because neural networks expect fixed-size inputs.

trainer = WordLevelTrainer(special_tokens=["<unk>", "<pad>"])

Training the Tokenizer Vocabulary and Enabling Padding

This step trains the tokenizer using the IMDb review text. It scans all reviews, collects unique words, and assigns each word a numerical ID. After training, the tokenizer can convert raw text into token IDs. We also enable automatic padding—shorter sentences get padded with <pad> to ensure all sequences have equal length, which is essential for batch processing in neural networks.

tokenizer.train_from_iterator(dataset["text"], trainer)
tokenizer.enable_padding(pad_token="<pad>")

Loading Pre-trained Word Embeddings

 # GoogleNews-vectors-negative300.bin.gz Download Link. Replace the value with the file path on your system
word2vec_model_path = "~/Downloads/GoogleNews-vectors-negative300.bin.gz" 
word2vec_model = KeyedVectors.load_word2vec_format(
    word2vec_model_path, binary=True
)

This line loads the Word2Vec model using Gensim.

binary=True indicates the file is in binary format
KeyedVectors allows efficient lookup of word embeddings

These embeddings provide semantic meaning to words before feeding them into the LSTM.

At first glance, it may seem redundant to use pre-trained word embeddings when we are already creating our own tokenizer and training vocabulary on our dataset. But Training a tokenizer does not mean the model understands word meaning. Tokenizers only assign IDs to words, while pre-trained embeddings provide semantic understanding learned from large corpora.

Embedding weight matrix

This step aligns the tokenizer’s vocabulary with pre-trained Word2Vec embeddings by constructing an embedding matrix. Words found in Word2Vec receive rich semantic vectors, while unseen words are initialized randomly and learned during training, allowing the model to balance prior knowledge with task-specific learning.

vocab = tokenizer.get_vocab()
itos = {index: word for word, index in vocab.items()}

embedding_dim = 300
vocab_size = len(vocab)

weight_matrix = torch.zeros(vocab_size, embedding_dim)

for index, word in itos.items():
    if word in word2vec_model:
        weight_matrix[index] = torch.from_numpy(word2vec_model[word].copy())
    else:
        # Random vector for words not in Word2Vec
        weight_matrix[index] = torch.randn(embedding_dim)

Defining the Sentiment Classification Model

class SentimentModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, output_dim, vocab_size, weight_matrix):
        super().__init__()
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim) #embedding lookup table
        self.embedding.weight.data.copy_>(weight_matrix) 
        self.embedding.weight.requires_grad = True  # trainable embeddings

        # LSTM layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

        # Fully connected output
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text):
        # text: tensor of token IDs, shape [batch_size, seq_len]
        embedded = self.embedding(text)           # [batch_size, seq_len, embedding_dim]
        lstm_out, (hidden, cell) = self.lstm(embedded)  # lstm_out: [batch_size, seq_len, hidden_dim]
        out = self.fc(hidden[-1])                 # Take last hidden state
        return out

Embedding layer: This creates an embedding lookup table where each token ID maps to a dense vector, initialized using pre-trained Word2Vec vectors. We set requires_grad = True to allow the embeddings to be fine-tuned—the model can adjust word meanings slightly for the sentiment task. This approach combines general language knowledge from Word2Vec with task-specific learning from the IMDb sentiment dataset.
LSTM layer: LSTM reads words in order, maintains a memory of important context, and handles long sentences better than vanilla RNNs. Setting batch_first=True means the input shape is [batch_size, sequence_length, embedding_dim].
Fully connected output: This layer converts the LSTM’s final hidden state into a sentiment score

The forward method defines how data flows through the model. It takes text—a tensor of token IDs with shape [batch_size, seq_len]—and passes it through the embedding layer (self.embedding(text)), which converts token IDs into word vectors of shape [batch_size, seq_len, embedding_dim]. The LSTM then processes these embeddings, returning lstm_out (outputs for all time steps), hidden (final hidden states), and cell (final cell states). We use hidden[-1]} (the last layer’s final hidden state) because it summarizes the entire sentence. Finally, we pass this through the linear layer (self.fc) to produce the sentiment logit.

Batch Preparation and Padding

The collate_batch function converts raw text samples into padded tensor batches suitable for neural network training. It tokenizes text, converts tokens to tensors, applies padding to handle variable-length sequences, and groups labels into a single tensor, ensuring consistent input shapes for efficient batch processing.

def collate_batch(batch):
    text_list, label_list = [], []

    for text, label in batch:
        token_ids = tokenizer.encode(text).ids #Converts raw text into token IDsTake
        text_list.append(torch.tensor(token_ids, dtype=torch.int64)) #Converts token IDs into a tensor
        label_list.append(label)

    text_tensor = pad_sequence(
        text_list,
        batch_first=True,
        padding_value=vocab[""<pad>"]
    )

    label_tensor = torch.tensor(label_list, dtype=torch.float32) #Creating Label Tensor
    return text_tensor, label_tensor

train_dataset = list(zip(dataset["text"], dataset["label"])) 
#turn your dataset into batches for training
train_loader = DataLoader( train_dataset, batch_size=8, shuffle=True, collate_fn=collate_batch )

Loss Function, backward() & Optimizer

model = SentimentModel(
    embedding_dim=300,
    hidden_dim=256,
    output_dim=1,
    vocab_size=len(vocab),
    weight_matrix=weight_matrix
)

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

We create our sentiment model with 300-dimensional word embeddings (each word becomes a vector of 300 numbers), 256 hidden units (the LSTM’s memory size), and a single output for positive/negative classification. We load pre-trained Word2Vec weights so the model starts with meaningful word representations. The Adam optimizer handles training by adjusting all model weights, and BCEWithLogitsLoss measures how wrong the predictions are—it converts raw scores to probabilities and calculates the error in one step.

Training Loop: Optimizing the LSTM Sentiment Model

num_epochs = 3

for epoch in range(num_epochs):
    total_loss = 0
    for texts, labels in train_loader:
        optimizer.zero_grad()

        predictions = model(texts).squeeze(1)
        loss = criterion(predictions, labels)

        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(
        f"Epoch {epoch + 1}, "
        f"Average Loss: {total_loss / len(train_loader)}"
    )

We set num_epochs = 3, meaning the model will see the full dataset three times. For each epoch, we loop through batches of text and labels from train_loader. Before each forward pass, we call optimizer.zero_grad() to clear previous gradients. We then pass the text through the model to get predictions, compute the loss by comparing predicted logits with true labels, and call loss.backward() to calculate gradients for all parameters. Finally, optimizer.step() updates the weights using Adam. We track the total loss and print the average loss per epoch to monitor training progress.

Inference: Predicting Sentiment for New Sentences

def predict_sentiment(model, sentence):
    model.eval()
    with torch.no_grad():
        token_ids = tokenizer.encode(sentence).ids
        tensor = torch.tensor(token_ids).unsqueeze(0)
        logits = model(tensor)
        prob = torch.sigmoid(logits).item()
        return "Positive" if prob >= 0.5 else "Negative"

The predict_sentiment function takes a trained model and a raw sentence as input. First, we call model.eval() to set the model to evaluation mode, which disables dropout and other training-only behaviors. We wrap the inference in torch.no_grad() to disable gradient computation, saving memory and speeding up inference (Using trained model to make predictions on new, unseen data). The function tokenizes the input sentence into token IDs and adds a batch dimension with unsqueeze(0) (since PyTorch expects batches). We pass the token IDs through the model to get a logit, apply sigmoid to convert it to a probability, and return “Positive” if the probability is ≥ 0.5, otherwise “Negative”.

Usages Example

Now it’s time to test the model.

predict_sentiment(model, "Worst movie I have ever seen.")

Response: Negative

predict_sentiment(model, "This move is amazing!")

Response: Positive

But our model fails in more complex sentences

predict_sentiment(model, " I don't think this is a good movie.")

Response: Positive

The model predicts “Positive” because a basic LSTM sentiment model relies on learned word patterns rather than true grammatical understanding. Strong positive words like “good” and “movie” carry more weight in the embeddings and often dominate the prediction, while negation phrases such as “don’t think” are harder for the model to interpret, especially if they appear less frequently in the training data. Tokenization can further weaken negation by splitting contractions like “don’t,” causing the negative signal to be lost. As a result, the overall score may cross the 0.5 threshold and be labeled Positive, even though a human would clearly interpret the sentence as negative.

Possible solutions include improving the model’s ability to understand negation and context. This can be done by adding more negation-heavy examples (such as “not good” or “don’t like this movie”) to the training data so the model learns these patterns. Preprocessing text to explicitly preserve negation (for example, converting “not good” to “not_good”) can also help. Using more advanced architectures like BiLSTM with attention or transformer-based models (BERT, RoBERTa) significantly improves handling of context and negation.

You can find the complete code on our GitHub: Sentiment_Analysis.ipynb

Conclusion

In this article, we built a complete sentiment analysis model from scratch. We learned how to tokenize text, load pre-trained Word2Vec embeddings, and create an LSTM-based classifier using PyTorch. We walked through the training loop—forward pass, loss computation, backpropagation, and weight updates—and used the trained model to predict sentiment on new sentences. We also observed that basic LSTM models can struggle with negation and context, which can be addressed using techniques like negation-aware preprocessing or advanced architectures such as BiLSTM with attention or transformers