Agglomerative Clustering with Sentence Transformer In Python

Agglomerative Clustering with Sentence Transformer In Python
Photo by Mel Poole / Unsplash

Agglomerative Clustering is a type of hierarchical clustering algorithm. It starts with each data point as an individual cluster and then iteratively merges the closest clusters together until a specified number of clusters or a stopping criterion is reached. The result is a tree-like structure called a dendrogram, which can be used to visualize the hierarchy of clusters. This method is also known as "bottom-up" or "agglomerative" hierarchical clustering because it starts with individual clusters and then builds them up into a hierarchy.

Why Use Agglomerative Clustering?

One of the main advantages of Agglomerative Clustering is that it does not require the number of clusters to be specified in advance, unlike other clustering methods such as k-means. This makes it useful for exploring and understanding the structure of a dataset when the number of clusters is not known.

Common Use Case Of Agglomerative Clustering?

  • Data Exploration: It can be used to explore and understand the structure of a dataset by identifying patterns and grouping similar data points together.

  • Image Segmentation: It can be used to segment images into different regions by grouping pixels with similar color or texture together.

  • Document Clustering: It can be used to group documents by topic by clustering together documents that have similar content.

  • Market Segmentation: It can be used to segment customers by grouping together those that have similar characteristics or behaviors.

  • Gene Expression Analysis: It can be used to identify patterns in gene expression data by clustering together genes that have similar expression profiles.

What Is Sentence Transformer?

Sentence Transformer is a pre-trained model or library developed by the German Research Center for Artificial Intelligence (DFKI) that can be used for various natural language processing (NLP) tasks such as semantic text similarity, text classification, and information retrieval. It is based on transformer architecture, which is a type of neural network architecture that has been shown to be very effective for a wide range of NLP tasks.

Sentence Transformer is trained on a large corpus of text data and can be fine-tuned on specific tasks using a smaller dataset. It can be used to generate sentence embeddings, which are fixed-length, dense representations of sentences that capture their meaning. These embeddings can be used as input for various NLP tasks such as text classification, information retrieval, and question answering.

Sentence Transformer also includes pre-trained models that are fine-tuned on specific tasks such as semantic text similarity, text classification, and information retrieval, which can be used out-of-the-box.

Overall, Sentence Transformer is a powerful tool for NLP tasks, particularly when it comes to understanding the meaning and context of sentences, and offers a lot of flexibility as it can be fine-tuned to different tasks and domains.

Read More about the python library here

Basic Example For Implementing Agglomerative Clustering

from sentence_transformers import SentenceTransformer, util
from sklearn.cluster import AgglomerativeClustering
import numpy as np
# What The Code Does
# 1- Download the model
# 2- Load the model
# 3- Encode the sentences
# 4- Normalize the embeddings
# 5- Create a clustering model
# 6- Fit the model with the embeddings
# 7- Print the clusters
sentences = ["This is an example sentence", "Each sentence is converted",
             "into a single embedding", "It will not rain today", "It might rain today", "Today will be a sunny day", "It will be hot today"]

embedder = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = embedder.encode(sentences)

embeddings = embeddings / \
    np.linalg.norm(embeddings, axis=1, keepdims=True)

clustering_model = AgglomerativeClustering(
    n_clusters=None, distance_threshold=0.8, compute_full_tree=True, linkage="ward")

cluster_assignment = clustering_model.fit_predict(embeddings)

for sentence, cluster_id in zip(sentences, cluster_assignment):
    print("Sentence:", sentence)
    print("Cluster ID:", cluster_id)

Agglomerative Clustering and Sentence Transformer are important tools that a NLP practitioner should have in their Arsenal
Happy Hacking 🤓