When dealing with a large amount of text, it is essential to have tools that can help computers recognize and evaluate the similarity between documents. One of the most effective methods in this field is cosine similarity. Cosine similarity is a technique that can measure the proximity between two documents by transforming words into vectors within a vector space. This transformative approach allows for the semantic interpretation of human language in a format that machines can easily understand. The idea behind this concept is that words can be represented as vectors, where each dimension corresponds to a unique feature of the text, such as the frequency of a word or its contextual relevance. Calculating the cosine similarity between two vectors becomes a way to measure how documents are similar in content and context, disregarding length and focusing the analysis on the structure of the vectors representing them. Although more complex methods exist for analyzing text similarity, such as neural networks or advanced clustering algorithms, cosine similarity offers an ideal balance between simplicity and effectiveness for analyzing moderate-sized documents. It is precious in applications such as recommendation systems, automatic text classification, and semantic search, where quickly understanding the relationship between different documents is crucial. Below, with a simple example, we will see how it is possible to determine the similarity between various documents, starting from the definition of cosine similarity. Formula The cosine similarity between two vectors is calculated using the following formula: \[ \text{cosine similarity} \ (V_x, V_y) = \frac{\sum_{i=1}^{n} V_{x_i} \cdot V_{y_i}}{\sqrt{\sum_{i=1}^{n} (V_{x_i})^2} \times \sqrt{\sum_{i=1}^{n} (V_{y_i})^2}} \] or in compact form: \[\text{cosine similarity} \ (V_x, V_y) = \frac{V_x \cdot V_y}{||V_x|| \ ||V_y||}\] where: \( V_x \cdot V_y \) is the dot product of the vectors \( A \) and \( B \).
\( ||V_x|| \) and \( ||V_y|| \) are the norms (lengths) of the vectors \( V_x \) and \( V_y \). The value of cosine similarity ranges from 0 to 1.
A value close to 1 means that the angle between the two vectors is minimal; therefore, the vectors are very similar. Conversely, a value close to 0 means that the angle between the two vectors approaches \( \ \frac {\pi}{2} \), and therefore, the two vectors have a low degree of similarity.
Example Let’s consider a scenario in which we aim to evaluate the similarity between various documents. For clarity, we will explore a straightforward example involving three brief sentences from which we seek to determine their respective degrees of similarity: \(x\) = I am fond of reading thriller novels.
\(y\) = I prefer reading thriller novels.
\(z\) = Yesterday, I arrived late. Upon a closer look, it’s clear that sentences \(x\) and \(y\) have similarities, while sentence \(z\) is unrelated. The initial step in our analysis involves transforming the sentences into vectors, extracting all words, and computing their frequencies within the sentences. Subsequently, we will refine the data by eliminating words that contribute little to no meaningful information, such as the conjunction of, the pronoun I and the verb to be. This process is crucial, particularly in large corpora, to ensure that the dataset is qualitatively significant and focuses on the most impactful elements for our analysis. Here is the result:
arrived fond late novels prefer reading thriller yesterday \(V_x\) 0 1 0 1 0 1 1 0 \(V_y\) 0 0 0 1 1 1 1 0 \(V_z\) 1 0 1 0 0 0 0 1
The vector representation of the three sentences is as follows: \(V_x = [0, 1, 0, 1, 0, 1, 1, 0]\)
\(V_y = [0, 0, 0, 1, 1, 1, 1, 0]\)
\(V_z = [1, 0, 1, 0, 0, 0, 0, 1]\) Let us now compute the cosine similarity between vector \(V_x\) and vector \(V_y\), which appear significantly alike upon initial inspection.
Let’s use the previously seen cosine similarity formula and obtain: \[\text{cosine similarity} \ (V_x, V_y) = \frac{V_x \cdot V_y}{||V_x|| \ ||V_y||}\] The dot product \(V_x \cdot V_y\) between the vectors \(V_x\) and \(V_y\) is given by: \[ V_x \cdot V_y = (0 \times 0) + (1 \times 0) + (0 \times 0) + (1 \times 1)\] \[ + (0 \times 1) + (1 \times 1) + (1 \times 1) + (0 \times 0) = 3 \] Let’s calculate the denominator of the formula, \( ||V_x|| \ ||V_y|| \), given by the product of the lengths of the two vectors. We obtain:
\[||V_x|| = \sqrt{0^2 + 1^2 + 0^2 + 1^2 + 0^2 + 1^2 + 1^2 + 0^2} = \sqrt{4} = 2\] \[||V_y|| = \sqrt{0^2 + 0^2 + 0^2 + 1^2 + 1^2 + 1^2 + 1^2 + 0^2} = \sqrt{4} = 2\]
Therefore, we obtain a cosine similarity value of: \[\text{cosine similarity} \ (V_x, V_y) = \frac{3}{2 \times 2} = \frac{3}{4} = 0.75\]
As cosine similarity ranges between 0 and 1, where 1 indicates maximum similarity, a value of 0.75 suggests significant similarity between the two vectors, indicating that they have similar content. To find the angle \( \theta \) between the two vectors \(V_x\) and \(V_y\) from the value of the cosine similarity, we proceed with the calculation of the arccosine function. The relation between the cosine similarity and the angle \( \theta \) is given by the formula: \[ \theta = \arccos(0.75) \approx 41.4^\circ \]
In general, as the angle magnitude approaches zero, the cosine similarity value increases, indicating greater similarity between the vectors. Below is an example of Python code for calculating the cosine similarity of vectors \(V_x\) and \(V_y\) that you can test on an online IDE.
import numpy as np # Define the vectors Vx = np.array([0, 1, 0, 1, 0, 1, 1, 0]) Vy = np.array([0, 0, 0, 1, 1, 1, 1, 0]) # Function to calculate cosine similarity def cosine_similarity(vector1, vector2): # Calculate the dot product dot_product = np.dot(vector1, vector2) # Calculate the norms of each vector norm1 = np.linalg.norm(vector1) norm2 = np.linalg.norm(vector2) # Calculate the cosine similarity cosine_sim = dot_product / (norm1 * norm2) return cosine_sim # Calculate and print the cosine similarity similarity = cosine_similarity(Vx, Vy) print(f"Cosine similarity (Vx,Vy): {similarity}")
Having removed some words from the original example sentences to make the evaluation more precise, I directly inserted the sentences already reduced to their vectors in the code to return the cosine similarity value of 0.75. For a complete example, you can consider the following code. In this case, the cosine similarity value between \(Vx\) and \(Vy\) will be approximately 0.48.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity sentences = [ "I am fond of reading thriller novels.", # x "I prefer reading thriller novels.", # y "Yesterday, I arrived late." # z ] # Initialize a TF-IDF Vectorizer vectorizer = TfidfVectorizer() # Fit and transform the sentences to a TF-IDF matrix tfidf_matrix = vectorizer.fit_transform(sentences) # Compute the cosine similarity matrix cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix) print("Cosine Similarity Matrix:
", cosine_sim)