Understanding Embeddings — Michael Yoannou

Perhaps the most beautiful idea in modern machine learning is that meaning can be captured in geometry. That abstract concepts like words, images, or even entire documents can be represented as points in space, arranged so that similar things are near each other and different things are far apart.

This idea is called embedding, and it's fundamental to nearly everything impressive that modern AI does. When a search engine understands that your query for "running shoes" should also match "jogging trainers", that's embeddings. When a recommendation system suggests songs similar to ones you love, that's embeddings. When a language model produces coherent text about topics it was never explicitly taught, embeddings are doing the heavy lifting underneath.

Let's build up the intuition from first principles, then implement real embeddings from scratch.

The Problem with Discrete Representations

Consider how a computer naturally represents words. The simplest approach is one-hot encoding: create a vocabulary of all possible words, assign each word a unique index, and represent each word as a vector with a 1 in that position and 0s everywhere else.

One-hot encoding

import numpy as np

# A tiny vocabulary
vocabulary = ['cat', 'dog', 'fish', 'bird', 'mammal', 'pet']
vocab_size = len(vocabulary)

def one_hot_encode(word):
    """Convert a word to one-hot vector."""
    vector = np.zeros(vocab_size)
    if word in vocabulary:
        vector[vocabulary.index(word)] = 1
    return vector

# Encode some words
cat_vector = one_hot_encode('cat')
dog_vector = one_hot_encode('dog')
fish_vector = one_hot_encode('fish')

print("cat:", cat_vector)   # [1, 0, 0, 0, 0, 0]
print("dog:", dog_vector)   # [0, 1, 0, 0, 0, 0]
print("fish:", fish_vector) # [0, 0, 1, 0, 0, 0]

This representation has a fatal flaw: it treats all words as equally different from all other words. The distance between "cat" and "dog" is exactly the same as the distance between "cat" and "fish". But intuitively, cats and dogs are more similar than cats and fish. They're both mammals, both common pets, both have four legs.

One-hot encoding throws away all of this semantic information. To the computer, words are just arbitrary symbols with no inherent relationships.

The Embedding Solution

Embeddings solve this by representing words as dense vectors in a continuous space. Instead of sparse vectors with one meaningful dimension, we use dense vectors where every dimension carries partial information about meaning.

Dense embeddings example

# Hypothetical learned embeddings (3 dimensions for visualisation)
embeddings = {
    'cat':    np.array([0.9, 0.8, 0.2]),   # pet-ness, mammal-ness, aquatic-ness
    'dog':    np.array([0.95, 0.85, 0.1]),
    'fish':   np.array([0.6, 0.0, 0.95]),
    'bird':   np.array([0.7, 0.0, 0.1]),
    'whale':  np.array([0.2, 0.9, 0.95]),
    'shark':  np.array([0.1, 0.0, 0.98]),
}

def cosine_similarity(a, b):
    """Measure similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Now similarity reflects semantic relationships
print(f"cat-dog similarity: {cosine_similarity(embeddings['cat'], embeddings['dog']):.3f}")
print(f"cat-fish similarity: {cosine_similarity(embeddings['cat'], embeddings['fish']):.3f}")
print(f"fish-shark similarity: {cosine_similarity(embeddings['fish'], embeddings['shark']):.3f}")

# Output:
# cat-dog similarity: 0.997
# cat-fish similarity: 0.616
# fish-shark similarity: 0.943

Now the geometry reflects semantics. Cats and dogs are close together in this space because they share properties. Fish and sharks are close for different reasons. The distance between points encodes meaningful relationships.

But where do these embedding vectors come from? We could hand-craft them as I did above, but that doesn't scale. For a vocabulary of 100,000 words, we need a method to learn these representations automatically from data.

Learning Embeddings: Word2Vec

The breakthrough came in 2013 with Word2Vec, which proposed a simple but powerful idea: words that appear in similar contexts should have similar embeddings. "Cat" and "dog" often appear in similar sentences ("The ___ slept on the couch", "She fed the ___"), so their embeddings should be close.

Word2Vec comes in two flavours. Skip-gram predicts context words given a target word. CBOW (Continuous Bag of Words) predicts a target word given context words. Let's implement a simplified Skip-gram.

Skip-gram implementation

import numpy as np
from collections import Counter

class Word2Vec:
    def __init__(self, embedding_dim=50, window_size=2, learning_rate=0.01):
        """
        Simple Word2Vec Skip-gram implementation.

        Args:
            embedding_dim: Size of word vectors
            window_size: Context window on each side
            learning_rate: Learning rate for SGD
        """
        self.embedding_dim = embedding_dim
        self.window_size = window_size
        self.lr = learning_rate

    def build_vocabulary(self, corpus):
        """Build vocabulary from corpus."""
        # Tokenise and count words
        words = []
        for sentence in corpus:
            words.extend(sentence.lower().split())

        word_counts = Counter(words)

        # Filter rare words
        self.vocabulary = [w for w, c in word_counts.items() if c >= 2]
        self.word_to_idx = {w: i for i, w in enumerate(self.vocabulary)}
        self.idx_to_word = {i: w for w, i in self.word_to_idx.items()}
        self.vocab_size = len(self.vocabulary)

        print(f"Vocabulary size: {self.vocab_size}")

    def initialize_embeddings(self):
        """Initialize embedding matrices."""
        # W_in: embeddings for input (target) words
        # W_out: embeddings for output (context) words
        self.W_in = np.random.randn(self.vocab_size, self.embedding_dim) * 0.01
        self.W_out = np.random.randn(self.embedding_dim, self.vocab_size) * 0.01

    def generate_training_pairs(self, corpus):
        """Generate (target, context) pairs from corpus."""
        pairs = []

        for sentence in corpus:
            words = sentence.lower().split()
            indices = [self.word_to_idx[w] for w in words
                       if w in self.word_to_idx]

            for i, target in enumerate(indices):
                # Context window
                start = max(0, i - self.window_size)
                end = min(len(indices), i + self.window_size + 1)

                for j in range(start, end):
                    if i != j:
                        pairs.append((target, indices[j]))

        return pairs

    def softmax(self, x):
        """Numerically stable softmax."""
        exp_x = np.exp(x - np.max(x))
        return exp_x / exp_x.sum()

    def train_pair(self, target_idx, context_idx):
        """Train on a single (target, context) pair."""
        # Forward pass
        hidden = self.W_in[target_idx]  # (embedding_dim,)
        output = np.dot(hidden, self.W_out)  # (vocab_size,)
        probs = self.softmax(output)

        # Compute loss gradient
        grad_out = probs.copy()
        grad_out[context_idx] -= 1  # Cross-entropy gradient

        # Backpropagate to embeddings
        grad_hidden = np.dot(self.W_out, grad_out)

        # Update weights
        self.W_out -= self.lr * np.outer(hidden, grad_out)
        self.W_in[target_idx] -= self.lr * grad_hidden

        # Return loss for monitoring
        loss = -np.log(probs[context_idx] + 1e-10)
        return loss

    def train(self, corpus, epochs=5):
        """Train embeddings on corpus."""
        self.build_vocabulary(corpus)
        self.initialize_embeddings()
        pairs = self.generate_training_pairs(corpus)

        print(f"Training pairs: {len(pairs)}")

        for epoch in range(epochs):
            np.random.shuffle(pairs)
            total_loss = 0

            for target, context in pairs:
                loss = self.train_pair(target, context)
                total_loss += loss

            avg_loss = total_loss / len(pairs)
            print(f"Epoch {epoch + 1}, Loss: {avg_loss:.4f}")

    def get_embedding(self, word):
        """Get embedding vector for a word."""
        if word not in self.word_to_idx:
            return None
        return self.W_in[self.word_to_idx[word]]

    def most_similar(self, word, n=5):
        """Find most similar words."""
        vec = self.get_embedding(word)
        if vec is None:
            return []

        similarities = []
        for other_word in self.vocabulary:
            if other_word != word:
                other_vec = self.get_embedding(other_word)
                sim = cosine_similarity(vec, other_vec)
                similarities.append((other_word, sim))

        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:n]

The training process is elegant in its simplicity. For each word in the corpus, we try to predict its context words. The embedding vectors are adjusted so that words appearing in similar contexts end up with similar embeddings. No explicit knowledge about word meanings is provided, the semantics emerge purely from patterns of co-occurrence.

The Magic of Word Arithmetic

The most famous demonstration of word embeddings is vector arithmetic. If the embeddings truly capture semantic relationships, then relationships between concepts should be encoded as directions in the embedding space.

The classic example: the vector from "king" to "queen" should be similar to the vector from "man" to "woman". Both capture the concept of gender transformation. This means we can solve analogies with vector arithmetic:

king - man + woman \approx queen

Word arithmetic

def analogy(word2vec, a, b, c, n=3):
    """
    Solve: a is to b as c is to ?

    Uses: b - a + c ≈ ?
    """
    vec_a = word2vec.get_embedding(a)
    vec_b = word2vec.get_embedding(b)
    vec_c = word2vec.get_embedding(c)

    if None in [vec_a, vec_b, vec_c]:
        return []

    # Target vector: b - a + c
    target = vec_b - vec_a + vec_c

    # Find closest words (excluding inputs)
    exclude = {a, b, c}
    similarities = []

    for word in word2vec.vocabulary:
        if word not in exclude:
            vec = word2vec.get_embedding(word)
            sim = cosine_similarity(target, vec)
            similarities.append((word, sim))

    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:n]

# Examples with pre-trained embeddings:
# analogy(model, 'king', 'queen', 'man')  → [('woman', 0.85), ...]
# analogy(model, 'paris', 'france', 'london')  → [('england', 0.78), ...]
# analogy(model, 'walked', 'walking', 'swam')  → [('swimming', 0.81), ...]

This works because the training process organises the space so that semantic relationships are encoded as geometric relationships. The difference vector between "king" and "queen" points in the "female" direction. Add this direction to "man" and you arrive near "woman".

Beyond Words: Document Embeddings

The same principles apply to larger units of text. We can embed entire sentences or documents into vector spaces where similar documents are nearby.

Simple document embedding

def document_embedding_average(word2vec, document):
    """
    Create document embedding by averaging word embeddings.

    Simple but surprisingly effective baseline.
    """
    words = document.lower().split()
    vectors = []

    for word in words:
        vec = word2vec.get_embedding(word)
        if vec is not None:
            vectors.append(vec)

    if not vectors:
        return np.zeros(word2vec.embedding_dim)

    return np.mean(vectors, axis=0)


def document_embedding_tfidf(word2vec, document, idf_weights):
    """
    Create document embedding using TF-IDF weighted average.

    Gives more weight to distinctive words.
    """
    words = document.lower().split()
    word_counts = Counter(words)
    total_words = len(words)

    weighted_sum = np.zeros(word2vec.embedding_dim)
    total_weight = 0

    for word, count in word_counts.items():
        vec = word2vec.get_embedding(word)
        if vec is not None:
            tf = count / total_words
            idf = idf_weights.get(word, 1.0)
            weight = tf * idf

            weighted_sum += weight * vec
            total_weight += weight

    if total_weight == 0:
        return np.zeros(word2vec.embedding_dim)

    return weighted_sum / total_weight


def find_similar_documents(word2vec, query, documents, n=5):
    """Find documents most similar to a query."""
    query_vec = document_embedding_average(word2vec, query)

    similarities = []
    for i, doc in enumerate(documents):
        doc_vec = document_embedding_average(word2vec, doc)
        sim = cosine_similarity(query_vec, doc_vec)
        similarities.append((i, doc, sim))

    similarities.sort(key=lambda x: x[2], reverse=True)
    return similarities[:n]

Averaging word embeddings is a simple baseline, but it works surprisingly well. More sophisticated approaches like Doc2Vec or sentence transformers can capture additional nuances, but the fundamental idea remains: represent documents as points in a semantic space.

Image Embeddings

Embeddings aren't limited to text. Convolutional neural networks trained for image classification learn to create embeddings in their intermediate layers. The final classification layer sees a compressed representation of the image that captures its essential visual features.

Using pre-trained image embeddings

from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Model

def create_image_embedder():
    """Create an image embedding model using ResNet50."""
    # Load pre-trained ResNet50
    base_model = ResNet50(weights='imagenet')

    # Remove classification layer, keep embedding layer
    model = Model(
        inputs=base_model.input,
        outputs=base_model.get_layer('avg_pool').output
    )

    return model


def get_image_embedding(model, img_path):
    """Get embedding vector for an image."""
    # Load and preprocess image
    img = image.load_img(img_path, target_size=(224, 224))
    img_array = image.img_to_array(img)
    img_array = np.expand_dims(img_array, axis=0)
    img_array = preprocess_input(img_array)

    # Get embedding (2048-dimensional for ResNet50)
    embedding = model.predict(img_array)

    return embedding.flatten()


def find_similar_images(model, query_path, image_paths, n=5):
    """Find images most similar to a query image."""
    query_emb = get_image_embedding(model, query_path)

    similarities = []
    for path in image_paths:
        emb = get_image_embedding(model, path)
        sim = cosine_similarity(query_emb, emb)
        similarities.append((path, sim))

    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:n]

The ResNet50 network was trained to classify images into 1000 categories. But in the process, it learned to create rich representations that capture visual similarity. Two images of dogs will have similar embeddings even if they're different breeds, because the network learned features that distinguish dogs from non-dogs.

Modern Embedding Models

The embedding techniques we've discussed are foundational, but the field has advanced significantly. Modern embedding models like BERT and its descendants create contextual embeddings where the same word gets different vectors depending on its context.

Contextual embeddings with transformers

from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Same word, different contexts
sentences = [
    "I need to bank the money.",
    "The river bank was steep.",
    "I deposited cash at the bank.",
    "Fish swam near the river bank."
]

# Get embeddings
embeddings = model.encode(sentences)

# Compute similarities
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        sim = cosine_similarity(embeddings[i], embeddings[j])
        print(f"Sentences {i} and {j}: {sim:.3f}")

# Output shows financial "bank" sentences are similar to each other
# and river "bank" sentences are similar to each other
# despite using the same word

This is the power of contextual embeddings. The word "bank" no longer has a single fixed vector. Its representation depends on surrounding words, allowing the model to distinguish between financial institutions and riverbanks.

Practical Applications

Understanding embeddings opens up a world of practical applications. Semantic search goes beyond keyword matching to find documents that are conceptually similar to a query. Recommendation systems use embeddings to find items similar to what users have liked. Clustering algorithms group similar items together, whether they're documents, images, or products.

Building a simple semantic search

import numpy as np
from sentence_transformers import SentenceTransformer

class SemanticSearch:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None

    def index_documents(self, documents):
        """Index a collection of documents."""
        self.documents = documents
        self.embeddings = self.model.encode(documents)
        print(f"Indexed {len(documents)} documents")

    def search(self, query, top_k=5):
        """Search for documents similar to query."""
        query_embedding = self.model.encode([query])[0]

        # Compute similarities
        similarities = np.dot(self.embeddings, query_embedding)
        similarities /= (np.linalg.norm(self.embeddings, axis=1) *
                        np.linalg.norm(query_embedding))

        # Get top results
        top_indices = np.argsort(similarities)[::-1][:top_k]

        results = []
        for idx in top_indices:
            results.append({
                'document': self.documents[idx],
                'score': similarities[idx]
            })

        return results

# Usage
search = SemanticSearch()
search.index_documents([
    "Machine learning algorithms can learn from data.",
    "Deep neural networks revolutionised computer vision.",
    "Natural language processing enables chatbots.",
    "Reinforcement learning trains agents through rewards.",
    "The cat sat on the warm windowsill.",
])

results = search.search("How do AI systems learn?")
for r in results:
    print(f"Score: {r['score']:.3f} | {r['document']}")

The Deeper Picture

Embeddings represent a profound shift in how we think about meaning in computation. Rather than defining meaning through explicit rules and relationships, we let meaning emerge from patterns in data. Words that appear together, images that look similar, documents that discuss related topics, all become nearby points in a geometric space.

This geometric view of meaning has philosophical implications. If "king" and "queen" differ by the same vector as "man" and "woman", what does that say about how language encodes concepts? If images of dogs cluster together in embedding space, what has the neural network learned about the essence of "dogness"?

These are questions without easy answers, but they point to something fascinating: machine learning systems are discovering structure in data that we never explicitly taught them. The embeddings they create reveal patterns in language and vision that we use every day but rarely articulate.

Understanding embeddings isn't just about building better search engines or recommendation systems. It's about understanding how meaning can be represented, computed, and compared. In that sense, embeddings are one of the most intellectually rich areas in all of machine learning.