husniddin's notes

Search, semantic search, and vectors

Let's say you're building a search feature. The simplest approach is keyword matching — check if the user's query words appear in the document. But this breaks down quickly. If someone searches "how to fix a bug" and your document says "debugging techniques," keyword search won't find it.

Inverted index

The classic approach to full-text search is the inverted index. Instead of scanning every document, you build an index that maps each word to the list of documents containing it.

"python"  → [doc1, doc3, doc7]
"search"  → [doc2, doc3, doc5]
"vector"  → [doc3, doc8]

When a user searches for "python search," you find the intersection of those two lists. It's fast, but it only matches exact words.

The problem with keywords

Keywords don't understand meaning. "Car" and "automobile" are the same thing but keyword search treats them as completely different. And "bank" could mean a financial institution or a river bank — keywords can't tell the difference from context.

Vector embeddings

This is where semantic search comes in. The idea: represent every piece of text as a vector — a list of numbers that captures its meaning. Similar texts get similar vectors.

"king"  → [0.2, 0.8, 0.1, ...]
"queen" → [0.3, 0.7, 0.1, ...]
"car"   → [0.9, 0.1, 0.6, ...]

These vectors are generated by neural networks (like BERT, OpenAI embeddings, etc.) that have been trained on massive amounts of text. They learn that "king" and "queen" are semantically close, while "car" is far from both.

Cosine similarity

To find similar texts, we measure the angle between their vectors using cosine similarity:

similarity = (A · B) / (|A| × |B|)

A value of 1 means identical direction (same meaning), 0 means unrelated, and -1 means opposite. In practice, most semantic searches look for vectors with cosine similarity above some threshold, like 0.7.

Putting it together

A semantic search system works like this:

  1. Convert all your documents into vectors and store them
  2. When a user searches, convert their query into a vector
  3. Find the documents whose vectors are closest to the query vector
  4. Return those documents, ranked by similarity

The result: searching for "how to fix a bug" will match documents about "debugging techniques" because they have similar meaning vectors, even though they share no words.

Libraries like FAISS, Pinecone, or pgvector make this surprisingly easy to implement. The hard part is choosing the right embedding model for your use case.