Maxim

Posted on Apr 27

From 70s Vectors to Modern AI Agents

#fullstack #aiengineer #softwareengineering #architecture

🧩 AI Agent Architecture: From 70s Vectors to Modern RAG Pipelines

1. Nothing New Under the Sun ☀️

As usual, with my meticulous approach, I decided to peek under the hood and figure out how it all actually works. And there it was... the pipeline! Tadam !

The ideas of similarity and nearest neighbor search are nothing new. These concepts have been used for decades in search, computer vision, pattern recognition, and research systems long before the public AI boom.

The only "new" thing is the wrapper:

Vector DB: Marketed as a "new" type of database (Pinecone and others), though they are essentially just efficient indices for vectors.
API and Cloud services: A convenient way to deliver and scale.
RAG pipelines and LLMs: Simply a high-level interface for the end user.

Vector representations—or Embeddings—are essentially data mapped as spatial coordinates, much like X, Y, and Z in 3D space. This was already a standard in the 70s, known back then as Vector Space Models. The only real 'novelty' today? We’re doing it in thousands of dimensions and processing massive datasets via an API.

2. What's Under the Hood? (The Pipeline) ⚙️

When we write a prompt query, that same pipeline kicks in:

The Tokenizer

Text is sliced into chunks (tokens). A word or symbol becomes an ID. Essentially, it’s a simple data structure (Map) and a slicing algorithm.

In open-source: these are tokenizer.json, vocab.json, and merges.txt files.
In managed APIs: this is hidden inside the provider's runtime.

const userQuestion = "How does Vector DB work?"

// Tokenization process (simplified):
// input  -> ["How", " does", " Vector", " DB", " work", "?"]
// output -> [2437, 857, 12944, 6212, 990, 30]

Elementary mapping: each token gets its own unique ID.

Embedding Layer

Tokens enter the model and are replaced by vectors (coordinates in a multi-dimensional space):
2437 -> [0.12, -0.03, 0.88, ...]

Attention Mechanism

Here, the model calculates which tokens in the context are related to each other and how strongly. In the question "How does Vector DB work?", the token "work" must "attentively" look at "Vector DB" to understand the meaning. It’s almost like a keyword search for the intent of the question.

Matrix Multiplication

The fundamental math. Vectors are multiplied by the weights of the pre-trained model.
Weights act like a weighted average. Input data with a higher weight influences the final result more strongly.

It’s important to understand: the model does not learn at the moment of the request. It simply runs the data through a pre-set function:
$$weights \times vectors \rightarrow \text{probabilities}$$

Next Token Prediction

On the output, we get probabilities for the next token. We choose one, add it to the context, and start a new circle (Loop):
Context -> Trained Function -> Prediction.
And so on, token by token, until the answer is complete.

"This reminded me exactly of Andrew Ng's Machine Learning course on Coursera. Parameters went through a function, we calculated the error (cost), and moved toward the minimum. Only here, the prediction is not a class label, but the next token from the vocabulary."

By the way, Andrew Ng's course was released over 12 years ago! And then it really made me think: if these ideas were being taught publicly so long ago, how many closed systems were using them even earlier?

Air defense systems, satellite intelligence, medical expert systems (like MYCIN from the 70s). All of this has been living on these principles for decades!

3. The "Memory" Problem and RAG 🧠

Context window is a hard limit on the number of tokens per request.
The model hasn't "forgotten" information—it's just that if there is too much data, the application cuts off the old stuff based on the FIFO (First In, First Out) principle.

while (countTokens(messages) > contextLimit) {
  messages.shift() // Old context is simply dropped from the request
}

To use the model effectively, you need to "feed" it fresh data using the RAG (Retrieval-Augmented Generation) method.

How it works: User question ⮕ Document search ⮕ Found fragments + Question ⮕ Language model ⮕ Answer.

Why it matters:

Eliminates "hallucinations": The model relies on facts from the search rather than its own internal "memory."
Relevance: RAG allows the model to use fresh data without expensive retraining or hoping for magic.

4. Who Runs the Process? (Orchestration) 🎮

The model doesn't manage the flow itself. It doesn't decide which chat history to keep or which tools to call. This is handled by the orchestration layer.

When a simple chatbot isn't enough, we use tools like LangGraph—a State Machine that works around the model. The orchestrator creates the macro-flow:

User query ⮕ Orchestration layer.
Check state / memory.
Determine route.
Search Vector DB.
Call tools.
Build final context ⮕ Call LLM.
Validate / post-process answer.
Finish or loop again.

This is a great reminder of the difference between imperative and declarative programming approaches

Declarative (The LLM): You describe what you want (the goal), and the model predicts the path.
Imperative (The Orchestrator): You explicitly define how the agent must behave ("If SQL fails, try Vector DB; if that fails, ask for human help").

It says: "First, go to SQL for facts; if that's not enough, look into the Vector DB; check the result; if it's junk, go back for another cycle."

Architecture Instead of Magic 🏗️

A modern AI agent is architecture. Calling it "intelligence" in the full sense of the word is strange, but the fact that we are all now "vibe-coders" living in a statistical matrix is a reality!

DEV Community