· 14 min read · ...

The foundation nobody teaches you to survive the AI era (Part 1)

The foundation nobody teaches you to survive the AI era (Part 1)

AI does everything. But if you don't understand what's underneath, you're replaceable. The technical fundamentals that separate those who use AI from those used by it.

AIfundamentalscareermathematicssoftware development
Table of Contents

Can you explain why ChatGPT sometimes “hallucinates”?

I’m not asking whether you know it hallucinates. Everyone knows that. I’m asking whether you know the technical mechanism that causes it. Whether you can pop the hood and point exactly where the gears fail.

If the answer is no, I have uncomfortable news: you’re not an AI professional. You’re a user. And at the speed this industry moves, the distance between the two is getting increasingly dangerous.

Think of it this way: everyone knows how to drive a car. But when the engine makes a strange noise, 99% of people take it to the mechanic. The mechanic understands what’s underneath. And in the era of self-driving cars, those who only know how to drive will be replaced by the software. Those who understand the engine will build the software.

It’s the same with AI. Millions of people know how to use ChatGPT, Claude, Gemini. They write prompts, copy answers, paste into code. But when something goes wrong, when the model hallucinates, when costs explode, when latency blows the SLA, they have no idea what to tweak. Because they never learned what’s underneath.

There is a modern technical foundation that separates those who will thrive in this era from those who will be replaced. It’s not a PhD in mathematics. It’s not memorizing papers. It’s specific, practical fundamentals that any developer with discipline can learn.

In this post (Part 1 of 2), I’ll give you the complete map. Today we cover theory: essential mathematics, how LLMs actually work, and data and representations. In Part 2, we’ll go to practice: AI systems engineering, evaluation and observability.

Let’s go.

The map: what you actually need to know

Before diving in, I need to show you the complete map. There are five fundamental areas, organized like an RPG skill tree. Each one unlocks the next:

#AreaWhat it coversPart
1Essential mathematicsLinear algebra, probability, gradientsPart 1 (this post)
2How LLMs workTokenization, embeddings, attention, generationPart 1 (this post)
3Data and representationsVector databases, chunking, feature engineeringPart 1 (this post)
4AI systems engineeringRAG, agents, guardrails, orchestrationPart 2
5Evaluation and observabilityEvals, metrics, monitoring, costsPart 2

The order matters. You can’t understand embeddings without linear algebra. You can’t understand attention without probability. You can’t build RAG without understanding vector databases. Each level is a prerequisite for the next.

The good news: you don’t need to master everything at an academic level. You need functional intuition. Understanding enough to make informed technical decisions, debug real problems, and have productive conversations with those who go deeper.

Essential mathematics: you don’t need a PhD, you need intuition

This is the part that scares most developers. “Math? I became a programmer precisely to escape that.” I get it. I was that person. But the truth is that the math behind modern AI is surprisingly accessible if you focus on what matters and ignore what doesn’t.

Linear algebra: the language of AI

If AI had a native language, it would be linear algebra. Everything an AI model “sees,” “hears,” or “reads” is transformed into vectors and matrices before being processed. Every word, every image, every sound becomes a list of numbers.

A vector is simply an ordered list of numbers. For example, the word “coffee” can be represented as a vector of 1,536 numbers (in OpenAI’s text-embedding-3-small model). Those numbers capture the “meaning” of the word in mathematical space.

The most famous example is word arithmetic. In 2013, researchers at Google (Mikolov et al.) showed that with Word2Vec:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

This works because vectors capture semantic relationships. The “direction” from man to woman is the same as from king to queen. It looks like magic, but it’s pure linear algebra.

The most important concept for day-to-day work is cosine similarity. It measures how much two vectors point in the same direction, regardless of magnitude. This is how semantic search systems find relevant documents, how RAG selects context, and how recommendations work.

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Fictional vectors to illustrate
coffee = np.array([0.8, 0.6, 0.1])
cappuccino = np.array([0.75, 0.65, 0.15])
car = np.array([0.1, 0.2, 0.9])

print(cosine_similarity(coffee, cappuccino))  # ~0.99 (very similar)
print(cosine_similarity(coffee, car))          # ~0.38 (not similar)

What you need to know: vectors, matrices, dot product, cosine similarity, matrix multiplication.

What you DON’T need: spectral decomposition, eigenvalues/eigenvectors, Hilbert spaces, advanced tensor algebra.

Probability and statistics: the language of uncertainty

LLMs don’t “know” anything. They calculate probabilities. When ChatGPT answers your question, it’s not querying a knowledge base. It’s doing next-token prediction: given all the preceding text, what is the most probable next token?

Formally: P(next_token | all_previous_tokens)

Each time the model generates a token, it calculates a probability distribution over the entire vocabulary (tens of thousands of possible tokens) and picks one. The temperature parameter controls how that choice is made:

  • Temperature = 0: always picks the highest-probability token. More “deterministic” and repetitive output.
  • Temperature = 1: samples proportionally to probabilities. More varied and “creative.”
  • Temperature > 1: flattens the distribution, making unlikely tokens more probable. More chaotic.

The function that transforms the model’s raw scores into probabilities is called softmax:

import numpy as np

def softmax(logits, temperature=1.0):
    scaled = logits / temperature
    exp_scaled = np.exp(scaled - np.max(scaled))
    return exp_scaled / exp_scaled.sum()

logits = np.array([2.0, 1.0, 0.5, 0.1])

print(softmax(logits, temperature=0.5))  # [0.72, 0.18, 0.07, 0.03] - concentrated
print(softmax(logits, temperature=1.0))  # [0.47, 0.17, 0.10, 0.07] - balanced
print(softmax(logits, temperature=2.0))  # [0.34, 0.20, 0.16, 0.13] - spread out

See how low temperature concentrates nearly all probability on the highest-scoring token, while high temperature distributes more evenly. That’s everything you need to know about temperature to make technical decisions day to day.

What you need to know: probability distributions, mean and variance, Bayes’ theorem (intuition, not the formula), softmax, sampling.

What you DON’T need: advanced Markov chains, stochastic processes, formal information theory.

Calculus: just the minimum to understand gradients

This is where most people give up. But I promise: you need exactly ONE concept from calculus to understand how AI models learn. That concept is the gradient.

Think of it this way: you’re on top of a mountain, in total darkness, and need to get down to the valley. You can’t see anything. But you can feel the terrain’s slope with your feet. So you take a step in the steepest downhill direction, feel the slope again, adjust direction, and repeat.

That’s gradient descent. The AI model starts with random weights (position on the mountain), calculates how wrong the answer is (altitude), and adjusts the weights in the direction that reduces error (walks downhill). The gradient is the “slope” that tells which direction to adjust.

The learning rate is the step size. Too large: you leap over the valley and climb the other side. Too small: it takes forever to arrive.

What you need to know: derivative as “rate of change,” gradient as “direction of greatest change,” gradient descent as the learning loop, learning rate as step size.

What you DON’T need: solving integrals, Taylor series, differential equations, backpropagation at the original paper level.

Summary: what to study vs what to ignore

TopicNeed to knowCan ignore
Linear algebraVectors, matrices, dot product, cosine similaritySpectral decomposition, eigenvalues
ProbabilityDistributions, Bayes (intuition), softmaxAdvanced Markov chains, information theory
CalculusDerivatives, gradients, learning rateIntegrals, series, differential equations

How LLMs actually work (no bullshit)

Now that you have the mathematical foundation, let’s pop the hood on an LLM. I’ll explain the complete flow, from the prompt you type to the response that appears on screen. No unnecessary jargon, no simplifications that lie.

Tokenization: the first step nobody explains

LLMs don’t read words. They read tokens. A token is a piece of text that can be a whole word, part of a word, a special character, or even a space.

The sentence “Artificial intelligence is fascinating” doesn’t enter the model as 4 words. It’s broken into tokens:

["Art", "ificial", " intelligence", " is", " fascinating"]

Why does this matter day to day?

Cost. You pay per token, not per word. And languages like Portuguese generate more tokens than English for the same content (because the tokenizer was trained predominantly on English). I’ve written in detail about the invisible cost of wasted tokens.

Context limits. When a model has “128K tokens of context,” that’s 128K tokens, not 128K words. In most languages, that’s significantly fewer words than it sounds.

Strange behavior. Models sometimes fail at simple character-counting tasks (“how many R’s in ‘strawberry’?”) because they don’t see individual characters, they see tokens.

You can experiment with OpenAI’s tiktoken:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Artificial intelligence is fascinating")
print(f"Tokens: {len(tokens)}")  # probably 4-5 tokens
print([enc.decode([t]) for t in tokens])  # shows each token

Embeddings: from tokens to vectors

Each token is converted into a vector of numbers (an embedding). If the model has dimension 4096, each token becomes a list of 4,096 numbers. Those numbers are the “weights” the model learned during training.

Tokens with similar meaning end up with similar vectors. “Dog” and “canine” are close in vector space. “Dog” and “economics” are far apart.

This conversion from token to vector is the bridge between text (which humans understand) and mathematics (which the model processes). All remaining processing happens in vector space.

In practice, you interact with embeddings all the time:

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["dog playing in the park", "canine running in the garden"]
)

# Two semantically similar texts will have close embeddings
embedding_1 = response.data[0].embedding  # 1536-dimension vector
embedding_2 = response.data[1].embedding

When you do semantic search, RAG, or text classification with AI, you’re using embeddings under the hood.

Attention: the mechanism that changed everything

In 2017, researchers at Google published the paper “Attention Is All You Need” (Vaswani et al.) and changed the history of AI. The central mechanism of that paper is self-attention.

The idea is elegant: for each token in the sequence, the model calculates how much that token should “pay attention” to each of the other tokens.

Example: in the sentence “The bank is near the river,” how does the model know “bank” refers to a riverbank and not a financial institution? Because the attention mechanism gives high weight to the connection between “bank” and “river.” The context “river” pulls the meaning of “bank” in the right direction.

Without attention, the model would process each token in isolation, like a bag of words. With attention, it understands relationships between tokens, regardless of distance. “The cat that my friend’s neighbor adopted last year escaped” links “cat” to “escaped” even with 10 tokens between them.

What you need to know:

  • Self-attention allows each token to “look at” every other token in the sequence
  • This is what makes Transformers powerful for understanding context
  • The context window (128K, 200K, 1M tokens) is the limit of how many tokens can “see each other” simultaneously
  • More context = more attention calculations = more computational cost (quadratic with sequence length)

The generation loop: next-token prediction

Now let’s put it all together. The complete process of generating a response works like this:

StepWhat happens
1. InputYou type “What is the capital of Brazil?“
2. TokenizationText is broken into tokens
3. EmbeddingsEach token becomes a vector
4. Attention layersTokens interact with each other (dozens of layers)
5. OutputModel produces probabilities for the next token
6. SamplingA token is chosen (influenced by temperature)
7. LoopChosen token is added to sequence, back to step 3

The model generates one token at a time. “Brasília” doesn’t appear all at once. It appears as “Bras” + “ília” (or similar, depending on the tokenizer). And for each token, the model runs the entire pipeline again.

This explains something fundamental: hallucinations. The model doesn’t query a database. It generates the most probable token given the context. If the statistical context suggests that “the capital of Brazil is” is usually followed by “Brasília,” it gets it right. But if the statistical pattern is ambiguous or the model never saw enough information about the topic, it generates the most probable token even if it’s wrong. It doesn’t “know” it’s wrong because it has no concept of truth. It has a concept of probability.

Data and representations: garbage in, garbage out (now on steroids)

In the AI era, the saying “garbage in, garbage out” has gained a new dimension. If the data feeding your system is poorly represented, poorly structured, or poorly retrieved, no model in the world will save the result.

Vector databases: the new database

Traditional databases (PostgreSQL, MySQL) are designed for exact search: “give me all users with age > 30.” Vector databases are designed for similarity search: “give me the documents most similar to this question.”

When you store an embedding (a vector of 1,536 numbers) in a vector database and then search, it finds the “closest” vectors using metrics like cosine similarity. This is the foundation of:

  • RAG (Retrieval-Augmented Generation): fetching relevant documents to enrich the LLM’s context
  • Semantic search: finding results by meaning, not exact words
  • Agent memory: storing and retrieving past experiences

The main vector databases you should know:

DatabaseTypeBest for
pgvectorPostgreSQL extensionThose already using Postgres who want simplicity
ChromaOpen-source, embeddedRapid prototyping, smaller projects
WeaviateOpen-source, standaloneProduction with advanced features
PineconeManaged SaaSThose who don’t want to manage infrastructure
QdrantOpen-source, RustHigh performance, large scale

The most important concept to understand is HNSW (Hierarchical Navigable Small World), the most common indexing algorithm. It creates a hierarchical graph that allows approximate search in sublinear time. It’s not exact search (hence ANN, Approximate Nearest Neighbor), but it’s fast enough for production.

Chunking and context window

LLMs have context limits. Even the latest models with 1 million tokens can’t process an entire knowledge base at once. You need to split your documents into pieces (chunks) and retrieve only the relevant ones.

Three main strategies:

Fixed-size chunking: splits text into pieces of N tokens. Simple, fast, but can cut sentences in the middle and lose context.

Semantic chunking: splits by semantic boundaries (paragraphs, sections, topic changes). Better quality, more complex to implement.

Recursive chunking: tries to split by large boundaries first (sections), and if the chunk is still too big, splits by smaller boundaries (paragraphs, sentences). It’s LangChain’s default and works well in most cases.

Bad chunking is the number 1 cause of poor RAG. If your chunks cut information in the middle, the model receives incomplete context and generates incomplete or wrong answers. I go deeper on this in the post about RAG in practice.

Feature engineering in the embeddings era

Before LLMs, transforming raw data into useful representations for models was manual, artisanal work. You’d create features like one-hot encoding for categories, TF-IDF for text, normalization for numbers.

With embeddings, much of this work is automated. An embedding model transforms text into vectors that capture meaning without you needing to define rules manually.

But be careful: embeddings aren’t a silver bullet. The practical rule is:

  • Structured data (tables, numbers, categories): traditional features still work better. XGBoost with good features beats LLMs on many tabular tasks.
  • Unstructured data (text, image, audio): embeddings are superior. Use embedding models specific to the data type.
  • Mixed data: combine both. Numeric features + text embeddings concatenated in the same pipeline.

This connects directly with what we discussed in the post about AI for everything is a cannon to kill a mosquito: knowing when to use embeddings and when to use traditional features is part of knowing when to use AI and when not to.

Next chapter: from theory to production (Part 2)

Everything we covered so far is theory. The foundation. The base. But knowing theory without knowing how to build is like knowing anatomy without knowing how to operate. You understand the body, but you can’t save the patient.

In Part 2, we head to the operating table:

  • AI systems engineering: how to architect RAG in production, orchestrate autonomous agents, implement guardrails that prevent disasters, and decide between fine-tuning and prompting
  • Evaluation and observability: how to measure whether your AI actually works (spoiler: “looks good” is not a metric), create consistent evals, monitor costs and latency, and detect degradation before the user complains

Part 2 is coming soon. If you made it this far, you already have a huge advantage over those who only know how to copy prompts.

Conclusion: AI does everything, except think for you

Let’s recap the map:

  1. Essential mathematics: linear algebra to understand embeddings, probability to understand generation, gradients to understand learning
  2. How LLMs work: tokenization, embeddings, attention, next-token prediction
  3. Data and representations: vector databases, chunking, when to use embeddings vs traditional features

These three pillars are what separates those who pilot AI from those piloted by it. AI replaces execution, not comprehension. It generates code, but doesn’t understand why that code is the right choice. It answers questions, but doesn’t know when the answer is wrong. It processes data, but doesn’t know whether the data makes sense.

Those who understand the fundamentals can:

  • Diagnose why the model is hallucinating (bad context? wrong chunking? high temperature?)
  • Choose the right tool for each problem (LLM? embedding + vector search? traditional algorithm?)
  • Optimize costs (reduce tokens, cache embeddings, use smaller models where possible)
  • Have productive technical conversations with the team instead of repeating buzzwords

I’ve previously written about why fundamentals matter more than certifications from a philosophical perspective. This post is the practical complement: what exactly those fundamentals are and how to start studying them.

If this post was useful, share it with that dev who thinks knowing how to prompt is enough. They’ll thank you when the hype fades and the fundamentals remain.

Stay in the loop

Get articles about software architecture, AI and open source projects delivered to your inbox.