- Technology Illumination
- Posts
- AI Learnings - Understanding Words, Sentences, and Tokens - Using a Real Example - Custom LLM For AWS EC2 Documentation
AI Learnings - Understanding Words, Sentences, and Tokens - Using a Real Example - Custom LLM For AWS EC2 Documentation
Introduction
Firstly, for this blog post I have taken the header text of AWS EC2 documentation from https://docs.aws.amazon.com/ec2/ and the text is
Amazon Elastic Compute Cloud Documentation Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable computing capacity-literally, servers in Amazon’s data centers—that you use to build and host your software systems.
Purpose
When working with large language models (LLMs), one term often causes confusion: tokens.
If you’re reading technical documentation or training your own AI model, it’s important to understand the difference between words, sentences, and tokens – especially when calculating data sizes, training cost, or billing.
Let’s explore these concepts using a real example from Amazon EC2 documentation.
The Example Text
Here’s a real paragraph from AWS EC2’s official documentation:
<pre> Amazon Elastic Compute Cloud Documentation Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable computing capacity-literally, servers in Amazon’s data centers—that you use to build and host your software systems. </pre>
Step 1: Count the Words
Let’s break this down as a human would — by counting words (space-separated units):
Total words: 36
Why it matters:
Words are how humans read and write – but LLMs don’t process text this way.
Step 2: Tokenize the Text
Now let’s tokenize this using a language model’s tokenizer (like GPT-2 or GPT-3):
<pre> [“Amazon”, ” Elastic”, ” Compute”, ” Cloud”, ” Documentation”, ” Amazon”, ” Elastic”, ” Compute”, ” Cloud”, ” (“, “Amazon”, ” EC”, “2”, “)”, ” is”, ” a”, ” web”, ” service”, ” that”, ” provides”, ” resizable”, ” computing”, ” capacity”, “—”, “literally”, “,”, ” servers”, ” in”, ” Amazon”, “‘s”, ” data”, ” centers”, “—”, “that”, ” you”, ” use”, ” to”, ” build”, ” and”, ” host”, ” your”, ” software”, ” systems”, “.”] </pre>
That’s 45 tokens.
So even though we had 36 words, the tokenizer sees 45 tokens.
Why Tokens > Words?
Let’s look at a few reasons why this happens:
Word or Phrase | Tokens |
|---|---|
EC2 | “EC”, “2” (split into 2 tokens) |
“Amazon’s” | “Amazon”, “‘s” |
Long compound terms | Broken into multiple parts |
Punctuation marks | Count as separate tokens |
Spaces before words | Sometimes included in tokens |
Tokens are more fine-grained than words — they’re what the LLM actually processes.
The Real Impact
Why does this matter in practice?
Training Costs: When training a model, you pay (in compute) based on total tokens, not words.
Inference Costs: When you send a prompt to an API like OpenAI or Anthropic, they charge per 1,000 tokens, not per sentence.
Model Design: When deciding how much data you need to train or fine-tune a model, you measure in tokens, not words.
In this case:
1 document paragraph
36 words
45 tokens
That’s ~1.25 tokens per word
Rule of Thumb
Text Type | Avg. Tokens per Word |
|---|---|
Plain English | ~1.3 |
Technical Docs | ~1.4–1.6 |
Code or Config Files | ~1.6–2.0 |
So if you’re working with 1 million words of AWS documentation, expect to see 1.3 to 1.5 million tokens – which directly impacts how much compute or memory you’ll need.
Final Takeaway
Concept | Meaning |
|---|---|
Word | Human-readable language unit |
Sentence | Sequence of words expressing a thought |
Token | Unit used by LLMs (subword, symbol) |
Whether you’re training a model or calculating usage costs, it’s the tokens — not the words — that truly matter.
If you’re building custom LLMs using AWS docs or technical manuals, always plan your data pipeline, compute sizing, and evaluation based on tokens.