Technology Illumination
Posts
AI Learnings - Understanding Words, Sentences, and Tokens - Using a Real Example - Custom LLM For AWS EC2 Documentation

AI Learnings - Understanding Words, Sentences, and Tokens - Using a Real Example - Custom LLM For AWS EC2 Documentation

Technology Illumination
July 06, 2025

Introduction

Firstly, for this blog post I have taken the header text of AWS EC2 documentation from https://docs.aws.amazon.com/ec2/ and the text is

Amazon Elastic Compute Cloud Documentation Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable computing capacity-literally, servers in Amazon’s data centers—that you use to build and host your software systems.

Purpose

When working with large language models (LLMs), one term often causes confusion: tokens.

If you’re reading technical documentation or training your own AI model, it’s important to understand the difference between words, sentences, and tokens – especially when calculating data sizes, training cost, or billing.

Let’s explore these concepts using a real example from Amazon EC2 documentation.

The Example Text

Here’s a real paragraph from AWS EC2’s official documentation:

<pre> Amazon Elastic Compute Cloud Documentation Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable computing capacity-literally, servers in Amazon’s data centers—that you use to build and host your software systems. </pre>

Step 1: Count the Words

Let’s break this down as a human would — by counting words (space-separated units):

Total words: 36

Why it matters:
Words are how humans read and write – but LLMs don’t process text this way.

Step 2: Tokenize the Text

Now let’s tokenize this using a language model’s tokenizer (like GPT-2 or GPT-3):

<pre> [“Amazon”, ” Elastic”, ” Compute”, ” Cloud”, ” Documentation”, ” Amazon”, ” Elastic”, ” Compute”, ” Cloud”, ” (“, “Amazon”, ” EC”, “2”, “)”, ” is”, ” a”, ” web”, ” service”, ” that”, ” provides”, ” resizable”, ” computing”, ” capacity”, “—”, “literally”, “,”, ” servers”, ” in”, ” Amazon”, “‘s”, ” data”, ” centers”, “—”, “that”, ” you”, ” use”, ” to”, ” build”, ” and”, ” host”, ” your”, ” software”, ” systems”, “.”] </pre>

That’s 45 tokens.

So even though we had 36 words, the tokenizer sees 45 tokens.

Why Tokens > Words?

Let’s look at a few reasons why this happens:

Word or Phrase	Tokens
EC2	“EC”, “2” (split into 2 tokens)
“Amazon’s”	“Amazon”, “‘s”
Long compound terms	Broken into multiple parts
Punctuation marks	Count as separate tokens
Spaces before words	Sometimes included in tokens

Tokens are more fine-grained than words — they’re what the LLM actually processes.

The Real Impact

Why does this matter in practice?

Training Costs: When training a model, you pay (in compute) based on total tokens, not words.
Inference Costs: When you send a prompt to an API like OpenAI or Anthropic, they charge per 1,000 tokens, not per sentence.
Model Design: When deciding how much data you need to train or fine-tune a model, you measure in tokens, not words.

In this case:

1 document paragraph
36 words
45 tokens
That’s ~1.25 tokens per word

Rule of Thumb

Text Type	Avg. Tokens per Word
Plain English	~1.3
Technical Docs	~1.4–1.6
Code or Config Files	~1.6–2.0

So if you’re working with 1 million words of AWS documentation, expect to see 1.3 to 1.5 million tokens – which directly impacts how much compute or memory you’ll need.

Final Takeaway

Concept	Meaning
Word	Human-readable language unit
Sentence	Sequence of words expressing a thought
Token	Unit used by LLMs (subword, symbol)

Whether you’re training a model or calculating usage costs, it’s the tokens — not the words — that truly matter.

If you’re building custom LLMs using AWS docs or technical manuals, always plan your data pipeline, compute sizing, and evaluation based on tokens.