• Technology Illumination
  • Posts
  • AI Learnings - Understanding Words, Sentences, and Tokens - Using a Real Example - Custom LLM For AWS EC2 Documentation

AI Learnings - Understanding Words, Sentences, and Tokens - Using a Real Example - Custom LLM For AWS EC2 Documentation

Introduction

Firstly, for this blog post I have taken the header text of AWS EC2 documentation from  https://docs.aws.amazon.com/ec2/ and the text is

Amazon Elastic Compute Cloud Documentation Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable computing capacity-literally, servers in Amazon’s data centers—that you use to build and host your software systems.

Purpose

When working with large language models (LLMs), one term often causes confusion: tokens.

If you’re reading technical documentation or training your own AI model, it’s important to understand the difference between wordssentences, and tokens – especially when calculating data sizes, training cost, or billing.

Let’s explore these concepts using a real example from Amazon EC2 documentation.

The Example Text

Here’s a real paragraph from AWS EC2’s official documentation:

<pre> Amazon Elastic Compute Cloud Documentation Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable computing capacity-literally, servers in Amazon’s data centers—that you use to build and host your software systems. </pre>

Step 1: Count the Words

Let’s break this down as a human would — by counting words (space-separated units):

Total words: 36

Why it matters:
Words are how humans read and write – but LLMs don’t process text this way.

Step 2: Tokenize the Text

Now let’s tokenize this using a language model’s tokenizer (like GPT-2 or GPT-3):

<pre> [“Amazon”, ” Elastic”, ” Compute”, ” Cloud”, ” Documentation”, ” Amazon”, ” Elastic”, ” Compute”, ” Cloud”, ” (“, “Amazon”, ” EC”, “2”, “)”, ” is”, ” a”, ” web”, ” service”, ” that”, ” provides”, ” resizable”, ” computing”, ” capacity”, “—”, “literally”, “,”, ” servers”, ” in”, ” Amazon”, “‘s”, ” data”, ” centers”, “—”, “that”, ” you”, ” use”, ” to”, ” build”, ” and”, ” host”, ” your”, ” software”, ” systems”, “.”] </pre>

That’s 45 tokens.

So even though we had 36 words, the tokenizer sees 45 tokens.

Why Tokens > Words?

Let’s look at a few reasons why this happens:

Word or Phrase

Tokens

EC2

“EC”, “2” (split into 2 tokens)

“Amazon’s”

“Amazon”, “‘s”

Long compound terms

Broken into multiple parts

Punctuation marks

Count as separate tokens

Spaces before words

Sometimes included in tokens

Tokens are more fine-grained than words — they’re what the LLM actually processes.

The Real Impact

Why does this matter in practice?

  • Training Costs: When training a model, you pay (in compute) based on total tokens, not words.

  • Inference Costs: When you send a prompt to an API like OpenAI or Anthropic, they charge per 1,000 tokens, not per sentence.

  • Model Design: When deciding how much data you need to train or fine-tune a model, you measure in tokens, not words.

In this case:

  • 1 document paragraph

  • 36 words

  • 45 tokens

  • That’s ~1.25 tokens per word

Rule of Thumb

Text Type

Avg. Tokens per Word

Plain English

~1.3

Technical Docs

~1.4–1.6

Code or Config Files

~1.6–2.0

So if you’re working with 1 million words of AWS documentation, expect to see 1.3 to 1.5 million tokens – which directly impacts how much compute or memory you’ll need.

Final Takeaway

Concept

Meaning

Word

Human-readable language unit

Sentence

Sequence of words expressing a thought

Token

Unit used by LLMs (subword, symbol)

Whether you’re training a model or calculating usage costs, it’s the tokens — not the words — that truly matter.

If you’re building custom LLMs using AWS docs or technical manuals, always plan your data pipeline, compute sizing, and evaluation based on tokens.