From Lucene segments to distributed clusters —
understanding the search engine from the ground up.
A Java library for full-text search.
The foundation everything else is built on.
First released in 1999 by Doug Cutting — named after his wife's middle name.
Think of it like the index at the back of a textbook — but automatically built for every word in every document.
Scan Doc 1: "elasticsearch is fast and scalable" — no match
Scan Doc 2: "lucene is fast and powerful" — match!
Scan Doc 3: "elasticsearch runs on lucene" — match!
Must scan every document in the entire corpus.
Look up "lucene" in term dictionary — found!
Read posting list → Doc 2, Doc 3 — done!
Direct lookup. Zero scanning.
Before text enters the inverted index, it passes through a 3-stage pipeline that transforms raw text into searchable terms.
"The Quick Brown Foxes are <b>FAST</b>!"
Transform raw characters before tokenizing.
Splits text into individual tokens.
Transform, remove, or add tokens.
Unicode tokenizer + lowercase filter
"Foxes" → "foxes"
Your own char_filter + tokenizer + filter chain
"Foxes" → "fox" (with stemmer)
No-op — entire input is one token
"a@b.com" → "a@b.com"
Language-specific stemmer + stop words
"running" → "run" (english)
Three layers working together to make full-text search fast.
A sorted list of every unique term → disk pointer. Stored as an FST (Finite State Transducer) — a compressed automaton that shares prefixes and suffixes.
Lookup: O(length of term) — independent of index size.
Why can't we just use a HashMap? Because with 100M unique terms, a HashMap would eat all your RAM.
Terms: fast faster fastest fun funny
A HashMap stores each term separately. At 100M unique terms, that's ~6 GB of RAM just for one field's dictionary.
Like a trie that also shares suffixes. Common prefixes AND suffixes stored once.
• "fast" and "faster" share f→a→s→t nodes
• Accept nodes store a disk pointer to the posting list
start Entry a-z Character green Accept = valid term
Lookup "fast": start→f→a→s→t (accept!) → read posting list pointer
Lookup "fan": start→f→a→no "n" child! → term doesn't exist
Three layers working together to make full-text search fast.
FST: term → disk pointer
Lookup: O(term length)
For each term: the sorted list of document IDs + frequency + positions. Compressed with delta encoding + bit packing.
Stores: doc IDs, term frequency, positions, offsets.
"product-42"). Stored in _source, mapped via a separate lookup.
"fast AND scalable"
Posting lists are sorted by doc ID — this enables 3 key compression & query optimizations:
1000000, 1000002, 1000005). Each needs 32 bits.1000000, 1000002, 1000005 → deltas 1000000, 2, 3 — dramatically smaller numbers!
3 doesn't need 32 bits!fast AND scalable, we need the intersection of two posting lists. Scanning all elements = O(n+m).scalable (shorter, 4 docs). For each ID, skip-search in fast:
Three layers working together to make full-text search fast.
FST: term → disk pointer
Lookup: O(term length)
Sorted doc IDs + metadata
Compression: delta + bit packing
Each field gets its own inverted index. Completely independent.
Why: title:laptop only scans the title index.
Key insight: The inverted index trades write-time work (analyzing text, building data structures) for read-time speed (instant lookups). This is the fundamental trade-off of every search engine.
A Lucene index is not a single giant inverted index. It's a collection of smaller, immutable pieces called segments.
Once written, never modified. This enables:
Multiple threads read simultaneously — zero contention.
OS page cache never invalidated — data never changes.
Compress once, read forever — no decompress for updates.
Delete segment = delete file. No garbage collection.
A segment is a self-contained mini-index on disk. Inside, the same data is stored in 3 different layouts — each optimized for a different access pattern.
Question: "Which docs contain this word?"
| Term Dictionary (FST, in RAM) | |
|---|---|
| laptop | → disk:0x4A20 |
| phone | → disk:0x7F10 |
| Posting Lists (on disk, sorted by ID) | |
|---|---|
| laptop | 2 5 13 89 144 |
| phone | 1 3 5 8 |
Question: "Give me the original JSON of doc 5."
| _source (row-oriented, on disk) | |
|---|---|
| doc 2 | {"name":"laptop pro", "price":999} |
| doc 5 | {"name":"laptop air", "price":1299} |
| doc 13 | {"name":"gaming laptop", "price":1899} |
Question: "Sort these results by price."
| doc | name | category | price |
|---|---|---|---|
| 0 | laptop pro | electronics | 999 |
| 1 | laptop air | electronics | 1299 |
| 2 | gaming lap | gaming | 1899 |
3 fields read, only 1 was needed
Random I/O — each row on a different disk block
Only price column read, contiguous block
Sequential I/O — single disk read
One column file per field → ideal for sorting & aggregations. Same-type values stored together → better compression.
Immutable segments can't delete in-place — a .liv bitset (1 bit/doc) tracks live docs. Reclaimed on merge.
How documents go from your application into a searchable index.
Data passes through three layers, each with different trade-offs between speed and safety.
fsync. Survives everything.The problem: fsync is slow (~10ms per call). Calling it after every document kills throughput. But without it, data is lost on crash. Elasticsearch solves this with the translog.
A search must visit every segment. Per-segment search is the fundamental unit of work.
Every matching document gets a relevance score. Lucene uses BM25 — an evolution of TF-IDF with term frequency saturation.
How rare is the term? Rare terms score much higher.
More occurrences = higher score, but with diminishing returns.
Shorter fields rank higher. "laptop" in a 3-word title beats "laptop" in a 500-word body.
"laptop"k1 = 1.2 controls TF saturation, b = 0.75 field-length impact"explain": true in your query to see the full breakdownSince segments are immutable, neither delete nor update can modify existing data in-place.
Document stays in the segment. A .liv bitset tracks who's alive.
.liv
Bytes only reclaimed during segment merge.
No in-place update. Old version marked deleted, new version indexed fresh.
Old bytes remain until merge. Frequent updates = many dead copies across segments.
The essential background process that keeps a Lucene index healthy.
Merging is essential for keeping search performance high and reclaiming disk space.
Each search visits every segment. 50 segments = 50x the overhead vs 5 segments.
Deleted docs waste disk and slow searches. Merging rewrites only live docs — dead docs are gone forever.
Larger segments compress more efficiently. Delta encoding and bit packing work better with more data.
Groups segments of roughly equal size for merging. Avoids merging tiny segments into huge ones.
Segments with many deleted docs are prioritized — more space to reclaim per merge.
Merged segment won't exceed 5 GB. Runs on background threads — does not block indexing or search.
Lucene is a brilliant library, but it's just a library.
Using it in production exposes several hard problems.
Your data must fit on one machine. No built-in distribution.
One server down = total outage. No replication, no failover.
fsync after every doc? ~100 docs/sec. Unusable.
Docs invisible until next commit or reader reopen.
Java library only. Need your own server layer for other languages.
You handle field types, backward compatibility, reindexing yourself.
A distributed system built on top of Lucene that solves
every limitation we just discussed.
Lucene runs on one machine.
Split into shards (each a Lucene index), distribute across nodes. 10TB? 10 nodes with 1TB each.
One server down = total outage.
Replica shards on different nodes. Primary dies? Replica promoted automatically. Zero downtime.
fsync after every doc is too slow.
Append-only transaction log, cheap to fsync. Full Lucene commits happen infrequently in the background.
Docs invisible until commit.
New segments written to OS page cache every 1s. Searchable via NRT reader, no fsync needed.
One or more nodes working together. Identified by a cluster name. Nodes auto-discover each other.
master cluster state
data stores shards
ingest pipelines
coordinating routes requests
Each shard = one Lucene index. ES distributes many Lucene indexes across machines. Each contains its own segments.
shard = hash(_routing) % num_primary_shards
Default: _routing = _id — Why primary shard count is fixed at creation.
Lucene's trick: new segments can be opened from the filesystem cache without a full commit.
Refresh only writes to the filesystem cache — not durable. If the node crashes, unflushed data is lost. The translog prevents this.
Every operation goes to both in-memory buffer and translog.
Fsync'd after every op by default. Append-only writes stay fast.
Translog kept after refresh — segment isn't committed yet.
Truncated when Lucene commit (flush) makes segments durable.
hash(_id) % shardsresponse_time — routes to fastest replicaEach shard has multiple copies (primary + replicas). Which copy should handle the search? ARS picks the fastest one.
The coordinating node maintains a score for every shard copy, updated after each response.
EWMA (exponentially weighted moving average) of past response times. Recent performance weighted more heavily.
Number of in-flight requests to that node. High queue = node is overloaded, penalized in ranking.
EWMA of time the node actually spent processing (excludes network). Separates slow nodes from slow networks.
Formula: rank(s) = queue_size(s) × service_time(s) + response_time(s)
Lowest rank wins. Automatically adapts to GC pauses, hot nodes, disk I/O spikes, and uneven shard sizes.
Master promotes replica to primary → retries op → allocates new replica elsewhere.
Removed from in-sync set → primary continues → master rebuilds replica on another node.
Replicas reject writes (term mismatch) → old primary discovers replacement → ops rerouted.
Coordinator retries on another shard copy → if all fail, returns partial results.
| Lucene Concept | ES Layer |
|---|---|
| Inverted Index | Core search data structure in every shard |
| Segment (immutable) | Written on refresh (every 1s), merged in background |
| Commit Point | Created during flush, not after every write |
| - | Translog fills the durability gap between refreshes and commits |
| Lucene Index | = 1 ES Shard (primary or replica) |
| - | ES Index = collection of shards distributed across nodes |
The Big Picture
Lucene gives us the search.
Translog gives us durability.
Refresh gives us near real-time.
Shards give us scale.
Replicas give us availability.
Cluster ties it all together.