Regex 101: a working developer’s introduction

Published 2026-05-15 · 12-min read

Most developers learn regex through copy-paste — they grab a pattern from Stack Overflow, it works for the case in front of them, and they move on. That works until it doesn’t. This guide gives you the smallest correct mental model — enough that you can read any regex, modify it deliberately, and know when you’ve picked the wrong tool.

The five things a regex is made of

Almost every pattern you’ll see is a combination of five primitives. If you can identify each of them in a pattern, you can read any regex.

1. Literal characters

Any non-special character matches itself. hello matches the string hello. 2026 matches 2026. This part is boring on purpose — the regex engine doesn’t care about semantics, just the bytes.

The special characters that DO have meaning are . * + ? ( ) [ ] { } | ^ $ \. To match one of those literally, escape it with a backslash: \\. matches a literal dot, \\\\ matches a literal backslash.

2. Character classes

Square brackets define a class — “match any one of these characters.”

  • [abc] — any one of a, b, or c
  • [a-z] — any lowercase letter (range)
  • [^0-9]— any character that’s NOT a digit (negation)
  • \\d — shorthand for [0-9]
  • \\w — “word character” — [A-Za-z0-9_]
  • \\s — whitespace — spaces, tabs, newlines
  • . — any character except newline (by default)

The \\D, \\W, \\S uppercase variants are negations of the lowercase ones.

3. Quantifiers

Quantifiers specify how many times the preceding thing should match.

  • ? — zero or one (optional)
  • * — zero or more
  • + — one or more
  • {3} — exactly 3
  • {3,5} — between 3 and 5
  • {3,} — 3 or more

Quantifiers default to greedy — they match as much as possible. Append a ? to make them lazy (as little as possible). On aaaaa, a+ matches all five; a+? matches one.

4. Groups and alternation

Parentheses group multiple things into one unit. (ab)+ matches ababab — the quantifier applies to the whole group. The pipe inside a group gives you alternation: (cat|dog) matches either cat or dog.

By default, groups also capture — they save the matched text into a numbered slot you can refer to ($1, $2, etc.). If you only want grouping without capture, use (?:...)— “non-capturing group.” Non-capturing groups are slightly faster and reduce noise. Most pros default to non-capturing unless they specifically need the capture.

5. Anchors

Anchors don’t match characters — they match positions.

  • ^ — start of string (or start of line with the m flag)
  • $ — end of string (or end of line with the m flag)
  • \\b — word boundary (between a word char and a non-word char)
  • \\B — non-word-boundary

Without anchors, cat matches the substring cat anywhere — inside category, scatter, or cat alone. With ^cat$, only the exact string cat matches. Forgetting anchors is the most common source of regex bugs in validation code.

Flags change everything

Flags are toggles that modify how the whole pattern behaves. The common ones, written after the trailing slash in regex literal syntax (/foo/g):

  • gglobal: find every match, not just the first
  • icase-insensitive
  • mmultiline: ^ and $ match line boundaries, not just string boundaries
  • sdotall: . matches newlines too
  • uunicode: enables Unicode interpretation of escapes like \\u{...}

Flag handling differs by engine. In Python and Go you set them as constants (re.IGNORECASE) or inline at the start of the pattern ((?i)foo); in JavaScript and Perl they go after the trailing slash.

How the engine actually runs

There are two families of regex engines:

NFA (backtracking) — JavaScript, PCRE, Java, .NET, Python re, Ruby Onigmo. The engine tries matches one after another and backs up when something fails. NFA engines support every feature including lookbehind and backreferences, but they can be tricked into catastrophic backtracking on carefully-crafted input.

DFA (linear time) — Go regexp (RE2), Rust regex, hyperscan. The engine pre-computes a state machine and runs the input through it once. Guaranteed linear time, immune to ReDoS, but doesn’t support lookbehind or backreferences.

The practical takeaway: in user-facing services, prefer RE2-style engines if your language has one (Go, Rust, hyperscan via FFI). In application code where you control the inputs, NFA is fine and gives you more expressive power.

The minimum testing discipline

Before you ship any non-trivial regex:

  1. Find at least three matchinginputs — including one that’s unusual but legal.
  2. Find at least three non-matchinginputs — including the “false friends” that almost match but shouldn’t.
  3. Run it against the AI Regex Toolkit or any tester to confirm.
  4. If it’s on a hot path or runs on untrusted input, audit for catastrophic backtracking — see ReDoS prevention.

Test cases should live in code — write a unit test that exercises both your matching and non-matching expectations. When the regex changes (and it will), the tests catch regressions.

Patterns you’ll write all the time

A few patterns come up over and over. The pattern library has the full set with examples, edge cases, and code in 13 languages — these are the most common:

What regex isn’t for

Three things regex is structurally bad at. Reach for a real parser:

  • Nested structure— HTML, XML, JSON, source code. The grammars are recursive; regex isn’t.
  • Semantic constraints— “is this email reachable”, “is this UUID actually random”, “is this date in the future”. Regex tells you about shape, not meaning.
  • Performance-sensitive scanning of huge text — for log scanning at GB/s, prefer purpose-built tools (ripgrep, hyperscan) over regex in your application.

FAQ

What does the `g` flag actually do?

Without `g`, methods like `String.prototype.match()` return the first match (with capture groups). With `g`, the same method returns every match in an array, but capture groups are dropped. To get global matches AND capture groups in JavaScript, use `String.prototype.matchAll()` (with `g`) instead.

Why does `.` not match newlines by default?

The dot was originally designed for line-oriented text-processing tools (`grep`, `sed`) where input was processed one line at a time. To make `.` match newlines, enable the dotall flag (`s` in most engines).

What's the difference between greedy and lazy quantifiers?

Greedy quantifiers (`*`, `+`, `{n,m}`) match as much as possible while still letting the overall regex succeed. Lazy versions (`*?`, `+?`, `{n,m}?`) match as little as possible. Both find correct matches; lazy is usually faster on large inputs because it backtracks less.

Are regex engines the same across languages?

No. JavaScript, PCRE (PHP/nginx), Python `re`, Java, .NET, Go RE2, Rust, and POSIX ERE all have meaningful differences. Lookbehind, possessive quantifiers, atomic groups, recursion, Unicode property classes, and named-group syntax all vary. Test in the engine you'll deploy to.

When should I NOT use regex?

Parsing HTML, parsing JSON, parsing code, or anything with nested structure. Use a real parser. Famous answer: 'now you have two problems.' For shape validation, extraction from line-oriented text, and simple substitution, regex is the right tool.

Where to go next

Two follow-ups that level up directly from here:

And if you’d rather skip the writing and describe what you want to match in plain English, the AI generator handles the translation.