JavaScript Text Processing: Mastering Strings and Regular Expressions | Mehran Khanjan

Strings

String creation
String literals (single, double quotes)
Template literals
String length
String indexing
String immutability
String methods (charAt, charCodeAt, codePointAt)
concat
includes, startsWith, endsWith
indexOf, lastIndexOf
slice, substring, substr
split
toLowerCase, toUpperCase
trim, trimStart, trimEnd
repeat
padStart, padEnd
replace, replaceAll
match, matchAll
search
localeCompare
String.raw
Unicode and strings
Normalization

Regular Expressions

RegExp literals
RegExp constructor
Regex patterns
Flags (g, i, m, s, u, y, d)
Character classes
Quantifiers
Anchors
Groups and capturing
Lookahead and lookbehind
Backreferences
test method
exec method
match, matchAll
replace with regex
search with regex
split with regex
Named capture groups

Strings

String Creation

Strings can be created using literals (quotes) or the String constructor; primitives are preferred over String objects for performance and predictable comparisons.

const primitive = "hello";           // primitive string (preferred)
const fromConstructor = String(123); // converts to "123"
const objectString = new String("hello"); // String object (avoid)
typeof primitive;        // "string"
typeof objectString;     // "object"

String Literals (Single, Double Quotes)

Single and double quotes are functionally identical in JavaScript; choose one style for consistency, typically single quotes in modern JS codebases.

const single = 'Hello World';
const double = "Hello World";
const nested = "She said 'Hi'";
const escaped = 'It\'s working';

Template Literals

Template literals use backticks and support embedded expressions (${}), multi-line strings, and tagged templates for custom processing.

const name = "Alice";
const age = 30;
const greeting = `Hello, ${name}! 
You are ${age} years old.
Next year: ${age + 1}`;

// Tagged template
const highlight = (strings, ...values) => strings.reduce((acc, str, i) => 
    `${acc}${str}<b>${values[i] || ''}</b>`, '');

String Length

The length property returns the number of UTF-16 code units, which may not equal the number of visible characters for emojis or certain Unicode symbols.

"hello".length;     // 5
"café".length;      // 4
"👨‍👩‍👧".length;        // 8 (family emoji = multiple code units)
[..."👨‍👩‍👧"].length;   // 5 (spread gives grapheme clusters... mostly)

String Indexing

Access individual characters using bracket notation or charAt(); indices are zero-based and return empty string (charAt) or undefined (bracket) for out-of-bounds.

const str = "JavaScript";
str[0];        // "J"
str[4];        // "S"
str.charAt(0); // "J"
str[100];      // undefined
str.charAt(100); // "" (empty string)

// ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
// │ J │ a │ v │ a │ S │ c │ r │ i │ p │ t │
// ├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
// │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │ 9 │
// └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

String Immutability

Strings are immutable—once created, their contents cannot be changed; all string methods return new strings rather than modifying the original.

let str = "hello";
str[0] = "H";      // Silently fails (no error, no change)
console.log(str);  // "hello" (unchanged)

str = str.toUpperCase(); // Creates NEW string, reassigns variable
console.log(str);  // "HELLO"

String Methods (charAt, charCodeAt, codePointAt)

charAt returns the character at an index, charCodeAt returns UTF-16 code unit (0-65535), codePointAt returns full Unicode code point (handles surrogate pairs).

const str = "A😀Z";
str.charAt(0);       // "A"
str.charCodeAt(0);   // 65 (ASCII/Unicode for 'A')
str.codePointAt(1);  // 128512 (full emoji code point)
str.charCodeAt(1);   // 55357 (high surrogate only - incomplete!)

// Use codePointAt for emoji/Unicode beyond BMP

concat

concat() joins strings together; however, template literals or the + operator are preferred for readability and performance.

const first = "Hello";
const second = "World";

first.concat(" ", second);           // "Hello World"
first.concat(", ", second, "!");     // "Hello, World!"
"".concat("a", "b", "c");            // "abc"

// Preferred alternatives:
`${first} ${second}`;                // "Hello World"
first + " " + second;                // "Hello World"

includes, startsWith, endsWith

These methods return booleans for substring presence checks; they accept an optional position parameter and are case-sensitive.

const str = "JavaScript is awesome";

str.includes("Script");      // true
str.includes("script");      // false (case-sensitive)
str.includes("is", 12);      // false (starts searching at index 12)

str.startsWith("Java");      // true
str.startsWith("Script", 4); // true (starts checking at index 4)

str.endsWith("awesome");     // true
str.endsWith("is", 13);      // true (treats string as 13 chars long)

indexOf, lastIndexOf

indexOf returns first occurrence index (-1 if not found), lastIndexOf searches from the end; both accept optional starting position.

const str = "banana";

str.indexOf("a");        // 1 (first 'a')
str.indexOf("a", 2);     // 3 (first 'a' from index 2)
str.indexOf("x");        // -1 (not found)

str.lastIndexOf("a");    // 5 (last 'a')
str.lastIndexOf("a", 4); // 3 (last 'a' before/at index 4)

// ┌───┬───┬───┬───┬───┬───┐
// │ b │ a │ n │ a │ n │ a │
// │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │
// └───┴───┴───┴───┴───┴───┘
//       ↑       ↑       ↑
//   indexOf  (2nd)  lastIndexOf

slice, substring, substr

slice(start, end) extracts portions with negative index support, substring(start, end) swaps if start > end, substr(start, length) is deprecated—use slice.

const str = "JavaScript";

// slice(start, end) - end not included
str.slice(0, 4);     // "Java"
str.slice(4);        // "Script"
str.slice(-6);       // "Script" (negative = from end)
str.slice(-6, -1);   // "Scrip"

// substring(start, end) - no negative support, swaps if needed
str.substring(4, 0); // "Java" (swapped to 0,4)

// substr(start, length) - DEPRECATED
str.substr(4, 6);    // "Script"

split

split divides a string into an array by a delimiter (string or regex); optional limit parameter caps the number of elements returned.

"a,b,c".split(",");          // ["a", "b", "c"]
"hello".split("");           // ["h", "e", "l", "l", "o"]
"a,b,c".split(",", 2);       // ["a", "b"] (limit)
"a1b2c3".split(/\d/);        // ["a", "b", "c", ""] (regex)
"  a  b  ".split(/\s+/);     // ["", "a", "b", ""]

// Preserve delimiters with capturing group
"a1b2c".split(/(\d)/);       // ["a", "1", "b", "2", "c"]

toLowerCase, toUpperCase

These methods return new strings with all characters converted to respective case; for locale-aware conversion, use toLocaleLowerCase()/toLocaleUpperCase().

"Hello World".toLowerCase();  // "hello world"
"Hello World".toUpperCase();  // "HELLO WORLD"

// Locale-aware (Turkish example)
"I".toLocaleLowerCase('tr');  // "ı" (dotless i)
"i".toLocaleUpperCase('tr');  // "İ" (dotted I)

// Common use: case-insensitive comparison
str1.toLowerCase() === str2.toLowerCase();

trim, trimStart, trimEnd

These methods remove whitespace (spaces, tabs, newlines) from both ends, start only, or end only respectively.

const str = "   Hello World   \n";

str.trim();       // "Hello World"
str.trimStart();  // "Hello World   \n" (alias: trimLeft)
str.trimEnd();    // "   Hello World" (alias: trimRight)

// Before: "   Hello World   \n"
//         ^^^            ^^^^
//         trimStart      trimEnd
// After:  "Hello World"

repeat

repeat(count) returns a new string with the original repeated count times; throws RangeError for negative or infinite values.

"ab".repeat(3);    // "ababab"
"x".repeat(5);     // "xxxxx"
"hi".repeat(0);    // ""

// Practical uses
"-".repeat(20);           // "--------------------"
"  ".repeat(indentLevel); // Indentation

// Error cases
"x".repeat(-1);    // RangeError
"x".repeat(Infinity); // RangeError

padStart, padEnd

padStart(length, padString) and padEnd pad the current string to target length with the specified string (default space).

"5".padStart(3, "0");     // "005"
"42".padStart(5);         // "   42" (default: space)
"abc".padStart(2);        // "abc" (no change if longer)

"5".padEnd(3, "0");       // "500"
"hi".padEnd(5, ".");      // "hi..."

// Practical examples
const price = "9.99";
price.padStart(10);       // "      9.99" (align right)

String(7).padStart(2,"0"); // "07" (leading zeros)

replace, replaceAll

replace substitutes first match (or all with regex+g flag), replaceAll replaces all occurrences; both support strings and callbacks.

"banana".replace("a", "o");     // "bonana" (first only)
"banana".replaceAll("a", "o");  // "bonono" (all)
"banana".replace(/a/g, "o");    // "bonono" (regex global)

// Callback function
"abc".replace(/./g, (char, index) => `${char}${index}`);
// "a0b1c2"

// Special replacement patterns
"John Smith".replace(/(\w+) (\w+)/, "$2, $1"); // "Smith, John"

match, matchAll

match returns array of matches (or null), matchAll returns iterator of all matches with full details; both require regex for full functionality.

const str = "test1test2test3";

// match without 'g' flag - includes groups
str.match(/test(\d)/);  // ["test1", "1", index: 0, ...]

// match with 'g' flag - all matches, no groups
str.match(/test\d/g);   // ["test1", "test2", "test3"]

// matchAll - iterator with full details (requires 'g')
[...str.matchAll(/test(\d)/g)];
// [
//   ["test1", "1", index: 0, ...],
//   ["test2", "2", index: 5, ...],
//   ["test3", "3", index: 10, ...]
// ]

search

search returns the index of the first regex match (-1 if not found); unlike indexOf, it only works with regex patterns.

const str = "Hello World 123";

str.search(/\d+/);      // 12 (index of "123")
str.search(/world/i);   // 6 (case-insensitive)
str.search(/xyz/);      // -1 (not found)

// Comparison with indexOf
str.indexOf("World");   // 6 (string only)
str.search(/World/);    // 6 (regex - more flexible)
str.search(/\s/);       // 5 (first whitespace - impossible with indexOf)

localeCompare

localeCompare compares strings according to locale rules, returning -1, 0, or 1; essential for proper alphabetical sorting across languages.

// Returns: negative (before), 0 (equal), positive (after)
"a".localeCompare("b");     // -1
"b".localeCompare("a");     // 1
"a".localeCompare("a");     // 0

// Locale-aware sorting
["ä", "z", "a"].sort((a, b) => a.localeCompare(b, 'de'));
// German: ["a", "ä", "z"]

// Options
"a".localeCompare("A", 'en', { sensitivity: 'base' }); // 0 (equal)
"2".localeCompare("10", 'en', { numeric: true });      // -1 (proper number sort)

String.raw

String.raw is a tag function that returns raw string content with escape sequences unprocessed; useful for regex patterns and file paths.

String.raw`Hello\nWorld`;     // "Hello\\nWorld" (literal \n)
`Hello\nWorld`;               // "Hello
                              //  World" (newline)

// Useful for regex
const pattern = String.raw`\d+\.\d+`;  // "\\d+\\.\\d+"
new RegExp(pattern);

// Windows paths
String.raw`C:\Users\name`;    // "C:\\Users\\name"

Unicode and Strings

JavaScript strings are UTF-16 encoded; characters outside BMP (Basic Multilingual Plane) use surrogate pairs, requiring special handling for accurate length/iteration.

const emoji = "😀";
emoji.length;                 // 2 (surrogate pair)
[...emoji].length;            // 1 (proper count)

// Iterate properly
for (const char of "A😀B") {
    console.log(char);        // "A", "😀", "B"
}

// Unicode escape sequences
"\u0041";                     // "A" (BMP)
"\u{1F600}";                  // "😀" (beyond BMP)
"\uD83D\uDE00";               // "😀" (surrogate pair)

Normalization

normalize() converts strings to a standard Unicode form (NFC, NFD, NFKC, NFKD) for consistent comparison of visually identical characters.

const e1 = "é";           // Single code point (U+00E9)
const e2 = "é";           // e + combining accent (U+0065 U+0301)

e1 === e2;                // false (different representations!)
e1.normalize() === e2.normalize(); // true (NFC default)

e1.length;                // 1
e2.length;                // 2
e2.normalize().length;    // 1

// Forms: NFC (composed), NFD (decomposed), NFKC, NFKD
"ﬁ".normalize("NFKC");    // "fi" (compatibility decomposition)

Regular Expressions

RegExp Literals

Regex literals are enclosed in forward slashes with optional flags; they're compiled at script load time, making them efficient for static patterns.

const regex = /hello/i;        // literal syntax
const pattern = /\d{3}-\d{4}/; // phone pattern

// When to use literals vs constructor:
// ✓ Literal: static patterns known at write-time
// ✓ Constructor: dynamic patterns from variables

RegExp Constructor

The RegExp constructor creates regex from strings at runtime; requires double-escaping backslashes and enables dynamic pattern building.

const pattern = "\\d+";                    // Note: double backslash
const regex = new RegExp(pattern, "gi");   // /\d+/gi

// Dynamic patterns
const searchTerm = "hello";
const dynamic = new RegExp(searchTerm, "i");

// Escape user input!
const escapeRegex = (str) => str.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
new RegExp(escapeRegex(userInput));

Regex Patterns

Patterns define text matching rules using literal characters and metacharacters; mastering the building blocks enables complex text processing.

/hello/          // Literal match
/hel.o/          // . = any char except newline
/hel\wo/         // \w = word character [a-zA-Z0-9_]
/\d\d\d/         // \d = digit [0-9]
/\s+/            // \s = whitespace, + = one or more
/[aeiou]/        // Character set
/[^0-9]/         // Negated set (not digits)

// Common patterns
/^.+$/           // Entire non-empty line
/\b\w+\b/        // Whole word

Flags (g, i, m, s, u, y, d)

Flags modify regex behavior: g global, i case-insensitive, m multiline anchors, s dotAll, u unicode, y sticky, d indices.

/abc/g   // global: find all matches, not just first
/abc/i   // ignoreCase: case-insensitive
/^abc/m  // multiline: ^ and $ match line boundaries
/a.b/s   // dotAll: . matches newlines too
/\u{1F600}/u  // unicode: proper emoji/unicode handling
/abc/y   // sticky: match only at lastIndex position
/abc/d   // hasIndices: include match index info

// Combine flags
/pattern/gim

Character Classes

Character classes match sets of characters; predefined shortcuts (\d, \w, \s) and custom sets ([abc]) provide flexible matching.

/[abc]/       // a, b, or c
/[a-z]/       // lowercase letter
/[a-zA-Z0-9]/ // alphanumeric
/[^abc]/      // NOT a, b, or c

// Predefined classes
/\d/  // [0-9]
/\D/  // [^0-9]
/\w/  // [a-zA-Z0-9_]
/\W/  // [^a-zA-Z0-9_]
/\s/  // whitespace
/\S/  // non-whitespace
/./   // any char (except newline, unless 's' flag)

Quantifiers

Quantifiers specify how many times a pattern should match; they're greedy by default but can be made lazy with ?.

/a*/      // 0 or more
/a+/      // 1 or more
/a?/      // 0 or 1
/a{3}/    // exactly 3
/a{2,4}/  // 2 to 4
/a{2,}/   // 2 or more

// Greedy vs Lazy
"aaaaaa".match(/a+/);    // ["aaaaaa"] (greedy: max)
"aaaaaa".match(/a+?/);   // ["a"] (lazy: min)

/<.+>/.exec("<a><b>");   // ["<a><b>"] (greedy)
/<.+?>/.exec("<a><b>");  // ["<a>"] (lazy)

Anchors

Anchors match positions rather than characters; ^ and $ match string/line boundaries, \b matches word boundaries.

/^hello/     // starts with "hello"
/world$/     // ends with "world"
/^exact$/    // exactly "exact"

/\bword\b/   // whole word "word"
/\Bword/     // "word" NOT at word boundary

// Multiline mode
const text = "line1\nline2";
text.match(/^line/gm);  // ["line", "line"] (each line start)

// ┌─────────────────────────┐
// │ ^ start      end $      │
// │   ↓            ↓        │
// │   hello world           │
// │   ↑     ↑↑    ↑         │
// │   \b    \b\b  \b        │
// └─────────────────────────┘

Groups and Capturing

Parentheses create groups for capturing matched substrings and applying quantifiers; use (?:...) for non-capturing groups.

const match = /(\d{3})-(\d{4})/.exec("555-1234");
// match[0] = "555-1234" (full match)
// match[1] = "555"      (first group)
// match[2] = "1234"     (second group)

// Non-capturing group
/(?:ab)+/.exec("ababab");  // ["ababab"] (no group capture)

// Alternation within group
/(cat|dog)/.exec("I have a cat"); // ["cat", "cat"]

Lookahead and Lookbehind

Lookahead (?=, ?!) and lookbehind (?<=, ?<!) assert patterns without consuming characters; useful for complex conditional matching.

// Positive lookahead: followed by
/\d+(?=px)/.exec("100px");    // ["100"]

// Negative lookahead: NOT followed by
/\d+(?!px)/.exec("100em");    // ["100"]

// Positive lookbehind: preceded by
/(?<=\$)\d+/.exec("$100");    // ["100"]

// Negative lookbehind: NOT preceded by
/(?<!\$)\d+/.exec("€100");    // ["100"]

// Password validation example
/^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,}$/

Backreferences

Backreferences (\1, \2) match the same text as previously captured groups; useful for finding repeated patterns or paired elements.

// Match repeated words
/(\w+)\s+\1/.exec("the the");     // ["the the", "the"]

// Match quoted strings (same quote type)
/(["']).*?\1/.exec("'hello'");    // ["'hello'", "'"]
/(["']).*?\1/.exec('"world"');    // ['"world"', '"']

// HTML tag matching (simple)
/<(\w+)>.*?<\/\1>/.exec("<div>content</div>");
// ["<div>content</div>", "div"]

test Method

test() returns a boolean indicating whether the pattern matches; most efficient for simple yes/no validation checks.

const emailPattern = /^\S+@\S+\.\S+$/;

emailPattern.test("user@example.com");  // true
emailPattern.test("invalid-email");      // false

// Validation function
const isValidPhone = (phone) => /^\d{3}-\d{3}-\d{4}$/.test(phone);
isValidPhone("555-123-4567");  // true

// ⚠️ Caution with global flag - lastIndex changes!
const regex = /a/g;
regex.test("abab");  // true (lastIndex = 1)
regex.test("abab");  // true (lastIndex = 3)
regex.test("abab");  // false (lastIndex = 0)

exec Method

exec() returns detailed match info (array with groups, index, input) or null; with g flag, successive calls iterate through matches via lastIndex.

const regex = /(\w+)@(\w+)/;
const result = regex.exec("email: user@domain");
// result[0] = "user@domain" (full match)
// result[1] = "user"        (group 1)
// result[2] = "domain"      (group 2)
// result.index = 7
// result.input = "email: user@domain"

// Iterate all matches with global flag
const gRegex = /\d+/g;
let match;
while ((match = gRegex.exec("a1b2c3")) !== null) {
    console.log(match[0], match.index); // "1" 1, "2" 3, "3" 5
}

match, matchAll

String methods that use regex: match() returns matches array, matchAll() returns iterator with full details including groups for each match.

const str = "test1 test2 test3";

// match (already covered in Strings section)
str.match(/test(\d)/g);     // ["test1", "test2", "test3"] (no groups!)

// matchAll - gets groups for each match
const matches = [...str.matchAll(/test(\d)/g)];
// [
//   { 0: "test1", 1: "1", index: 0 },
//   { 0: "test2", 1: "2", index: 6 },
//   { 0: "test3", 1: "3", index: 12 }
// ]

replace with regex

replace() with regex enables powerful pattern-based substitution; supports special replacement patterns and callback functions for dynamic replacement.

// Global replacement
"banana".replace(/a/g, "o");  // "bonono"

// Special patterns
"John Smith".replace(/(\w+) (\w+)/, "$2, $1");    // "Smith, John"
"hello".replace(/./g, "$&!");                      // "h!e!l!l!o!"

// Callback function
"hello".replace(/[aeiou]/g, (match, offset) => {
    return match.toUpperCase();
}); // "hEllO"

// Named groups in replacement
"2023-12-25".replace(
    /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/,
    "$<month>/$<day>/$<year>"
); // "12/25/2023"

search with regex

String's search() method returns the index of the first regex match; unlike indexOf, it provides regex pattern matching but doesn't support global searching.

const str = "Hello123World";

str.search(/\d+/);        // 5 (index of "123")
str.search(/[A-Z]/);      // 0 (first uppercase)
str.search(/world/i);     // 8 (case-insensitive)
str.search(/xyz/);        // -1 (not found)

// Comparison with indexOf
str.indexOf("123");       // 5 (identical result)
str.search(/\d{3}/);      // 5 (but regex is more powerful)

split with regex

split() with regex enables complex delimiter patterns; capturing groups in the regex include matched delimiters in the result array.

"a1b2c3d".split(/\d/);           // ["a", "b", "c", "d"]
"a1b2c3d".split(/\d+/);          // ["a", "b", "c", "d"]
"  hello   world  ".split(/\s+/); // ["", "hello", "world", ""]

// Keep delimiters with capturing group
"a1b2c3".split(/(\d)/);          // ["a", "1", "b", "2", "c", "3", ""]

// Complex splitting
"key:value;foo:bar".split(/[:;]/);  // ["key", "value", "foo", "bar"]

Named Capture Groups

Named groups (?<name>) provide readable access to captured content via groups property; makes complex patterns self-documenting and maintainable.

const dateRegex = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const match = dateRegex.exec("2023-12-25");

match.groups.year;   // "2023"
match.groups.month;  // "12"
match.groups.day;    // "25"

// Destructuring
const { groups: { year, month, day } } = dateRegex.exec("2023-12-25");

// In replace
"2023-12-25".replace(dateRegex, "$<month>/$<day>/$<year>");
// "12/25/2023"

// Backreference with name
/(?<word>\w+)\s+\k<word>/.test("the the"); // true