What is Lexer & Implementation in PHP

Mohasin Hossain
2 min readJun 3, 2024

--

A lexer, also known as a lexical analyzer or tokenizer, is a component of a compiler or interpreter that processes input text to produce a sequence of tokens. Tokens are the meaningful units of text, such as keywords, identifiers, literals, and symbols, that the parser uses to understand the structure of the source code or data.

Photo by Ben Griffiths on Unsplash

In the context of parsing JSON, a lexer would break down the JSON string into tokens such as:

{ and } for object delimiters
[ and ] for array delimiters
: for key-value separation
, for item separation
Strings enclosed in “
Numbers, booleans (true, false), and null
These tokens are then fed into a parser, which interprets their sequence according to the grammar rules of JSON to construct a meaningful data structure (such as a dictionary or list in Python, or an associative array in PHP).

Lexing Example

For example, consider the following JSON:


{
"key": "value",
"number": 123,
"boolean": true,
"nullValue": null,
"object": {"nestedKey": "nestedValue"},
"array": [1, 2, 3]
}

A lexer would produce the following sequence of tokens:

{
"key"
:
"value"
,
"number"
:
123
,
"boolean"
:
true
,
"nullValue"
:
null
,
"object"
:
{
"nestedKey"
:
"nestedValue"
}
,
"array"
:
[
1
,
2
,
3
]
}

Lexer Implementation in PHP

Here is an implementation of a lexer in PHP:

<?php

function lex($input) {
$tokens = [];
$length = strlen($input);
$i = 0;
while ($i < $length) {
$char = $input[$i];
if ($char === '{' || $char === '}' || $char === '[' || $char === ']' || $char === ':' || $char === ',' || $char === '-') {
$tokens[] = $char;
$i++;
} elseif ($char === '"') {
$end = strpos($input, '"', $i + 1);
if ($end === false) {
return null; // Handle error: unmatched quote
}
$tokens[] = substr($input, $i, $end - $i + 1);
$i = $end + 1;
} elseif (ctype_space($char)) {
$i++;
} elseif (preg_match('/^(true|false|null)/', substr($input, $i), $match)) {
$tokens[] = $match[0];
$i += strlen($match[0]);
} elseif (preg_match('/^-?\d+(\.\d+)?([eE][-+]?\d+)?/', substr($input, $i), $match)) {
$tokens[] = $match[0];
$i += strlen($match[0]);
} else {
return null; // Handle error: unrecognized character
}
}
return $tokens;
}

// Example usage:
$input = '{"key": "value", "number": 123, "boolean": true, "nullValue": null, "object": {"nestedKey": "nestedValue"}, "array": [1, 2, 3]}';
$tokens = lex($input);
print_r($tokens);
?>

Conclusion:

This lex function processes the input JSON string and produces an array of tokens. These tokens can then be passed to a parser to construct the JSON structure.

--

--

Mohasin Hossain

Senior Software Engineer | Mentor @ADPList | Backend focused | PHP, JavaScript, Laravel, Vue.js, Nuxt.js, MySQL, TDD, CI/CD, Docker, Linux