What is Lexer & Implementation in PHP

Mohasin Hossain
2 min readJun 3, 2024

A lexer, also known as a lexical analyzer or tokenizer, is a component of a compiler or interpreter that processes input text to produce a sequence of tokens. Tokens are the meaningful units of text, such as keywords, identifiers, literals, and symbols, that the parser uses to understand the structure of the source code or data.

Photo by Ben Griffiths on Unsplash

In the context of parsing JSON, a lexer would break down the JSON string into tokens such as:

{ and } for object delimiters
[ and ] for array delimiters
: for key-value separation
, for item separation
Strings enclosed in “
Numbers, booleans (true, false), and null
These tokens are then fed into a parser, which interprets their sequence according to the grammar rules of JSON to construct a meaningful data structure (such as a dictionary or list in Python, or an associative array in PHP).

Lexing Example

For example, consider the following JSON:


{
"key": "value",
"number": 123,
"boolean": true,
"nullValue": null,
"object": {"nestedKey": "nestedValue"},
"array": [1, 2, 3]
}

A lexer would produce the following sequence of tokens:

{
"key"
:
"value"
,
"number"
:
123
,
"boolean"
:
true
,
"nullValue"
:
null
,
"object"
:
{
"nestedKey"
:
"nestedValue"
}
,
"array"
:
[
1
,
2
,
3
]
}

Lexer Implementation in PHP

Here is an implementation of a lexer in PHP:

<?php

function lex($input) {
$tokens = [];
$length = strlen($input);
$i = 0;
while ($i < $length) {
$char = $input[$i];
if ($char === '{' || $char === '}' || $char === '[' || $char === ']' || $char === ':' || $char === ',' || $char === '-') {
$tokens[] = $char;
$i++;
} elseif ($char === '"') {
$end = strpos($input, '"', $i + 1);
if ($end === false) {
return null; // Handle error: unmatched quote
}
$tokens[] = substr($input, $i, $end - $i + 1);
$i = $end + 1;
} elseif (ctype_space($char)) {
$i++;
} elseif (preg_match('/^(true|false|null)/', substr($input, $i), $match)) {
$tokens[] = $match[0];
$i += strlen($match[0]);
} elseif (preg_match('/^-?\d+(\.\d+)?([eE][-+]?\d+)?/', substr($input, $i), $match)) {
$tokens[] = $match[0];
$i += strlen($match[0]);
} else {
return null; // Handle error: unrecognized character
}
}
return $tokens;
}

// Example usage:
$input = '{"key": "value", "number": 123, "boolean": true, "nullValue": null, "object": {"nestedKey": "nestedValue"}, "array": [1, 2, 3]}';
$tokens = lex($input);
print_r($tokens);
?>

Conclusion:

This lex function processes the input JSON string and produces an array of tokens. These tokens can then be passed to a parser to construct the JSON structure.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Mohasin Hossain
Mohasin Hossain

Written by Mohasin Hossain

Senior Software Engineer | Mentor @ADPList | Backend focused | PHP, JavaScript, Laravel, Vue.js, Nuxt.js, MySQL, TDD, CI/CD, Docker, Linux

No responses yet

Write a response