Introduction
In the previous article, we discussed how HTTP requests are sent and how responses are received. Additionally, we parsed the text context from the response body.
In this part, we will discuss how HTML is parsed and how CSS is computed.
Base on the diagram above, the blue part represents what we had done so far.
Refactoring file structure
To improve the management of our files, we will break down the code into JavaScript files based on the purpose of each.
We are going to take out the part that is responsible for parsing into a different file called parser.js
, the rest of the code would be our client.js
.
Adding in HTML Parser
- We received a
Response
object at the end of our last article, and we are going to use the body content from it - Pass the body content to a
parser
and run it'sparseHTML()
method to parse the content. - A real browser passes the context to the parser in a stream (meaning that data is passed chunk by chunk and parsing is also chunk by chunk).
- Instead, we will pass the entire content to the parser in our mini-browser. In this way, it can be better understood.
File:
client.js
// client.js
// 1. require parser.js
const parser = require('./parser.js');
// ...
// Copy all the code here,
// from want we have before
// ...
let response = await request.send();
// 2. Get the response body from the request and pass it to the HMTL parser.
// -> At the end we should have a DOM object returning.
let dom = parser.parseHTML(response.body);
File:
parser.js
/**
* Parser
* @filename parser.js
* @author TriDiamond
* @version v1.0.0
*/
module.exports.parseHTML = function (html) {
// Here we just print out the html content for now.
console.log(html);
};
Implementing HTML parser
We are going to use the Finite State Machine (FSM) to implement our HTML parser.
- Using the FSM to analyze the HTML context
- In the HTML standards, there are states rules we can follow
- In our mini-browser we are only going choose one part of it to implement
There is a very complete state design made in the HTML standards, these states are exactly what our FSM need to use. You can jump to the "Tokenization" part of the document to look at all the state.
Looking at this standard may cause you to feel confused and lost. That's perfectly normal. The browser engineers are the target audience of this standard.
Once we had implemented our own mini-browser, we should be able to read and understand these standards, and you will notice what we implement is very similar to what is stated in the standards.
HTML standards have 80 states, we are here to understand the browser working flow, so we will not implement all of them. Instead, we will implement only a portion of it, just enough for us to understand how it works!
Initializing the FSM
let's begin by initializing our parseHTML
FSM, we will start adding code from what we have above.
/**
* Parser
* @filename parser.js
* @author TriDiamond
* @version v1.0.0
*/
const EOF = Symbol('EOF'); // EOF: end of file
function data(char) {}
/**
* HTTP Parser
* @param {string} html HTML context
*/
module.exports.parseHTML = function (html) {
let state = data;
// Iterate the HTML text content by
// each character of the string
for (let char of html) {
// Pass each string to the state machie
state = state(char);
}
// When it reach the EOF string, means
// it's the end of the content.
state = state(EOF);
};
We used a little trick in the above example. There is always an end to every file. We need a string character to represent the end of the file in our FSM. Upon reaching this string character, we will know that we have reached the end of this HTML content.
In this case, the
EOF
character serves as a pointer to the end of a file, which is a meaningless character. TheEOF
character string is created by using aSymbol
.
Parsing HTML tags
HTML has three types of tags:
- Opening tag
- Closing tag
- Self-closing tag
We will ignore the attributes
for now, just focus on parsing the tag itself first.
File:
parser.js
/**
* Parser
* @filename parser.js
* @author TriDiamond
* @version v1.0.0
*/
const EOF = Symbol('EOF'); // EOF: end of file
// STATE: Start reading HTML content
// --------------------------------
// 1. If `<` is found - means start of a tag
// 2. If `EOF` is found - means end of HTML content
// 3. Other characters - continue searching
function data(char) {
if (char === '<') {
// Start of a tag
return tagOpen;
} else if (char === EOF) {
// End of HTML content
// Exit out of the FSM
return;
} else {
// Continue searching
return data;
}
}
// STATE: Start of a tag
// ----------------------------------
// 1. If `/` is found - means it's a self-closing tag
// 2. If a-Z is found - means it's the tag name
// 3. Other characters - continue searching
function tagOpen(char) {
if (char === '/') {
// self-closing tag
return endTagOpen;
} else if (char.match(/^[a-zA-Z]$/)) {
// tag name
return tagName(char);
} else {
// continue searching
return;
}
}
// STATE: End of a tag
// --------------------------------
// 1. If a-Z is found - means it's still tag name
// 2. If `>` is found - means syntax error
// 3. If `EOF` is found - means syntax error
function endTagOpen(char) {
if (char.match(/^[a-zA-Z]$/)) {
return tagName(char);
} else if (char === '>') {
// syntax error —— Tag is not closed
} else if (char === EOF) {
// syntax error —— End tag is invalid
}
}
// STATE: Tag name
// --------------------------------
// 1. If `\t`(Tab), `\n`(Space), `\f`(Stop) or space
// are found - means attributes property detected
// 2. If `/` is found - means self-closing tag
// 3. If a-Z character found - means still is tag name
// 4. If `>` is found - means start of end tag
// 5. Other characters - continue searching
// for tag name
function tagName(char) {
if (c.match(/^[\t\n\f ]$/)) {
return beforeAttributeName;
} else if (char === '/') {
return selfClosingStartTag;
} else if (c.match(/^[a-zA-Z]$/)) {
return tagName;
} else if (char === '>') {
return data;
} else {
return tagName;
}
}
// STATE: Tag attributes and properties
// --------------------------------
// 1. If `/` is found - means sel-closing tag
// 2. If a-Z is found - means attribute name
// 3. If `>` is found - means tag ending
// 4. If `=` is found - means attribute value start
// 5. Other cases - means attribute value
function beforeAttributeName(char) {
if (char === '/') {
return selfClosingStartTag;
} else if (char.match(/^[\t\n\f ]$/)) {
return beforeAttributeName;
} else if (char === '>') {
return data;
} else if (char === '=') {
return beforeAttributeName;
} else {
return beforeAttributeName;
}
}
// STATE: Self-closing tag
// --------------------------------
// 1. If `>` found - means self-closing tag ends
// 2. if `EOF` found - syntax error
// 3. Other cases are also syntax error
function selfClosingStartTag(char) {
if (char === '>') {
return data;
} else if (char === 'EOF') {
} else {
}
}
/**
* HTTP Parser
* @param {string} html HTML context
*/
module.exports.parseHTML = function (html) {
let state = data;
for (let char of html) {
state = state(char);
}
state = state(EOF);
};
This isn't done yet! Hang in there buddy!, this part we only written the state changing flow. All the tag information isn't being saved.
Next we will look at how to create Element Tokens
using the states we have now.
Creating Element Tokens
Right now in our FSM all we have is a switching process of each state. We need to save the information somewhere for us to use to create our DOM object later on.
In a DOM object all the HTML information are saved in Element Tokens
, so we will also use that structure, at each state we will create a respective Element Token and fill in the tag's information.
Let's first look at how are we going to tackle this:
- First we need to define a
currentToken
variable to store our current Token (This token is used to store the start and end tag information) - Then create an
emit()
method to receive the token (It will generate the DOM tree at the end.)
Implementation Logic of Each Method
Start reading HTML content -
data()
- If
EOF
is found
- set token with
{type: 'EOF'}
- If text content is found
- set token with
{type: 'text', content: char}
Start of a tag -
tagOpen()
- If text content is found - means start of a tag
- set token with
{type: 'startTag', tagName: > ''}
- In the
tagName()
state method, we will finish filling up thetagName
valueEnd of a tag -
endTagOpen()
- If text content is found - means found end tag name emit a token with
{type: 'endTag', tagName: ''}
- Same as before, the
tagName
value will be filled intagName()
state method.Tag name -
tagName()
- Here contain the core logic of HTML parsing
- We will fill out the
tagName
value in thecurrentTag
in this method- When we found
>
that means it's the end of the tag, now we can emit thecurrentToken
and let theemit()
method to hook it into our DOM tree.Tag attributes and properties -
beforeAttributeName()
- If
>
is found - means tag ended, we can emit thecurrentToken
Self-closing tag -
selfClosingStartTag
- If
>
string is found - means closing of the self-closing tag- Here we need to set an extra property to our
currentToken
, give it aisSelfClosing: true
- Then we can emit the
currentToken
Now let's look at how we implement all these logic into our code.
File:
parser.js
/**
* Parser
* @filename parser.js
* @author TriDiamond
* @version v1.0.0
*/
let currentToken = null;
/**
* Emitting HTML token
* @param {*} token
*/
function emit(token) {
console.log(token);
}
const EOF = Symbol('EOF'); // EOF: end of file
// STATE: Start reading HTML content
// --------------------------------
// 1. If `<` is found - means start of a tag
// 2. If `EOF` is found - means end of HTML content
// 3. Other characters - continue searching
function data(char) {
if (char === '<') {
// Start of a tag
return tagOpen;
} else if (char === EOF) {
// End of HTML content
// Emit token
emit({
type: 'EOF',
});
return;
} else {
// Text content
emit({
type: 'text',
content: char,
});
return data;
}
}
// STATE: Start of a tag
// ----------------------------------
// 1. If `/` is found - means it's a self-closing tag
// 2. If a-Z is found - means it's the tag name
// 3. Other characters - continue searching
function tagOpen(char) {
if (char === '/') {
// self-closing tag
return endTagOpen;
} else if (char.match(/^[a-zA-Z]$/)) {
// tag name
currentToken = {
type: 'startTag',
tagName: '',
};
return tagName(char);
} else {
// continue searching
return;
}
}
// STATE: End of a tag
// --------------------------------
// 1. If a-Z is found - means it's still tag name
// 2. If `>` is found - means syntax error
// 3. If `EOF` is found - means syntax error
function endTagOpen(char) {
if (char.match(/^[a-zA-Z]$/)) {
currentToken = {
type: 'endTag',
tagName: '',
};
return tagName(char);
} else if (char === '>') {
// syntax error —— Tag is not closed
} else if (char === EOF) {
// syntax error —— End tag is invalid
}
}
// STATE: Tag name
// --------------------------------
// 1. If `\t`(Tab), `\n`(Space), `\f`(Stop) or space
// are found - means attributes property detected
// 2. If `/` is found - means self-closing tag
// 3. If a-Z character found - means still is tag name
// 4. If `>` is found - means start of end tag
// 5. Other characters - continue searching
// for tag name
function tagName(char) {
if (char.match(/^[\t\n\f ]$/)) {
return beforeAttributeName;
} else if (char === '/') {
return selfClosingStartTag;
} else if (char.match(/^[a-zA-Z]$/)) {
currentToken.tagName += char;
return tagName;
} else if (char === '>') {
emit(currentToken);
return data;
} else {
return tagName;
}
}
// STATE: Tag attributes and properties
// --------------------------------
// 1. If `/` is found - means sel-closing tag
// 2. If a-Z is found - means attribute name
// 3. If `>` is found - means tag ending
// 4. If `=` is found - means attribute value start
// 5. Other cases - means attribute value
function beforeAttributeName(char) {
if (char === '/') {
return selfClosingStartTag;
} else if (char.match(/^[\t\n\f ]$/)) {
return beforeAttributeName;
} else if (char === '>') {
emit(currentToken);
return data;
} else if (char === '=') {
return beforeAttributeName;
} else {
return beforeAttributeName;
}
}
// STATE: Self-closing tag
// --------------------------------
// 1. If `>` found - means self-closing tag ends
// 2. if `EOF` found - syntax error
// 3. Other cases are also syntax error
function selfClosingStartTag(char) {
if (char === '>') {
currentToken.isSelfClosing = true;
emit(currentToken);
return data;
} else if (char === 'EOF') {
} else {
}
}
/**
* HTTP Parser
* @param {string} html HTML context
*/
module.exports.parseHTML = function (html) {
let state = data;
for (let char of html) {
state = state(char);
}
state = state(EOF);
};
I know you can make it through! Here we have generated the Tokens for each of our HTML tags.
However, in this section we ignored the parsing of the HTML attributes, next let's add in the part where we parse the attributes.
Parsing HTML Attributes
First let's analyze the syntax of HTML attributes, there are 3 types of syntax:
- Single quote -
''
- Double quote -
""
- No quote
The parsing process is similar to tags, at the end of the parsing we will add these attributes onto our Token
and emit it out.
Implementation Logic of Attributes
- First we need to define a
currentAttributes
variable to store the attributes names and values. - After parsing all the attributes of a HTML tag, we will add them all into the
currentToken
object.
beforeAttributeName
- If
space
,new line
orreturn
character is found, we can continue to search for attribute content- If
/
or>
is found, we go straight to tag end state- if
=
orEOF
is found, this is a HTML syntax error- Other situations, means start of attribute
- create
currentAtrribute
object with{name: '', value: ''}
- return
attributeName
state methodattributeName
- If
SPACE
,NEW LINE
,RETURN
,/
,>
orEOF
string is found, means this attribute is ending, returnafterAttributeName
state method.- If
=
string is found, means attribute name is done, attribute value is coming up next.- If
\u0000
is found, throwParse error
- Any other situations means still reading the attribute name value, continue moving to the next character.
beforeAttributeValue
- If
SPACE
,NEW LINE
,RETURN
,/
,>
orEOF
string is found, we can continue reading for attribute value. ReturnbeforeAttributeValue
state method.- If
"
is found, means it's a double quoted attribute value, go todoubleQuotedAttributeValue
state method next.- If
'
is found, means it's a single quoted attribute value, go tosingleQuotedAttributeValue
state method next.- All other situations means
no quote attribute value
, to tounquotedAttributeValue(char)
state method.doubleQuotedAttributeValue
- Just keep waiting for the
"
string, when this string is found we know it's end of reading for the property. Then we can add this attribute name and value pair into ourcurrentAttribute
token.- If
\u0000
orEOF
is found, this is a syntax error. ThrowParse Error
.- All other situation means continue reading for the attribute value, return
doubleQuotedAttributeValue
state method.singleQuotedAttributeValue
- Just keep waiting for the
'
string, when this string is found we know it's end of reading for the property. Then we can add this attribute name and value pair into ourcurrentAttribute
token.- If
\u0000
orEOF
is found, this is a syntax error. ThrowParse Error
.- All other situation means continue reading for the attribute value, return
singleQuotedAttributeValue
state method.afterQuotedAttributeValue
- If
SPACE
,NEW LINE
,RETURN
string is found, means there are still more attributes to be read. Therefore we need to returnbeforeAttributeName
state method.- If
/
string is found mean we had hit the self-closing tag ending state. ReturnselfClosingStartTag
state method.- If
>
string is found, means tag is ending. At this point we can add all our formatted attribute array intocurrentToken
and emit the token.- If
EOF
is found, means it's a HTML syntax error. This is amissing-whitespace-between-attributes
error.unquotedAttributeValue
- If
SPACE
,NEW LINE
,RETURN
string is found, means it's the end of this attribute, add the current attribute to thecurrentToken
. But there may have more attributes coming after this, so we need to go tobeforeAttributeName
state method next.- If
/
string is found mean we had hit the self-closing tag ending state. ReturnselfClosingStartTag
state method.- If
>
string is found, means tag is ending. At this point we can add all our formatted attribute array intocurrentToken
and emit the token.- All other situation means continue reading for the attribute value, return
unquotedAttributeValue
state method.afterAttributeName
- If
SPACE
,NEW LINE
,RETURN
string is found, means still haven't found the ending character of our attribute name, continue to look for attribute name characters. ReturnafterAttributeName
state method.- If
/
string is found mean we had hit the self-closing tag ending state. ReturnselfClosingStartTag
state method.- If
=
string is found, means attribute name is done, attribute value is coming up next.- If
>
string is found, means tag is ending. At this point we can add all our formatted attribute array intocurrentToken
and emit the token.- If
EOF
is found, means HTML syntax error, HTML ended unexpectedly.- All other situation means attribute name started again, add the current attribute name and value key into the
currentAttributes
and continue reading the next attribute.
That's all for the logic, now let's look at how are we going to implement these logic into our previous code.
File:
parser.js
/**
* Parser
* @filename parser.js
* @author TriDiamond
* @version v1.0.0
*/
let currentToken = null;
let currentAttribute = null;
/**
* Emitting HTML token
* @param {*} token
*/
function emit(token) {
console.log(token);
}
const EOF = Symbol('EOF'); // EOF: end of file
// STATE: Start reading HTML content
// --------------------------------
// 1. If `<` is found - means start of a tag
// 2. If `EOF` is found - means end of HTML content
// 3. Other characters - continue searching
function data(char) {
if (char === '<') {
// Tag starting
return tagOpen;
} else if (char === EOF) {
// Context ended
emit({
type: 'EOF',
});
return;
} else {
// Text
emit({
type: 'text',
content: char,
});
return data;
}
}
// STATE: Start of a tag
// ----------------------------------
// 1. If `/` is found - means it's a self-closing tag
// 2. If a-Z is found - means it's the tag name
// 3. Other characters - continue searching
function tagOpen(char) {
if (char === '/') {
// 自关闭标签
return endTagOpen;
} else if (char.match(/^[a-zA-Z]$/)) {
// 标签名
currentToken = {
type: 'startTag',
tagName: '',
};
return tagName(char);
} else {
return;
}
}
// STATE: End of a tag
// --------------------------------
// 1. If a-Z is found - means it's still tag name
// 2. If `>` is found - means syntax error
// 3. If `EOF` is found - means syntax error
function endTagOpen(char) {
if (char.match(/^[a-zA-Z]$/)) {
currentToken = {
type: 'endTag',
tagName: '',
};
return tagName(char);
} else if (char === '>') {
// // syntax error —— Tag is not closed
} else if (char === EOF) {
// syntax error —— End tag is invalid
}
}
// STATE: Tag name
// --------------------------------
// 1. If `\t`(Tab), `\n`(Space), `\f`(Stop) or space
// are found - means attributes property detected
// 2. If `/` is found - means self-closing tag
// 3. If a-Z character found - means still is tag name
// 4. If `>` is found - means start of end tag
// 5. Other characters - continue searching
// for tag name
function tagName(char) {
if (char.match(/^[\t\n\f ]$/)) {
return beforeAttributeName;
} else if (char === '/') {
return selfClosingStartTag;
} else if (char.match(/^[a-zA-Z]$/)) {
currentToken.tagName += char;
return tagName;
} else if (char === '>') {
emit(currentToken);
return data;
} else {
return tagName;
}
}
// STATE: Tag attributes and properties
// --------------------------------
// 1. If `/` is found - means sel-closing tag
// 2. If a-Z is found - means attribute name
// 3. If `>` is found - means tag ending
// 4. If `=` is found - means attribute value start
// 5. Other cases - means attribute value
function beforeAttributeName(char) {
if (char.match(/^[\t\n\f ]$/)) {
return beforeAttributeName;
} else if (char === '/' || char === '>') {
return afterAttributeName(char);
} else if (char === '=' || char === EOF) {
throw new Error('Parse error');
} else {
currentAttribute = {
name: '',
value: '',
};
return attributeName(char);
}
}
// STATE: Attribute Name
function attributeName(char) {
if (char.match(/^[\t\n\f ]$/) || char === '/' || char === '>' || char === EOF) {
return afterAttributeName(char);
} else if (char === '=') {
return beforeAttributeValue;
} else if (char === '\u0000') {
throw new Error('Parse error');
} else {
currentAttribute.name += char;
return attributeName;
}
}
// STATE: Before Attribute Value
function beforeAttributeValue(char) {
if (char.match(/^[\t\n\f ]$/) || char === '/' || char === '>' || char === EOF) {
return beforeAttributeValue;
} else if (char === '"') {
return doubleQuotedAttributeValue;
} else if (char === "'") {
return singleQuotedAttributeValue;
} else if (char === '>') {
// return data;
} else {
return unquotedAttributeValue(char);
}
}
// STATE: Doube Quoted Attribute Value
function doubleQuotedAttributeValue(char) {
if (char === '"') {
currentToken[currentAttribute.name] = currentAttribute.value;
return afterQuotedAttributeValue;
} else if (char === '\u0000') {
throw new Error('Parse error');
} else if (char === EOF) {
throw new Error('Parse error');
} else {
currentAttribute.value += char;
return doubleQuotedAttributeValue;
}
}
// STATE: Single QUoted Attribute Value
function singleQuotedAttributeValue(char) {
if (char === "'") {
currentToken[currentAttribute.name] = currentAttribute.value;
return afterQuotedAttributeValue;
} else if (char === '\u0000') {
throw new Error('Parse error');
} else if (char === EOF) {
throw new Error('Parse error');
} else {
currentAttribute.value += char;
return singleQuotedAttributeValue;
}
}
// STATE: After QUoted Attribute Value
function afterQuotedAttributeValue(char) {
if (char.match(/^[\t\n\f ]$/)) {
return beforeAttributeName;
} else if (char === '/') {
return selfClosingStartTag;
} else if (char === '>') {
currentToken[currentAttribute.name] = currentAttribute.value;
emit(currentToken);
return data;
} else if (char === EOF) {
throw new Error('Parse error: eof-in-tag');
} else {
throw new Error('Parse error: missing-whitespace-between-attributes');
}
}
// STATE: Unquoted Attribute Value
function unquotedAttributeValue(char) {
if (char.match(/^[\t\n\f ]$/)) {
currentToken[currentAttribute.name] = currentAttribute.value;
return beforeAttributeName;
} else if (char === '/') {
currentToken[currentAttribute.name] = currentAttribute.value;
return selfClosingStartTag;
} else if (char === '>') {
currentToken[currentAttribute.name] = currentAttribute.value;
emit(currentToken);
return data;
} else if (char === '\u0000') {
throw new Error('Parse error');
} else if (char === '"' || char === "'" || char === '<' || char === '=' || char === '`') {
throw new Error('Parse error');
} else if (char === EOF) {
throw new Error('Parse error');
} else {
currentAttribute.value += char;
return unquotedAttributeValue;
}
}
// STATE: After Attribute Name
function afterAttributeName(char) {
if (char.match(/^[\t\n\f ]$/)) {
return afterAttributeName;
} else if (char === '/') {
return selfClosingStartTag;
} else if (char === '=') {
return beforeAttributeValue;
} else if (char === '>') {
currentToken[currentAttribute.name] = currentAttribute.value;
emit(currentToken);
return data;
} else if (char === EOF) {
throw new Error('Parse error');
} else {
currentToken[currentAttribute.name] = currentAttribute.value;
currentAttribute = {
name: '',
value: '',
};
return attributeName(char);
}
}
// STATE: Self-closing tag
// --------------------------------
// 1. If `>` found - means self-closing tag ends
// 2. if `EOF` found - syntax error
// 3. Other cases are also syntax error
function selfClosingStartTag(char) {
if (char === '>') {
currentToken.isSelfClosing = true;
emit(currentToken);
return data;
} else if (char === 'EOF') {
} else {
}
}
/**
* HTTP Parser
* @param {string} html HTML context
*/
module.exports.parseHTML = function (html) {
let state = data;
for (let char of html) {
state = state(char);
}
state = state(EOF);
};
Up to this point, we are finally done with parsing the HTML tag and its attribute values. Isn't it easy? 👻
Where do we go from here?! Before we wrap up this part of the code, there is one more thing to do. All of these tokens must be used to create the DOM tree object.
Building DOM Tree with Tokens
Compare to the complicated JavaScript syntax parsing, HTML syntax parsing is relatively easier already. Up until this point, we have all the HTML tag and attribute information saved inside tokens, but with these tokens laying around won't be enough for our browser to use to render our web pages.
Side note: A real browser will find certain HTML syntax errors and auto-fix them. Since we are building a mini-browser, not a complete and fully featured browser, we will not consider these features and functions.
If we going to make a complete mini-browser, we will need to use these tokens and create a DOM tree object. So, the question is "How are we going to use all these tokens to create our DOM tree?""
Let's flatten out our logic:
- To create a DOM tree, the basic trick is by using a
Stack Data Structure
- When we bump into a starting tag, we create the Stack and push it into the stack, when we bump into the ending tag, we pop everything out of the stack, at the end we will have a full HTML tag information in the right order.
- Self-closing tags will push in the stack and out of the stack when it closes, because there are not contents in between the opening and closing tags (well there is no closing tag to between with right?)
Still confused? Let's look at this:
<div>
<p>JavaScript</p>
<span> is fun!</span>
</div>
In a Stack Data Structure, it will look like this:
. <- Here is top of the stack
├── <div>
├── <p>
├── JavaScript
├── </p>
├── <span>
├── is fun!
├── </span>
└── </div>
For a self-closing tag:
<img src="https://example.com/img.png" />
. <- Here is top of the stack
└── <img src="https://example.com/img.png" />
From the look of it, it will go in the stack and come right back out right?
After this basic understanding of how HTML DOM is stacked and formed, let's go look at how are we going to implement this in our code.
Let's start simply by ignoring Text Node
s inside our HTML tags first.
// Default root node `document`
// All HTML start with the `document` node
let stack = [{ type: 'document', children: [] }];
// Emitting HTML token
function emit(token) {
if (token.type === 'text') return;
// Record the previous element - Top of Stack
let top = stack[stack.length - 1];
// If it's starting tag
if (token.type == 'startTag') {
let element = {
type: 'element',
children: [],
attributes: [],
};
element.tagName = token.tagName;
for (let prop in token) {
if (prop !== 'type' && prop != 'tagName') {
element.attributes.push({
name: prop,
value: token[prop],
});
}
}
// Find the matching closing tag
top.children.push(element);
element.parent = top;
if (!token.isSelfClosing) stack.push(element);
currentTextNode = null;
} else if (token.type == 'endTag') {
if (top.tagName !== token.tagName) {
throw new Error('Parse error: Tag start end not matched');
} else {
stack.pop();
}
currentTextNode = null;
}
}
That's it, now you will build a DOM tree looking like this:
.
├── `<div>`
│ ├── `<p>`
│ ├── `</p>`
│ ├── `<span>`
│ └── `</span>`
└── `</div>`
However we are still missing the Text Element inside the HTML tags, that's what we are going to do next.
Adding Text Node to DOM Tree
This is the last section of the HTML parsing, we need to add Text Node into our DOM tree object. Here are two things we need to note:
- Processing a Text Node is the same as Self-Closing Tag.
- Multiple Text Nodes need to be combined.
For this part let's talk less and let our code speak the truth.
File:
emit()
method insideparser.js
let currentToken = null;
let currentAttribute = null;
let currentTextNode = null;
// Default root node `document`
// All HTML start with the `document` node
let stack = [{ type: 'document', children: [] }];
// Emitting HTML token
function emit(token) {
// Record the previous element - Top of Stack
let top = stack[stack.length - 1];
// If it's starting tag
if (token.type == 'startTag') {
let element = {
type: 'element',
children: [],
attributes: [],
};
element.tagName = token.tagName;
for (let prop in token) {
if (prop !== 'type' && prop != 'tagName') {
element.attributes.push({
name: prop,
value: token[prop],
});
}
}
// Find the matching closing tag
top.children.push(element);
element.parent = top;
if (!token.isSelfClosing) stack.push(element);
currentTextNode = null;
} else if (token.type == 'endTag') {
if (top.tagName !== token.tagName) {
throw new Error('Parse error: Tag start end not matched');
} else {
stack.pop();
}
currentTextNode = null;
} else if (token.type === 'text') {
// Text Node processing
if (currentTextNode === null) {
currentTextNode = {
type: 'text',
content: '',
};
top.children.push(currentTextNode);
}
currentTextNode.content += token.content;
}
}
That's it! You made it!
That's all the content for HTML parsing, next article we will talk about how CSS computes the style properties and creating the CSSOM.
Recommended Open Source Projects
Hexo Theme Aurora