HTML Parsing

HTML Parsing

Browser Working Principle

·

24 min read

Introduction

In the previous article, we discussed how HTTP requests are sent and how responses are received. Additionally, we parsed the text context from the response body.

In this part, we will discuss how HTML is parsed and how CSS is computed.

Browser Working Flow

Base on the diagram above, the blue part represents what we had done so far.


Refactoring file structure

To improve the management of our files, we will break down the code into JavaScript files based on the purpose of each.

We are going to take out the part that is responsible for parsing into a different file called parser.js, the rest of the code would be our client.js.

Adding in HTML Parser

  • We received a Response object at the end of our last article, and we are going to use the body content from it
  • Pass the body content to a parser and run it's parseHTML() method to parse the content.
  • A real browser passes the context to the parser in a stream (meaning that data is passed chunk by chunk and parsing is also chunk by chunk).
  • Instead, we will pass the entire content to the parser in our mini-browser. In this way, it can be better understood.

File: client.js

// client.js

 // 1. require parser.js
const parser = require('./parser.js');

// ...
// Copy all the code here, 
// from want we have before
// ...

let response = await request.send();

// 2. Get the response body from the request and pass it to the HMTL parser.
// -> At the end we should have a DOM object returning.
let dom = parser.parseHTML(response.body);

File: parser.js

/**
 * Parser
 * @filename parser.js
 * @author TriDiamond
 * @version v1.0.0
 */

module.exports.parseHTML = function (html) {
  // Here we just print out the html content for now.
  console.log(html); 
};

Implementing HTML parser

We are going to use the Finite State Machine (FSM) to implement our HTML parser.

  • Using the FSM to analyze the HTML context
  • In the HTML standards, there are states rules we can follow
  • In our mini-browser we are only going choose one part of it to implement

There is a very complete state design made in the HTML standards, these states are exactly what our FSM need to use. You can jump to the "Tokenization" part of the document to look at all the state.

Looking at this standard may cause you to feel confused and lost. That's perfectly normal. The browser engineers are the target audience of this standard.

Once we had implemented our own mini-browser, we should be able to read and understand these standards, and you will notice what we implement is very similar to what is stated in the standards.

HTML standards have 80 states, we are here to understand the browser working flow, so we will not implement all of them. Instead, we will implement only a portion of it, just enough for us to understand how it works!


Initializing the FSM

let's begin by initializing our parseHTML FSM, we will start adding code from what we have above.

/**
   * Parser
   * @filename parser.js
   * @author TriDiamond
   * @version v1.0.0
 */

const EOF = Symbol('EOF'); // EOF: end of file

function data(char) {}

/**
   * HTTP Parser
   * @param {string} html HTML context
 */
module.exports.parseHTML = function (html) {
  let state = data;

  // Iterate the HTML text content by 
  // each character of the string
  for (let char of html) {

    // Pass each string to the state machie
    state = state(char);

  }

  // When it reach the EOF string, means
  // it's the end of the content.
  state = state(EOF);
};

We used a little trick in the above example. There is always an end to every file. We need a string character to represent the end of the file in our FSM. Upon reaching this string character, we will know that we have reached the end of this HTML content.

In this case, the EOF character serves as a pointer to the end of a file, which is a meaningless character. The EOF character string is created by using a Symbol.


Parsing HTML tags

HTML has three types of tags:

  • Opening tag
  • Closing tag
  • Self-closing tag

We will ignore the attributes for now, just focus on parsing the tag itself first.

File: parser.js

/**
   * Parser
   * @filename parser.js
   * @author TriDiamond
   * @version v1.0.0
 */

const EOF = Symbol('EOF'); // EOF: end of file


// STATE: Start reading HTML content
// --------------------------------
// 1. If `<` is found - means start of a tag
// 2. If `EOF` is found - means end of HTML content
// 3. Other characters - continue searching
function data(char) {
  if (char === '<') {
    // Start of a tag
    return tagOpen;
  } else if (char === EOF) {
    // End of HTML content
    // Exit out of the FSM
    return;
  } else {
    // Continue searching
    return data;
  }
}

// STATE: Start of a tag
// ----------------------------------
// 1. If `/` is found - means it's a self-closing tag
// 2. If a-Z is found - means it's the tag name
// 3. Other characters - continue searching
function tagOpen(char) {
  if (char === '/') {
    // self-closing tag
    return endTagOpen;
  } else if (char.match(/^[a-zA-Z]$/)) {
    // tag name
    return tagName(char);
  } else {
    // continue searching
    return;
  }
}

// STATE: End of a tag
// --------------------------------
// 1. If a-Z is found - means it's still tag name
// 2. If `>` is found - means syntax error
// 3. If `EOF` is found - means syntax error
function endTagOpen(char) {
  if (char.match(/^[a-zA-Z]$/)) {
    return tagName(char);
  } else if (char === '>') {
    // syntax error —— Tag is not closed
  } else if (char === EOF) {
    // syntax error —— End tag is invalid
  }
}

// STATE: Tag name
// --------------------------------
// 1. If `\t`(Tab), `\n`(Space), `\f`(Stop) or space
//    are found - means attributes property detected
// 2. If `/` is found - means self-closing tag
// 3. If a-Z character found - means still is tag name
// 4. If `>` is found - means start of end tag
// 5. Other characters - continue searching 
//    for tag name
function tagName(char) {
  if (c.match(/^[\t\n\f ]$/)) {
    return beforeAttributeName;
  } else if (char === '/') {
    return selfClosingStartTag;
  } else if (c.match(/^[a-zA-Z]$/)) {
    return tagName;
  } else if (char === '>') {
    return data;
  } else {
    return tagName;
  }
}

// STATE: Tag attributes and properties
// --------------------------------
// 1. If `/` is found - means sel-closing tag
// 2. If a-Z is found - means attribute name
// 3. If `>` is found - means tag ending
// 4. If `=` is found - means attribute value start
// 5. Other cases - means attribute value
function beforeAttributeName(char) {
  if (char === '/') {
    return selfClosingStartTag;
  } else if (char.match(/^[\t\n\f ]$/)) {
    return beforeAttributeName;
  } else if (char === '>') {
    return data;
  } else if (char === '=') {
    return beforeAttributeName;
  } else {
    return beforeAttributeName;
  }
}

// STATE: Self-closing tag
// --------------------------------
// 1. If `>` found - means self-closing tag ends
// 2. if `EOF` found - syntax error
// 3. Other cases are also syntax error
function selfClosingStartTag(char) {
  if (char === '>') {
    return data;
  } else if (char === 'EOF') {
  } else {
  }
}

/**
* HTTP Parser
* @param {string} html HTML context
*/
module.exports.parseHTML = function (html) {
  let state = data;
  for (let char of html) {
    state = state(char);
  }
  state = state(EOF);
};

Phew

This isn't done yet! Hang in there buddy!, this part we only written the state changing flow. All the tag information isn't being saved.

Next we will look at how to create Element Tokens using the states we have now.


Creating Element Tokens

Right now in our FSM all we have is a switching process of each state. We need to save the information somewhere for us to use to create our DOM object later on.

In a DOM object all the HTML information are saved in Element Tokens, so we will also use that structure, at each state we will create a respective Element Token and fill in the tag's information.

Let's first look at how are we going to tackle this:

  • First we need to define a currentToken variable to store our current Token (This token is used to store the start and end tag information)
  • Then create an emit() method to receive the token (It will generate the DOM tree at the end.)

Implementation Logic of Each Method

Start reading HTML content - data()

  • If EOF is found
    • set token with {type: 'EOF'}
  • If text content is found
    • set token with {type: 'text', content: char}

Start of a tag - tagOpen()

  • If text content is found - means start of a tag
    • set token with {type: 'startTag', tagName: > ''}
  • In the tagName() state method, we will finish filling up the tagName value

End of a tag - endTagOpen()

  • If text content is found - means found end tag name emit a token with {type: 'endTag', tagName: ''}
  • Same as before, the tagName value will be filled in tagName() state method.

Tag name - tagName()

  • Here contain the core logic of HTML parsing
  • We will fill out the tagName value in the currentTag in this method
  • When we found > that means it's the end of the tag, now we can emit the currentToken and let the emit() method to hook it into our DOM tree.

Tag attributes and properties - beforeAttributeName()

  • If > is found - means tag ended, we can emit the currentToken

Self-closing tag - selfClosingStartTag

  • If > string is found - means closing of the self-closing tag
  • Here we need to set an extra property to our currentToken, give it a isSelfClosing: true
  • Then we can emit the currentToken

Now let's look at how we implement all these logic into our code.

File: parser.js

/**
   * Parser
   * @filename parser.js
   * @author TriDiamond
   * @version v1.0.0
 */

let currentToken = null;

/**
 * Emitting HTML token
 * @param {*} token
 */
function emit(token) {
  console.log(token);
}

const EOF = Symbol('EOF'); // EOF: end of file

// STATE: Start reading HTML content
// --------------------------------
// 1. If `<` is found - means start of a tag
// 2. If `EOF` is found - means end of HTML content
// 3. Other characters - continue searching
function data(char) {
  if (char === '<') {
    // Start of a tag
    return tagOpen;
  } else if (char === EOF) {
    // End of HTML content
    // Emit token
    emit({
      type: 'EOF',
    });
    return;
  } else {
    // Text content
    emit({
      type: 'text',
      content: char,
    });
    return data;
  }
}

// STATE: Start of a tag
// ----------------------------------
// 1. If `/` is found - means it's a self-closing tag
// 2. If a-Z is found - means it's the tag name
// 3. Other characters - continue searching
function tagOpen(char) {
  if (char === '/') {
    // self-closing tag
    return endTagOpen;
  } else if (char.match(/^[a-zA-Z]$/)) {
    // tag name
    currentToken = {
      type: 'startTag',
      tagName: '',
    };
    return tagName(char);
  } else {
    // continue searching
    return;
  }
}

// STATE: End of a tag
// --------------------------------
// 1. If a-Z is found - means it's still tag name
// 2. If `>` is found - means syntax error
// 3. If `EOF` is found - means syntax error
function endTagOpen(char) {
  if (char.match(/^[a-zA-Z]$/)) {
    currentToken = {
      type: 'endTag',
      tagName: '',
    };
    return tagName(char);
  } else if (char === '>') {
    // syntax error —— Tag is not closed
  } else if (char === EOF) {
    // syntax error —— End tag is invalid
  }
}

// STATE: Tag name
// --------------------------------
// 1. If `\t`(Tab), `\n`(Space), `\f`(Stop) or space
//    are found - means attributes property detected
// 2. If `/` is found - means self-closing tag
// 3. If a-Z character found - means still is tag name
// 4. If `>` is found - means start of end tag
// 5. Other characters - continue searching 
//    for tag name
function tagName(char) {
  if (char.match(/^[\t\n\f ]$/)) {
    return beforeAttributeName;
  } else if (char === '/') {
    return selfClosingStartTag;
  } else if (char.match(/^[a-zA-Z]$/)) {
    currentToken.tagName += char;
    return tagName;
  } else if (char === '>') {
    emit(currentToken);
    return data;
  } else {
    return tagName;
  }
}

// STATE: Tag attributes and properties
// --------------------------------
// 1. If `/` is found - means sel-closing tag
// 2. If a-Z is found - means attribute name
// 3. If `>` is found - means tag ending
// 4. If `=` is found - means attribute value start
// 5. Other cases - means attribute value
function beforeAttributeName(char) {
  if (char === '/') {
    return selfClosingStartTag;
  } else if (char.match(/^[\t\n\f ]$/)) {
    return beforeAttributeName;
  } else if (char === '>') {
    emit(currentToken);
    return data;
  } else if (char === '=') {
    return beforeAttributeName;
  } else {
    return beforeAttributeName;
  }
}

// STATE: Self-closing tag
// --------------------------------
// 1. If `>` found - means self-closing tag ends
// 2. if `EOF` found - syntax error
// 3. Other cases are also syntax error
function selfClosingStartTag(char) {
  if (char === '>') {
    currentToken.isSelfClosing = true;
    emit(currentToken);
    return data;
  } else if (char === 'EOF') {
  } else {
  }
}

/**
* HTTP Parser
* @param {string} html HTML context
*/
module.exports.parseHTML = function (html) {
  let state = data;
  for (let char of html) {
    state = state(char);
  }
  state = state(EOF);
};

I know you can make it through! Here we have generated the Tokens for each of our HTML tags.

However, in this section we ignored the parsing of the HTML attributes, next let's add in the part where we parse the attributes.


Parsing HTML Attributes

First let's analyze the syntax of HTML attributes, there are 3 types of syntax:

  • Single quote - ''
  • Double quote - ""
  • No quote

The parsing process is similar to tags, at the end of the parsing we will add these attributes onto our Token and emit it out.

Implementation Logic of Attributes

  • First we need to define a currentAttributes variable to store the attributes names and values.
  • After parsing all the attributes of a HTML tag, we will add them all into the currentToken object.

beforeAttributeName

  • If space, new line or return character is found, we can continue to search for attribute content
  • If / or > is found, we go straight to tag end state
  • if = or EOF is found, this is a HTML syntax error
  • Other situations, means start of attribute
    • create currentAtrribute object with {name: '', value: ''}
    • return attributeName state method

attributeName

  • If SPACE, NEW LINE, RETURN, /, > or EOF string is found, means this attribute is ending, return afterAttributeName state method.
  • If = string is found, means attribute name is done, attribute value is coming up next.
  • If \u0000 is found, throw Parse error
  • Any other situations means still reading the attribute name value, continue moving to the next character.

beforeAttributeValue

  • If SPACE, NEW LINE, RETURN, /, > or EOF string is found, we can continue reading for attribute value. Return beforeAttributeValue state method.
  • If " is found, means it's a double quoted attribute value, go to doubleQuotedAttributeValue state method next.
  • If ' is found, means it's a single quoted attribute value, go to singleQuotedAttributeValue state method next.
  • All other situations means no quote attribute value, to to unquotedAttributeValue(char) state method.

doubleQuotedAttributeValue

  • Just keep waiting for the " string, when this string is found we know it's end of reading for the property. Then we can add this attribute name and value pair into our currentAttribute token.
  • If \u0000 or EOF is found, this is a syntax error. Throw Parse Error.
  • All other situation means continue reading for the attribute value, return doubleQuotedAttributeValue state method.

singleQuotedAttributeValue

  • Just keep waiting for the ' string, when this string is found we know it's end of reading for the property. Then we can add this attribute name and value pair into our currentAttribute token.
  • If \u0000 or EOF is found, this is a syntax error. Throw Parse Error.
  • All other situation means continue reading for the attribute value, return singleQuotedAttributeValue state method.

afterQuotedAttributeValue

  • If SPACE, NEW LINE, RETURN string is found, means there are still more attributes to be read. Therefore we need to return beforeAttributeName state method.
  • If / string is found mean we had hit the self-closing tag ending state. Return selfClosingStartTag state method.
  • If > string is found, means tag is ending. At this point we can add all our formatted attribute array into currentToken and emit the token.
  • If EOF is found, means it's a HTML syntax error. This is a missing-whitespace-between-attributes error.

unquotedAttributeValue

  • If SPACE, NEW LINE, RETURN string is found, means it's the end of this attribute, add the current attribute to the currentToken. But there may have more attributes coming after this, so we need to go to beforeAttributeName state method next.
  • If / string is found mean we had hit the self-closing tag ending state. Return selfClosingStartTag state method.
  • If > string is found, means tag is ending. At this point we can add all our formatted attribute array into currentToken and emit the token.
  • All other situation means continue reading for the attribute value, return unquotedAttributeValue state method.

afterAttributeName

  • If SPACE, NEW LINE, RETURN string is found, means still haven't found the ending character of our attribute name, continue to look for attribute name characters. Return afterAttributeName state method.
  • If / string is found mean we had hit the self-closing tag ending state. Return selfClosingStartTag state method.
  • If = string is found, means attribute name is done, attribute value is coming up next.
  • If > string is found, means tag is ending. At this point we can add all our formatted attribute array into currentToken and emit the token.
  • If EOF is found, means HTML syntax error, HTML ended unexpectedly.
  • All other situation means attribute name started again, add the current attribute name and value key into the currentAttributes and continue reading the next attribute.

That's all for the logic, now let's look at how are we going to implement these logic into our previous code.

File: parser.js

/**
   * Parser
   * @filename parser.js
   * @author TriDiamond
   * @version v1.0.0
 */

let currentToken = null;
let currentAttribute = null;

/**
 * Emitting HTML token
 * @param {*} token
 */
function emit(token) {
  console.log(token);
}

const EOF = Symbol('EOF'); // EOF: end of file

// STATE: Start reading HTML content
// --------------------------------
// 1. If `<` is found - means start of a tag
// 2. If `EOF` is found - means end of HTML content
// 3. Other characters - continue searching
function data(char) {
  if (char === '<') {
    // Tag starting
    return tagOpen;
  } else if (char === EOF) {
    // Context ended
    emit({
      type: 'EOF',
    });
    return;
  } else {
    // Text
    emit({
      type: 'text',
      content: char,
    });
    return data;
  }
}

// STATE: Start of a tag
// ----------------------------------
// 1. If `/` is found - means it's a self-closing tag
// 2. If a-Z is found - means it's the tag name
// 3. Other characters - continue searching
function tagOpen(char) {
  if (char === '/') {
    // 自关闭标签
    return endTagOpen;
  } else if (char.match(/^[a-zA-Z]$/)) {
    // 标签名
    currentToken = {
      type: 'startTag',
      tagName: '',
    };
    return tagName(char);
  } else {
    return;
  }
}

// STATE: End of a tag
// --------------------------------
// 1. If a-Z is found - means it's still tag name
// 2. If `>` is found - means syntax error
// 3. If `EOF` is found - means syntax error
function endTagOpen(char) {
  if (char.match(/^[a-zA-Z]$/)) {
    currentToken = {
      type: 'endTag',
      tagName: '',
    };
    return tagName(char);
  } else if (char === '>') {
    // // syntax error —— Tag is not closed
  } else if (char === EOF) {
    // syntax error —— End tag is invalid
  }
}

// STATE: Tag name
// --------------------------------
// 1. If `\t`(Tab), `\n`(Space), `\f`(Stop) or space
//    are found - means attributes property detected
// 2. If `/` is found - means self-closing tag
// 3. If a-Z character found - means still is tag name
// 4. If `>` is found - means start of end tag
// 5. Other characters - continue searching 
//    for tag name
function tagName(char) {
  if (char.match(/^[\t\n\f ]$/)) {
    return beforeAttributeName;
  } else if (char === '/') {
    return selfClosingStartTag;
  } else if (char.match(/^[a-zA-Z]$/)) {
    currentToken.tagName += char;
    return tagName;
  } else if (char === '>') {
    emit(currentToken);
    return data;
  } else {
    return tagName;
  }
}

// STATE: Tag attributes and properties
// --------------------------------
// 1. If `/` is found - means sel-closing tag
// 2. If a-Z is found - means attribute name
// 3. If `>` is found - means tag ending
// 4. If `=` is found - means attribute value start
// 5. Other cases - means attribute value
function beforeAttributeName(char) {
  if (char.match(/^[\t\n\f ]$/)) {
    return beforeAttributeName;
  } else if (char === '/' || char === '>') {
    return afterAttributeName(char);
  } else if (char === '=' || char === EOF) {
    throw new Error('Parse error');
  } else {
    currentAttribute = {
      name: '',
      value: '',
    };
    return attributeName(char);
  }
}

// STATE: Attribute Name
function attributeName(char) {
  if (char.match(/^[\t\n\f ]$/) || char === '/' || char === '>' || char === EOF) {
    return afterAttributeName(char);
  } else if (char === '=') {
    return beforeAttributeValue;
  } else if (char === '\u0000') {
    throw new Error('Parse error');
  } else {
    currentAttribute.name += char;
    return attributeName;
  }
}

// STATE: Before Attribute Value
function beforeAttributeValue(char) {
  if (char.match(/^[\t\n\f ]$/) || char === '/' || char === '>' || char === EOF) {
    return beforeAttributeValue;
  } else if (char === '"') {
    return doubleQuotedAttributeValue;
  } else if (char === "'") {
    return singleQuotedAttributeValue;
  } else if (char === '>') {
    // return data;
  } else {
    return unquotedAttributeValue(char);
  }
}

// STATE: Doube Quoted Attribute Value
function doubleQuotedAttributeValue(char) {
  if (char === '"') {
    currentToken[currentAttribute.name] = currentAttribute.value;
    return afterQuotedAttributeValue;
  } else if (char === '\u0000') {
    throw new Error('Parse error');
  } else if (char === EOF) {
    throw new Error('Parse error');
  } else {
    currentAttribute.value += char;
    return doubleQuotedAttributeValue;
  }
}

// STATE: Single QUoted Attribute Value
function singleQuotedAttributeValue(char) {
  if (char === "'") {
    currentToken[currentAttribute.name] = currentAttribute.value;
    return afterQuotedAttributeValue;
  } else if (char === '\u0000') {
    throw new Error('Parse error');
  } else if (char === EOF) {
    throw new Error('Parse error');
  } else {
    currentAttribute.value += char;
    return singleQuotedAttributeValue;
  }
}

// STATE: After QUoted Attribute Value
function afterQuotedAttributeValue(char) {
  if (char.match(/^[\t\n\f ]$/)) {
    return beforeAttributeName;
  } else if (char === '/') {
    return selfClosingStartTag;
  } else if (char === '>') {
    currentToken[currentAttribute.name] = currentAttribute.value;
    emit(currentToken);
    return data;
  } else if (char === EOF) {
    throw new Error('Parse error: eof-in-tag');
  } else {
    throw new Error('Parse error: missing-whitespace-between-attributes');
  }
}

// STATE: Unquoted Attribute Value
function unquotedAttributeValue(char) {
  if (char.match(/^[\t\n\f ]$/)) {
    currentToken[currentAttribute.name] = currentAttribute.value;
    return beforeAttributeName;
  } else if (char === '/') {
    currentToken[currentAttribute.name] = currentAttribute.value;
    return selfClosingStartTag;
  } else if (char === '>') {
    currentToken[currentAttribute.name] = currentAttribute.value;
    emit(currentToken);
    return data;
  } else if (char === '\u0000') {
    throw new Error('Parse error');
  } else if (char === '"' || char === "'" || char === '<' || char === '=' || char === '`') {
    throw new Error('Parse error');
  } else if (char === EOF) {
    throw new Error('Parse error');
  } else {
    currentAttribute.value += char;
    return unquotedAttributeValue;
  }
}

// STATE: After Attribute Name
function afterAttributeName(char) {
  if (char.match(/^[\t\n\f ]$/)) {
    return afterAttributeName;
  } else if (char === '/') {
    return selfClosingStartTag;
  } else if (char === '=') {
    return beforeAttributeValue;
  } else if (char === '>') {
    currentToken[currentAttribute.name] = currentAttribute.value;
    emit(currentToken);
    return data;
  } else if (char === EOF) {
    throw new Error('Parse error');
  } else {
    currentToken[currentAttribute.name] = currentAttribute.value;
    currentAttribute = {
      name: '',
      value: '',
    };
    return attributeName(char);
  }
}

// STATE: Self-closing tag
// --------------------------------
// 1. If `>` found - means self-closing tag ends
// 2. if `EOF` found - syntax error
// 3. Other cases are also syntax error
function selfClosingStartTag(char) {
  if (char === '>') {
    currentToken.isSelfClosing = true;
    emit(currentToken);
    return data;
  } else if (char === 'EOF') {
  } else {
  }
}

/**
* HTTP Parser
* @param {string} html HTML context
*/
module.exports.parseHTML = function (html) {
  let state = data;
  for (let char of html) {
    state = state(char);
  }
  state = state(EOF);
};

Make it

Up to this point, we are finally done with parsing the HTML tag and its attribute values. Isn't it easy? 👻

Where do we go from here?! Before we wrap up this part of the code, there is one more thing to do. All of these tokens must be used to create the DOM tree object.


Building DOM Tree with Tokens

Compare to the complicated JavaScript syntax parsing, HTML syntax parsing is relatively easier already. Up until this point, we have all the HTML tag and attribute information saved inside tokens, but with these tokens laying around won't be enough for our browser to use to render our web pages.

Side note: A real browser will find certain HTML syntax errors and auto-fix them. Since we are building a mini-browser, not a complete and fully featured browser, we will not consider these features and functions.

If we going to make a complete mini-browser, we will need to use these tokens and create a DOM tree object. So, the question is "How are we going to use all these tokens to create our DOM tree?""

Let's flatten out our logic:

  • To create a DOM tree, the basic trick is by using a Stack Data Structure
  • When we bump into a starting tag, we create the Stack and push it into the stack, when we bump into the ending tag, we pop everything out of the stack, at the end we will have a full HTML tag information in the right order.
  • Self-closing tags will push in the stack and out of the stack when it closes, because there are not contents in between the opening and closing tags (well there is no closing tag to between with right?)

Still confused? Let's look at this:

<div>
    <p>JavaScript</p>
    <span> is fun!</span>
</div>

In a Stack Data Structure, it will look like this:

. <- Here is top of the stack
├── <div>
├── <p>
├── JavaScript
├── </p>
├── <span>
├── is fun!
├── </span>
└── </div>

For a self-closing tag:

<img src="https://example.com/img.png" />
. <- Here is top of the stack
└── <img src="https://example.com/img.png" />

From the look of it, it will go in the stack and come right back out right?

After this basic understanding of how HTML DOM is stacked and formed, let's go look at how are we going to implement this in our code.

Let's start simply by ignoring Text Nodes inside our HTML tags first.

// Default root node `document`
// All HTML start with the `document` node
let stack = [{ type: 'document', children: [] }];

// Emitting HTML token
function emit(token) {
  if (token.type === 'text') return;

  // Record the previous element - Top of Stack
  let top = stack[stack.length - 1];

  // If it's starting tag
  if (token.type == 'startTag') {
    let element = {
      type: 'element',
      children: [],
      attributes: [],
    };

    element.tagName = token.tagName;

    for (let prop in token) {
      if (prop !== 'type' && prop != 'tagName') {
        element.attributes.push({
          name: prop,
          value: token[prop],
        });
      }
    }

    // Find the matching closing tag
    top.children.push(element);
    element.parent = top;

    if (!token.isSelfClosing) stack.push(element);

    currentTextNode = null;
  } else if (token.type == 'endTag') {
    if (top.tagName !== token.tagName) {
      throw new Error('Parse error: Tag start end not matched');
    } else {
      stack.pop();
    }

    currentTextNode = null;
  }
}

That's it, now you will build a DOM tree looking like this:

.
├── `<div>`
│   ├── `<p>`
│   ├── `</p>`
│   ├── `<span>`
│   └── `</span>`
└── `</div>`

However we are still missing the Text Element inside the HTML tags, that's what we are going to do next.


Adding Text Node to DOM Tree

This is the last section of the HTML parsing, we need to add Text Node into our DOM tree object. Here are two things we need to note:

  1. Processing a Text Node is the same as Self-Closing Tag.
  2. Multiple Text Nodes need to be combined.

For this part let's talk less and let our code speak the truth.

File: emit() method inside parser.js

let currentToken = null;
let currentAttribute = null;
let currentTextNode = null;

// Default root node `document`
// All HTML start with the `document` node
let stack = [{ type: 'document', children: [] }];

// Emitting HTML token
function emit(token) {
  // Record the previous element - Top of Stack
  let top = stack[stack.length - 1];

  // If it's starting tag
  if (token.type == 'startTag') {
    let element = {
      type: 'element',
      children: [],
      attributes: [],
    };

    element.tagName = token.tagName;

    for (let prop in token) {
      if (prop !== 'type' && prop != 'tagName') {
        element.attributes.push({
          name: prop,
          value: token[prop],
        });
      }
    }

    // Find the matching closing tag
    top.children.push(element);
    element.parent = top;

    if (!token.isSelfClosing) stack.push(element);

    currentTextNode = null;
  } else if (token.type == 'endTag') {
    if (top.tagName !== token.tagName) {
      throw new Error('Parse error: Tag start end not matched');
    } else {
      stack.pop();
    }

    currentTextNode = null;
  } else if (token.type === 'text') {
    // Text Node processing
    if (currentTextNode === null) {
      currentTextNode = {
        type: 'text',
        content: '',
      };
      top.children.push(currentTextNode);
    }

    currentTextNode.content += token.content;
  }
}

That's it! You made it!

You made it

That's all the content for HTML parsing, next article we will talk about how CSS computes the style properties and creating the CSSOM.


Recommended Open Source Projects

Hexo Theme Aurora

Usage Document


VSCode Aurora Future theme


Firefox Aurora Future

Theme page