Preface

Hot Module Replacement (HMR) is a major feature of Webpack. When you modify and save the code, Webpack repackages the code and sends the new module to the browser, which replaces the old module with the new one without refreshing the browser, allowing you to update the application without refreshing the browser.

For example, when developing a web page, if you click a button and a pop-up window appears, but the title of the pop-up window is not aligned, you can modify the CSS style and save it. Without refreshing the browser, the title style changes. It feels like directly modifying the element style in Chrome’s developer tools.

Hot Module Replacement (HMR)

The Hot Module Replacement (HMR) function replaces, adds, or deletes modules during application runtime without reloading the entire page. This significantly speeds up development in the following ways:

  • Preserve application state lost during a full page reload.

  • Only update the changed content to save valuable development time.

  • When CSS/JS changes occur in the source code, they are immediately updated in the browser, which is almost equivalent to directly changing the style in the browser devtools.

Why do we need HMR?

Before the webpack HMR function, there were many live reload tools or libraries, such as live-server. These libraries monitor file changes and notify the browser to refresh the page. So why do we still need HMR? The answer is actually mentioned in the previous text.

  • Live reload tools cannot save the application state (states). When the page is refreshed, the previous state of the application is lost. In the example mentioned earlier, when you click a button to display a pop-up window, the pop-up window disappears when the browser is refreshed. To restore the previous state, you need to click the button again. However, webapck HMR does not refresh the browser, but replaces the module at runtime, ensuring that the application state is not lost and improving development efficiency.

  • In the ancient development process, we may need to manually run commands to package the code and then manually refresh the browser page after packaging. All these repetitive work can be automated through the HMR workflow, allowing more energy to be devoted to business instead of wasting time on repetitive work.

  • HMR is compatible with most front-end frameworks or libraries on the market, such as React Hot Loader, Vue-loader, which can listen to changes in React or Vue components and update the latest components to the browser in real-time. Elm Hot Loader supports the translation and packaging of Elm language code through webpack, and of course, it also implements HMR functionality.

HMR Working Principle Diagram

When I first learned about HMR, I thought it was very magical, and there were always some questions lingering in my mind.

  • Webpack can package different modules into bundle files or several chunk files, but when I develop with webpack HMR, I did not find the webpack packaged files in my dist directory. Where did they go?

  • By looking at the package.json file of webpack-dev-server, we know that it depends on the webpack-dev-middleware library. So what role does webpack-dev-middleware play in the HMR process?

  • During the use of HMR, I know that the browser communicates with webpack-dev-server through websocket, but I did not find new module code in the websocket message. How are the new modules sent to the browser? Why are the new modules not sent to the browser through websocket with the message?

  • After the browser gets the latest module code, how does HMR replace the old module with the new one? How to handle the dependency relationship between modules during the replacement process?

  • During the module hot replacement process, is there any fallback mechanism if the replacement module fails?

With these questions in mind, I decided to delve into the webpack source code and find the underlying secrets of HMR.

webpack-optimization

Figure 1: HMR workflow diagram

The above figure is a module hot update process diagram for application development using webpack with webpack-dev-server.

The red box at the bottom of the figure is the server, and the orange box above is the browser.

The green box is the area controlled by the webpack code. The blue box is the area controlled by the webpack-dev-server code. The magenta box is the file system, where file changes occur, and the cyan box is the application itself.

The figure shows a cycle from when we modify the code to when the module hot update is completed. The entire process of HMR is marked by Arabic numerals in dark green.

  • In the first step, in webpack’s watch mode, when a file in the file system is modified, webpack detects the file change, recompiles and packages the module according to the configuration file, and saves the packaged code in memory as a simple JavaScript object.

  • The second step is the interface interaction between webpack-dev-server and webpack. In this step, the main interaction is between the dev-server middleware webpack-dev-middleware and webpack. Webpack-dev-middleware calls webpack’s exposed API to monitor code changes and tells webpack to package the code into memory.

  • The third step is the monitoring of file changes by webpack-dev-server. This step is different from the first step, and it does not monitor code changes and repackage them. When we configure devServer.watchContentBase to true in the configuration file, the server will monitor changes in static files in these configured folders, and notify the browser to perform live reload of the corresponding application after the changes. Note that this is a different concept from HMR.

  • The fourth step is also the work of the webpack-dev-server code. In this step, the server establishes a websocket long connection between the browser and the server through sockjs (a dependency of webpack-dev-server), and informs the browser of the status information of various stages of webpack compilation and packaging, including the information of Server listening to static file changes in the third step. The browser performs different operations based on these socket messages. Of course, the most important information transmitted by the server is the hash value of the new module. The subsequent steps perform module hot replacement based on this hash value.

The webpack-dev-server/client side cannot request updated code or perform hot module replacement operations, but instead returns these tasks to webpack. The role of webpack/hot/dev-server is to determine whether to refresh the browser or perform module hot updates based on the information passed to it by webpack-dev-server/client and the configuration of dev-server. Of course, if it is only to refresh the browser, there will be no subsequent steps.

HotModuleReplacement.runtime is the hub of client HMR. It receives the hash value of the new module passed to it by the previous step, and sends an Ajax request to the server through JsonpMainTemplate.runtime. The server returns a json that contains the hash values of all modules to be updated. After obtaining the update list, the module requests the latest module code again through jsonp. This is steps 7, 8, and 9 in the above figure.

The tenth step is the key step that determines the success or failure of HMR. In this step, the HotModulePlugin compares the old and new modules and decides whether to update the module. After deciding to update the module, it checks the dependency relationship between the modules and updates the dependency references between the modules while updating the modules.

The last step is to fall back to live reload when HMR fails, that is, to refresh the browser to obtain the latest packaged code.

Simple Example of Using HMR

In the previous section, a HMR workflow diagram was presented to briefly explain the process of module hot updates. However, you may still feel confused, and some of the English terms that appear above may be unfamiliar (these English terms represent code repositories or file modules). Don’t worry, in this section, I will use the simplest and purest example to analyze in detail the specific responsibilities of each library in the HMR process through the webpack and webpack-dev-server source code.

Here, I will use a simple vue example to demonstrate. Here is a link to the repository github.com/ikkkp/webpack-vue-demo

Before starting this example, let me briefly explain the files in this repository. The files in the repository include:

webpack-optimization

const path = require('path');
const HtmlWebpackPlugin = require('html-webpack-plugin');
const {
    VueLoaderPlugin
} = require('vue-loader');
const webpack = require('webpack'); // 引入 webpack
const AutoImport = require('unplugin-auto-import/webpack')
const Components = require('unplugin-vue-components/webpack')
const {
    ElementPlusResolver
} = require('unplugin-vue-components/resolvers')

/**
* @description 
* @version 1.0
* @author Huangzl
* @fileName webpack.base.config.js
* @date 2023/11/10 11:00:59
*/

module.exports = {
    entry: {
        main: './src/main',
        //单页应用开发模式禁用多入口
    },
    resolveLoader: {
        modules: [
            'node_modules',
            path.resolve(__dirname, './src/loader')
        ]
    },
    output: {
        filename: '[id].[fullhash].js', // 使用 [fullhash] 替代 [hash],这是新版本 webpack 的写法
        path: path.join(__dirname, 'dist'),
        publicPath: './'
    },
    module: {
        rules: [{
            test: /\.vue$/,
            loader: 'vue-loader'
        },
        {
            test: /\.css$/,
            use: [
                'style-loader',
                {
                    loader: 'css-loader',
                    options: {
                        importLoaders: 1
                    }
                },
                'postcss-loader'
            ]
        }, {
            test: /\.js$/,
            use: ['babel-loader', {
                loader: 'company-loader',
                options: {
                    sign: 'we-doctor@2021',
                },
            },],
            exclude: /node_modules/,
        },
        {
            test: /\.(ico|png|jpg|gif|svg|eot|woff|woff2|ttf)$/,
            loader: 'file-loader',
            options: {
                name: '[name].[ext]?[hash]'
            }
        },

        ]
    },

    plugins: [
        new HtmlWebpackPlugin({
            template: './public/index.html'
        }),
        new VueLoaderPlugin(),
        new webpack.DefinePlugin({
            BASE_URL: JSON.stringify('./') // 这里定义了 BASE_URL 为根路径 '/'
        }),
        AutoImport({
            resolvers: [ElementPlusResolver()],
        }),
        Components({
            resolvers: [ElementPlusResolver()],
        }),
    ],
    optimization: {
        splitChunks: {
            chunks: 'all', // 只处理异步模块
            maxSize: 20000000, // 设置最大的chunk大小为2MB
        },
    },
};

It is worth mentioning that HotModuleReplacementPlugin is not configured in the above configuration, because when we set devServer.hot to true and add the following script to package.json:

“start”: “webpack-dev-server –hot –open”

After adding the –hot configuration item, devServer will tell webpack to automatically introduce the HotModuleReplacementPlugin plugin, without us having to manually introduce it.

The above is the content of webpack.base.config.js. We will modify the content of App.vue below:

- <div>hello</div> // change the hello string to hello world
+ <div>hello world</div>

Step 1: webpack watches the file system and packages it into memory

webpack-dev-middleware calls webpack’s api to watch the file system. When the hello.js file changes, webpack recompiles and packages the file, then saves it to memory.

// webpack-dev-middleware/lib/Shared.js
if(!options.lazy) {
    var watching = compiler.watch(options.watchOptions, share.handleCompilerCallback);
    context.watching = watching;
}

You may wonder why webpack does not directly package files into the output.path directory. Where do the files go? It turns out that webpack packages the bundle.js file into memory. The reason for not generating files is that accessing code in memory is faster than accessing files in the file system, and it also reduces the overhead of writing code to files. All of this is thanks to memory-fs, a dependency of webpack-dev-middleware. Webpack-dev-middleware replaces the original outputFileSystem of webpack with a MemoryFileSystem instance, so the code is output to memory. The relevant source code of webpack-dev-middleware is as follows:
webpack-optimization

// webpack-dev-middleware/lib/Shared.js
var isMemoryFs = !compiler.compilers && compiler.outputFileSystem instanceof MemoryFileSystem;
if(isMemoryFs) {
    fs = compiler.outputFileSystem;
} else {
    fs = compiler.outputFileSystem = new MemoryFileSystem();
}

First, check whether the current fileSystem is an instance of MemoryFileSystem. If not, replace the outputFileSystem before the compiler with an instance of MemoryFileSystem. This way, the code of the bundle.js file is saved as a simple JavaScript object in memory. When the browser requests the bundle.js file, devServer directly retrieves the JavaScript object saved above from memory and returns it to the browser.

Step 2: devServer notifies the browser that the file has changed

In this stage, sockjs is the bridge between the server and the browser. When devServer is started, sockjs establishes a WebSocket long connection between the server and the browser to inform the browser of the various stages of webpack compilation and packaging. The key step is still webpack-dev-server calling the webpack API to listen for the done event of the compile. After the compile is completed, webpack-dev-server sends the hash value of the newly compiled and packaged module to the browser through the _sendStatus method.

// webpack-dev-server/lib/Server.js
compiler.plugin('done', (stats) => {
  // stats.hash 是最新打包文件的 hash 值
  this._sendStats(this.sockets, stats.toJson(clientStats));
  this._stats = stats;
});
...
Server.prototype._sendStats = function (sockets, stats, force) {
  if (!force && stats &&
  (!stats.errors || stats.errors.length === 0) && stats.assets &&
  stats.assets.every(asset => !asset.emitted)
  ) { return this.sockWrite(sockets, 'still-ok'); }
  // 调用 sockWrite 方法将 hash 值通过 websocket 发送到浏览器端
  this.sockWrite(sockets, 'hash', stats.hash);
  if (stats.errors.length > 0) { this.sockWrite(sockets, 'errors', stats.errors); } 
  else if (stats.warnings.length > 0) { this.sockWrite(sockets, 'warnings', stats.warnings); }      else { this.sockWrite(sockets, 'ok'); }
};

Step 3: webpack-dev-server/client responds to server messages

You may wonder how the code in bundle.js receives websocket messages since you did not add any code to receive websocket messages in your business code or add a new entry file in the entry property of webpack.config.js. It turns out that webpack-dev-server modifies the entry property in webpack configuration and adds webpack-dev-client code to it. This way, the code in bundle.js will have the code to receive websocket messages.

When webpack-dev-server/client receives a hash message, it temporarily stores the hash value. When it receives an ok message, it performs a reload operation on the application. The hash message is received before the ok message.

webpack-optimization

In the reload operation, webpack-dev-server/client stores the hash value in the currentHash variable. When it receives an ok message, it reloads the App. If module hot updates are configured, it calls webpack/hot/emitter to send the latest hash value to webpack and then hands over control to the webpack client code. If module hot updates are not configured, it directly calls the location.reload method to refresh the page.

Step 4: webpack receives the latest hash value, verifies it, and requests module code

In this step, three modules (three files, with the English names corresponding to the file paths) in webpack work together. First, webpack/hot/dev-server (referred to as dev-server) listens for the webpackHotUpdate message sent by webpack-dev-server/client in step 3. It calls the check method in webpack/lib/HotModuleReplacement.runtime (referred to as HMR runtime) to check for new updates. In the check process, it uses two methods in webpack/lib/JsonpMainTemplate.runtime (referred to as jsonp runtime): hotDownloadUpdateChunk and hotDownloadManifest. The second method calls AJAX to request whether there are updated files from the server. If there are, it returns the list of updated files to the browser. The first method requests the latest module code through jsonp and returns the code to HMR runtime. HMR runtime further processes the returned new module code, which may involve refreshing the page or hot updating the module.

webpack-optimization

It is worth noting that both requests use the file name concatenated with the previous hash value. The hotDownloadManifest method returns the latest hash value, and the hotDownloadUpdateChunk method returns the code block corresponding to the latest hash value. Then, the new code block is returned to HMR runtime for module hot updating.

Step 5: HotModuleReplacement.runtime hot updates the module

This step is the key step of the entire module hot updating (HMR), and all module hot updates occur in the hotApply method of HMR runtime.

// webpack/lib/HotModuleReplacement.runtime
function hotApply() {
    // ...
    var idx;
    var queue = outdatedModules.slice();
    while(queue.length > 0) {
        moduleId = queue.pop();
        module = installedModules[moduleId];
        // ...
        // remove module from cache
        delete installedModules[moduleId];
        // when disposing there is no need to call dispose handler
        delete outdatedDependencies[moduleId];
        // remove "parents" references from all children
        for(j = 0; j < module.children.length; j++) {
            var child = installedModules[module.children[j]];
            if(!child) continue;
            idx = child.parents.indexOf(moduleId);
            if(idx >= 0) {
                child.parents.splice(idx, 1);
            }
        }
    }
    // ...
    // insert new code
    for(moduleId in appliedUpdate) {
        if(Object.prototype.hasOwnProperty.call(appliedUpdate, moduleId)) {
            modules[moduleId] = appliedUpdate[moduleId];
        }
    }
    // ...
}

From the hotApply method above, it can be seen that module hot replacement mainly consists of three stages. The first stage is to find outdatedModules and outdatedDependencies. I did not include this part of the code here, but if you are interested, you can read the source code yourself. The second stage is to delete expired modules and dependencies from the cache, as follows:

delete installedModules[moduleId];
delete outdatedDependencies[moduleId];

The third stage is to add the new module to the modules object. The next time the __webpack_require__ method (the require method rewritten by webpack) is called, the new module code will be obtained.

For error handling during module hot updates, if an error occurs during the hot update process, the hot update will fall back to refreshing the browser. This part of the code is in the dev-server code, and the brief code is as follows:

module.hot.check(true).then(function(updatedModules) {
    if(!updatedModules) {
        return window.location.reload();
    }
    // ...
}).catch(function(err) {
    var status = module.hot.status();
    if(["abort", "fail"].indexOf(status) >= 0) {
        window.location.reload();
    }
});

dev-server first verifies if there are any updates, and if there are no code updates, it reloads the browser. If an abort or fail error occurs during the hotApply process, the browser is also reloaded.

webpack-optimization

What is Hadoop Yarn?

In the ancient Hadoop 1.0, the JobTracker of MapReduce was responsible for too many tasks, including resource scheduling and managing numerous TaskTrackers. This was naturally unreasonable. Therefore, during the upgrade process from 1.0 to 2.0, Hadoop separated the resource scheduling work of JobTracker and made it an independent resource management framework, which directly made Hadoop the most stable cornerstone in big data. This independent resource management framework is Yarn.

Before we introduce Yarn in detail, let’s briefly talk about Yarn. The full name of Yarn is “Yet Another Resource Negotiator”, which means “another resource scheduler”. This naming is similar to “Have a Nice Inn”. Here’s a little more information: there used to be a Java project compilation tool called Ant, which was named similarly, “Another Neat Tool” in abbreviation, which means “another organizing tool”.

Since it is called a resource scheduler, its function is naturally responsible for resource management and scheduling. Next, let’s take a closer look at Yarn.

Yarn Architecture

hadoop-Yarn

① Client: The client is responsible for submitting jobs to the cluster.

② ResourceManager: The main process of the cluster, the arbitration center, is responsible for cluster resource management and task scheduling.

③ Scheduler: Resource arbitration module.

④ ApplicationManager: Selects, starts, and supervises the ApplicationMaster.

⑤ NodeManager: The cluster’s secondary process, which manages and monitors Containers and executes specific tasks.

⑥ Container: A collection of local resources, such as a Container with 4 CPUs and 8GB of memory.

⑦ ApplicationMaster: The task execution and supervision center.

Three Main Components

Looking at the top of the figure, we can intuitively see two main components, ResourceManager and NodeManager, but there is actually an ApplicationMaster that is not displayed in the figure. Let’s take a look at these three components separately.

ResourceManager

Let’s start with the ResourceManager in the center of the figure. From the name, we can know that this component is responsible for resource management, and there is only one ResourceManager in the entire system to be responsible for resource scheduling.

It also includes two main components: the Scheduler and the ApplicationManager.

The Scheduler: Essentially, the Scheduler is a strategy or algorithm. When a client submits a task, it allocates resources based on the required resources and the current state of the cluster. Note that it only allocates resources to the application and does not monitor the status of the application.

ApplicationManager: Similarly, you can roughly guess what it does from its name. The ApplicationManager is responsible for managing the applications submitted by the client. Didn’t we say that the Scheduler does not monitor the program submitted by the user? In fact, the monitoring of the application is done by the ApplicationManager.

ApplicationMaster

Every time a client submits an Application, a new ApplicationMaster is created. This ApplicationMaster applies to the ResourceManager for container resources, sends the program to be run to the container after obtaining the resources, and then performs distributed computing.

This may be a bit difficult to understand. Why send the running program to the container? If you look at it from a traditional perspective, the program runs still, and data flows in and out constantly. But when the data volume is large, it cannot be done because the cost of moving massive data is too high and takes too long. However, there is an old Chinese saying that “if the mountain will not come to Muhammad, then Muhammad must go to the mountain.” This is the idea of big data distributed computing. Since big data is difficult to move, I will publish the application program that is easy to move to each node for calculation. This is the idea of big data distributed computing.

NodeManager

The NodeManager is a proxy for the ResourceManager on each machine, responsible for container management, monitoring their resource usage (CPU, memory, disk, and network, etc.), and providing these resource usage reports to the ResourceManager/Scheduler.

The main idea of Yarn is to split the two functions of resource management and task scheduling of MRv1 JobTracker into two independent processes:

hadoop-Yarn

  • Yarn is still a master/slave structure.

  • The main process ResourceManager is the resource arbitration center of the entire cluster.

  • The secondary process NodeManager manages local resources.

  • ResourceManager and the subordinate node process NodeManager form the Hadoop 2.0 distributed data computing framework.

The Process of Submitting an Application to Yarn

hadoop-Yarn

This figure shows the process of submitting a program, and we will discuss the process of each step in detail below.

  • The client submits an application to Yarn, assuming it is a MapReduce job.

  • The ResourceManager communicates with the NodeManager to allocate the first container for the application and runs the ApplicationMaster corresponding to the application in this container.

  • After the ApplicationMaster is started, it splits the job (i.e., the application) into tasks that can run in one or more containers. Then it applies to the ResourceManager for containers to run the program and sends heartbeats to the ResourceManager regularly.

  • After obtaining the container, the ApplicationMaster communicates with the NodeManager corresponding to the container and distributes the job to the container in the NodeManager. The MapReduce that has been split will be distributed here, and the container may run Map tasks or Reduce tasks.

  • The task running in the container sends heartbeats to the ApplicationMaster to report its status. When the program is finished, the ApplicationMaster logs out and releases the container resources to the ResourceManager.
    The above is the general process of running a job.

hadoop-Yarn

Typical Topology of Yarn Architecture

In addition to the two entities of ResourceManager and NodeManager, Yarn also includes two entities of WebAppProxyServer and JobHistoryServer.

hadoop-Yarn

JobHistoryServer: Manages completed Yarn tasks

  • The logs and various statistical information of historical tasks are managed by JobTracker.
  • Yarn abstracts the function of managing historical tasks into an independent entity, JobHistoryServer.

WebAppProxyServer: Web page proxy during task execution

  • By using a proxy, not only the pressure on ResourceManager is further reduced, but also the Web attacks on Yarn can be reduced.
  • Responsible for supervising the entire MapReduce task execution process, collecting the task execution information from the Container, and displaying it on a Web interface.

Yarn Scheduling Strategy

Capacity Scheduling Algorithm
CapacityScheduler is a multi-user and multi-task scheduling strategy that divides tasks into queues and allocates resources in Container units.

hadoop-Yarn

Fair Scheduling Strategy
FairScheduler is a pluggable scheduling strategy that allows multiple Yarn tasks to use cluster resources fairly.

hadoop-Yarn

HDFS Design Principles

Design Goals

Store very large files: “very large” here means several hundred M, G, or even TB.

  • Adopt a stream-based data access method: HDFS is based on the assumption that the most effective data processing mode is to generate or copy a data set once and then do a lot of analysis work on it. Analysis work often reads most of the data in the data set, even if not all of it. Therefore, the time required to read the entire data set is more important than the delay in reading the first record.

  • Run on commercial hardware: Hadoop does not require special expensive, reliable machines and can run on ordinary commercial machines (which can be purchased from multiple vendors). Commercial machines do not mean low-end machines. In a cluster (especially a large one), the node failure rate is relatively high. HDFS’s goal is to ensure that the cluster does not cause significant interruptions to users when nodes fail.

Application Types Not Suitable for HDFS

Some scenarios are not suitable for storing data in HDFS. Here are a few examples:

  1. Low-latency data access
    Applications that require latency in the millisecond range are not suitable for HDFS. HDFS is designed for high-throughput data transmission, so latency may be sacrificed. HBase is more suitable for low-latency data access.

  2. A large number of small files
    The metadata of files (such as directory structure, node list of file blocks, and block-node mapping) is stored in the memory of the NameNode. The number of files in the entire file system is limited by the memory size of the NameNode. As a rule of thumb, a file/directory/file block generally occupies 150 bytes of metadata memory space. If there are one million files, each file occupies one file block, which requires about 300M of memory. Therefore, the number of files in the billions is difficult to support on existing commercial machines.

  3. Multiple reads and writes, requiring arbitrary file modification
    HDFS writes data in an append-only manner. It does not support arbitrary offset modification of files. It does not support multiple writers.

HDFS Positioning

To improve scalability, HDFS uses a master/slave architecture to build a distributed storage cluster, which makes it easy to add or remove slaves to the cluster.

HDFS is an important component of the Hadoop ecosystem. It is a distributed file system designed to store large amounts of data and provide high-throughput data access. HDFS is designed to store data on inexpensive hardware and provide high fault tolerance. It achieves this goal by distributing data to multiple nodes in the cluster. HDFS is positioned as a batch processing system suitable for offline processing of large-scale data.

The main features of HDFS include:

  • High fault tolerance: HDFS distributes data to multiple nodes, so even if a node fails, data can still be accessed through other nodes.
  • High throughput: HDFS is designed to support batch processing of large-scale data, so it provides high-throughput data access.
  • Suitable for large files: HDFS is suitable for storing large files because it divides files into multiple blocks for storage and distributes these blocks to multiple nodes.
  • Stream data access: HDFS supports stream data access, which means it can efficiently process large amounts of data streams.

hadoop-HDFS

HDFS Architecture

HDFS uses a master/slave architecture to build a distributed storage service, which improves the scalability of HDFS and simplifies the architecture design. HDFS stores files in blocks, optimizing storage granularity. The NameNode manages the storage space of all slave machines, while the DataNode is responsible for actual data storage and read/write operations.

Blocks

There is a concept of blocks in physical disks. The physical block of a disk is the smallest unit of disk operation for reading and writing, usually 512 bytes. The file system abstracts another layer of concepts on top of the physical block of the disk, and the file system block is an integer multiple of the physical disk block. Generally, it is several KB. The blocks in Hadoop are much larger than those in general single-machine file systems, with a default size of 128M. The file in HDFS is split into block-sized chunks for storage, and these chunks are scattered across multiple nodes. If the size of a file is smaller than the block size, the file will not occupy the entire block, only the actual size. For example, if a file is 1M in size, it will only occupy 1M of space in HDFS, not 128M.

Why are HDFS blocks so large?
To minimize the seek time and control the ratio of time spent locating and transmitting files. Assuming that the time required to locate a block is 10ms and the disk transmission speed is 100M/s. If the proportion of time spent locating a block to the transmission time is controlled to 1%, the block size needs to be about 100M. However, if the block is set too large, in MapReduce tasks, if the number of Map or Reduce tasks is less than the number of cluster machines, the job efficiency will be very low.

Benefits of block abstraction

  • The splitting of blocks allows a single file size to be larger than the capacity of the entire disk, and the blocks that make up the file can be distributed across the entire cluster. In theory, a single file can occupy the disk of all machines in the cluster.
  • Block abstraction also simplifies the storage system, without worrying about its permissions, owner, and other content (these contents are controlled at the file level).
  • Blocks are the unit of replication in fault tolerance and high availability mechanisms.

Namenode & Datanode

The entire HDFS cluster consists of a master-slave model of Namenode and Datanode. The Namenode stores the file system tree and metadata of all files and directories. The metadata is persisted in two forms:

  • Namespace image
  • Edit log

However, the persistent data does not include the node list where the block is located and which nodes the file blocks are distributed to in the cluster. This information is reconstructed when the system is restarted (through the block information reported by the Datanode). In HDFS, the Namenode may become a single point of failure for the cluster. When the Namenode is unavailable, the entire file system is unavailable. HDFS provides two solutions to single point of failure:

  1. Backup persistent metadata
    Write the file system metadata to multiple file systems at the same time, such as writing metadata to both the local file system and NFS at the same time. These backup operations are synchronous and atomic.

  2. Secondary Namenode
    The Secondary node periodically merges the namespace image and edit log of the main Namenode to avoid the edit log being too large, and merges them by creating a checkpoint. It maintains a merged namespace image replica that can be used to recover data when the Namenode completely crashes. The following figure shows the management interface of the Secondary Namenode:

hadoop-HDFS

Internal Features of HDFS

Data Redundancy

  • HDFS stores each file as a series of data blocks, with a default block size of 64MB (configurable).

  • For fault tolerance, all data blocks of a file have replicas (the replication factor is configurable).

  • HDFS files are written once and strictly limited to only one write user at any time.

Replica Placement

  • HDFS clusters usually run on multiple racks, and communication between machines on different racks requires switches.

  • HDFS uses a rack-aware strategy to improve data reliability, availability, and network bandwidth utilization.

  • Rack failures are much less common than node failures, and this strategy can prevent data loss when an entire rack fails, improve data reliability and availability, and ensure performance.

Replica Selection

  • HDFS tries to use the replica closest to the program to meet user requests, reducing total bandwidth consumption and read latency.

  • The HDFS architecture supports data balancing strategies.

Heartbeat Detection

  • The NameNode periodically receives heartbeats and block reports from each DataNode in the cluster, indicating that the DataNode is working properly.

  • The NameNode marks DataNodes that have not sent heartbeats recently as down and does not send them any new I/O requests.

  • The NameNode continuously checks these data blocks that need to be replicated and re-replicates them when necessary.

Data Integrity Check

  • For various reasons, the data block obtained from the DataNode may be corrupted.

Classic HDFS Architecture

The NameNode is responsible for managing the metadata of the file system, while the DataNode is responsible for storing the actual data of the file blocks. This division of labor enables HDFS to efficiently store and manage large-scale data.

hadoop-HDFS

Specifically, when a client needs to read or write a file, it sends a request to the NameNode. The NameNode returns the metadata information of the file and the location information of the file blocks. The client communicates with the DataNode based on this information to read or write the actual data of the file blocks.

Therefore, the NameNode and DataNode play different roles in the HDFS architecture.

What is the difference in function?

HDFS is an abbreviation for Hadoop Distributed File System, an important component of the Hadoop ecosystem. The HDFS architecture includes one NameNode and multiple DataNodes. The NameNode is the master node of HDFS, responsible for managing the namespace of the file system, the metadata information of the file, and the location information of the file blocks. The DataNode is the slave node of HDFS, responsible for storing the actual data of the file blocks.

Specifically, when a client needs to read or write a file, it sends a request to the NameNode. The NameNode returns the metadata information of the file and the location information of the file blocks. The client communicates with the DataNode based on this information to read or write the actual data of the file blocks.

hadoop-HDFS

General Topology

There is only one NameNode node, and the SecondaryNameNode or BackupNode node is used to obtain NameNode metadata information in real time and back up metadata.

hadoop-HDFS

Commercial Topology

There are two NameNode nodes, and ZooKeeper is used to implement hot standby between NameNode nodes.

hadoop-HDFS

Command Line Interface

HDFS provides various interaction methods, such as Java API, HTTP, and shell command line. Command line interaction is mainly operated through hadoop fs. For example:

hadoop fs -copyFromLocal // Copy files from local to HDFS
hadoop fs mkdir // Create a directory
hadoop fs -ls // List file list

In Hadoop, the permissions of files and directories are similar to the POSIX model, including three permissions: read, write, and execute.

Read permission (r): Used to read files or list the contents of a directory
Write permission (w): For files, it is the write permission of the file. The write permission of the directory refers to the permission to create or delete files (directories) under the directory.
Execute permission (x): Files do not have so-called execute permissions and are ignored. For directories, execute permission is used to access the contents of the directory.

Each file or directory has three attributes: owner, group, and mode:

Owner: Refers to the owner of the file
Group: For permission groups
Mode: Consists of the owner’s permissions, the permissions of the members of the file’s group, and the permissions of non-owners and non-group members.

hadoop-HDFS

Data Flow (Read and Write Process)

Read File

The rough process of reading a file is as follows:

hadoop-HDFS

  1. The client passes a file Path to the FileSystem’s open method.

  2. DFS uses RPC to remotely obtain the datanode addresses of the first few blocks of the file. The NameNode determines which nodes to return based on the network topology structure (provided that the node has a block replica). If the client itself is a DataNode and there is a block replica on the node, it is read directly from the local node.

  3. The client uses the FSDataInputStream object returned by the open method to read data (call the read method).

  4. The DFSInputStream (FSDataInputStream implements this class) connects to the node that holds the first block and repeatedly calls the read method to read data.

  5. After the first block is read, find the best datanode for the next block and read the data. If necessary, DFSInputStream will contact the NameNode to obtain the node information of the next batch of Blocks (stored in memory, not persistent), and these addressing processes are invisible to the client.

  6. After the data is read, the client calls the close method to close the stream object.

During the data reading process, if communication with the DataNode fails, the DFSInputStream object will try to read data from the next best node and remember the failed node, and subsequent block reads will not connect to the node.

After reading a Block, DFSInputStram performs checksum verification. If the Block is damaged, it tries to read data from other nodes and reports the damaged block to the NameNode.

Which DataNode does the client connect to get the data block is guided by the NameNode, which can support a large number of concurrent client requests, and the NameNode evenly distributes traffic to the entire cluster as much as possible.

The location information of the Block is stored in the memory of the NameNode, so the corresponding location request is very efficient and will not become a bottleneck.

Write File

hadoop-HDFS

Step breakdown

  1. The client calls the create method of DistributedFileSystem.

  2. DistributedFileSystem remotely RPC calls the Namenode to create a new file in the namespace of the file system, which is not associated with any blocks at this time. During this process, the Namenode performs many verification tasks, such as whether there is a file with the same name, whether there are permissions, if the verification passes, it returns an FSDataOutputStream object. If the verification fails, an exception is thrown to the client.

  3. When the client writes data, DFSOutputStream is decomposed into packets (data packets) and written to a data queue, which is consumed by DataStreamer.

  4. DataStreamer is responsible for requesting the Namenode to allocate new blocks to store data nodes. These nodes store replicas of the same Block and form a pipeline. DataStreamer writes the packet to the first node of the pipeline. After the first node stores the packet, it forwards it to the next node, and the next node continues to pass it down.

  5. DFSOutputStream also maintains an ack queue, waiting for confirmation messages from datanodes. After all datanodes on the pipeline confirm, the packet is removed from the ack queue.

  6. After the data is written, the client closes the output stream. Flush all packets to the pipeline, and then wait for confirmation messages from datanodes. After all are confirmed, inform the Namenode that the file is complete. At this time, the Namenode already knows all the Block information of the file (because DataStreamer is requesting the Namenode to allocate blocks), and only needs to wait for the minimum replica number requirement to be reached, and then return a successful message to the client.

How does the Namenode determine which DataNode the replica is on?

The storage strategy of HDFS replicas is a trade-off between reliability, write bandwidth, and read bandwidth. The default strategy is as follows:

The first replica is placed on the machine where the client is located. If the machine is outside the cluster, a random one is selected (but it will try to choose a capacity that is not too slow or too busy).

The second replica is randomly placed on a rack different from the first replica.

The third replica is placed on the same rack as the second replica, but on a different node, and a random selection is made from the nodes that meet the conditions.

More replicas are randomly selected throughout the cluster, although too many replicas are avoided on the same rack as much as possible.

After the location of the replica is determined, when establishing the write pipeline, the network topology structure is considered. The following is a possible storage strategy:

hadoop-HDFS

This selection balances reliability, read and write performance well.

  • Reliability: Blocks are distributed on two racks.

  • Write bandwidth: The write pipeline process only needs to cross one switch.

  • Read bandwidth: You can choose one of the two racks to read from.

Internal Features of HDFS

Data Redundancy

  • HDFS stores each file as a series of data blocks, with a default block size of 64MB (configurable).

  • For fault tolerance, all data blocks of a file have replicas (the replication factor is configurable).

  • HDFS files are written once and strictly limited to only one writing user at any time.

Replica Placement

  • HDFS clusters usually run on multiple racks, and communication between machines on different racks requires switches.

  • HDFS uses a rack-aware strategy to improve data reliability, availability, and network bandwidth utilization.

  • Rack failures are much less common than node failures, and this strategy can prevent data loss when an entire rack fails, improving data reliability and availability while ensuring performance.

Replica Selection

  • HDFS tries to use the replica closest to the program to satisfy user requests, reducing total bandwidth consumption and read latency.

  • HDFS architecture supports data balancing strategies.

Heartbeat Detection

  • The NameNode periodically receives heartbeats and block reports from each DataNode in the cluster. Receiving a heartbeat indicates that the DataNode is working properly.

  • The NameNode marks DataNodes that have not sent a heartbeat recently as dead and does not send them any new I/O requests.

  • The NameNode continuously checks for data blocks that need to be replicated and replicates them when necessary.

Data Integrity Check

  • Various reasons may cause the data block obtained from the DataNode to be corrupted.

  • HDFS client software implements checksum verification of HDFS file content.

  • If the checksum of the data block obtained by the DataNode is different from that in the hidden file corresponding to the data block, the client judges that the data block is corrupted and obtains a replica of the data block from another DataNode.

Simple Consistency Model, Stream Data Access

  • HDFS applications generally access files in a write-once, read-many mode.

  • Once a file is created, written, and closed, it does not need to be changed again.

  • This simplifies data consistency issues and makes high-throughput data access possible. Applications running on HDFS mainly focus on stream reading and batch processing, emphasizing high-throughput data access.

Client Cache

  • The request for the client to create a file does not immediately reach the NameNode. The HDFS client first caches the data to a local temporary file, and the write operation of the program is transparently redirected to this temporary file.

  • When the accumulated data in this temporary file exceeds the size of a block (64MB), the client contacts the NameNode.

  • If the NameNode crashes before the file is closed, the file will be lost.

  • If client caching is not used, network speed and congestion will have a significant impact on output.

Definition of MapReduce

MapReduce is a programming framework for distributed computing programs. It is the core framework for developing “Hadoop-based data analysis applications”. Its core function is to integrate the user’s written business logic code and default components into a complete distributed computing program, which runs concurrently on a Hadoop cluster.

Reason for the Emergence of MapReduce

Why do we need MapReduce?

  • Massive data cannot be processed on a single machine due to hardware resource limitations.
  • Once the single-machine version of the program is extended to run on a cluster, it will greatly increase the complexity and development difficulty of the program.
  • With the introduction of the MapReduce framework, developers can focus most of their work on the development of business logic, while leaving the complexity of distributed computing to the framework to handle.

Consider a word count requirement in a scenario with massive data:

  • Single-machine version: limited memory, limited disk, limited computing power
  • Distributed: file distributed storage (HDFS), computing logic needs to be divided into at least two stages (one stage is independently concurrent, one stage is converged), how to distribute computing programs, how to allocate computing tasks (slicing), how to start the two-stage program? How to coordinate? Monitoring during the entire program running process? Fault tolerance? Retry?

It can be seen that when the program is extended from a single-machine version to a distributed version, a large amount of complex work will be introduced.

Relationship between MapReduce and Yarn

Yarn is a resource scheduling platform that is responsible for providing server computing resources for computing programs, which is equivalent to a distributed operating system platform. MapReduce and other computing programs are like application programs running on top of the operating system.

Important concepts of YARN:

  1. Yarn does not know the running mechanism of the program submitted by the user;
  2. Yarn only provides scheduling of computing resources (when the user program applies for resources from Yarn, Yarn is responsible for allocating resources);
  3. The supervisor role in Yarn is called ResourceManager;
  4. The role that specifically provides computing resources in Yarn is called NodeManager;
  5. In this way, Yarn is completely decoupled from the running user program, which means that various types of distributed computing programs (MapReduce is just one of them), such as MapReduce, storm programs, spark programs, tez, etc., can run on Yarn;
  6. Therefore, computing frameworks such as Spark and Storm can be integrated to run on Yarn, as long as they have resource request mechanisms that comply with Yarn specifications in their respective frameworks;
  7. Yarn becomes a universal resource scheduling platform. From then on, various computing clusters that previously existed in enterprises can be integrated on a physical cluster to improve resource utilization and facilitate data sharing.

MapReduce Working Principle

Strictly speaking, MapReduce is not an algorithm, but a computing idea. It consists of two stages: map and reduce.

MapReduce Process

To improve development efficiency, common functions in distributed programs can be encapsulated into frameworks, allowing developers to focus on business logic.

MapReduce is such a general framework for distributed programs, and its overall structure is as follows (there are three types of instance processes during distributed operation):

  • MRAppMaster: responsible for the process scheduling and status coordination of the entire program
  • MapTask: responsible for the entire data processing process of the map phase
  • ReduceTask: responsible for the entire data processing process of the reduce phase

MapReduce Mechanism

hadoop

The process is described as follows:

  1. When an MR program starts, the MRAppMaster is started first. After the MRAppMaster starts, according to the description information of this job, it calculates the number of MapTask instances required and applies to the cluster to start the corresponding number of MapTask processes.

  2. After the MapTask process is started, data processing is performed according to the given data slice range. The main process is:

  • Use the inputformat specified by the customer to obtain the RecordReader to read the data and form input KV pairs;
  • Pass the input KV pairs to the customer-defined map() method for logical operation, and collect the KV pairs output by the map() method to the cache;
  • Sort the KV pairs in the cache according to K partition and continuously overflow to the disk file.
  1. After the MRAppMaster monitors that all MapTask process tasks are completed, it will start the corresponding number of ReduceTask processes according to the customer-specified parameters, and inform the ReduceTask process of the data range (data partition) to be processed.

  2. After the ReduceTask process is started, according to the location of the data to be processed notified by the MRAppMaster, it obtains several MapTask output result files from several machines where the MapTask is running, and performs re-merging and sorting locally. Then, groups the KV with the same key into one group, calls the customer-defined reduce() method for logical operation, collects the result KV output by the operation, and then calls the customer-specified outputformat to output the result data to external storage.

Let’s take an example.

hadoop
The above figure shows a word frequency counting task.

  1. Hadoop divides the input data into several slices and assigns each split to a map task for processing.

  2. After mapping, each word and its frequency in this task are obtained.

  3. Shuffle puts the same words together, sorts them, and divides them into several slices.

  4. According to these slices, reduce is performed.

  5. The result of the reduce task is counted and output to a file.

In MapReduce, two roles are required to complete these processes: JobTracker and TaskTracker.

hadoop

JobTracker is used to schedule and manage other TaskTrackers. JobTracker can run on any computer in the cluster. TaskTracker is responsible for executing tasks and must run on DataNode.

Here is a simple MapReduce implementation example:

It is used to count the number of occurrences of each word in the input file.

  1. Import necessary packages:

    import java.io.IOException;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  2. Define the Mapper class:

    public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
      protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        // Split each line of text into words and send them to the Reducer
        String[] words = line.split("\\s+");
        for (String word : words) {
          context.write(new Text(word), new IntWritable(1));
        }
      }
    }

    The Mapper class is responsible for splitting the input text data into words and outputting a key-value pair (word, 1) for each word.

  3. Define the Reducer class:

    public static class MyReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
      protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        // Accumulate the number of occurrences of the same word
        for (IntWritable value : values) {
          sum += value.get();
        }
        // Output the word and its total number of occurrences
        context.write(key, new IntWritable(sum));
      }
    }

    The Reducer class receives key-value pairs from the Mapper, accumulates the values of the same key, and then outputs the word and its total number of occurrences.

  4. Main function (main method):

    public static void main(String[] args) throws InterruptedException, IOException, ClassNotFoundException {
      Configuration conf = new Configuration();
      Job job = Job.getInstance(conf, "word count");
      job.setJarByClass(word.class);
    
      job.setMapperClass(MyMapper.class);
      job.setMapOutputKeyClass(Text.class);
      job.setMapOutputValueClass(IntWritable.class);
    
      job.setReducerClass(MyReduce.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(IntWritable.class);
    
      // Set the input and output paths
      FileInputFormat.addInputPath(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
      // Submit the job and wait for it to complete
      job.waitForCompletion(true);
    }

hadoop

In the entire Hadoop architecture, the computing framework plays a crucial role, on the one hand, it can operate on the data in HDFS, on the other hand, it can be encapsulated to provide calls from upper-level components such as Hive and Pig.

Let’s briefly introduce some of the more important components.

HBase: originated from Google’s BigTable; it is a highly reliable, high-performance, column-oriented, and scalable distributed database.

Hive: is a data warehouse tool that can map structured data files to a database table, and quickly implement simple MapReduce statistics through SQL-like statements, without the need to develop dedicated MapReduce applications, which is very suitable for statistical analysis of data warehouses.

Pig: is a large-scale data analysis tool based on Hadoop. It provides a SQL-LIKE language called Pig Latin. The compiler of this language converts SQL-like data analysis requests into a series of optimized MapReduce operations.

ZooKeeper: originated from Google’s Chubby; it is mainly used to solve some data management problems frequently encountered in distributed applications, simplifying the difficulty of coordinating and managing distributed application.

Ambari: Hadoop management tool, which can monitor, deploy, and manage clusters quickly.

Sqoop: used to transfer data between Hadoop and traditional databases.

Mahout: an extensible machine learning and data mining library.

Advantages and Applications of Hadoop

Overall, Hadoop has the following advantages:

High reliability: This is determined by its genes. Its genes come from Google. The best thing Google is good at is “garbage utilization.” When Google started, it was poor and couldn’t afford high-end servers, so it especially likes to deploy this kind of large system on ordinary computers. Although the hardware is unreliable, the system is very reliable.

High scalability: Hadoop distributes data and completes computing tasks among available computer clusters, and these clusters can be easily expanded. In other words, it is easy to become larger.

High efficiency: Hadoop can dynamically move data between nodes and ensure dynamic balance of each node, so the processing speed is very fast.

High fault tolerance: Hadoop can automatically save multiple copies of data and automatically redistribute failed tasks. This is also considered high reliability.

Low cost: Hadoop is open source and relies on community services, so the cost of use is relatively low.

Based on these advantages, Hadoop is suitable for applications in large data storage and large data analysis, suitable for running on clusters of several thousand to tens of thousands of servers, and supports PB-level storage capacity.

Hadoop’s applications are very extensive, including: search, log processing, recommendation systems, data analysis, video and image analysis, data storage, etc., can be deployed using it.

What is Hadoop?

Hadoop is a distributed system infrastructure developed by the Apache Foundation. It is a software framework that combines a storage system and a computing framework. It mainly solves the problem of storing and computing massive data and is the cornerstone of big data technology. Hadoop processes data in a reliable, efficient, and scalable way. Users can develop distributed programs on Hadoop without understanding the underlying details of the distributed system. Users can easily develop and run applications that process massive data on Hadoop.

What problems can Hadoop solve?

  • Massive data storage

    HDFS has high fault tolerance and is designed to be deployed on low-cost hardware. It provides high throughput for accessing data and is suitable for applications with large data sets. It consists of n machines running DataNode and one machine running NameNode (another standby). Each DataNode manages a portion of the data, and NameNode is responsible for managing the information (metadata) of the entire HDFS cluster.

  • Resource management, scheduling, and allocation

    Apache Hadoop YARN (Yet Another Resource Negotiator) is a new Hadoop resource manager. It is a general resource management system and scheduling platform that provides unified resource management and scheduling for upper-layer applications. Its introduction has brought huge benefits to the cluster in terms of utilization, unified resource management, and data sharing.

The origin of Hadoop

hadoop

The core architecture of Hadoop

The core of Hadoop is HDFS and MapReduce. HDFS provides storage for massive data, and MapReduce provides a computing framework for massive data.

HDFS

hadoop

The entire HDFS has three important roles: NameNode, DataNode, and Client.

Typical master-slave architecture, using TCP/IP communication.

  • NameNode: The master node of the distributed file system, responsible for managing the namespace of the file system, cluster configuration information, and storage block replication. The NameNode stores the metadata of the file system in memory, including file information, block information for each file, and information about each block in the DataNode.

  • DataNode: The slave node of the distributed file system, which is the basic unit of file storage. It stores blocks in the local file system and saves the metadata of the blocks. It also periodically sends information about all existing blocks to the NameNode.

  • Client: Splits files, accesses HDFS, interacts with the NameNode to obtain file location information, and interacts with the DataNode to read and write data.

There is also the concept of a block: a block is the basic read and write unit in HDFS. Files in HDFS are stored as blocks, which are replicated to multiple DataNodes. The size of a block (usually 64MB) and the number of replicated blocks are determined by the client when the file is created.

MapReduce

MapReduce is a distributed computing model that divides large data sets (greater than 1TB) into many small data blocks, and then performs parallel processing on various nodes in the cluster, and finally aggregates the results. The MapReduce calculation process can be divided into two stages: the Map stage and the Reduce stage.

  • Map stage: The input data is divided into several small data blocks, and then multiple Map tasks process them in parallel. Each Map task outputs the processing result as several key-value pairs.

  • Reduce stage: The output results of the Map stage are grouped according to the keys in the key-value pairs, and then multiple Reduce tasks process them in parallel. Each Reduce task outputs the processing result as several key-value pairs.

Summary

Hadoop is a distributed system infrastructure that mainly solves the problem of storing and computing massive data. Its core is HDFS and MapReduce, where HDFS provides storage for massive data, and MapReduce provides a computing framework for massive data. In addition, Hadoop also has an important component-YARN, which is a general resource management system and scheduling platform that provides unified resource management and scheduling for upper-layer applications.

Definition of Cloud Computing

Cloud computing is a type of service related to information technology, software, and the internet that provides on-demand, dynamically scalable, and inexpensive computing services through a network.

Cloud computing is a pay-per-use model that provides available, convenient, and on-demand network access to a shared pool of configurable computing resources (including networks, servers, storage, application software, and services) that can be rapidly provisioned.

History of Cloud Computing

In March 2006, Amazon launched the Elastic Compute Cloud (EC2) service.

On August 9, 2006, Google CEO Eric Schmidt first proposed the concept of “cloud computing” at the Search Engine Strategies conference (SES San Jose 2006). Google’s “cloud computing” originated from the “Google 101” project by Google engineer Christopher Beshlia.

In October 2007, Google and IBM began promoting cloud computing on American university campuses.

On February 1, 2008, IBM (NYSE: IBM) announced the establishment of the world’s first cloud computing center for Chinese software companies in the Wuxi Taihu New City Science and Education Industrial Park.

On July 29, 2008, Yahoo, HP, and Intel announced a joint research project to launch a cloud computing research test bed to promote cloud computing.

On August 3, 2008, the US Patent and Trademark Office website showed that Dell was applying for the “cloud computing” trademark to strengthen its control over the term that could reshape future technology architecture.

In March 2010, Novell and the Cloud Security Alliance (CSA) jointly announced a vendor-neutral plan called the “Trusted Cloud Initiative.”

In July 2010, the US National Aeronautics and Space Administration and supporting vendors such as Rackspace, AMD, Intel, and Dell announced the “OpenStack” open source code plan. Microsoft announced its support for the integration of OpenStack and Windows Server 2008 R2 in October 2010, while Ubuntu added OpenStack to version 11.04.

In February 2011, Cisco Systems officially joined OpenStack and focused on developing OpenStack’s network services.

Technical Background of Cloud Computing

Cloud computing is the commercial implementation of concepts in computer science such as parallel computing, distributed computing, and grid computing.

Cloud computing is the result of the mixed evolution and improvement of technologies such as virtualization, utility computing, Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).

Introduction

In this section, we will discuss how to handle events in Vue, including how to describe events in virtual nodes, how to add events to DOM elements, and how to update events. Let’s start by addressing the first question, which is how to describe events in virtual nodes. Events can be considered as special attributes, so we can agree that any attribute starting with the string “on” in the vnode.props object should be treated as an event. For example:

const vnode = {
  type: 'p',
  props: {
    // Describe events using onXxx
    onClick: () => {
      alert('clicked');
    }
  },
  children: 'text'
};

Once we have resolved how events are described in virtual nodes, let’s see how to add events to DOM elements. This is very simple, just call the addEventListener function in the patchProps method to bind the event, as shown in the following code:

function patchProps(el, key, prevValue, nextValue) {
  // Match attributes starting with on as events
  if (/^on/.test(key)) {
    // Get the corresponding event name based on the attribute name, e.g., onClick ---> click
    const name = key.slice(2).toLowerCase();
    
    // Remove the previously bound event handler
    prevValue && el.removeEventListener(name, prevValue);
    // Bind the new event handler
    el.addEventListener(name, nextValue);
  } else if (key === 'class') {
    // Omitted code (handling class attribute logic)
  } else if (shouldSetAsProps(el, key, nextValue)) {
    // Omitted code (handling other attribute logic)
  } else {
    // Omitted code (handling other attribute logic)
  }
}

In fact, the event update mechanism can be further optimized to avoid multiple calls to removeEventListener and addEventListener.

function patchProps(el, key, prevValue, nextValue) {
  if (/^on/.test(key)) {
    const name = key.slice(2).toLowerCase();
    let invoker = el.__vei || (el.__vei = {});

    if (nextValue) {
      if (!invoker[name]) {
        // If there is no invoker, create a fake invoker function
        invoker[name] = (e) => {
          invoker[name].value(e);
        };
      }
      
      // Assign the actual event handler to the value property of the invoker function
      invoker[name].value = nextValue;

      // Bind the invoker function as the event handler
      el.addEventListener(name, invoker[name]);
    } else if (invoker[name]) {
      // If the new event handler does not exist and the previously bound invoker exists, remove the binding
      el.removeEventListener(name, invoker[name]);
      invoker[name] = null;
    }
  } else if (key === 'class') {
    // Omitted code (handling class attribute logic)
  } else if (shouldSetAsProps(el, key, nextValue)) {
    // Omitted code (handling other attribute logic)
  } else {
    // Omitted code (handling other attribute logic)
  }
}

Looking at the above code, event binding is divided into two steps. First, read the corresponding invoker from el._vei. If invoker does not exist, create a fake invoker function and cache it in el._vei. Assign the actual event handler to the invoker.value property, and then bind the fake invoker function as the event handler to the element. When the event is triggered, the fake event handler is executed, indirectly invoking the actual event handler invoker.value(e).

When updating events, since el._vei already exists, we only need to modify the value of invoker.value to the new event handler.

This way, updating events can avoid a call to removeEventListener, improving performance. However, the current implementation still has issues. The problem is that el._vei currently caches only one event handler at a time. This means that if an element binds multiple events simultaneously, event override will occur.

const vnode = {
  type: 'p',
  props: {
    // Describe events using onXxx
    onClick: () => {
      alert('clicked');
    },
    onContextmenu: () => {
      alert('contextmenu');
    }
  },
  children: 'text'
};

// Assume renderer is your renderer object
renderer.render(vnode, document.querySelector('#app'));

When the renderer tries to render the vnode provided in the above code, it first binds the click event and then binds the contextmenu event. The contextmenu event handler bound later will override the click event handler. To solve the event override problem, we need to redesign the data structure of el._vei. We should design el._vei as an object, where the keys are event names and the values are corresponding event handler functions. This way, event override issues will be resolved.

Based on the code snippet you provided, this code is mainly used for handling attribute updates on DOM elements, including the logic for event binding and unbinding. In this code, it uses an el._vei object to cache event handler functions.

Introduction

Vue.js templates are powerful and can meet most of our application needs. However, in certain scenarios, such as creating dynamic components based on input or slot values, the render function can be a more flexible solution.

Developers familiar with the React ecosystem might already be acquainted with render functions, commonly used in JSX to construct React components. While Vue render functions can also be written in JSX, this discussion focuses on using plain JavaScript. This approach simplifies understanding the fundamental concepts of the Vue component system.

Every Vue component includes a render function. Most of the time, this function is created by the Vue compiler. When a template is specified for a component, the Vue compiler processes the template’s content, ultimately generating a render function. This render function produces a virtual DOM node, which Vue renders in the browser DOM.

This brings us to the concept of the virtual DOM. But what exactly is the virtual DOM?

The virtual Document Object Model (or “DOM”) enables Vue to render components in its memory before updating the browser. This approach enhances speed and avoids the high cost associated with re-rendering the DOM. Since each DOM node object contains numerous properties and methods, pre-rendering them in memory using the virtual DOM eliminates the overhead of creating DOM nodes directly in the browser.

When Vue updates the browser DOM, it compares the updated virtual DOM with the previous virtual DOM. Only the modified parts of the virtual DOM are used to update the actual DOM, reducing the number of element changes and enhancing performance. The render function returns virtual DOM nodes, often referred to as VNodes in the Vue ecosystem. These objects enable Vue to write these nodes into the browser DOM. They contain all the necessary information Vue needs.

vue-render

Mounting Child Nodes and Element Attributes

When vnode.children is a string, it sets the element’s text content. An element can have multiple child elements besides text nodes. To describe an element’s child nodes, vnode.children needs to be defined as an array:

const vnode = {
  type: 'div',
  children: [
    {
      type: 'p',
      children: 'hello'
    }
  ]
};

In the above code, we describe “a div tag with a child node, which is a p tag.” As seen, vnode.children is an array, and each element of the array is an independent virtual node object. This creates a tree-like structure, or a virtual DOM tree.

To render child nodes, we need to modify the mountElement function, as shown in the following code:

function mountElement(vnode, container) {
  const el = createElement(vnode.type);
  if (typeof vnode.children === 'string') {
    setElementText(el, vnode.children);
  } else if (Array.isArray(vnode.children)) {
    // If `children` is an array, iterate through each child node and call the `patch` function to mount them
    vnode.children.forEach(child => {
      patch(null, child, el);
    });
  }
  insert(el, container);
}

In this code, we have added a new conditional branch. We use the Array.isArray function to check if vnode.children is an array. If it is, we loop through each child node and call the patch function to mount the virtual nodes in the array. During mounting of child nodes, we need to pay attention to two points:

  1. The first argument passed to the patch function is null. Since this is the mounting phase and there is no old vnode, we only need to pass null. This way, when the patch function is executed, it will recursively call the mountElement function to mount the child nodes.

  2. The third argument passed to the patch function is the mounting point. Since the child elements being mounted are child nodes of the div element, the div element created earlier serves as the mounting point to ensure that these child nodes are mounted in the correct position.

After mounting the child nodes, let’s look at how to describe the attributes of an element using vnode and how to render these attributes. We know that HTML elements have various attributes, some of which are common, such as id and class, while others are specific to certain elements, such as the action attribute for form elements. In this discussion, we will focus on the basic attribute handling.

To describe the attributes of an element, we need to define a new field in the virtual DOM called vnode.props, as shown in the following code:

const vnode = {
  type: 'div',
  props: {
    id: 'foo'
  },
  children: [
    {
      type: 'p',
      children: 'hello'
    }
  ]
};

vnode.props is an object where the keys represent the attribute names of the element, and the values represent the corresponding attribute values. This way, we can iterate through the props object and render these attributes onto the element, as shown in the following code:

function mountElement(vnode, container) {
  const el = createElement(vnode.type);
  // Skip children handling for now

  // Only handle `vnode.props` if it exists
  if (vnode.props) {
    // Iterate through `vnode.props` object
    for (const key in vnode.props) {
      // Use `setAttribute```javascript
      // Use `setAttribute` to set attributes on the element
      el.setAttribute(key, vnode.props[key]);
    }
  }

  insert(el, container);
}

In this code snippet, we first check if vnode.props exists. If it does, we iterate through the vnode.props object and use the setAttribute function to set attributes on the element. This approach ensures that the attributes are rendered onto the element during the mounting process.

When dealing with attributes, it’s essential to understand the distinction between HTML Attributes and DOM Properties. HTML Attributes are the attributes defined in the HTML tags, such as id="my-input", type="text", and value="foo". When the browser parses this HTML code, it creates a corresponding DOM element object, which we can access using JavaScript code:

const el = document.querySelector('#my-input');

Now, let’s talk about DOM Properties. Many HTML Attributes have corresponding DOM Properties on the DOM element object, such as id="my-input" corresponding to el.id, type="text" corresponding to el.type, and value="foo" corresponding to el.value. However, the names of DOM Properties don’t always exactly match HTML Attributes:

<div class="foo"></div>

In this case, class="foo" corresponds to the DOM Property el.className. Additionally, not all HTML Attributes have corresponding DOM Properties:

<div aria-valuenow="75"></div>

Attributes with the aria-* prefix do not have corresponding DOM Properties.

Similarly, not all DOM Properties have corresponding HTML Attributes. For example, you can use el.textContent to set the element’s text content, but there is no equivalent HTML Attribute for this operation.

The values of HTML Attributes and DOM Properties are related. For example, consider the following HTML snippet:

<div id="foo"></div>

This snippet defines a div element with an id attribute. The corresponding DOM Property is el.id, and its value is the string 'foo'. We consider this situation as a direct mapping, where the HTML Attribute and DOM Property have the same name (id in this case).

However, not all HTML Attributes and DOM Properties have a direct mapping relationship. For example:

<input value="foo" />

Here, the input element has a value attribute set to 'foo'. If the user does not modify the input field, accessing el.value would return the string 'foo'. If the user changes the input value to 'bar', accessing el.value would return 'bar'. But if you run the following code:

console.log(el.getAttribute('value')); // Still 'foo'
console.log(el.value); // 'bar'

You’ll notice that modifying the input value does not affect the return value of el.getAttribute('value'). This behavior indicates the meaning behind HTML Attributes. Essentially, HTML Attributes are used to set the initial value of corresponding DOM Properties. Once the value changes, the DOM Properties always store the current value, while getAttribute retrieves the initial value.

However, you can still access the initial value using el.defaultValue, as shown below:

el.getAttribute('value'); // Still 'foo'
el.value; // 'bar'
el.defaultValue; // 'foo'

This example illustrates that an HTML Attribute can be associated with multiple DOM Properties. In this case, value="foo" is related to both el.value and el.defaultValue.

Although HTML Attributes are considered as setting the initial values of corresponding DOM Properties, some values are restricted. It’s as if the browser internally checks for default value validity. If the value provided through HTML Attributes is invalid, the browser uses a built-in valid value for the corresponding DOM Properties. For example:

<input type="foo" />

We know that specifying the string 'foo' for the type attribute of the <input/> tag is invalid. Therefore, the browser corrects this invalid value. When you try to read el.type, you actually get the corrected value, which is 'text', not 'foo':

console.log(el.type); // 'text'

From the analysis above, we can see that the relationship between HTML Attributes and DOM Properties is complex. However, the core principle to remember is this: HTML Attributes are used to set the initial values of corresponding DOM Properties.

How to Properly Set Element Attributes

In the previous discussion, we explored how HTML Attributes and DOM Properties are handled in Vue.js single-file components’ templates. In regular HTML files, the browser automatically parses HTML Attributes and sets the corresponding DOM Properties. However, in Vue.js templates, the framework needs to handle the setting of these attributes manually.

Firstly, let’s consider a disabled button as an example in plain HTML:

<button disabled>Button</button>

The browser automatically disables this button and sets its corresponding DOM Property el.disabled to true. However, if the same code appears in a Vue.js template, the behavior would be different.

In Vue.js templates, the HTML template is compiled into virtual nodes (vnode). The value of props.disabled in the virtual node is an empty string. If you use the setAttribute function directly to set the attribute, unexpected behavior occurs, and the button becomes disabled. For example, in the following template:

<button disabled="false">Button</button>

The corresponding virtual node is:

const button = {
  type: 'button',
  props: {
    disabled: false
  }
};

If you use the setAttribute function to set the attribute value to an empty string, it is equivalent to:

el.setAttribute('disabled', '');

However, the el.disabled property is of boolean type and does not care about the specific value of HTML Attributes; it only checks for the existence of the disabled attribute. So, the button becomes disabled. Therefore, renderers should not always use the setAttribute function to set attributes from the vnode.props object.

To solve this issue, a better approach is to prioritize setting the element’s DOM Properties. However, if the value is an empty string, manually correct it to true. Here is an implementation example:

function mountElement(vnode, container) {
  const el = createElement(vnode.type);

  if (vnode.props) {
    for (const key in vnode.props) {
      if (key in el) {
        const type = typeof el[key];
        const value = vnode.props[key];
        if (type === 'boolean' && value === '') {
          el[key] = true;
        } else {
          el[key] = value;
        }
      } else {
        el.setAttribute(key, vnode.props[key]);
      }
    }
  }

  insert(el, container);
}

In this code, we first check if the property exists on the DOM element. If it does, we determine the type of the property and the value from vnode.props. If the property is of boolean type and the value is an empty string, we correct it to true. If the property does notexist on the DOM element, we use the setAttribute function to set the attribute.

However, there are still issues with this implementation. Some DOM Properties are read-only, such as el.form. To address this problem, we can create a helper function, shouldSetAsProps, to determine whether an attribute should be set as DOM Properties. If the property is read-only or requires special handling, we should use the setAttribute function to set the attribute.

Finally, to make the attribute setting operation platform-agnostic, we can extract the attribute-related operations into the renderer options. Here is the updated code:

const renderer = createRenderer({
  createElement(tag) {
    return document.createElement(tag);
  },
  setElementText(el, text) {
    el.textContent = text;
  },
  insert(el, parent, anchor = null) {
    parent.insertBefore(el, anchor);
  },
  patchProps(el, key, prevValue, nextValue) {
    if (shouldSetAsProps(el, key, nextValue)) {
      const type = typeof el[key];
      if (type === 'boolean' && nextValue === '') {
        el[key] = true;
      } else {
        el[key] = nextValue;
      }
    } else {
      el.setAttribute(key, nextValue);
    }
  }
});

In the mountElement function, we only need to call the patchProps function and pass the appropriate parameters. This way, we’ve extracted the attribute-related rendering logic from the core renderer, making it more maintainable and flexible.

Please note that the shouldSetAsProps function should be implemented according to your specific requirements and the DOM properties you want to handle differently.

Preface

In Vue.js, many functionalities rely on renderers to be implemented, such as Transition components, Teleport components, Suspense components, as well as template refs and custom directives.

Moreover, the renderer is the core of the framework’s performance, as its implementation directly affects the framework’s performance. Vue.js 3’s renderer not only includes the traditional Diff algorithm but also introduces a fast path update method, leveraging the information provided by the compiler, significantly improving update performance.

In Vue.js, the renderer is responsible for executing rendering tasks. On the browser platform, it renders the virtual DOM into real DOM elements. The renderer can render not only real DOM elements but also plays a key role in the framework’s cross-platform capabilities. When designing a renderer, its customizable capabilities need to be considered.

Basic Concepts and Meanings of Renderer

Before implementing a basic renderer, we need to understand a few fundamental concepts:

In Vue.js, a renderer is a component responsible for rendering virtual DOM (or virtual nodes) into real elements on a specific platform. On the browser platform, the renderer renders virtual DOM into real DOM elements.

Renderer

The renderer is responsible for rendering virtual DOM (or virtual nodes) into real elements on a specific platform. On the browser platform, the renderer renders virtual DOM into real DOM elements.

Virtual DOM (vnode)

The virtual DOM (also known as virtual nodes, abbreviated as vnode) is a tree-like structure, similar to real DOM, consisting of various nodes. The renderer’s task is to render the virtual DOM into real DOM elements.

Mounting

Mounting refers to rendering the virtual DOM into real DOM elements and adding them to a specified mounting point. In Vue.js, the mounted lifecycle hook of a component is triggered when the mounting is completed, and it can access the real DOM element at this point.

Container

The container specifies the mounting position’s DOM element. The renderer renders the virtual DOM into real DOM elements and adds them to the specified container. In the renderer’s render function, a container parameter is usually passed in, indicating which DOM element the virtual DOM is mounted to.

Creation and Usage of Renderer

The renderer is usually created using the createRenderer function, which returns an object containing rendering and hydration functions. The hydration function is used in server-side rendering (SSR) to hydrate virtual DOM into existing real DOM elements. Here’s an example of creating and using a renderer:

function createRenderer() {
  function render(vnode, container) {
    // Render logic
  }

  function hydrate(vnode, container) {
    // Hydration logic
  }

  return {
    render,
    hydrate
  };
}

const { render, hydrate } = createRenderer();
// Initial rendering
render(vnode, document.querySelector('#app'));
// Server-side rendering
hydrate(vnode, document.querySelector('#app'));

In the above code, the createRenderer function creates a renderer object that contains the render and hydrate functions. The render function is used to render the virtual DOM into real DOM elements, while the hydrate function is used to hydrate the virtual DOM into existing real DOM elements.

Now that we have a basic understanding of the renderer, let’s dive deeper step by step.

The implementation of the renderer can be represented by the following function, where domString is the HTML string to be rendered, and container is the DOM element to mount to:

function renderer(domString, container) {
  container.innerHTML = domString;
}

Example usage of the renderer:

renderer('<h1>Hello</h1>', document.getElementById('app'));

In the above code, <h1>Hello</h1> is inserted into the DOM element with the id app. The renderer can not only render static strings but also dynamic HTML content:

let count = 1;
renderer(`<h1>${count}</h1>`, document.getElementById('app'));

If count is a reactive data, the reactivity system can automate the entire rendering process. First, define a reactive data count and then call the renderer function inside the side effect function to render:

const count = ref(1);

effect(() => {
  renderer(`<h1>${count.value}</h1>`, document.getElementById('app'));
});

count.value++;

In the above code, count is a ref reactive data. When modifying the value of count.value, the side effect function will be re-executed, triggering re-rendering. The final content rendered to the page is <h1>2</h1>.

Here, the reactive API provided by Vue 3’s @vue/reactivity package is used. It can be included in the HTML file using the <script> tag:

<script src="https://unpkg.com/@vue/reactivity@3.0.5/dist/reactivity.global.js"></script>

The basic implementation of the render function is given in the above code. Let’s analyze its execution flow in detail. Suppose we call the renderer.render function three times consecutively for rendering:

const renderer = createRenderer();

// Initial rendering
renderer.render(vnode1, document.querySelector('#app'));
// Second rendering
renderer.render(vnode2, document.querySelector('#app'));
// Third rendering
renderer.render(null, document.querySelector('#app'));

During the initial rendering, the renderer renders vnode1 into real DOM and stores vnode1 in the container element’s container.vnode property as the old vnode.

During the second rendering, the old vnode exists (container.vnode has a value), and the renderer takes vnode2 as the new vnode, passing both the new and old vnodes to the patch function to perform patching.

During the third rendering, the new vnode’s value is null, indicating that no content should be rendered. However, at this point, the container already contains the content described by vnode2, so the renderer needs to clear the container. In the code above, container.innerHTML = '' is used to clear the container. It’s important to note that clearing the container this way is not the best practice but is used here for demonstration purposes.

Regarding the patch function, it serves as the core entry point of the renderer. It takes three parameters: the old vnode n1, the new vnode n2, and the container container. During the initial rendering, the old vnode n1 is undefined, indicating a mounting action. The patch function not only serves for patching but can also handle mounting actions.

Custom Renderer

The implementation of a custom renderer involves abstracting the core rendering logic, making it independent of specific platform APIs. The following example demonstrates the implementation of a custom renderer using configuration options to achieve platform-independent rendering:

// Create a renderer function, accepting options as parameters
function createRenderer(options) {
  // Retrieve DOM manipulation APIs from options
  const { createElement, insert, setElementText } = options;

  // Define the function to mount elements
  function mountElement(vnode, container) {
    // Call createElement function to create an element
    const el = createElement(vnode.type);
    // If children are a string, use setElementText to set text content
    if (typeof vnode.children === 'string') {
      setElementText(el, vnode.children);
    }
    // Call insert function to insert the element into the container
    insert(el, container);
  }

  // Define the function to patch elements
  function patch(n1, n2, container) {
    // Implement patching logic, this part is omitted in the example
  }

  // Define the render function, accepting virtual nodes and a container as parameters
  function render(vnode, container) {
    // If the old virtual node exists, execute patching logic; otherwise, execute mounting logic
    if (container.vnode) {
      patch(container.vnode, vnode, container);
    } else {
      mountElement(vnode, container);
    }
    // Store the current virtual node in the container's vnode property
    container.vnode = vnode;
  }

  // Return the render function
  return render;
}

// Create configuration options for the custom renderer
const customRendererOptions = {
  // Function for creating elements
  createElement(tag) {
    console.log(`Creating element ${tag}`);
    // In a real application, you can return a custom object to simulate a DOM element
    return { type: tag };
  },
  // Function for setting an element's text content
  setElementText(el, text) {
    console.log(`Setting text content of ${JSON.stringify(el)}: ${text}`);
    // In a real application, set the object's text content
    el.textContent = text;
  },
  // Function for inserting an element under a given parent
  insert(el, parent, anchor = null) {
    console.log(`Adding ${JSON.stringify(el)} to ${JSON.stringify(parent)}`);
    // In a real application, insert el into parent
    parent.children = el;
  },
};

// Create a render function using the custom renderer's configuration options
const customRenderer = createRenderer(customRendererOptions);

// Create a virtual node describing <h1>hello</h1>
const vnode = {
  type: 'h1',
  children: 'hello',
};

// Use an object to simulate a mounting point
const container = { type: 'root' };

// Render the virtual node to the mounting point using the custom renderer
customRenderer(vnode, container);

In the above code, we create a custom renderer using the createRenderer function, which takes configuration options as parameters. The configuration options include functions for creating elements, setting text content, and inserting elements into a parent. The customRenderer function takes a virtual node and a container as parameters, and it can handle both mounting and patching logic based on the existence of the old virtual node (container.vnode).

This custom renderer allows platform-independent rendering by abstracting the core logic and making it adaptable to different platforms through configuration options.

Please note that the code above demonstrates the concept of a custom renderer and focuses on its implementation logic. In a real-world scenario, you might need additional error handling, optimizations, and proper DOM manipulations based on the specific platform requirements.

Introduction

Before delving into JIT, it’s essential to have a basic understanding of the compilation process.

In compiler theory, translating source code into machine instructions generally involves several crucial steps:

JIT (Just-In-Time) Compiler

JIT Overview

JIT stands for Just-In-Time compiler. Through JIT technology, it’s possible to accelerate the execution speed of Java programs. But how is this achieved?

Java is an interpreted language (or semi-compiled, semi-interpreted language). Java compiles the source code into platform-independent Java bytecode files (.class) using the javac compiler. These bytecode files are then interpreted and executed by the Java Virtual Machine (JVM), ensuring platform independence. However, interpreting bytecode involves translating it into corresponding machine instructions, which inevitably slows down the execution speed compared to directly executing binary bytecode files.

To enhance execution speed, JIT technology is introduced. When the JVM identifies a method or code block that is executed frequently, it recognizes it as a “Hot Spot Code.” JIT compiles these “Hot Spot Codes” into native machine-specific machine code, optimizes it, and caches the compiled machine code for future use.

JIT (Just-In-Time) Compiler

Hot Spot Compilation

When the JVM executes code, it doesn’t immediately start compiling it. There are two main reasons for this:

Firstly, if a piece of code is expected to be executed only once in the future, compiling it immediately is essentially a waste of resources. Compiling code into Java bytecode is much faster than both compiling and executing the code.

However, if a piece of code, such as a method call or a loop, is executed multiple times, compiling it becomes worthwhile. The compiler has the ability to discern which methods are frequently called to ensure efficient compilation. Hot Spot VM employs JIT compilation technology to directly compile high-frequency bytecode into machine instructions (with the method as the compilation unit). These compiled machine instructions are executed directly when bytecode is JIT-compiled, providing a performance boost.

The second reason involves optimization. As a method or loop is executed more frequently, the JVM gains a better understanding of the code structure. Therefore, the JVM can make corresponding optimizations during the compilation process.

How JavaScript is Compiled - How JIT (Just-In-Time) Compiler Works

In general, there are two ways to translate programs into machine-executable instructions: using a Compiler or an Interpreter.

Interpreter

An interpreter translates and executes code line by line as it encounters it.

Pros:

  • Fast execution, no compilation delay.
    Cons:
  • Same code might be translated multiple times, especially within loops.

JIT (Just-In-Time) Compiler

Compiler

A compiler translates the code in advance and generates an executable program.

Pros:

  • No need for repeated compilation; can optimize code during compilation.
    Cons:
  • Requires upfront compilation.

JIT (Just-In-Time) Compiler

JIT

When JavaScript first emerged, it was a typical interpreted language, resulting in slow execution speeds. Later, browsers introduced JIT compilers, significantly improving JavaScript’s execution speed.

Principle: They added a new component to the JavaScript engine, known as a monitor (or profiler). This monitor observes the running code, noting how many times it runs and the types used.

In essence, browsers added a monitor to the JavaScript engine to observe the running code, recording how many times each code segment is executed and the variable types used.

Now, why does this approach speed up the execution?

Let’s consider a function for illustration:

function arraySum(arr) {
  var sum = 0;
  for (var i = 0; i < arr.length; i++) {
    sum += arr[i];
  }
}

1st Step - Interpreter

Initially, the code is executed using an interpreter. When a line of code is executed several times, it is marked as “Warm,” and if executed frequently, it is labeled as “Hot.”

2nd Step - Baseline Compiler

Warm-labeled code is passed to the Baseline Compiler, which compiles and stores it. The compiled code is indexed based on line numbers and variable types (why variable types are important will be explained shortly).

When the index matches, the corresponding compiled code is directly executed without recompilation, eliminating the need to recompile already compiled code.

3rd Step - Optimizing Compiler

Hot-labeled code is sent to the Optimizing Compiler, where further optimizations are applied. How are these optimizations performed? This is the key: due to JavaScript’s dynamic typing, a single line of code can have multiple possible compilations, exposing the drawback of dynamic typing.

For instance:

  • sum is Int, arr is Array, i is Int; the + operation is simple addition, corresponding to one compilation result.
  • sum is string, arr is Array, i is Int; the + operation is string concatenation, requiring the conversion of i to a string type.

As illustrated in the diagram below, such a simple line of code has 16 possible compilation results.

JIT (Just-In-Time) Compiler

The Baseline Compiler handles this complexity, and thus, the compiled code needs to be indexed using both line numbers and variable types. Different variable types lead to different compilation results.

If the code is “Warm,” the JIT’s job ends here. Each subsequent execution involves type checks and uses the corresponding compiled result.

However, when the code becomes “Hot,” more optimizations are performed. Here, optimization means JIT makes a specific assumption. For example, assuming sum and i are both Integers and arr is an Array, only one compilation result is needed.

In practice, type checks are performed before execution. If the assumptions are incorrect, the execution is “deoptimized,” reverting to the interpreter or baseline compiler versions. This process is called “deoptimization.”

As evident, the speed of execution relies on the accuracy of these assumptions. If the assumption success rate is high, the code executes faster. Conversely, low success rates lead to slower execution than without any optimization (due to the optimize => deoptimize process).

Conclusion

In summary, this is what JIT does at runtime. It monitors running code, identifies hot code paths for optimization, making JavaScript run faster. This significantly improves the performance of most JavaScript applications.

However, JavaScript performance remains unpredictable. To make it faster, JIT adds some overhead at runtime, including:

Optimization and Deoptimization

  • Memory for monitoring and recovering lost information
  • Memory for storing baseline and optimized versions of functions

There’s room for improvement, notably in eliminating overhead to make performance more predictable.