Skip to main content

SQLite Frontend Internal Architecture


In a computer program, the database is the heart of handling data. It provides an efficient way of storing and retrieving data fast. Learning Database internal architecture in code level is not that so easy since Database has complex architecture. Looking into a complex database is not that easy because of the complexity. But, SQLite is a nice little compact database that used in millions of devices and operating systems. Interesting fact about the SQLite database is that, code is highly readable. All most all of the codes are well commented and architecture is highly layered. This post is to look at the high-level architecture of the SQLite database.
As I early mentioned, SQLite has a highly layered architecture and that can be separated into seven layers. We will go through each of the layers one by one.
1_t5ygkzu6a28fO41ynZ8kgQ.jpeg

Tokenizer

First Layer is called Tokenizer. Purpose of the tokenizer is to create tokens for a given SQL query. Tokenizer scans SQL query from left to right and generates a list of tokens. Following are some of the types of tokens used in SQL tokenizer.
  1. Operator tokens Operator token used to tokenized arithmetic operators in SQL queries. Here are some of the example set of Operator tokens.
  2. "-" as token MINUS
  3. "(" as token LP
  4. "+" as token PLUS
  5. Keyword tokens Keyword token used to tokenized the keywords used in SQL queries. ADD, SELECT, ROW, INSERT are examples for Keyword tokens.
  6. Whitespace tokens This token used to separate other tokens. Here comments are also identified as white space tokens

Parser

Paser used to generate a parse tree by reading a stream of tokens. SQLite uses Lemon parser to parse SQL queries. Lemon parser allows SQLite C compiler to generate relevant C code from given sets of language rules. Language rules are defined by using Context Free Grammer(CFG). Example parser tree for the following SQL statement as follows.
SELECT title FROM StartsIn, MovieStar WHERE starName="Jack" AND birthday like "%1990"slide_6.jpg
Here below an example, CFG rules to generate a program to evaluate a simple mathematical expression.
   1  program ::= expr(A).   { std::cout << "Result=" << A << std::endl; }
   2
   3  expr(A) ::= expr(B) MINUS  expr(C).   { A = B - C; }
   4  expr(A) ::= expr(B) PLUS  expr(C).   { A = B + C; }
   5  expr(A) ::= expr(B) TIMES  expr(C).   { A = B * C; }
   6  expr(A) ::= expr(B) DIVIDE expr(C).  {

   7           if(C != 0){
   8             A = B / C;
   9            }else{
   10             std::cout << "divide by zero" << std::endl;
   11             }
   12  }  /* end of DIVIDE */

  13  expr(A) ::= INTEGER(B). { A = B; }
Here MINUS, PLUS, TIMES and DIVIDE are the operator token generated by the tokenizer. In this code from line number 3 to 6 defined the CFG for mathematical operation. Lemon parser takes the precedence of these to operate as same as the given order. Think the given expression to evaluate is "3+5*6/3". Tokenized output will be
3 PLUSE 5 TIMES 6 DIVIDE 3
According to the precedence, Lemon parser evaluate line #6 first and evaluate 6/3. Then it evaluates TIMES operation in line #5 and the result would be 10. Finally, it evaluates line #4 and generates the final result as 13. This result will be printed in line #1 which has the least precedence.
All of the SQL rules are defined in the same kind of structure as I described in the previous example.

Code Generator and Virtual Machine

While SQL query parsing, SQLite generate a sequence of instruction that needs to runs in order to generate the result. Each of these instructions has some operation to perform and inputs. It same as calling to a function with arguments. These generated codes are running on top of the Virtual machine. The virtual machine read each of this instruction one by one and evaluate it. To understand what are the available operations and how Virtual Machine works, we need to have a good understanding of Btree.

Btree

Btree is a data structure used to store data as a tree based on its value. The simplest form of the BTree is the Binary tree. Databases use a B-tree data structure to store indexes to improve the performance of the database. Data records are stored in a B+tree structure. If no indexing use, only B+tree used to store the data. A cursor is a special pointer which used to point a record( or row) which given with page id and offset. You can travel through all data records by using the cursor. To find the data in given Btree which used indexing, SQLite start move cursor to the root node first element and perform a B-tree search. If there are no Btree indexing available, SQLite searches the matching record in the B+tree by using linear search. You can find out more details about Btree in this article.
To search for data on a B+tree, the sequence of operation that generated by Code Generator as follows. You can check what was the sequence of instruction that needs to run for a particular SQL query by adding "explain" keyword at the beginning of the query.
explain select * from foo where value = "value";
0 |ColumnCount   |1 |0 |
1 |ColumnName    |0 |0 |value
2 |Open          |0 |3 |foo     // Open cursor and point it to the third page which is root page for foo table(p3 is not important )
3 |VerifyCookie  |46|0 |       // Make sure schema not changed
4 |Rewind        |0 |11|       // Go to first entry
5 |Column        |0 |0 |       // read data and push it on stack
6 |Column        |0 |0 |
7 |Ne            |1 |10|       //Pop the top two elements from the stack.  If they are not equal, then jump to instruction P2.  Otherwise, continue to the next instruction.
8 |Column        |0 |0 |
9 |Callback      |1 |0 |       // Pop P1 values off the stack and form them into an array

10|Next          |0 |5 |       // Move cursor to next record, if data exit goto P2 else go to next line
11|Close         |0 |0 |       // Close the cursor

Pager

Pages are the smallest unit of a transaction on the file system. Btree and B+tree stored on the page. When the database needs to read data from a file, it requests it as a page. Once the page loaded into the database engine it can store page if it accesses frequently on its cache. Pages are numbered and start from one. The first page is called the root page. Size of a page is constant.

VFS (Virtual File System)

File access on Unix and Windows are different from each other. VFS provides common API to access files without considering the type of the operating system it runs on. This API includes functions to open, read, write and close a file. Here follows are some of the API used in VFS to read, write data into a file.

Conclusion

In this post, I tried to explain the SQLite database implementation at a high level. If you are interested to know more about the internal architecture of SQLite, I like to share my knowledge with you. So, comment down if there is any clarification/ improvement that needs to add. Also here my backend implementation that used to understand the SQLite architecture. In this repository, I tried to separate each of the layers in SQLite and write some test for each of the layers. You can find out SQLite 2.5 source code also that used to implement this project over here.

Comments

  1. Having the scored in India discharges that fabricate from captains likewise arranges it in the crisiss of lecturers who drive altogether inquiry American idioms or cultural materials. business online advertising

    ReplyDelete

Post a Comment

Popular posts from this blog

Database Internel Architecture: SQLite

Introduction A database is an essential part of building a software system which used to store and read data efficiently. Here, We are going to discuss some architectural details of database implementation by using an early version of SQLite. SQLite is a small database application which used in millions of software and devices. SQLite invented by D.Richard Hipp in August 2000. SQLite is a high performance, lightweight relational database. If you are willing to learn internal of a database in coding level, then SQLite is the best open source database available out there with highly readable source code with lots of documentation. Reading later versions of SQLite become a little harder since it contains lots of new features. In order to understand the basic implementation of database internals, You should have good knowledge about data structures, some knowledge about Theory of Computing and how an operating system works. Here we are looking into the SQLite 2.5.0 version. Here

Weird Programming Languages

There are thousands of programming languages are invented and only about hundred of programming languages are commonly used to build software. Among this thousands of programming languages, there are some weird type of programming languages can be also found. These programming languages are seems to be called weird, since their programming syntax and the way it represent its code. In this blog we will look into some of these language syntax. Legit Have you ever wonder, when you come to a Github project that print hello world program, but you cannot see any codes or any content. Check this link  https://github.com/blinry/legit-hello  and you will see nothing in this repository. But trust me, there is hidden code in this project. If you see the  commit  section, you can reveal the magic. Yeah, you are right. Its storing hello world code in inside the git commit history. If you clone this project and run the following command, then you can see the hidden code in this project. g

Basic Concepts of the Kubernetes

Handling large software which has multiple services is a tedious, time-consuming task for DevOps engineer. Microservices comes into the rescue DevOps engineers from all these complicated deployment processes. Simply, each microservice in the system has it own responsibility to handle one specific task. The container can be used to deploy each of these micro-tasks as a unit of service. If you are not that familiar with Containers, read this article to get to know about Docker, Which is the most popular and widely used container technology to deploy microservices. As I described early, we can use single container to deploy a single service and container contain all required configurations and dependencies. Single service always faces a common problem of a single point of failure. In order to avoid single point failure, we need to set up another service such that if one service is getting down, next available service takes that load and continue to provide the service. Another requi