Skip to content

project2 : Disk Based B Tree

Sonny edited this page Feb 25, 2022 · 1 revision

Analyze the given in-memory B+ tree

1. Possible call path of the insert/delete operation

Insert

If instruction is i in main(), it calls insert() in bpt.cc to insert input in db_file.

        case 'i':
            scanf("%d", &input);
            root = insert(root, input, input);

insert() performs various operations according to the situation of the tree.

  1. There is already a key to insert in tree.
if (find(root, key, false) != NULL)
        return root;

Because the current implementation ignores duplicates, it just return root node.

  1. The tree does not exist yet.
    if (root == NULL) 
        return start_new_tree(key, pointer);

Then, it makes new tree by using the given key.

  1. The tree already exists and leaf node has room for key and pointer.
    if (leaf->num_keys < order - 1) {
        leaf = insert_into_leaf(leaf, key, pointer);
        return root;
    }

Find the appropriate leaf node into which the given key will be inserted, and insert the key and pointer in appropriate position if there is space in the leaf node.

  1. The tree already exists but leaf node doesn't have any room for inserted key and pointer.
    return insert_into_leaf_after_splitting(root, leaf, key, pointer);

Then, firstly split the node and insert the key and pointer. Let's take a closer look at this insert_into_leaf_after_splitting().

node * insert_into_leaf_after_splitting(node * root, node * leaf, int key, record * pointer)

It creates a temporary array for keys and pointers to arrange existing keys and newly inserted keys in ascending order, and align pointers in that order.

    temp_keys = (int *)malloc( order * sizeof(int) );
    if (temp_keys == NULL) {
        perror("Temporary keys array.");
        exit(EXIT_FAILURE);
    }

    temp_pointers = (void **)malloc( order * sizeof(void *) );
    if (temp_pointers == NULL) {
        perror("Temporary pointers array.");
        exit(EXIT_FAILURE);
    }

    insertion_index = 0;
    while (insertion_index < order - 1 && leaf->keys[insertion_index] < key)
        insertion_index++;

    for (i = 0, j = 0; i < leaf->num_keys; i++, j++) {
        if (j == insertion_index) j++;
        temp_keys[j] = leaf->keys[i];
        temp_pointers[j] = leaf->pointers[i];
    }

    temp_keys[insertion_index] = key;
    temp_pointers[insertion_index] = pointer;

Then divide the temporarily created arrays in half, and the leaf_node has the left half and the new_leaf node has the right half.

    split = cut(order - 1);

    for (i = 0; i < split; i++) {
        leaf->pointers[i] = temp_pointers[i];
        leaf->keys[i] = temp_keys[i];
        leaf->num_keys++;
    }

    for (i = split, j = 0; i < order; i++, j++) {
        new_leaf->pointers[j] = temp_pointers[i];
        new_leaf->keys[j] = temp_keys[i];
        new_leaf->num_keys++;
    }

The temporarily dynamic allocation arrays free. And connection between leaf nodes should be updated.

    free(temp_pointers);
    free(temp_keys);

    new_leaf->pointers[order - 1] = leaf->pointers[order - 1];
    leaf->pointers[order - 1] = new_leaf;

After that, establish the parent relationship of the new leaf node and insert the first key value of the new leaf node into the parent node.

    new_leaf->parent = leaf->parent;
    new_key = new_leaf->keys[0];

    return insert_into_parent(root, leaf, new_key, new_leaf);

Let's take a closer look at this insert_into_parent().

node * insert_into_parent(node * root, node * left, int key, node * right)

If left node is root ( parent node is NULL ), it calls insert_into_new_root() which creates a new root for two subtrees and inserts the appropriate key into the new root.

    node * parent;
    parent = left->parent;

    if (parent == NULL)
        return insert_into_new_root(left, key, right);

Otherwise, it finds the parent's pointer to the left node.
If there is a space in the parent node where the key is to be inserted, the key is inserted by calling insert_into_node().

    left_index = get_left_index(parent, left);
    if (parent->num_keys < order - 1)
        return insert_into_node(root, parent, left_index, key, right);

Unless there is a space in parent node, split the node and insert key.

    return insert_into_node_after_splitting(root, parent, left_index, key, right);

Then, Let's take a closer look at this insert_into_node_after_splitting().

node * insert_into_node_after_splitting(node * root, node * old_node, int left_index, 
        int key, node * right)

It creates a temporary set of keys and pointers to hold everything in order, including the new key and pointer, inserted in their correct places. Then create a new node and copy half of the keys and pointers to the old node and the other half to the new.

    split = cut(order);
    new_node = make_node();
    old_node->num_keys = 0;
    for (i = 0; i < split - 1; i++) {
        old_node->pointers[i] = temp_pointers[i];
        old_node->keys[i] = temp_keys[i];
        old_node->num_keys++;
    }
    old_node->pointers[i] = temp_pointers[i];
    k_prime = temp_keys[split - 1];
    for (++i, j = 0; i < order; i++, j++) {
        new_node->pointers[j] = temp_pointers[i];
        new_node->keys[j] = temp_keys[i];
        new_node->num_keys++;
    }
    new_node->pointers[j] = temp_pointers[i];
    free(temp_pointers);
    free(temp_keys);
    new_node->parent = old_node->parent;
    for (i = 0; i <= new_node->num_keys; i++) {
        child = (node *)new_node->pointers[i];
        child->parent = new_node;
    }

Then, it inserts a new key into the parent of the two nodes resulting from the split, with the old node to the left and new to the right.

    return insert_into_parent(root, old_node, k_prime, new_node);

Delete

If instruction is d in main(), it calls db_delete() in bpt.cc to delete input in db_file.

        case 'd':
            scanf("%d", &input);
            root = db_delete(root, input);

The db_delete() finds key_record and key_leaf by using input key.
If key_record and key_leaf are not NULL, it calls delete_entry()

    key_record = find(root, key, false);
    key_leaf = find_leaf(root, key, false);
    if (key_record != NULL && key_leaf != NULL) {
        root = delete_entry(root, key_leaf, key, key_record);
        free(key_record);
    }

The delete_entry() firstly removes key and pointer in node by using remove_entry_from_node().

    n = remove_entry_from_node(n, key, (node *)pointer);

Then, the delete operation has four main types of call paths.

  1. The node where the key was deleted is a root node.
    if (n == root) 
        return adjust_root(root);

If then, it calls adjust_root().
adjust_root() firstly checks if the root node is empty.
If root is not empty, there is nothing to do. So it just return root.
Otherwise, check if root node is leaf_node. If root node is not leaf_node and has a child, promote the first child as the new root.
But if root node is empty and leaf_node, it means the whole tree is empty. so it returns NULL.

  1. After deleting key, the node complies with the occupancy invariant. ( node stays at or above minimum )
    if (n->num_keys >= min_keys)
        return root;

If then, it just returns root.

  1. In the case where the occupancy invariant cannot follow ( node falls below minimum ), but merge operation is possible.
    if (neighbor->num_keys + n->num_keys < capacity)
        return coalesce_nodes(root, n, neighbor, neighbor_index, k_prime);

If the sum of the number of keys in the neighbor node and the number of keys in the current node is less than the capacity, it calls coalesce_nodes(). Because delete_entry() occurs in coalesce_nodes() to modify parent_node, delete_entry() currently being described may be executed recursively.

node * coalesce_nodes(node * root, node * n, node * neighbor, int neighbor_index, int k_prime) {
    ...
    root = delete_entry(root, n->parent, k_prime, n);
    ...
}
  1. In the case where the occupancy invariant cannot follow ( node falls below minimum ), and merge is impossible.
    else
        return redistribute_nodes(root, n, neighbor, neighbor_index, k_prime_index, k_prime)

If then, it calls redistribute_nodes() to follow occupancy invariant.

2. Detail flow of the structure modification

Split

If he node with the inserted key has the number of key more than (or equal) order, the node should be divided.
There are two types of splitting.

1. insert_into_node_after_splitting

If an internal node has more keys than the occupancy invariant, the internal node should be divided.

node * insert_into_node_after_splitting(node * root, node * old_node, int left_index, int key, node * right)

First place a set of keys and pointers to hold everything in order, including the new key and pointer.

    for (i = 0, j = 0; i < old_node->num_keys + 1; i++, j++) {
        if (j == left_index + 1) j++;
        temp_pointers[j] = (node *)old_node->pointers[i];
    }

    for (i = 0, j = 0; i < old_node->num_keys; i++, j++) {
        if (j == left_index) j++;
        temp_keys[j] = old_node->keys[i];
    }

    temp_pointers[left_index + 1] = right;
    temp_keys[left_index] = key;

Then create new node and copy half the keys and pointers to the old and half to the new.

    split = cut(order);
    new_node = make_node();
    old_node->num_keys = 0;
    for (i = 0; i < split - 1; i++) {
        old_node->pointers[i] = temp_pointers[i];
        old_node->keys[i] = temp_keys[i];
        old_node->num_keys++;
    }
    old_node->pointers[i] = temp_pointers[i];
    k_prime = temp_keys[split - 1];
    for (++i, j = 0; i < order; i++, j++) {
        new_node->pointers[j] = temp_pointers[i];
        new_node->keys[j] = temp_keys[i];
        new_node->num_keys++;
    }
    new_node->pointers[j] = temp_pointers[i];

Updates the value of the parent nodes of the child nodes of the existing nodes.

    for (i = 0; i <= new_node->num_keys; i++) {
        child = (node *)new_node->pointers[i];
        child->parent = new_node;
    }

Insert a new key into the parent of the two nodes resulting from the split, with the old node to the left and the new to the right.

    return insert_into_parent(root, old_node, k_prime, new_node);

2. insert_into_leaf_after_splitting

If a leaf node has more keys than the occupancy invariant, the leaf node should be divided.

node * insert_into_leaf_after_splitting(node * root, node * leaf, int key, record * pointer)

Before splitting, place all keys and pointers in order, including the keys inserted in the array temporarily.

    insertion_index = 0;
    while (insertion_index < order - 1 && leaf->keys[insertion_index] < key)
        insertion_index++;

    for (i = 0, j = 0; i < leaf->num_keys; i++, j++) {
        if (j == insertion_index) j++;
        temp_keys[j] = leaf->keys[i];
        temp_pointers[j] = leaf->pointers[i];
    }

    temp_keys[insertion_index] = key;
    temp_pointers[insertion_index] = pointer;

And divide the keys and pointers temporarily arranged in the existing leaf node and the new leaf node into half.

    split = cut(order - 1);

    for (i = 0; i < split; i++) {
        leaf->pointers[i] = temp_pointers[i];
        leaf->keys[i] = temp_keys[i];
        leaf->num_keys++;
    }

    for (i = split, j = 0; i < order; i++, j++) {
        new_leaf->pointers[j] = temp_pointers[i];
        new_leaf->keys[j] = temp_keys[i];
        new_leaf->num_keys++;
    }

Since these are leaf nodes, modify the pointer pointing to the next leaf node to suit its location. The new leaf node is located on the right side of the existing leaf node.

    new_leaf->pointers[order - 1] = leaf->pointers[order - 1];
    leaf->pointers[order - 1] = new_leaf;

Determine the parent node of the new leaf node and insert the initial key value of the new leaf node into the parent node.

    new_leaf->parent = leaf->parent;
    new_key = new_leaf->keys[0];

    return insert_into_parent(root, leaf, new_key, new_leaf);

Merge

If the node with the deleted key after deleting the entry is not the root node, but the node fails to keep the occupancy invariant and the sum of the number of keys of the node and the number of keys of the neighbor node is less than the capacity, merge operation occurs.

Coalesces a node that has become too small after deleting with a left neighboring node that can accept the additional entries without exceeding the maximum.

Swap neighbor with node if node is on the extreme left and neighbor is to its right.

    if (neighbor_index == -1) {
        tmp = n;
        n = neighbor;
        neighbor = tmp;
    }

The merge operation varies slightly depending on whether the node to merge is a leaf node or not.

  • Case : not leaf node
    Append k_prime and the following pointer.
    Append all pointers and keys from the neighbor.
    if (!n->is_leaf) {

        /* Append k_prime.
         */

        neighbor->keys[neighbor_insertion_index] = k_prime;
        neighbor->num_keys++;


        n_end = n->num_keys;

        for (i = neighbor_insertion_index + 1, j = 0; j < n_end; i++, j++) {
            neighbor->keys[i] = n->keys[j];
            neighbor->pointers[i] = n->pointers[j];
            neighbor->num_keys++;
            n->num_keys--;
        }

        /* The number of pointers is always
         * one more than the number of keys.
         */

        neighbor->pointers[i] = n->pointers[j];

        /* All children must now point up to the same parent.
         */

        for (i = 0; i < neighbor->num_keys + 1; i++) {
            tmp = (node *)neighbor->pointers[i];
            tmp->parent = neighbor;
        }
    }
  • Case : leaf node
    Append the keys and pointers of n to the neighbor.
    Set the neighbor's last pointer to point to what had been n's right neighbor.
    else {
        for (i = neighbor_insertion_index, j = 0; j < n->num_keys; i++, j++) {
            neighbor->keys[i] = n->keys[j];
            neighbor->pointers[i] = n->pointers[j];
            neighbor->num_keys++;
        }
        neighbor->pointers[order - 1] = n->pointers[order - 1];
    }

Finally, modify the parent node and clean up the node n.

    root = delete_entry(root, n->parent, k_prime, n);
    free(n->keys);
    free(n->pointers);
    free(n); 

3.(Naïve) designs or required changes for building on-disk b+ tree

Unlike the in-memory b+ tree, the disk-based b+ tree uses page, not node.
For example, in order to utilize the root, you must access the header page, find the root page number, call the disk manager, and go through file read. Likewise, the page number is used instead of the pointer used in the in-memory b+ tree, so file I/O is required to access the page to find out and modify the content.
There are various pages such as the header page as well as the leaf page internal page. And each contains different contents, so each structure is made. And memory access is needed to deal with the variable length of the record contained in the leaf page.
And when you split or merge/redistribute the leaf node, you should proceed based on the size of the free page, not the number of keys on the page.

Analyze my on-disk B+ tree

0. Overview

Information on the internal page was fixed in size, similar to the in-memory b+ tree, so it was easily accessible using the Internal_page structure including the array. However, information on the leaf page was accessed using the Slot structure and memcpy() because the size of the value was variable. And for offset calculation and convenience of memory access, a char-type array (for example, delivery) with a unit offset size of 1 byte was created and used.

1. db_find()

A function of finding a record containing an input key.

Find the (possible) leaf page containing the key through db_find_leaf() and return -1 if not found.

	pagenum_t leaf_pagenum = db_find_leaf(table_id, key); 
	if ( leaf_pagenum == -1 ) return -1;

Once found, read the page from disk and search each slot in the leaf node to get information about the key and value.

	file_read_page(table_id, leaf_pagenum, (page_t*)leaf_page);

	char delivery[PAGE_SIZE] = {};
	memcpy(delivery, leaf_page, PAGE_SIZE);

	for ( i = 0; i < leaf_page->numOfkeys; i++ )
	{
		memcpy( &slot, delivery+(SLOT_STARTING_POINT+SLOT_SIZE*i), SLOT_SIZE);
		if ( slot.key == key ) break;
	}

	if ( i == leaf_page->numOfkeys ) return -1;

	memcpy(ret_val, delivery+slot.offset, slot.size);
	memcpy(val_size,delivery+(SLOT_STARTING_POINT+SLOT_SIZE*i)+sizeof(uint64_t), sizeof(uint16_t));

2. db_insert()

The current design does not allow duplicate keys, so if there is a page containing the entered key, return -1.

	if (db_find(table_id, key, ret_val, size) == 0) 
		return -1;

If there is no key in the tree, first check the presence or absence of the root using the header page, and if not, call db_start_new_tree().

	// No such key in page.
	file_read_page(table_id, 0, (page_t*)header_page);;
	pagenum_t root_pagenum = header_page->root_pagenum;

	
	// Case : the tree does not exist yet.
	if (root_pagenum == 0 )
		return db_start_new_tree(table_id, key, value, val_size);

If the tree exists, find the leaf page where the entered record will enter, and if the page has enough space, call db_insert_into_leaf() if not db_insert_into_leaf_splitting().

	pagenum_t leaf_pagenum = db_find_leaf(table_id, key);
	
	Leaf_page* leaf_page = (Leaf_page*)malloc(sizeof(Leaf_page));
	file_read_page(table_id, leaf_pagenum, (page_t*)leaf_page);

	if ( leaf_page->free_space_amount >= SLOT_SIZE + val_size )
	{
		free(leaf_page);
		return db_insert_into_leaf(table_id, leaf_pagenum, key, value, val_size);
	}

	free(leaf_page);
	return db_insert_into_leaf_after_splitting(table_id, leaf_pagenum, key, value, val_size);

After the leaf split, insert the first key value of the new page into the parent page. If there is no parent page, create a new root page, insert the key in the appropriate position so that the key is ascending, and if there is no space to put the key, call db_insert_into_node_after_splitting() to split again.

3. db_delete()

First, check if the key is in the tree through db_find(), call db_delete_entry() if the key is in the tree, and return -1 if not, and terminate the function.

int db_delete (int64_t table_id, int64_t key)
{
	void* dummy = (void*)malloc(sizeof(uint16_t));
	int flag = db_find(table_id, key, (char*)dummy, (uint16_t*)dummy);
	if ( flag == 0 )  
	{
		int key_leaf_pagenum = db_find_leaf(table_id, key);
		free(dummy);
		return db_delete_entry(table_id, key_leaf_pagenum, key);
	}
	free(dummy);
	return -1;
}

In db_delete_entry, first call db_remove_entry_from_node() to delete entry. And if the page where entry was deleted was root, call db_adjust_root() to update the root page.

If it was a leaf page, not a root page, proceed with merge or redistribution if the free space amount exceeds 2500 after entry deletion. If you have enough free space on your neighbor's page to merge the current page with the neighbor's page, call merge_leaf_page(), otherwise call redistribute_leaf_page().

If the page where entry was deleted was an internal page, not a root page or a leaf page, compare the minimum number of keys with the number of keys on the current page and proceed with merge or redistribution if it is less than that. Call merge_internal_pages() only if it is less than capacity by adding the number of keys in the current internal page and the number of keys in the neighboring page, otherwise call redistribute_internal_pages().

Also, after merge, you have to go through modifying the parent page through db_delete_entry().