postgresql kernel source code analysis btree indexes add, delete and check the code basic principles of flow analysis, the reason for the index expansion is here!

Time:2023-9-14

B-Tree index code flow analysis

Column content

open source contribution

Personal homepageMy Home Page
Managing the communityOpen source databases
Motto: “The sky is healthy, the gentleman is self-improvement; the earth is Kun, the gentleman is virtuous.

summarize

The most commonly used index in postgresql is btree, which supports range and equal value queries.

This article mainly introduces the btree code entry, interface definition, mainly involved in index query, insertion, deletion, and data cleaning operations.

preamble

Indexes are used to find data in the actual data table faster, then the index key values are very small and a large amount of indexed data can be read from disk at once.
But some indexes store actual data in the value, which is one-to-one correspondence with the data, that is dense indexes, while there are some indexes that do not store actual data, but store the maximum and minimum values in the range, this type of index is called sparse indexes;

For dense indexes, such as primary keys, it is straightforward to get the corresponding data position or the corresponding column, and the btree algorithm supports this type of index;
Sparse indexes, on the other hand, after checking the index, you need to traverse the data table again, or the secondary index to hit the target data.

code entry

Structures for indexing operations are defined in postgresql for the purpose of code decoupling, with the base member being a uniform set of operations and identification options;
For the definition of btree is as follows, you can find the name of the operation interface of the btree index here, in the actual practical just call the members of the structure, that is, the function pointer.

/*
 * Btree handler function: return IndexAmRoutine with access method parameters
 * and callbacks.
 */
Datum
bthandler(PG_FUNCTION_ARGS)
{
	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);

	amroutine->amstrategies = BTMaxStrategyNumber;
	amroutine->amsupport = BTNProcs;
	amroutine->amoptsprocnum = BTOPTIONS_PROC;
	amroutine->amcanorder = true;
	amroutine->amcanorderbyop = false;
	amroutine->amcanbackward = true;
	amroutine->amcanunique = true;
	amroutine->amcanmulticol = true;
	amroutine->amoptionalkey = true;
	amroutine->amsearcharray = true;
	amroutine->amsearchnulls = true;
	amroutine->amstorage = false;
	amroutine->amclusterable = true;
	amroutine->ampredlocks = true;
	amroutine->amcanparallel = true;
	amroutine->amcaninclude = true;
	amroutine->amusemaintenanceworkmem = false;
	amroutine->amsummarizing = false;
	amroutine->amparallelvacuumoptions =
		VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_COND_CLEANUP;
	amroutine->amkeytype = InvalidOid;

	amroutine->ambuild = btbuild;
	amroutine->ambuildempty = btbuildempty;
	amroutine->aminsert = btinsert;
	amroutine->ambulkdelete = btbulkdelete;
	amroutine->amvacuumcleanup = btvacuumcleanup;
	amroutine->amcanreturn = btcanreturn;
	amroutine->amcostestimate = btcostestimate;
	amroutine->amoptions = btoptions;
	amroutine->amproperty = btproperty;
	amroutine->ambuildphasename = btbuildphasename;
	amroutine->amvalidate = btvalidate;
	amroutine->amadjustmembers = btadjustmembers;
	amroutine->ambeginscan = btbeginscan;
	amroutine->amrescan = btrescan;
	amroutine->amgettuple = btgettuple;
	amroutine->amgetbitmap = btgetbitmap;
	amroutine->amendscan = btendscan;
	amroutine->ammarkpos = btmarkpos;
	amroutine->amrestrpos = btrestrpos;
	amroutine->amestimateparallelscan = btestimateparallelscan;
	amroutine->aminitparallelscan = btinitparallelscan;
	amroutine->amparallelrescan = btparallelrescan;

	PG_RETURN_POINTER(amroutine);
}

We first look at the basic operations of indexing, querying btgettuple, inserting btinsert and deleting.

index lookup

Call stack for indexed queries

  • ExecIndexScan

There will be nodes in the execution plan for indexed queries, such as ExecIndexScan, which initiates an indexed query to find the tuple of the data table through the index.

  • -> IndexNext

Returns the tuple of the table, with a secondary lookup if it is sparsely indexed.

  • -> index_getnext_slot

Returns the tuple of the datasheet, here it will use the tid found by the index to look up in the datasheet and check the visibility, if it’s not visible then continue to look up the next one;

  • -> index_getnext_tid

Returns the tid of the record in the index key.

  • ->btgettuple

Lookup in index, hit the index item corresponding to the key by traversing and comparing it.

The basic process of finding indexed data

Index lookups are roughly divided into two steps:

  • Finding the starting point, i.e. finding the key value
  • Scan from the start and return eligible index entries

code analysis

The query entry function for the index is btgettuple, and here is its implementation;

bool
btgettuple(IndexScanDesc scan, ScanDirection dir)
{
	BTScanOpaque so = (BTScanOpaque) scan->opaque;
	bool		res;

	/* btree indexes are never lossy */
	scan->xs_recheck = false;

	/*
	 * If we have any array keys, initialize them during first call for a
	 * scan.  We can't do this in btrescan because we don't know the scan
	 * direction at that time.
	 */
	if (so->numArrayKeys && !BTScanPosIsValid(so->currPos))
	{
		/* punt if we have any unsatisfiable array keys */
		if (so->numArrayKeys < 0)
			return false;

		_bt_start_array_keys(scan, dir);
	}

	/* This loop handles advancing to the next array elements, if any */
	do
	{
		/*
		 * If we've already initialized this scan, we can just advance it in
		 * the appropriate direction.  If we haven't done so yet, we call
		 * _bt_first() to get the first item in the scan.
		 */
		if (!BTScanPosIsValid(so->currPos))
			res = _bt_first(scan, dir);		
		else
		{
			/*
			 * Check to see if we should kill the previously-fetched tuple.
			 */
			if (scan->kill_prior_tuple)
			{
				/*
				 * Yes, remember it for later. (We'll deal with all such
				 * tuples at once right before leaving the index page.)  The
				 * test for numKilled overrun is not just paranoia: if the
				 * caller reverses direction in the indexscan then the same
				 * item might get entered multiple times. It's not worth
				 * trying to optimize that, so we don't detect it, but instead
				 * just forget any excess entries.
				 */
				if (so->killedItems == NULL)
					so->killedItems = (int *)
						palloc(MaxTIDsPerBTreePage * sizeof(int));
				if (so->numKilled < MaxTIDsPerBTreePage)
					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
			}

			/*
			 * Now continue the scan.
			 */
			res = _bt_next(scan, dir);
		}

		/* If we have a tuple, return it ... */
		if (res)
			break;
		/* ... otherwise see if we have more array keys to deal with */
	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));

	return res;
}
  • Initialize the lookup point; from the code, after entering the loop, first BTScanPosIsValid(so->currPos) determines whether currPos is valid, that is, whether the lookup point has been initialized; if not initialized, then call _bt_first to initialize;
  • Scanning for index entries. After initializing the lookup point, _bt_next is called to get one index entry of data, and returns when a valid index is found;

Index Insertion

Index insertion into the call stack

From insert, the call path is as follows

  • ExecInsert

Execution entry function for SQL insert statement

  • -> ExecInsertIndexTuples

If there is an index built in the current table, call this function to insert the index after the table data tuple is inserted, there may be more than one index, the cycle of each index to call the lower level function to insert;

  • index_insert

The public calling interface for index insertion, which actually calls the insertion definition interface for the corresponding index;

  • btinsert

The entry function for the operation of btree index insertion; in this function, an index tuple is first assembled, and then the subordinate function is called to perform the insertion;

  • _bt_doinsert

Performing an insertion of an index item goes through a series of process steps such as finding the location, checking for uniqueness, and inserting;

The basic process of index insertion

The broad process of index insertion has the following main components:

  • Finds the location where the indexed item was inserted, because the btree is an ordered tree, so it is important to find the location of the insertion first, to keep the order; this will be similar to an index query, where the lookup key is initialized and the query point is found first;
  • The uniqueness constraint check is not performed if the attribute columns in the index are all NULL;
  • The insertion session of the index is done by calling _bt_insertonpg, where there will be a lookup for free space, possible index splitting, etc;

code analysis

The entry function for index insertion is btinsert, and the actual execution is _bt_doinsert, so here’s a look at the executed code flow;

bool
_bt_doinsert(Relation rel, IndexTuple itup,
			 IndexUniqueCheck checkUnique, bool indexUnchanged,
			 Relation heapRel)
{
	bool		is_unique = false;
	BTInsertStateData insertstate;
	BTScanInsert itup_key;
	BTStack		stack;
	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);

	/* we need an insertion scan key to do our search, so build one */
	itup_key = _bt_mkscankey(rel, itup);

	if (checkingunique)
	{
		if (!itup_key->anynullkeys)
		{
			/* No (heapkeyspace) scantid until uniqueness established */
			itup_key->scantid = NULL;
		}
		else
		{
			checkingunique = false;
			/* Tuple is unique in the sense that core code cares about */
			Assert(checkUnique != UNIQUE_CHECK_EXISTING);
			is_unique = true;
		}
	}

	insertstate.itup = itup;
	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
	insertstate.itup_key = itup_key;
	insertstate.bounds_valid = false;
	insertstate.buf = InvalidBuffer;
	insertstate.postingoff = 0;

search:
	stack = _bt_search_insert(rel, heapRel, &insertstate);

	if (checkingunique)
	{
		TransactionId xwait;
		uint32		speculativeToken;

		xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
								 &is_unique, &speculativeToken);

		if (unlikely(TransactionIdIsValid(xwait)))
		{
			/* Have to wait for the other guy ... */
			_bt_relbuf(rel, insertstate.buf);
			insertstate.buf = InvalidBuffer;

			if (speculativeToken)
				SpeculativeInsertionWait(xwait, speculativeToken);
			else
				XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);

			/* start over... */
			if (stack)
				_bt_freestack(stack);
			goto search;
		}

		/* Uniqueness is established -- restore heap tid as scantid */
		if (itup_key->heapkeyspace)
			itup_key->scantid = &itup->t_tid;
	}

	if (checkUnique != UNIQUE_CHECK_EXISTING)
	{
		OffsetNumber newitemoff;

		CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));

		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
									   indexUnchanged, stack, heapRel);
		_bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
					   stack, itup, insertstate.itemsz, newitemoff,
					   insertstate.postingoff, false);
	}
	else
	{
		/* just release the buffer */
		_bt_relbuf(rel, insertstate.buf);
	}

	/* be tidy */
	if (stack)
		_bt_freestack(stack);
	pfree(itup_key);

	return is_unique;
}

The code flow is as follows.

  • Initialization work: Initializes the lookup key;
  • Find the insertion location; call _bt_search_insert to query to a leaf node page with enough free space;
  • Check the uniqueness constraint; check the uniqueness constraint, if there is a conflict transaction, then wait for the execution of the conflict transaction is completed, and then re-query the location, and then check the uniqueness constraint; and then the result of the judgment checkUnique ! = UNIQUE_CHECK_EXISTING, if the violation of the return then the end of the insertion; otherwise, the implementation of the insertion action;
  • Index insertion; first determine the insertion location, then call _bt_insertonpg;

Index Deletion

The update of an index is the deletion and insertion operations, here we look at the outline process of index deletion.
For the deletion of a tuple of a data table, the data is not really deleted, so the corresponding index entries will not be deleted, so when to delete the index entries?

Delete Index Basic Process

When doing vacuum or prune paga, the last data tuple is left on each page for HOT chains, because HOT chains within the same page correspond to only one index entry, and the last one is left to remove the index entry.
When vacuum indexing is performed, the corresponding index item is found through this dead tuple, and the index item is deleted first, then the dead tuple is deleted.
Often said that the performance of the index has decreased, in fact, is caused by index expansion, that is, the deadtuple becomes more, resulting in more index entries to be deleted, the query efficiency is greatly reduced, but also will bring the index IO increase.

code analysis

  • vac_bulkdel_one_index

Calls the generic index processing interface;

  • ->index_bulk_delete

Here the generic index processing interface, which calls the processing interface of the corresponding index, in this case the btree index processing is called;

  • ->btbulkdelete

The btree counterpart of the bulk delete interface; to avoid the impact of exits, the exit callback function is registered at the start to handle the aftermath before unsharing memory; btvacuumscan is then called to clean up all pages for index deletion.

wind up

Thank you very much for your support, don’t forget to leave your valuable comments while browsing, if you think it’s worth encouraging, please like it and favorite it, I will try harder!

Author’s email: [email protected]
Feel free to point out any errors or omissions and learn from each other.

Note: May not be reproduced without consent!

Recommended Today

uniapp and applet set tabBar and show and hide tabBar

(1) Set the tabBar: uni.setTabberItem({}); wx.setTabberItem({}); indexnumberisWhich item of the tabBar, counting from the left, is indexed from 0.textstringnoButton text on tabiconPathstringnoImage PathselectedIconPathstringnoImage path when selectedpagePathstringnoPage absolute pathvisiblebooleannotab Whether to display uni.setTabBarItem({ index: 0, text: ‘text’, iconPath: ‘/path/to/iconPath’, selectedIconPath: ‘/path/to/selectedIconPath’, pagePath: ‘pages/home/home’ }) wx.setTabBarItem({ index: 0, text: ‘text’, iconPath: ‘/path/to/iconPath’, selectedIconPath: ‘/path/to/selectedIconPath’, pagePath: ‘pages/home/home’ }) […]