InnoDB 的索引页结构

type

Post

status

Published

date

Feb 16, 2023

slug

summary

Page Header

The FIL header and trailer: This is typical and common to all page types. One aspect somewhat unique to INDEX pages is that the “previous page” and “next page” pointers in the FIL header point to the previous and next pages at the same level of the index, and in ascending order based on the index’s key. This forms a doubly-linked list of all pages at each level.

Macro	bytes	Desc	ㅤ
FIL_PAGE_SPACE_OR_CHKSUM	4	在MySQL4.0之前存储space id，之后的版本用于存储checksum	ㅤ
FIL_PAGE_OFFSET	4	当前页的page no	ㅤ
FIL_PAGE_PREV	4	通常用于维护btree同一level的双向链表，指向链表的前一个page，没有的话则值为FIL_NULL	ㅤ
FIL_PAGE_NEXT	4	记录链表的下一个Page的Page No	ㅤ
FIL_PAGE_LSN	8	最近一次修改该page的LSN	ㅤ
FIL_PAGE_TYPE	2	Page类型	ㅤ
FIL_PAGE_FILE_FLUSH_LSN	8	只用于系统表空间的第一个Page，记录在正常shutdown时安全checkpoint到的点，对于用户表空间，这个字段通常是空闲的，但在5.7里，FIL_PAGE_COMPRESSED类型的数据页则另有用途。下一小节单独介绍	ㅤ
FIL_PAGE_SPACE_ID	4	存储page所在的space id	ㅤ

FSEG Header

The FSEG header: As described in Page management in InnoDB space files, the index root page’s FSEG header contains pointers to the file segments used by this index. All other index pages’ FSEG headers are unused and zero-filled.

Index Header

Macro	bytes	Desc
PAGE_N_DIR_SLOTS	2	Page directory中的slot个数（见下文关于Page directory的描述）
PAGE_HEAP_TOP	2	指向数据页中的空闲空间的起始地址
PAGE_N_HEAP	2	表示目前存放的 Record 数量
PAGE_FREE	2	指向标记删除的记录链表的第一个记录
PAGE_GARBAGE	2	被删除的记录链表上占用的总的字节数，属于可回收的垃圾碎片空间
PAGE_LAST_INSERT	2	指向最近一次插入的记录偏移量，主要用于优化顺序插入操作
PAGE_DIRECTION	2	用于指示当前记录的插入顺序以及是否正在进行顺序插入，每次插入时，PAGE_LAST_INSERT会和当前记录进行比较，以确认插入方向，据此进行插入优化
PAGE_N_DIRECTION	2	当前以相同方向的顺序插入记录个数
PAGE_N_RECS	2	Page上有效的未被标记删除的用户记录个数
PAGE_MAX_TRX_ID	8	最近一次修改该page记录的事务ID，主要用于辅助判断二级索引记录的可见性。
PAGE_LEVEL	2	该Page所在的btree level，根节点的level最大，叶子节点的level为0
PAGE_INDEX_ID	8	该Page归属的索引ID

仅在Btree的root Page中被设置，其他Page都是未使用的。

Macro	bytes	Desc
PAGE_BTR_SEG_LEAF	10(FSEG_HEADER_SIZE)	leaf segment在inode page中的位置
PAGE_BTR_SEG_TOP	10(FSEG_HEADER_SIZE)	non-leaf segment在inode page中的位置

10个字节的inode信息包括：

Macro	bytes	Desc
FSEG_HDR_SPACE	4	描述该segment的inode page所在的space id （目前的实现来看，感觉有点多余…）
FSEG_HDR_PAGE_NO	4	描述该segment的inode page的page no
FSEG_HDR_OFFSET	2	inode page内的页内偏移量

系统记录

System records: InnoDB has two system records in each page called infimum and supremum. These records are stored in a fixed location in the page so that they can always be found directly based on byte offset in the page.

用户记录

User records: The actual data. Every record has a variable-width header and the actual column data itself. The header contains a “next record” pointer which stores the offset to the next record within the page in ascending order, forming a singly-linked list.

bytes	Desc
变长列长度数组	如果列的最大长度为255字节，使用1byte；否则，0xxxxxxx (one byte, length=0..127), or 1exxxxxxxxxxxxxx (two bytes, length=128..16383, extern storage flag)
SQL-NULL flag	标示值为NULL的列的bitmap，每个位标示一个列，bitmap的长度取决于索引上可为NULL的列的个数(dict_index_t::n_nullable)，这两个数组的解析可以参阅函数`rec_init_offsets`
下面5个字节（REC_N_NEW_EXTRA_BYTES）描述记录的额外信息	….
REC_NEW_INFO_BITS (4 bits)	目前只使用了两个bit，一个用于表示该记录是否被标记删除(`REC_INFO_DELETED_FLAG`)，另一个bit(REC_INFO_MIN_REC_FLAG)如果被设置，表示这个记录是当前level最左边的page的第一个用户记录
REC_NEW_N_OWNED (4 bits)	当该值为非0时，表示当前记录占用page directory里一个slot，并和前一个slot之间存在这么多个记录
REC_NEW_HEAP_NO (13 bits)	该记录的heap no
REC_NEW_STATUS (3 bits)	记录的类型，包括四种：`REC_STATUS_ORDINARY`(叶子节点记录)， `REC_STATUS_NODE_PTR`（非叶子节点记录），`REC_STATUS_INFIMUM`(infimum系统记录)以及`REC_STATUS_SUPREMUM`(supremum系统记录)
REC_NEXT (2bytes)	指向按照键值排序的page内下一条记录数据起点，这里存储的是和当前记录的相对位置偏移量（函数`rec_set_next_offs_new`）

Page directory

The page directory: The page directory grows downwards from the “top” of the page starting at the FIL trailer and contains pointers to some of the records in the page (every 4th to 8th record).

Index的页面内部records是用单链表链接起来，单链表遍历会很耗时，可以使用Page Directory构建一颗B+Tree，Page directory是一个存放records偏移量的有序数组，每4到8个record就保存一个偏移量，这就可以使用二分查找。