发布于 2026-01-06 0 阅读
0

🚀 Think about Primary Key & Indexes before anything else 🐘 DEV's Worldwide Show and Tell Challenge Presented by Mux: Pitch Your Projects!

🚀 首先要考虑主键和索引 🐘

由 Mux 赞助的 DEV 全球展示挑战赛:展示你的项目!

Nikolay Samokhvalov 最近发布的一条推文强调了理解数据库结构的重要性:

安迪·帕夫洛也提出了一个观点,那就是人们提出了许多激进的变革建议,却不了解正确的索引方法:

在这篇文章中,我将以这个例子来说明,索引的构建其实并不难。而且,无论如何,这项工作必须在扩展到分布式数据库之前完成。这个问题是在 PostgreSQL 中遇到的。我将在YugabyteDB上运行我的演示,以证明这些概念在分布式数据库中是相同的。SQL 语句也是一样的。

James Long 的方法很好:向社区寻求帮助,并提供所有必要的信息,包括执行计划和索引定义:https://gist.github.com/jlong​​ster/4b31299dcb622aa7e29b59d889db2b2c #file-gistfile1-txt

有了这些信息,问题就很容易重现:

yugabyte=# \c yugabyte yugabyte

psql (15devel, server 11.2-YB-2.9.1.0-b0)
You are now connected to database "yugabyte" as user "yugabyte".

yugabyte=# create table messages_binary (
            "timestamp" text,
            "group_id" uuid,
            "other_column" int,
            primary key("timestamp","group_id")
           );
CREATE TABLE

yugabyte=# EXPLAIN SELECT * FROM messages_binary
           WHERE group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'
           AND timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e';
                                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
------------------
 Seq Scan on messages_binary  (cost=0.00..105.00 rows=1000 width=52)
   Filter: (("timestamp" > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e'::text) AND (group_id = '983d5259-97ff-49e3-8829-101a
b8dead92'::uuid))
(2 rows)
Enter fullscreen mode Exit fullscreen mode

这是全表扫描。这不是我们想要的,因为它会读取所有行,而我们只需要一个“group_id”。我们需要的是范围扫描。

让我们插入几行数据(3 组,每组 3 个时间戳):

yugabyte=# create extension pgcrypto;
CREATE EXTENSION

yugabyte=# insert into messages_binary
           with groups as (
            select gen_random_uuid() group_id from generate_series(1,3)
           )
           select
            to_char(now()+(generate_series(1,3)*interval'1 second')
             ,'yyyy-mm-ddThh24:mi:ss.000Z-')
             ||substr(gen_random_uuid()::text,25) "timestamp"
            ,group_id, 42 as "value"
           from groups;
INSERT 0 9

yugabyte=# select * from messages_binary;

               timestamp               |               group_id               | other_column
---------------------------------------+--------------------------------------+--------------
 2021-11-07T20:00:23.000Z-c533a5e5623e | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:24.000Z-b879daca6cb7 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:23.000Z-ca98dd4de397 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:22.000Z-c440295c4500 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:24.000Z-631b45e66aba | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:22.000Z-ad01842bb691 | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:24.000Z-90342717a0c8 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:22.000Z-933f552d0159 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:23.000Z-1dcde16fc472 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
(9 rows)
Enter fullscreen mode Exit fullscreen mode

在 YugabyteDB 中,数据行被分片并存储在主索引本身中。在 PostgreSQL 中,数据行被追加到堆表中,并在主键上创建额外的索引。在这两种情况下,范围扫描访问都依赖于主键,这里将其定义为 `<pre>` ("timestamp","group_id")。我们可以看到,我需要的行group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e分散在这个顺序扫描结果中。

让我们通过对相同列使用 SELECT ... ORDER BY 语句来了解一下主键的顺序:

yugabyte=# select * from messages_binary
            order by "timestamp","group_id";

               timestamp               |               group_id               | other_column
---------------------------------------+--------------------------------------+--------------
 2021-11-07T20:00:22.000Z-933f552d0159 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:22.000Z-ad01842bb691 | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:22.000Z-c440295c4500 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:23.000Z-1dcde16fc472 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:23.000Z-c533a5e5623e | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:23.000Z-ca98dd4de397 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:24.000Z-631b45e66aba | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:24.000Z-90342717a0c8 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:24.000Z-b879daca6cb7 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
(9 rows)
Enter fullscreen mode Exit fullscreen mode

现在你应该明白带谓词的查询效率有多低了WHERE group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e' AND timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e'。我们从第一行开始,因为它符合条件timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e',但之后我们必须扫描所有行,然后再进行筛选。没有哪种数据结构能够让我们在一个很小的范围内找到感兴趣的行,从而可以单独读取。这就是顺序扫描的原理。

我们需要一个类似这样的结构,首先按“group_id”排序:

yugabyte=# select * from messages_binary
            order by "group_id","timestamp";

               timestamp               |               group_id               | other_column
---------------------------------------+--------------------------------------+--------------
 2021-11-07T20:00:22.000Z-c440295c4500 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:23.000Z-1dcde16fc472 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:24.000Z-90342717a0c8 | 9c3d61e1-6d3f-4b95-9e08-46f485d10b75 |           42
 2021-11-07T20:00:22.000Z-ad01842bb691 | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:23.000Z-c533a5e5623e | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:24.000Z-631b45e66aba | e7e46753-2e99-4ee4-b77f-17136b01790e |           42
 2021-11-07T20:00:22.000Z-933f552d0159 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:23.000Z-ca98dd4de397 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
 2021-11-07T20:00:24.000Z-b879daca6cb7 | f27ac68f-2a10-46f0-a8fe-77b99c0c5a66 |           42
(9 rows)
Enter fullscreen mode Exit fullscreen mode

在这种结构下(请注意行顺序,我没有更改列顺序),数据库引擎可以:

  • 寻找第一个group_id='e7e46753-2e99-4ee4-b77f-17136b01790e'
  • 此外,还要寻求第一个timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e'
  • 按顺序读取下列各行:
  • 并在最后一个以group_id='e7e46753-2e99-4ee4-b77f-17136b01790e'.

如何获得这种结构?使用另一个索引很容易定义:

yugabyte=# create index messages_binary_key2 
            on messages_binary ("group_id","timestamp");
CREATE INDEX
Enter fullscreen mode Exit fullscreen mode

以下是执行计划:

yugabyte=# EXPLAIN SELECT * FROM messages_binary
           WHERE group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'
           AND timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e';
                                                                      QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
----------------------
 Index Scan using messages_binary_key2 on messages_binary  (cost=0.00..5.25 rows=10 width=52)
   Index Cond: ((group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'::uuid) AND ("timestamp" > '1970-01-01T00:00:00.000Z-0000-ae26
b84edae7349e'::text))
(2 rows)
Enter fullscreen mode Exit fullscreen mode

这样做效率很高。如果我们使用 INCLUDE 子句将所有选定的列添加到索引中,效率会更高,例如:

yugabyte=# create index messages_binary_key2 
           on messages_binary ("group_id","timestamp")
           include ("other_column");
CREATE INDEX

yugabyte=# EXPLAIN SELECT * FROM messages_binary
           WHERE group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'
           AND timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e';
                                                                      QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
----------------------
 Index Only Scan using messages_binary_key2 on messages_binary  (cost=0.00..5.15 rows=10 width=52)
   Index Cond: ((group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'::uuid) AND ("timestamp" > '1970-01-01T00:00:00.000Z-0000-ae26
b84edae7349e'::text))
(2 rows)
Enter fullscreen mode Exit fullscreen mode

我在以下链接中详细介绍了这种仅索引扫描技术:
https://blog.yugabyte.com/how-a-distributed-sql-database-boosts-secondary-index-queries-with-index-only-scan/

经过进一步分析,我们发现基于 `("timestamp", "group_id")` 的索引可能完全没有用,因为我们很少会遇到只查询 `timestamp`timestamp而不包含 `group_id` 的情况group_id
那么,最好将表定义如下:

yugabyte=# create table messages_binary (
            "timestamp" text,
            "group_id" uuid,
            "other_column" int,
            primary key("group_id","timestamp")
            );
CREATE TABLE
Enter fullscreen mode Exit fullscreen mode

我正在插入更多行,并查看执行计划:

yugabyte=# insert into messages_binary
           with groups as (
            select gen_random_uuid() group_id from generate_series(1,1e3)
           )
           select
            to_char(now()+(generate_series(1,1e4)*interval'1 second')
             ,'yyyy-mm-ddThh24:mi:ss.000Z-')
             ||substr(gen_random_uuid()::text,25) "timestamp"
            ,group_id, 42 as "value"
           from groups;

yugabyte=# analyze messages_binary;
ANALYZE

yugabyte=# EXPLAIN (analyze)
           SELECT * FROM messages_binary
           WHERE group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'
           AND timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e';
                                                                                                                                                                                                                  QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------
----------------------
 Index Scan using messages_binary_pkey on messages_binary  (cost=0.00..1214.95 rows=10530 width=52) (actual time=10.588..100.838
 rows=10000 loops=1)
   Index Cond: ((group_id = 'e7e46753-2e99-4ee4-b77f-17136b01790e'::uuid) AND ("timestamp" > '1970-01-01T00:00:00.000Z-0000-ae26
b84edae7349e'::text))
 Planning Time: 0.067 ms
 Execution Time: 101.711 ms
(4 rows)
Enter fullscreen mode Exit fullscreen mode

在我的小型实验室(8 个虚拟 CPU)中,从一个包含 1000 万行的表中检索 10000 行数据仅需 100 毫秒,但我可以告诉你,这与表的大小无关。这是理解访问模式的另一个好处:你知道它是如何扩展的。这种索引扫描的时间复杂度接近表大小的 O(1)。实际上,它的时间复杂度是“log n”,但对于大范围扫描,B 树中的分支到叶节点的操作(或 LSM 树的 SST 文件中的查找操作)可以忽略不计。而结果集的时间复杂度是结果集大小的 O(n)。

斜体字部分的更新是在与 Nikolay Samokhvalov ( postgres.ai ) 讨论后进行的。我们还将 YugabyteDB 与 PostgreSQL 的结果进行了比较。我的查询生成行时,会按照“组 ID”顺序插入它们。这在 YugabyteDB 中不会造成任何影响,因为行是按主键组织的,但在 PostgreSQL 堆表中,性能取决于索引/表的关联性。以下是一个 PostgreSQL 的示例,更贴近实际应用场景,按时间戳排序:

postgres=# truncate table messages_binary;
TRUNCATE TABLE
postgres=# insert into messages_binary
           with groups as (
                       select gen_random_uuid() group_id from generate_series(1,1e3)
                      ), timestamps as (
                       select to_char(now()+(generate_series(1,1e4)*interval'1 second')
                        ,'yyyy-mm-ddThh24:mi:ss.000Z-')
                        ||substr(gen_random_uuid()::text,25) "timestamp"
                      )
                      select
                       "timestamp",group_id, 42 as "value"
                      from timestamps,groups;
INSERT 0 10000000

postgres=# select * from messages_binary limit 1;
               timestamp               |               group_id               | other_column
---------------------------------------+--------------------------------------+--------------
 2021-11-08T08:34:14.000Z-6bb27dbe2723 | 91ee7381-eb92-48cd-bb82-9ed939dc3a13 |           42
(1 row)

postgres=# EXPLAIN (analyze,buffers)
           SELECT * FROM messages_binary
           WHERE group_id = '91ee7381-eb92-48cd-bb82-9ed939dc3a13'
           AND timestamp > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e';
                                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on messages_binary  (cost=778.61..30792.78 rows=9956 width=58) (actual time=3.021..43.141 rows=10000 loops=1)
   Recheck Cond: ((group_id = '91ee7381-eb92-48cd-bb82-9ed939dc3a13'::uuid) AND ("timestamp" > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e'::text))
   Heap Blocks: exact=10000
   Buffers: shared hit=2006 read=8167 written=2249
   ->  Bitmap Index Scan on messages_binary_pkey  (cost=0.00..776.12 rows=9956 width=0) (actual time=1.897..1.897 rows=10000 loops=1)
         Index Cond: ((group_id = '91ee7381-eb92-48cd-bb82-9ed939dc3a13'::uuid) AND ("timestamp" > '1970-01-01T00:00:00.000Z-0000-ae26b84edae7349e'::text))
         Buffers: shared hit=11 read=162
 Planning Time: 0.086 ms
 Execution Time: 43.832 ms
Enter fullscreen mode Exit fullscreen mode

存储延迟和缓冲区可用内存会影响最终结果。YugabyteDB 的 LSM 树自动压缩功能以及 PostgreSQL 的手动 pg_repack 命令可以有所帮助。但总体而言,结果大致相同。当响应时间比预期高出几个数量级时,我们就需要检查数据访问的设计。此外,每个数据库引擎可能还有一些额外的优化方案。

这是在不创建额外索引的情况下,仅定义主键中列的正确顺序,就能实现此查询的最佳方案。在 PostgreSQL 中,仍然需要从表中随机读取数据,但至少所有数据都已通过主键索引的范围扫描进行过滤。在 YugabyteDB 中,所有行都将通过顺序读取来检索。

  • 分片是基于主键进行的,以便仅从一个平板电脑读取数据。
  • SST 文件带有布隆过滤器,可以跳过其中许多文件。
  • 它们对数据块进行索引,以便只读取所需的数据块。

Alvaro Hernández 还提出了一个疑问,即查询返回的行数是否正确:

还有一点,James Long 的执行计划显示(group_id = '983d5259-97ff-49e3-8829-101ab8dead92'::text)在索引条件中。将 UUID 存储为 TXT 格式效率不高,我在这里使用了 uuid 数据类型:

yugabyte=# select pg_column_size(
           'e7e46753-2e99-4ee4-b77f-17136b01790e'::uuid);

 pg_column_size
----------------
             16
(1 row)

yugabyte=# select pg_column_size(
           'e7e46753-2e99-4ee4-b77f-17136b01790e'::text);

 pg_column_size
----------------
             40
(1 row)
Enter fullscreen mode Exit fullscreen mode

总之,在断言你的数据库“无法扩展”之前,没有捷径可走,你需要先了解你的访问模式:预期结果的大小以及高效访问该结果的结构。而且,要像 James Long 那样做:阅读执行计划并向社区寻求帮助 👍

文章来源:https://dev.to/yugabyte/think-about-primary-key-indexes-before-anything-else-o5m