如何使用 Milvus 存储和查询向量嵌入

在当今数据驱动的世界中，管理和搜索大型数据集变得日益重要。Milvus 是一个强大的工具，可以应对这一挑战。Milvus 是一个专为人工智能应用设计的开源向量数据库。在本篇博文中，我们将探讨 Milvus 的一个实际 Python 实现，并展示如何将其与文本嵌入技术相结合，从而创建一个高效的搜索系统。

本文所有代码均可在配套的 GitHub 存储库中找到。

Milvus：向量数据库

Milvus旨在为矢量数据提供可扩展、可靠且快速的搜索功能。它尤其适用于图像和视频识别、自然语言处理和推荐系统等应用，在这些应用中，数据可以表示为高维矢量。

设置 Milvus

在深入代码之前，请确保已安装并运行 Milvus。我们的 Python 脚本的第一步是建立与 Milvus 服务器的连接：

from pymilvus import connections

def connect_to_milvus():
    try:
        connections.connect("default", host="localhost", port="19530")
        print("Connected to Milvus.")
    except Exception as e:
        print(f"Failed to connect to Milvus: {e}")
        raise

此函数尝试连接到运行在本地计算机上的 Milvus 服务器。错误处理至关重要，它可以捕获并了解连接过程中可能出现的任何问题。

在 Milvus 中创建集合

在 Milvus 中，集合就像传统数据库中的表一样，用于存储数据。每个集合可以包含多个字段，类似于表中的列。在我们的示例中，我们创建了一个包含三个字段的集合：主键（pk）、源文本（source）和嵌入（embeddings）：

from pymilvus import FieldSchema, CollectionSchema, DataType, Collection

def create_collection(name, fields, description):
    schema = CollectionSchema(fields, description)
    collection = Collection(name, schema, consistency_level="Strong")
    return collection

# Define fields for our collection
fields = [
    FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
    FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=768)
]

collection = create_collection("hello_milvus", fields, "Collection for demo purposes")

在此代码片段中，嵌入的维度为 768（对于我们特定的自定义嵌入模型，将在下一节中提到），这应该与您使用的嵌入模型的输出一致。

在 Python 中生成文本嵌入

在将数据插入 Milvus 数据集之前，我们需要生成文本嵌入。这个过程需要使用 transformers 库中的预训练模型将文本转换为数值向量。在我们的代码中，我们使用thenlper/gte-base 模型embedding_util.py来实现这一目标。对于我们的应用程序而言，这个过程由一个专门处理向量嵌入生成的模块进行抽象。

有关我们的自定义embedding_util.py模块如何创建矢量嵌入的更多详细信息，请查看我的博客文章，了解如何使用 weaviate 存储和查询矢量嵌入。

生成和插入数据

为了从文本生成词嵌入，我们使用了前面提到的库中的预训练模型transformers。该模型将文本转换为可以存储在我们的 Milvus 集合中的数值向量：

from embedding_util import generate_embeddings

documents = [...]
embeddings = [generate_embeddings(doc) for doc in documents]
entities = [
    [str(i) for i in range(len(documents))],
    [str(doc) for doc in documents],
    embeddings
]

insert_result = insert_data(collection, entities)

该insert_data函数将我们的数据插入到 Milvus 集合中，然后刷新操作以确保数据持久性。

创建索引以实现高效搜索

Milvus 使用索引来加快搜索过程。这里，我们在嵌入字段上创建IVF_FLAT索引：

def create_index(collection, field_name, index_type, metric_type, params):
    index = {"index_type": index_type, "metric_type": metric_type, "params": params}
    collection.create_index(field_name, index)

create_index(collection, "embeddings", "IVF_FLAT", "L2", {"nlist": 128})

执行向量搜索

数据索引完成后，我们现在可以基于向量相似性进行搜索：

def search_and_query(collection, search_vectors, search_field, search_params):
    collection.load()
    result = collection.search(search_vectors, search_field, search_params, limit=3, output_fields=["source"])
    print_search_results(result, "Vector search results:")

query = "Give me some content about the ocean"
query_vector = generate_embeddings(query)
search_and_query(collection, [query_vector], "embeddings", {"metric_type": "L2", "params": {"nprobe": 10}})

在此搜索中，我们正在寻找与查询“给我一些关于海洋的内容”最相似的前 3 个文档。

如果应用程序运行成功，您应该会看到以下向量搜索结果，按余弦距离排序（距离越小，语义越相似）：

Vector search results:
Hit: id: 6, distance: 0.39819106459617615, entity: {'source': 'The sunset paints the sky with shades of orange, pink, and purple, reflecting on the calm sea.'}, source field: The sunset paints the sky with shades of orange, pink, and purple, reflecting on the calm sea.
Hit: id: 4, distance: 0.4780573844909668, entity: {'source': 'The ancient tree, with its gnarled branches and deep roots, whispers secrets of the past.'}, source field: The ancient tree, with its gnarled branches and deep roots, whispers secrets of the past.
Hit: id: 0, distance: 0.4835127890110016, entity: {'source': 'A group of vibrant parrots chatter loudly, sharing stories of their tropical adventures.'}, source field: A group of vibrant parrots chatter loudly, sharing stories of their tropical adventures.

清理

操作完成后，最好通过删除实体和丢弃集合来进行清理：

delete_entities(collection, f'pk in ["{insert_result.primary_keys[0]}", "{insert_result.primary_keys[1]}"]')
drop_collection("hello_milvus")

结论

Milvus 提供了一种强大而灵活的方式来处理矢量数据。通过将其与自然语言处理技术相结合，我们可以构建复杂的搜索和推荐系统。这里演示的 Python 脚本只是一个简单的示例，但它的潜在应用范围非常广泛。

无论您是处理大规模图像数据库、复杂的推荐系统还是高级 NLP 任务，Milvus 都可以成为您 AI 工具库中不可或缺的工具。

文章来源：https://dev.to/stephenc222/how-to-use-milvus-to-store-and-query-vector-embeddings-5hhl

菜单

分享

如何使用 Milvus 存储和查询向量嵌入

如何使用 Milvus 存储和查询向量嵌入

Milvus：向量数据库

设置 Milvus

在 Milvus 中创建集合

在 Python 中生成文本嵌入

生成和插入数据

创建索引以实现高效搜索

执行向量搜索

清理

结论

系统设计面试中的 19 种微服务模式

使用 React 和 AWS Amplify 实现无服务器架构第三部分：跟踪应用使用情况

模型-视图-控制器（MVC）模式到底是什么？DEV 全球项目展示挑战赛，由 Mux 主办：快来展示你的项目吧！

我在两年内从 PHP 开发人员晋升为高级 C#/.NET 开发人员。

了解 Docker：第 12 部分 – 传递构建参数

Yarn 和第三方 NPM 客户端的黑暗未来 DEV 的全球展示与讲述挑战赛，由 Mux 呈现：展示你的项目！

CSS DEV 的全球展示挑战赛“响应式字体”由 Mux 呈现：展示你的项目！

我是如何以学生开发者的身份免费获得 Tabnine Pro 的，你也可以！

五大顶级JS框架

从 Rector PHP 开始：利用自动化改进您的 PHP 代码