JavaScript 中的数据指纹识别

由 Mux 主办的 DEV 全球展示挑战赛：展示你的项目！

我想谈谈如何使用基于内容的寻址（又称数据指纹识别）作为一种通用方法，通过一些实际的 JavaScript 示例来提高应用程序的速度和安全性。

首先，我觉得基于内容的寻址方式简直太棒了。👀

它是一款功能极其强大的工具，可用于构建性能更高、可扩展性更强、安全性更高的服务。💪

它与不可篡改性、去中心化、数据完整性以及其他一些热门词汇有关……

但它也非常实用，而且总体上没有得到应有的重视，所以我希望写一篇实用的入门文章，展示它是如何与一些真实的 JavaScript 代码一起使用的。

你到底在说什么？

你可以把基于内容的寻址想象成数据的指纹识别。

就像指纹可以让你：

根据指纹识别身份
将指纹视为该人的唯一身份标识。
根据指纹判断两个人是否是同一个人
只需指纹即可快速检测某人是否在数据库中。

只需将上述描述中的“人”替换为“数据”，即可大致了解基于内容的寻址方式的功能。

换句话说，基于内容的寻址允许您根据数据的实际内容（而不是 ID 或 URL 等外部信息）来唯一且高效地引用数据。

数据库生成的 ID、随机 GUID 和 URL 本身都很有用，但它们不如数据指纹识别强大。

闭嘴，给我看看代码。

让我们看看用我实际使用过的代码来演示一下：

const pick = require('lodash.pick')
const stableStringify = require('fast-json-stable-stringify')

const data = pick(myData, ['keyFoo', 'keyBar'])
const fingerprint = hash(stableStringify(data))

这段代码片段省略了hash函数（下文会详细介绍），但它确实很清晰地表示了核心算法。

fingerprint它会为任何 JavaScript 对象创建一个基于内容的哈希值myData，该哈希值是根据我们关心的键来唯一表示该对象的[ 'keyFoo', 'keyBar' ]。

简而言之，这种指纹识别方式可以非常有效地判断两个 JavaScript 对象是否相同。

如果两个基于内容的 ID 相同，则这些对象中的数据也相同。

无需深入比较。无需 Redux。只有纯粹的、不可变的优秀。

那么，这究竟是如何运作的呢？

让我们再看一下我们的 JavaScript 代码：

const pick = require('lodash.pick')
const stableStringify = require('fast-json-stable-stringify')

const data = pick(myData, ['keyFoo', 'keyBar'])
const fingerprint = hash(stableStringify(data))

首先，我们接受任意 JavaScript 对象作为输入myData。例如，这可以是数据库中的模型，也可以是包含类似 Redux 的应用状态的对象。

Second, we clean our data to ensure that we're only considering parts of the data we actually care about via lodash.pick. This step is optional but usually you'll want to clean your data like this before proceeding. I've found in practice that most of the time there will be parts of your data that aren't actually representative of the uniqueness of your model (we'll refer to this extra stuff as metadata 😉).

As an example, let's say I want to create unique IDs for all of the rows in a SQL table. Most SQL implementations will add metadata to your table like the date an entry was created or modified, and it's unlikely we'd want this metadata to affect our notion of uniqueness. In other words, if two rows were inserted into the table at different times but have the exact same values according to our application's business logic, then we want to treat them as having the same fingerprint so we filter out this extra metadata.

Third, we simplify our cleaned data into a stable, efficient representation that we can store and use for quick comparisons. Most of the time this step involves some sort of cryptographic hash to normalize the way we refer to our content in a unique, concise manner.

In the code above, we want to make sure that our hashing is stable, which is made easy for us by the fast-json-stable-stringify package.

This awesome package recursively makes sure that no matter how our JavaScript object was constructed or what order its keys may be in, it will always output the same string representation for any two objects that have deep equality.

There are some details this explanation is glossing over, but that's the beauty of the NPM ecosystem – we don't have to understand all the bits & pieces to take advantage of their abstractions.

Let's hash this thing out

Up until now, we've glossed over the hashing aspect of things, so let's see what this looks like in code:

const hasha = require('hasha')

const hash = (input) => hasha(input, { algorithm: 'sha256' })

Note that there are lots of different ways you could define your hash function. This example uses a very common SHA256 hash function and outputs a 64-character hex encoding of the results.

Here is an example output fingerprint: 2d3ea73f0faacebbb4a437ff758c84c8ef7fd6cce45c07bee1ff59deae3f67f5

Here is an alternative hash implementation that uses the Node.js crypto package directly:

const crypto = require('crypto')

const hash = (d) => {
  const buffer = Buffer.isBuffer(d) ? d : Buffer.from(d.toString())
  return crypto.createHash('sha256').update(buffer).digest('hex')
}

Both of these hash implementations are equivalent for our purposes.

The most important thing to keep in mind here is that we want to use a cryptographic hash function to output a compact, unique fingerprint that changes if our input data changes and remains the same if our input data remains the same.

So where should I go from here?

Once you start thinking about how data can be uniquely defined by its content, the applications are really endless.

Here are a few use cases where I've personally found this approach useful:

Generating unique identifiers for immutable deployments of serverless functions at Saasify. I know ZEIT uses a very similar approach to optimize their lambda deployments and package dependencies.
Generating unique identifiers for videos based on the database schema we used to generate them at Automagical. If two videos have the same fingerprint, they should have the same content. One note here is that it's often useful to add a version number to your data before hashing since changes in our video renderer resulted in changes to the output videos.
Caching Stripe plans and coupons that have the same parameters across different projects and accounts.
Caching client-side models and HTTP metadata in a React webapp.

We've really only started to scratch the surface of what you can do with content-based addressing. Hopefully, I've shown how simple this mindset shift can be done in JavaScript and touched on a bit on the benefits this approach brings to the table.

If you enjoy this stuff, I would recommend checking out:

The power of content-based addressing - An awesome intro to the topic with a focus on content identifiers (CIDs) as they're used in IPFS.
Multihashes - Self-describing hashes. 💪
Merkle trees - A recursive data structure built on top of content-based hashes.
Rabin fingerprinting - An efficient string searching algorithm that uses content-based hashing.
IPFS - InterPlanetary File System.
libp2p - Modular building blocks for decentralized applications.
Saasify - An easier way for devs to earn passive income... Oh wait, that's my company and it's not really related to content-based addressing but cut me some slack haha 😂

Thanks! 🙏

文章来源：https://dev.to/transitivebullshit/data-fingerprinting-in-javascript-ojm

菜单

分享

Data Fingerprinting in JavaScript DEV's Worldwide Show and Tell Challenge Presented by Mux: Pitch Your Projects!

JavaScript 中的数据指纹识别

由 Mux 主办的 DEV 全球展示挑战赛：展示你的项目！

你到底在说什么？

闭嘴，给我看看代码。

那么，这究竟是如何运作的呢？

Let's hash this thing out

So where should I go from here?

系统设计面试中的 19 种微服务模式

使用 React 和 AWS Amplify 实现无服务器架构第三部分：跟踪应用使用情况

模型-视图-控制器（MVC）模式到底是什么？DEV 全球项目展示挑战赛，由 Mux 主办：快来展示你的项目吧！

我在两年内从 PHP 开发人员晋升为高级 C#/.NET 开发人员。

了解 Docker：第 12 部分 – 传递构建参数

Yarn 和第三方 NPM 客户端的黑暗未来 DEV 的全球展示与讲述挑战赛，由 Mux 呈现：展示你的项目！

CSS DEV 的全球展示挑战赛“响应式字体”由 Mux 呈现：展示你的项目！

我是如何以学生开发者的身份免费获得 Tabnine Pro 的，你也可以！

五大顶级JS框架

从 Rector PHP 开始：利用自动化改进您的 PHP 代码