如何抓取 Glassdoor 评论
介绍
本文将介绍如何使用Page2API从 Glassdoor 抓取公司评论。
Glassdoor.com是一个美国网站,现任和前任员工可以在上面匿名评价公司。
免责声明:
我们强烈建议您仅出于个人用途抓取 Glassdoor 上的信息。
例如:假设您正在寻找新工作,并且想要快速分析您感兴趣的公司的评价。
先决条件
要开始抓取 Glassdoor 评论,我们需要以下物品:
- Page2API 帐户
- 我们感兴趣的一家公司。具体来说,我们感兴趣的公司是…… Glassdoor。(该公司网站上也有用户评论)
如何抓取 Glassdoor 评论
首先,我们需要打开 glassdoor.com 网站,然后在搜索框中输入“Glassdoor reviews” 。
这将把浏览器地址栏的 URL 更改为类似以下内容:
https://www.glassdoor.com/Reviews/Glassdoor-Reviews-E100431.htm
我们将使用此 URL 作为启动抓取过程所需的第一个参数。
您看到的页面必须与下图类似:
如果您检查页面 HTML 代码,您会发现单个评论看起来像这样:
我们将从 Glassdoor 评论页面抓取每条评论的以下属性:
- 标题
- 作者信息
- 等级
- 优点
- 缺点
- 有帮助
现在,让我们为每个属性定义选择器。
/* Parent: */
div.gdReview
/* Title */
a.reviewLink
/* Author Info */
.authorInfo
/* Rating */
span.ratingNumber
/* Pros */
span[data-test=pros]
/* Cons */
span[data-test=cons]
/* Helpful */
div.common__EiReviewDetailsStyle__socialHelpfulcontainer
我们来看看分页处理。
要进入下一页,如果页面上有“下一页”按钮,我们必须点击它:
document.querySelector(".nextButton").click()
只要页面上存在“下一页”按钮,抓取操作就会继续;如果“下一页”按钮消失,抓取操作就会停止。
抓取工具的停止条件是以下 JavaScript 代码片段:
document.querySelector(".nextButton") === null
// but to avoid timeouts, we will scrape a fixed amount of pages (see the payload below)
现在是时候构建抓取 Glassdoor 评论的请求了。
我们发起的网络爬虫请求的有效载荷将是:
{
"url": "https://www.glassdoor.com/Reviews/Glassdoor-Reviews-E100431.htm",
"real_browser": true,
"merge_loops": true,
"premium_proxy": "us",
"scenario": [
{
"loop": [
{ "wait_for": "div.gdReview" },
{ "execute": "parse" },
{ "execute_js": "document.querySelector(\".nextButton\").click()" }
],
"iterations": 2
}
],
"parse": {
"reviews": [
{
"_parent": "div.gdReview",
"title": "a.reviewLink >> text",
"author_info": ".authorInfo >> text",
"rating": "span.ratingNumber >> text",
"pros": "span[data-test=pros] >> text",
"cons": "span[data-test=cons] >> text",
"helpful": "div.common__EiReviewDetailsStyle__socialHelpfulcontainer >> text"
}
]
}
}
将 api_key 设置为环境变量:
export API_KEY=YOUR_PAGE2API_KEY
使用cURL运行抓取请求:
curl -v -XPOST -H "Content-type: application/json" -d '{
"api_key": "'"$API_KEY"'",
"url": "https://www.glassdoor.com/Reviews/Glassdoor-Reviews-E100431.htm",
"real_browser": true,
"merge_loops": true,
"premium_proxy": "us",
"scenario": [
{
"loop": [
{ "wait_for": "div.gdReview" },
{ "execute": "parse" },
{ "execute_js": "document.querySelector(\".nextButton\").click()" }
],
"iterations": 2
}
],
"parse": {
"reviews": [
{
"_parent": "div.gdReview",
"title": "a.reviewLink >> text",
"author_info": ".authorInfo >> text",
"rating": "span.ratingNumber >> text",
"pros": "span[data-test=pros] >> text",
"cons": "span[data-test=cons] >> text",
"helpful": "div.common__EiReviewDetailsStyle__socialHelpfulcontainer >> text"
}
]
}
}' 'https://www.page2api.com/api/v1/scrape' | python -mjson.tool
结果:
{
"result": {
"reviews": [
{
"title": "Glassdoor Walks the Walk",
"author_info": "Jan 7, 2022 - Senior Manager",
"rating": "5.0",
"pros": "Glassdoor creates a positive environment for employees to learn and grow. ...",
"cons": "At any organization, there is always room for improvement. ...",
"helpful": "1 person found this review helpful"
},
{
"title": "Great Company To Work For",
"author_info": "Jan 5, 2022 - Customer Success Manager",
"rating": "4.0",
"pros": "I absolutely love working at Glassdoor. ...",
"cons": "While we do have more of an extensive career growth plan, ...",
"helpful": "2 people found this review helpful"
}, ...
]
}, ...
}
结论
完毕!
我们刚刚完成了从 Glassdoor 抓取评论的工作,事实证明,如果我们有合适的抓取工具,这项工作既简单又有趣。
原文链接如下:
page2api.com/blog/how-to-scrape-glassdoor-reviews/
文章来源:https://dev.to/nrotaru/how-to-scrape-glassdoor-reviews-362m


