Crawler4JJS是什么?探索这款JavaScript爬虫框架的功能与用途
- 行业动态
- 2025-01-20
- 2719
Crawler4jJS 是一个基于 JavaScript 的开源爬虫框架,用于抓取网页数据。它提供了简单易用的 API,支持多线程和自定义抓取规则,适用于各种规模的爬取任务。
crawler4jjs是一个开源的Java爬虫框架,专为快速创建多线程网络爬虫而设计,它提供了简单易用的接口,使得开发者可以在几分钟内搭建一个功能丰富的网络爬虫,本文将详细介绍crawler4jjs的特性、安装方法、基本使用以及常见问题解答。
一、crawler4jjs简介
crawler4jjs是由Yasser Ganjisaffar开发的一个简单易用的开源网络爬虫框架,它支持多线程和深度数据采集,并且内置了URL过滤机制,开发者可以利用数据解析工具(如Jsoup)提取网页中的结构化字段,crawler4jjs项目的源码可以在GitHub上进行下载。
二、crawler4jjs的主要特性
特性 | 描述 |
多线程支持 | crawler4jjs支持多线程,可以同时抓取多个页面,提高抓取效率。 |
深度数据采集 | 支持对网页进行深度爬取,获取更多层级的数据。 |
URL过滤机制 | 内置URL过滤机制,可以根据需要过滤不需要的URL。 |
数据解析 | 支持使用Jsoup等工具解析网页内容,提取结构化数据。 |
易于扩展 | crawler4jjs提供了丰富的API,方便开发者进行功能扩展和定制。 |
三、crawler4jjs的安装与配置
1. Maven依赖
使用Maven构建项目时,可以在pom.xml文件中添加以下依赖:
<dependency> <groupId>edu.uci.ics</groupId> <artifactId>crawler4j</artifactId> <version>4.2</version> </dependency>
2. 手动下载
如果不使用Maven,可以从releases page或Maven Central下载crawler4jjs的JAR包,注意,crawler4jjs有几个外部依赖包,在releases page可以找到包含所有依赖的捆绑JAR包(如crawler4j-X.Y-with-dependencies.jar),下载并添加到classpath中即可。
四、crawler4jjs的基本使用
1. 创建Crawler类
使用crawler4jjs需要创建一个继承自WebCrawler的爬虫类,以下是一个简单的示例:
import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.ics.crawler4j.url.WebURL; import java.io.FileWriter; import java.io.IOException; import java.util.Locale; import java.util.regex.Pattern; public class MyCrawler extends WebCrawler { private final static Pattern FILTERS = Pattern.compile(".*(\.(css|js|bmp|gif|jpe?g|ico" + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); @Override public boolean shouldVisit(Page referringPage, WebURL url) { String href = url.getURL().toLowerCase(); return !FILTERS.matcher(href).matches() && href.startsWith("http://www.example.com/"); } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); String text = htmlParseData.getText(); String html = htmlParseData.getHtml(); try { FileWriter writer = new FileWriter("data/" + page.getWebURL().getDocid() + ".txt"); writer.append("新闻的id为:" + page.getWebURL().getDocid()).append(" "); writer.append("标题: " + htmlParseData.getTitle()).append(" "); writer.append("链接: " + url).append(" "); writer.append("内容: " + text).append(" "); writer.close(); } catch (IOException e) { e.printStackTrace(); } } } }
在这个示例中,MyCrawler类继承了WebCrawler,并重写了shouldVisit和visit方法,shouldVisit方法用于指定哪些URL应该被访问,visit方法用于处理抓取到的页面。
2. 运行爬虫
要运行上述爬虫,可以使用CrawlController来调用实现的爬虫类:
import edu.uci.ics.crawler4j.crawler.CrawlConfig; import edu.uci.ics.crawler4j.crawler.CrawlController; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.fetcher.PageFetchError; import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig; import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer; public class Main { public static void main(String[] args) throws Exception { String crawlStorageFolder = "/data/crawl"; int numberOfCrawlers = 7; // 设置爬虫数量 // 创建配置文件 CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); config.setMaxDepthOfCrawling(5); config.setPolitenessDelay(100); config.setUserAgentString("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"); config.setIncludeHttpsPages(true); config.addIncludeRule("/example/*"); config.addExcludeRule("/abc/*"); config.setFollowRedirects(true); config.setMaxDownloadSize(new Long(100).byteValue()); config.setMaxDownloadSizeOnPage(new Long(100).byteValue()); config.setIncludeBinaryContentInMainContent(false); config.setIncludeBinaryContentInLinks(false); config.setExtractLinksWithinSameDomain(true); config.setIncludeProtocolRelativeUrls(true); config.setUseCanonicalLinks(true); config.setUseHeadRequestForMetaTags(true); config.setUseHeadRequestForRobotsTxt(true); config.setUseHeadRequestForSitemaps(true); config.setUseHeadRequestForJavaScriptLinks(true); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowedDomainsImpl(new AllowedDomainsImpl()); config.setDisallowedDomainsImpl(new DisallowedDomainsImpl()); config.setAllowd
本站发布或转载的文章及图片均来自网络,其原创性以及文中表达的观点和判断不代表本站,有问题联系侵删!
本文链接:http://www.xixizhuji.com/fuzhu/89837.html