Go Crawler Documentation #
About Go Crawler #
A web crawling framework implemented in Golang, it is simple to write and delivers powerful performance. It comes with a wide range of practical middleware and supports various parsing and storage methods. Additionally, it supports distributed deployment.
Run #
git clone git@github.com:lizongying/go-crawler-example.git my-crawler
cd my-crawler
go run cmd/multi_spider/*.go -c example.yml -n test1 -m once
Feature #
- Simple to write, yet powerful in performance.
- Built-in various practical middleware for easier development.
- Supports multiple parsing methods for simpler page parsing.
- Supports multiple storage methods for more flexible data storage.
- Provides numerous configuration options for richer customization.
- Allows customizations for components, providing more freedom for feature extensions.
- Includes a built-in mock Server for convenient debugging and development.
- It supports distributed deployment.
Support Summary #
- Parsing supports CSS, XPath, Regex, and JSON.
- Output supports JSON, CSV, MongoDB, MySQL, Sqlite, and Kafka.
- Supports Chinese decoding for gb2312, gb18030, gbk, big5 character encodings.
- Supports gzip, deflate, and brotli decompression.
- Supports distributed processing.
- Supports Redis and Kafka as message queues.
- Supports automatic handling of cookies and redirects.
- Supports BaseAuth authentication.
- Supports request retry.
- Supports request filtering.
- Supports image file downloads.
- Supports image processing.
- Supports object storage.
- Supports SSL fingerprint modification.
- Supports HTTP/2.
- Supports random request headers.
- Browser simulation is supported.
- Supports browser AJAX requests.
- Mock server is supported.
- Priority queue is supported.
- Supports scheduled tasks, recurring tasks, and one-time tasks.
- Supports parsing based on field labels.
- Supports DNS Cache.
- Supports MITM
- Supports error logging