ProductPromotion

Kotlin

made by https://0x3d.site

GitHub - brianmadden/krawler: A web crawling framework written in Kotlin

A web crawling framework written in Kotlin. Contribute to brianmadden/krawler development by creating an account on GitHub.

Visit Site

GitHub - brianmadden/krawler: A web crawling framework written in Kotlin

About

Krawler is a web crawling framework written in Kotlin. It is heavily inspired by crawler4j by Yasser Ganjisaffar. The project is still very new, and those looking for a mature, well tested crawler framework should likely still use crawler4j. For those who can tolerate a bit of turbulence, Krawler should serve as a replacement for crawler4j with minimal modifications to existing applications.

Some neat features and benefits of Krawler include:

Kotlin project!
Krawler differentiates between a "check" and a "visit". Checks are used to verify the status code of a resource by issuing an HTTP HEAD request rather than a GET request. Each policy (get or check) can have it's own logic associated with it by implementing either shouldCheck or shouldVisit and check and visit.
Krawler's politeness delay is per-host rather than global. This way servers aren't overwhelmed, but crawls visiting many hosts in parallel are not effectively serialized by the politeness delay.
Krawler uses Jsoup for parsing HTML files while harvesting links, making it more tolerant of malformed or poorly written websites, and thus less likely to error out during a crawl. The original HTML of the page is still available to facilitate validation and checking though.
Krawler collects full anchor tags including all attributes and anchor text.
Krawler currently has no proxy support, but it is on the roadmap. :(

Add Dependency

Krawler is published through jitpack.io at: https://jitpack.io/#brianmadden/krawler/ . Add jitpack.io as a repository, and krawler as a dependency to use Krawler in your project:

Using Gradle

repositories {
    jcenter()
    maven { url "https://jitpack.io" }
}

dependencies {
    compile 'com.github.brianmadden:krawler:0.4.4'
}

Using Maven

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>

<dependency>
    <groupId>com.github.brianmadden</groupId>
    <artifactId>krawler</artifactId>
    <version>0.4.4</version>
</dependency>

Usage

Using the Krawler framework is fairly simple. Minimally, there are two methods that must be overridden in order to use the framework. Overriding the shouldVisit method dictates what should be visited by the crawler, and the visit method dictates what happens once the page is visited. Overriding these two methods is sufficient for creating your own crawler, however there are additional methods that can be overridden to privde more robust behavior.

Kotlin

GitHub - brianmadden/krawler: A web crawling framework written in KotlinA web crawling framework written in Kotlin. Contribute to brianmadden/krawler development by creating an account on GitHub.Visit Site

GitHub - brianmadden/krawler: A web crawling framework written in Kotlin

About

Add Dependency

Using Gradle

Using Maven

Usage

Roadmap

Release Notes

More Resourcesto explore the angular.

Related Articlesto learn about angular.

FAQ'sto learn more about Angular JS.

More Sitesto check out once you're finished browsing here.

GitHub - brianmadden/krawler: A web crawling framework written in Kotlin
A web crawling framework written in Kotlin. Contribute to brianmadden/krawler development by creating an account on GitHub.
Visit Site

More Resources
to explore the angular.

Related Articles
to learn about angular.

FAQ's
to learn more about Angular JS.

More Sites
to check out once you're finished browsing here.