ScalaHub

Goose - Article Extractor

Goose is an article extractor written in Scala. Point it to an url and it will extract the plain text of the article, along with the main image, embedded movies and meta information that it finds (tags, publish date). Awesome stuff, right?

Here’s an example of how to use it (It’s this simple!):

1
2
3
4
5
6
val goose = new Goose(new Configuration)
val article = goose.extractContent(url)

println(article.cleanedArticleText)
println(article.title)
println(article.tags)

Checkout the live demo page where you can play with parsing an url of your choice.

Goose is created by Jim Plush from Gravity.com Maven/SBT dependency is available in MavenCentral for quick use in your project.