Skip to content

Sadedegel extraction-based summarizer public data generation tool from new sites.

License

Notifications You must be signed in to change notification settings

GlobalMaksimum/sadedegel-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SadedeGel Scraper

This web scraper is developed to meet the data requirements of SadedeGel library. It scrapes data from news websites and stores them as .txt files. Developed as a part of Açık Kaynak Hackathon Programı 2020.

License Last Commit

💬 Where to ask questions

The SadedeGel project is maintained by @globalmaksmum AI team members @dafajon, @askarbozcan, @mccakir and @husnusensoy.

Type Platforms
🚨 Bug Reports GitHub Issue Tracker
🎁 Feature Requests GitHub Issue Tracker

How it works

  • Gets author urls of given news website
  • Gets article urls of each author
  • Scrapes data from the article and write to a .txt file

Install Scraper

You need sbt to build the project.

$ git clone https://github.com/GlobalMaksimum/sadedegel-scraper.git 
$ cd sadedegel-scraper
$ sbt assembly  

You will get the jar under ./target/scala-[version]/

Example Run

$ nohup java -jar sadedegel-scraper-assembly-0.3.jar "hurriyet" > hurriyet.out &

Check for hurriyet-[dd-MM-yyyy] directory for .txt files.

For Developers

You can add support for additional news sources by extending NewsWebsite Trait.

Example:

import com.sadedegel.ScraperUtils.getArticles

class HurriyetScraper extends NewsWebsite {
  val domain = "https://www.hurriyet.com.tr"
  val authorsUrl = "https://www.hurriyet.com.tr/yazarlar/tum-yazarlar/#hurriyetcomtr"
  override def getAuthorUrls(): List[String] = {
    List("https://www.hurriyet.com.tr/yazarlar/ilber-ortayli/"
    )
  }
  override def getArticlesOfAuthors(authorUrls: List[String], domain: String): Unit = {
    getArticles(authorUrls, domain, ".highlighted-box.mb20", writeArticlesToFile, "?p=", "")
  }
  override def writeArticlesToFile(articleUrl: String): Unit = {
    ScraperUtils.writeToFile(articleUrl, List(".article-content.news-text", ".rhd-all-article-detail"),
      "hurriyet")
  }
}

About

Sadedegel extraction-based summarizer public data generation tool from new sites.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages