Skip to content

This Java program extracts links from a list of URLs, saves the extracted links to files, and removes the processed URLs from the input list.

License

Notifications You must be signed in to change notification settings

akumathedyn123/java-url-collector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

ExtractLinks - A Java Web Crawler

This Java program extracts links from a list of URLs, saves the extracted links to files, and removes the processed URLs from the input list.

How to Clone

  1. Open your terminal or command prompt.
  2. Navigate to the directory where you want to clone the repository.
  3. Use the following command to clone the repository:
git clone https://github.com/akumathedyn123/java-url-collector.git

How to Run

Prerequisites:

  • Java 11 or above installed on your system.
  • Jsoup library (https://jsoup.org/download) added to your project. You can download the Jsoup jar file and place it in the same directory as your project or use a dependency management tool like Maven or Gradle.

Steps:

  1. Open a terminal or command prompt and navigate to the project directory.
  2. Compile the Java source code:
javac main.java

This will create an executable file named main.class.

  1. Run the program with the following command:
java main [arguments]

Arguments:

  • -i|--input <filename>: Path to the text file containing the list of URLs (default: urls.txt).
  • -o|--output <directory>: Directory path to save the extracted link files (default: /path/to/text).

Example Usage:

java main -i my_urls.txt -o extracted_links

This will process the URLs in my_urls.txt and save the extracted links to files in the extracted_links directory.

How it Works

  1. The program reads a list of URLs from a text file.
  2. It iterates through each URL in the list.
  3. For each URL, it downloads the HTML content using the downloadHtml function.
  4. It then parses the downloaded HTML with Jsoup to extract links from anchor tags (<a>) with href attributes using the extractLinks function.
  5. The extracted links are saved to a file named after the processed URL using the saveLinksToFile function.
  6. Finally, the processed URL is removed from the input file using the deleteUrl function.

License

This project is licensed under the MIT License (see LICENSE file for details).

Contribution

We welcome contributions to this project. Feel free to fork the repository and submit pull requests with your improvements.

About

This Java program extracts links from a list of URLs, saves the extracted links to files, and removes the processed URLs from the input list.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages