We all know that one of the ways to deal with data is to retrieve data from sources.
Where we write some queries against a database and then it extracts the data we are seeking.
- Not all data is stored in databases. Data is often stored in documents, in files, in case of HTML or XML files, those documents are structured.
- Using this knowledge we can focus on specific parts of the HTML document lets say a part that contains a paragraph we can use grep commands in bash to retrieve all content in that paragraph.
- It is a method of extracting data from websites.
- Parse HTML document to collect info from web pages.
- Use cases:
- Price comparison services.
- Extracting product data from an e-commerce website.
- Pros:
- Data is publicly accessible and free.
- Cons:
- Legal and ethical considerations.
- Subject to changes of website structures.
- Challenges:
- Handling dynamic websites.
- Handling anti-scraping mechanisms.
- Maintaining scrapers.
- Provides a bridge between your software and an external system. APIs provide an interface for outside users so they can interact with that interface use certain endpoints to actually send or retrieve data.
- APIs help applications to connect with each other.
- APIs provide structured data in a standardized format.(JSON, XML)
- Use cases:
- Integrating with third party services.
- Accessing real-time data about sport scores, or weather data from a weather API.
- Pros:
- Controlled access to data.
- Less likely to violate terms of service.
- Sructured and consistent data.
- Cons:
- Limited to the data provided by the API.
- May require authentication and API keys.
- May cost money to have an API key.
- Challenges:
- Finding suitable APIs.
- Understanding API documentation.
- Handling rate limits.
Web Scraping:
- When APIs are not available or do not provide the required data eg Ryanair
- When you need data from websites without exposing your identity or API usage
APIs:
- When you need real-time or up-to-date data from a reliable source.
- When you use want to ensure compliance with data usage policies and avoid potential legal issues.
Implement rate limiting and caching to reduce the load on servers.