Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve (& document) scraper testing workflow #37

Open
jamesturk opened this issue Jul 19, 2023 · 2 comments
Open

Improve (& document) scraper testing workflow #37

jamesturk opened this issue Jul 19, 2023 · 2 comments

Comments

@jamesturk
Copy link
Owner

I want to think through this a bit & welcome feedback from anyone that'd like better ways to test their scrapers written using spatula.

The problem this is attempting to solve is that when writing scrapers, you might want the ability to test against a cached page, you would also want the ability to update your cached copy easily. This feels like it falls well within spatula's domain and spatula could offer a solution that works for common cases.

I've considered a few approaches & currently leaning towards the following:

Idea: Provide helper to turn page into a TestablePage

Sources are responsible for fetching themselves in Source.get_response, by replacing sources with special caching versions, an existing Page can be tested against a cached response.

def test_example_page():
# this would replace all of a page's sources with a new TestCacheURL, other parameters would stay the same
page = make_testable_page(ExamplePage(...))
assert page.process_page() == [1, 2, 3]

TestCacheURL would do the following:

check a configurable location (spatula_testdata.sqlite3) for a cached copy of the response, if present, return as-is
if a URL isn't present in the cache this would be an error unless a special (SPATULA_TEST_UPDATE_SOURCES) environment variable is set
to make this easier to use, the CLI interface could add methods to check the status of the cache/clear entries/etc.

This would be pretty simple for 80% of cases, it might get complicated for pages that yield back other pages, etc. since presumably you'd want to have their sources replaced too.

I'd also considered just having a global flag that alters how URL sources work (SPATULA_TEST_MODE) but not sure I like that approach yet.

@jamesturk jamesturk changed the title Improve & Document nice testing workflow Improve (& document) nice testing workflow Jul 19, 2023
@jamesturk jamesturk changed the title Improve (& document) nice testing workflow Improve (& document) scraper testing workflow Jul 19, 2023
@jefftriplett
Copy link

It'd be nice to be able to pass a string (response or byte-string or whatever you think is best) into our Page class via ExamplePage(response="...") where "..." is the contents of what we'd like it to parse. If we need to wrap the string with a Response object or something that's fine too. Then we can test the scraper against it.

I'm happy to expand this more if it'd be helpful.

@jefftriplett
Copy link

This looks handy:

@dataclass
class Response:
content: bytes
@property
def text(self):
return self.content
def json(self):
return json.loads(self.content)
def test_html_page():
class ConcreteHtmlPage(HtmlPage):
def process_page(self):
pass
p = ConcreteHtmlPage(source=URL(SOURCE))
p.response = Response(b"<html><a href='/test'>link</a></html>")
p.postprocess_response()

I'll try again to test a few of my scrapers this week. I think the main pain point is having a way to test selectors for a given page type and quickly see what broke.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants