Improve (& document) scraper testing workflow #37

jamesturk · 2023-07-19T22:26:29Z

I want to think through this a bit & welcome feedback from anyone that'd like better ways to test their scrapers written using spatula.

The problem this is attempting to solve is that when writing scrapers, you might want the ability to test against a cached page, you would also want the ability to update your cached copy easily. This feels like it falls well within spatula's domain and spatula could offer a solution that works for common cases.

I've considered a few approaches & currently leaning towards the following:

Idea: Provide helper to turn page into a TestablePage

Sources are responsible for fetching themselves in Source.get_response, by replacing sources with special caching versions, an existing Page can be tested against a cached response.

def test_example_page():
# this would replace all of a page's sources with a new TestCacheURL, other parameters would stay the same
page = make_testable_page(ExamplePage(...))
assert page.process_page() == [1, 2, 3]

TestCacheURL would do the following:

check a configurable location (spatula_testdata.sqlite3) for a cached copy of the response, if present, return as-is
if a URL isn't present in the cache this would be an error unless a special (SPATULA_TEST_UPDATE_SOURCES) environment variable is set
to make this easier to use, the CLI interface could add methods to check the status of the cache/clear entries/etc.

This would be pretty simple for 80% of cases, it might get complicated for pages that yield back other pages, etc. since presumably you'd want to have their sources replaced too.

I'd also considered just having a global flag that alters how URL sources work (SPATULA_TEST_MODE) but not sure I like that approach yet.

The text was updated successfully, but these errors were encountered:

jefftriplett · 2023-07-19T23:20:07Z

It'd be nice to be able to pass a string (response or byte-string or whatever you think is best) into our Page class via ExamplePage(response="...") where "..." is the contents of what we'd like it to parse. If we need to wrap the string with a Response object or something that's fine too. Then we can test the scraper against it.

I'm happy to expand this more if it'd be helpful.

jefftriplett · 2023-07-20T00:26:21Z

This looks handy:

spatula/tests/test_pages.py

Lines 18 to 37 in 2bf8f37

    
           @dataclass 
        
           class Response: 
        
               content: bytes 
        
               @property 
        
               def text(self): 
        
                   return self.content 
        
               def json(self): 
        
                   return json.loads(self.content) 
        
           def test_html_page(): 
        
               class ConcreteHtmlPage(HtmlPage): 
        
                   def process_page(self): 
        
                       pass 
        
               p = ConcreteHtmlPage(source=URL(SOURCE)) 
        
               p.response = Response(b"<html><a href='/test'>link</a></html>") 
        
               p.postprocess_response()

I'll try again to test a few of my scrapers this week. I think the main pain point is having a way to test selectors for a given page type and quickly see what broke.

jamesturk added the enhancement label Jul 19, 2023

jamesturk changed the title ~~Improve & Document nice testing workflow~~ Improve (& document) nice testing workflow Jul 19, 2023

jamesturk changed the title ~~Improve (& document) nice testing workflow~~ Improve (& document) scraper testing workflow Jul 19, 2023

jamesturk mentioned this issue Aug 6, 2023

WIP: add some test_utils to experiment with #38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve (& document) scraper testing workflow #37

Improve (& document) scraper testing workflow #37

jamesturk commented Jul 19, 2023

jefftriplett commented Jul 19, 2023

jefftriplett commented Jul 20, 2023

Improve (& document) scraper testing workflow #37

Improve (& document) scraper testing workflow #37

Comments

jamesturk commented Jul 19, 2023

Idea: Provide helper to turn page into a TestablePage

jefftriplett commented Jul 19, 2023

jefftriplett commented Jul 20, 2023