Review research on the state-of-the-art for self-hosted HTML document processing #54

eric-czech · 2023-06-19T13:43:36Z

eric-czech
Jun 19, 2023

My intent with this discussion is to track when the moment arrives that any model/library/system for information extraction over web documents is capable of fully automating scraping tasks AND can be self-hosted on a reasonable budget. Let's say that "reasonable budget" means being able to process 1000 HTML docs averaging about 2.2MB in size (the mean web page size 2021 according to HTTP Archive) for $1 USD.

I think it's also worth adding some context on what "scraping" requires and where that sits in the spectrum of related research problems. Roughly speaking, I would put it on a spectrum like this:

Automated parsing (difficulty = easiest): This requires little of a model beyond understanding the HTML, JS, CSS, etc. grammars.
Automated scraping (difficulty = medium): This requires much more semantic content reasoning and aligns with the experience of a developer in needing to understand not only how web platforms/UIs are designed and structured, but also the semantics of a task. For example, if you're writing an HTML scraper to extract t-shirt prices from e-commerce site and the CSS class names are uninformative, you can't rely on list elements or table rows with class names like product-row -- you would have to look at the page as a human or understand that the text in some of the HTML elements has something to do with t-shirts first before knowing what logic to program into the scraper (and it would likely be very brittle + difficult to maintain).
Automated navigation (difficulty = hardest): This requires understanding dynamic interactions on web platforms and aligns with the experience of a user. For an automated system, this currently means knowing how to run a headless browser in addition to processing any generated HTML, CSS, etc.

My interest is primarily in case 2 above since I know case 3 is still likely to be a ways off, and I think that case 2 best aligns with what @jamesturk intended to be in scope for scrapeghost (see #28 (comment)).

Research

That said, here are a few legacy papers/models/projects mostly in the direction of "automated parsing", some of which were mentioned in #26:

MS Document AI: There are a number of related projects run by MS Asia that are indexed in https://github.com/microsoft/unilm/tree/master#multimodal-x--language
- XDoc (2022)
- LayoutLM (2019)
  - This was updated in 2022 at https://github.com/microsoft/unilm/tree/master/layoutlmv3
- MarkupLM (2022)
  - repo: https://github.com/microsoft/unilm/tree/master/markuplm
Ernie-Layout (2022)
- Current leader on WebSRC (leaderboard)

And here are some projects on general-purpose LLMs that are far more likely to be immediately useful in "automated scraping" tasks:

GPT4: This is obviously not one you could self-host or do so on a "reasonable budget", per my definition above, but I'm including it since it's the gold standard to which these others should strive (IMO)
WebFormer (2022)
- WebFormer: The Web-page Transformer for Structure Information Extraction
- This is a Meta/Facebook AI paper (the first author is from there at least)
- They report some interesting results but it doesn't look like they provide much context on other open models, and the code/data itself doesn't appear to be shared either
LLM4HTML (May 2023)
- This is a Google Research paper that summarizes the state of LLMs for HTML document understanding very nicely; from the abstract:
  - Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. Yet, their capabilities for HTML understanding -- i.e., parsing the raw HTML of a webpage, with applications to automation of web-based tasks, crawling, and browser-assisted retrieval – have not been fully explored.
  - We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Description Generation for HTML inputs, and (iii) Autonomous Web Navigation of HTML pages.
  - While previous work has developed dedicated architectures and training procedures for HTML understanding, we show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
- They test LaMDA, PaLM and T5 architectures on a variety of tasks, and at least the T5 models would probably already fall into my definition of working on a "reasonable budget"
- There is a dedicated site for this at UNDERSTANDING HTML WITH LARGE LANGUAGE MODELS
- They share the code and data
- The WebC-T5-large model with only 800M parameters is quite competitive with the best models like WebC-PaLM-62B

tl;dr WebC-T5-large (800M params) is probably the most promising model I've seen so far for self-hosted scraping. I haven't tried it out yet, but I'll share some results if/when I do!

Datasets

WebSRC (2021)
- https://arxiv.org/abs/2101.09465
- https://x-lance.github.io/WebSRC/
MiniWob++ (2022)
- https://arxiv.org/pdf/2202.08137.pdf
  - This is a DeepMind project describing "a challenging suite of computer control problems, and find strong evidence of cross-task transfer. These results demonstrate the usefulness of a unified human-agent interface when training machines to use computers"
    - Browser automation is a subtask of this
LLM4HTML (2023)
- https://arxiv.org/pdf/2210.03945.pdf
- https://console.cloud.google.com/storage/browser/gresearch/webllm
  - Looks like training/eval data is in https://console.cloud.google.com/storage/browser/gresearch/webllm/datasets/descgen
  - Appendix A1 "Dataset Detail" in the paper has some info on what's in here, but it's not well documented yet
- Model weights are in the top level of the folder https://console.cloud.google.com/storage/browser/gresearch/webllm

Context Length

It's also worth noting that I think truly automating scraping tasks depends on support for a large context window, in terms of max tokens. I'm sure there is a whole zoo of strategies for reducing HTML content in prompts via Aria trees, prompt compression or just writing some parsing code yourself to pare down an HTML page to something manageable. Personally, I'll hold my breath for when it's possible to avoid writing/maintaining any code at all for scraping.

I thought it would at least be helpful to do some back-of-the-envelope math on this to get a sense of what size context window is necessary. The distribution of HTML character counts in web pages looks like this according to ClueWeb22:

Assuming 1 token = ~4 characters [1], this would suggest that 117k characters requires a 29,250 token (117k/4) context.

And then given that a gpt-4-32k model already exists and gpt-3.5-turbo is now supporting at 16k token context, presumably it is just a matter of time before smaller, economical, OSS models can match both the capabilities of those models and the 32k context support. That should be enough for most web pages if you take that ClueWeb22 data at face value and note that the average size will be a lot smaller than the median size for a distribution like that. You can also see in the distribution that there will still be a long-tail of massive pages to deal with, but I would say that's much easier to address than using more extensive compression/extraction heuristics now on top of much smaller context windows.

trompx · 2023-06-19T17:17:52Z

trompx
Jun 19, 2023

Awesome post, thanks for putting all that info together.

I stumbled upon some other papers but none would really match or are more useful that the one you listed:

ReXMiner - Towards Zero-shot Relation Extraction in Web Mining: A Multimodal Approach with Relative XML Path may 2023. Use MarkupLM as a backbone.
DOM-LM: Learning Generalizable Representations for HTML Documents jan 2022. Superior to SimpDOM
LEAST-SimpDOM - Label-Efficient Self-Training for Attribute Extraction from Semi-Structured Web Documents aug 2022

One more interesting dataset:

SWDE (Hao et al., 2011) : Structured Web Data Extraction

Did you find the source for the WebC-T5-large model?

0 replies

eric-czech · 2023-06-20T13:13:05Z

eric-czech
Jun 20, 2023
Author

Did you find the source for the WebC-T5-large model?

No not yet, but at least the weights are in https://console.cloud.google.com/storage/browser/gresearch/webllm. Not sure how to load those yet either.

0 replies

trompx · 2023-06-20T13:40:14Z

trompx
Jun 20, 2023

No not yet, but at least the weights are in https://console.cloud.google.com/storage/browser/gresearch/webllm. Not sure how to load those yet either.

The weights are the one for WebD which is for web automation, not WebC for semantic classification.

0 replies

eric-czech · 2023-06-23T00:26:08Z

eric-czech
Jun 23, 2023
Author

The weights are the one for WebD which is for web automation, not WebC for semantic classification.

Ah too bad. Looks like they're still probably working on it: https://twitter.com/AleksandraFaust/status/1579932636893442048.

Perhaps @sandraorion would be willing to weigh in (on open-sourcing the code from Understanding HTML with Large Language Models that is)?

0 replies

sandraorion · 2023-06-23T00:32:17Z

sandraorion
Jun 23, 2023

+Izzeddin Gur ***@***.***> , can you answer?

…

On Fri, Jun 23, 2023, 1:26 AM Eric Czech ***@***.***> wrote: The weights are the one for WebD which is for web automation, not WebC for semantic classification. Ah too bad. Looks like they're still probably working on it: https://twitter.com/AleksandraFaust/status/1579932636893442048. Perhaps @sandraorion <https://github.com/sandraorion> would be willing to weigh in (on open-sourcing the code from Understanding HTML with Large Language Models <https://arxiv.org/abs/2210.03945> that is)? — Reply to this email directly, view it on GitHub <#54 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNXXQVJMRGIXSBEPEMPV63XMTPCVANCNFSM6AAAAAAZL7BBCI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

eric-czech Jun 23, 2023
Author

Thanks Aleksandra! In case that mention doesn’t work, cc: @izzeddingur

eric-czech · 2023-07-27T20:22:51Z

eric-czech
Jul 27, 2023
Author

Another exciting paper in this direction by @izzeddingur: A Real-World WebAgent with Planning, Long Context
Understanding, and Program Synthesis! This says they built an HTML LLM (HTML-T5) even better than the one in LLM4HTML.

I don't see any links to source code or weights though sadly.

1 reply

trompx Jul 28, 2023

I saw that one too, but it is compared to WebN and not WebC. However, it references another model TIE-Large which has better F1 than HTML-T5, itself based on MarkupLM.

eliot-akira · 2023-08-06T15:39:46Z

eliot-akira
Aug 6, 2023

To promote further research on LLMs for HTML understanding, we create and open-source a large-scale HTML dataset distilled and auto-labeled from CommonCrawl.

From: Understanding HTML with Large Language Models

I'm curious about this dataset, but there were no links in the paper. Would love to know when it becomes open-sourced.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review research on the state-of-the-art for self-hosted HTML document processing #54

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Review research on the state-of-the-art for self-hosted HTML document processing #54

eric-czech Jun 19, 2023

Research

Datasets

Context Length

Replies: 7 comments · 2 replies

trompx Jun 19, 2023

eric-czech Jun 20, 2023 Author

trompx Jun 20, 2023

eric-czech Jun 23, 2023 Author

sandraorion Jun 23, 2023

eric-czech Jun 23, 2023 Author

eric-czech Jul 27, 2023 Author

trompx Jul 28, 2023

eliot-akira Aug 6, 2023

eric-czech
Jun 19, 2023

Replies: 7 comments 2 replies

trompx
Jun 19, 2023

eric-czech
Jun 20, 2023
Author

trompx
Jun 20, 2023

eric-czech
Jun 23, 2023
Author

sandraorion
Jun 23, 2023

eric-czech Jun 23, 2023
Author

eric-czech
Jul 27, 2023
Author

eliot-akira
Aug 6, 2023