Review research on the state-of-the-art for self-hosted HTML document processing #54
Replies: 7 comments 2 replies
-
Awesome post, thanks for putting all that info together. I stumbled upon some other papers but none would really match or are more useful that the one you listed:
One more interesting dataset:
Did you find the source for the WebC-T5-large model? |
Beta Was this translation helpful? Give feedback.
-
No not yet, but at least the weights are in https://console.cloud.google.com/storage/browser/gresearch/webllm. Not sure how to load those yet either. |
Beta Was this translation helpful? Give feedback.
-
The weights are the one for WebD which is for web automation, not WebC for semantic classification. |
Beta Was this translation helpful? Give feedback.
-
Ah too bad. Looks like they're still probably working on it: https://twitter.com/AleksandraFaust/status/1579932636893442048. Perhaps @sandraorion would be willing to weigh in (on open-sourcing the code from Understanding HTML with Large Language Models that is)? |
Beta Was this translation helpful? Give feedback.
-
+Izzeddin Gur ***@***.***> , can you answer?
…On Fri, Jun 23, 2023, 1:26 AM Eric Czech ***@***.***> wrote:
The weights are the one for WebD which is for web automation, not WebC for
semantic classification.
Ah too bad. Looks like they're still probably working on it:
https://twitter.com/AleksandraFaust/status/1579932636893442048.
Perhaps @sandraorion <https://github.com/sandraorion> would be willing to
weigh in (on open-sourcing the code from Understanding HTML with Large
Language Models <https://arxiv.org/abs/2210.03945> that is)?
—
Reply to this email directly, view it on GitHub
<#54 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABNXXQVJMRGIXSBEPEMPV63XMTPCVANCNFSM6AAAAAAZL7BBCI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Another exciting paper in this direction by @izzeddingur: A Real-World WebAgent with Planning, Long Context I don't see any links to source code or weights though sadly. |
Beta Was this translation helpful? Give feedback.
-
I'm curious about this dataset, but there were no links in the paper. Would love to know when it becomes open-sourced. |
Beta Was this translation helpful? Give feedback.
-
My intent with this discussion is to track when the moment arrives that any model/library/system for information extraction over web documents is capable of fully automating scraping tasks AND can be self-hosted on a reasonable budget. Let's say that "reasonable budget" means being able to process 1000 HTML docs averaging about 2.2MB in size (the mean web page size 2021 according to HTTP Archive) for $1 USD.
I think it's also worth adding some context on what "scraping" requires and where that sits in the spectrum of related research problems. Roughly speaking, I would put it on a spectrum like this:
product-row
-- you would have to look at the page as a human or understand that the text in some of the HTML elements has something to do with t-shirts first before knowing what logic to program into the scraper (and it would likely be very brittle + difficult to maintain).My interest is primarily in case 2 above since I know case 3 is still likely to be a ways off, and I think that case 2 best aligns with what @jamesturk intended to be in scope for scrapeghost (see #28 (comment)).
Research
That said, here are a few legacy papers/models/projects mostly in the direction of "automated parsing", some of which were mentioned in #26:
And here are some projects on general-purpose LLMs that are far more likely to be immediately useful in "automated scraping" tasks:
WebC-T5-large
model with only 800M parameters is quite competitive with the best models likeWebC-PaLM-62B
tl;dr
WebC-T5-large
(800M params) is probably the most promising model I've seen so far for self-hosted scraping. I haven't tried it out yet, but I'll share some results if/when I do!Datasets
https://console.cloud.google.com/storage/browser/gresearch/webllm
Context Length
It's also worth noting that I think truly automating scraping tasks depends on support for a large context window, in terms of max tokens. I'm sure there is a whole zoo of strategies for reducing HTML content in prompts via Aria trees, prompt compression or just writing some parsing code yourself to pare down an HTML page to something manageable. Personally, I'll hold my breath for when it's possible to avoid writing/maintaining any code at all for scraping.
I thought it would at least be helpful to do some back-of-the-envelope math on this to get a sense of what size context window is necessary. The distribution of HTML character counts in web pages looks like this according to ClueWeb22:
Assuming 1 token = ~4 characters [1], this would suggest that 117k characters requires a 29,250 token (117k/4) context.
And then given that a
gpt-4-32k
model already exists and gpt-3.5-turbo is now supporting at 16k token context, presumably it is just a matter of time before smaller, economical, OSS models can match both the capabilities of those models and the 32k context support. That should be enough for most web pages if you take that ClueWeb22 data at face value and note that the average size will be a lot smaller than the median size for a distribution like that. You can also see in the distribution that there will still be a long-tail of massive pages to deal with, but I would say that's much easier to address than using more extensive compression/extraction heuristics now on top of much smaller context windows.Beta Was this translation helpful? Give feedback.
All reactions