Skip to content

bdrad/keppel

Repository files navigation

book-scrape

Motivated by Textbooks Are All You Need.

todos

  • Generate requirements.txt
    • Use dependency manager -- setting up detectron2 is a pain
  • Implement logging
    • Log what is printed now (text box collissions, dehyphen fails, etc.)
    • Save to actual file -- see comparison of log events after code updates

Scraping

  • Parse math equations, e.x. page 288 of Chest - Webb - Fundamentals of Body CT (4e)
  • Parse tables? This may be difficult, and perhaps irrelevant for the purpose of training LLMs

Post-processing

  • Remove references to figures, tables, and other information that is not scrapped
  • De-hyphenate words (TEXT_DEHYPHENATE flag doesn't work?)
    • Greedy approach may not be sufficient. E.g. 'x-ray' is common

books (to) support

  • Chest - Elicker - HRCT of the Lungs 2e
  • Chest - Felson - Principles of Chest Roentgenology (4e)
  • Cardiac Imaging Requisites 4e
  • Chest - Elicker - HRCT of the Lungs 2e
  • Chest - Felson - Principles of Chest Roentgenology (4e)
  • Chest - Webb - Fundamentals of Body CT (4e)
  • Emergency Radiology Requisites 2e
  • Fundamentals of Body CT 4e
  • General - Brant _ Helms - Fundamentals of Diagnostic Radiology (4e)
  • General - Mandell - Core Radiology (1e)
  • General - Weissleder - Primer of Diagnostic Imaging (5e)
  • Vascular and Interventional Radiology Requisites 2e

About

Scraping texts for LLM training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published