Motivated by Textbooks Are All You Need.
- Generate
requirements.txt
- Use dependency manager -- setting up detectron2 is a pain
- Implement logging
- Log what is printed now (text box collissions, dehyphen fails, etc.)
- Save to actual file -- see comparison of log events after code updates
- Parse math equations, e.x. page 288 of Chest - Webb - Fundamentals of Body CT (4e)
- Parse tables? This may be difficult, and perhaps irrelevant for the purpose of training LLMs
- Remove references to figures, tables, and other information that is not scrapped
- De-hyphenate words (
TEXT_DEHYPHENATE
flag doesn't work?)- Greedy approach may not be sufficient. E.g. 'x-ray' is common
- Chest - Elicker - HRCT of the Lungs 2e
- Chest - Felson - Principles of Chest Roentgenology (4e)
- Cardiac Imaging Requisites 4e
- Chest - Elicker - HRCT of the Lungs 2e
- Chest - Felson - Principles of Chest Roentgenology (4e)
- Chest - Webb - Fundamentals of Body CT (4e)
- Emergency Radiology Requisites 2e
- Fundamentals of Body CT 4e
- General - Brant _ Helms - Fundamentals of Diagnostic Radiology (4e)
- General - Mandell - Core Radiology (1e)
- General - Weissleder - Primer of Diagnostic Imaging (5e)
- Vascular and Interventional Radiology Requisites 2e