This repository aims to be a comprehensive dataset written in the Kurdish language, sourced from various materials. The resulting dataset will facilitate diverse studies on the Kurdish language.
The data
folder is the main container for dataset-related files. It is organized into three subfolders:
- data_files: Contains data tables (e.g., PDFs) to be added to the dataset.
- raw: Serves as a backup folder. It includes:
raw_kurmanji.txt
: Stores unprocessed text extracted using thePDFTextExtractor
frompdf_to_text.py
. This file is used for manual review and corrections before processing.
- processed: The final output folder, containing:
kurmanji.txt
: The plain text file of the final dataset.kurmanji.json
: The JSON representation of the final dataset, automatically populated with the following fields:file_name
: The source file name.char_count
: The character count of the text.word_count
: The word count of the text.text
: The actual text content.
The scripts
folder contains:
- pdf_to_text.py: A script to convert PDF files to text.
main.py
is the primary Python script, enabling streamlined data integration without dealing directly with intermediate processing scripts. Below is an example usage:
from scripts.pdf_to_text import PDFTextExtractor
# Converts the PDF file to text and transfers it to data/raw/raw_kurmanji.txt for manual review.
pdftextextractor = PDFTextExtractor("data/data_files/file_name.pdf")
# After manual review, the final data is transferred to kurmanji.txt and kurmanji.json.
pdftextextractor.append_raw_to_processed_data()
- Add the desired file to the
data/data_files
directory. - Use the
PDFTextExtractor
class to extract raw text:The extracted text will be saved inpdftextextractor = PDFTextExtractor("data/data_files/file_name.pdf")
data/raw/raw_kurmanji.txt
for manual review. - Manually review and correct the contents of
data/raw/raw_kurmanji.txt
. - Transfer the reviewed data to the processed files:
The reviewed data will be appended to
pdftextextractor.append_raw_to_processed_data()
data/processed/kurmanji.txt
anddata/processed/kurmanji.json
.
The PDFTextExtractor
class supports two optional parameters: start_page
and end_page
. These parameters allow users to specify the range of pages to extract text from a PDF. By default, both parameters are None
, meaning the entire PDF is processed. To process specific pages, specify the range as shown below:
pdftextextractor = PDFTextExtractor("data/data_files/file_name.pdf", start_page=1, end_page=5)