Convert markdown and its elements (tables, lists, code, etc.) into structured, easily processable data formats like lists and hierarchical dictionaries (or JSON), with support for parsing back to markdown.
- Detect, extract and convert markdown building blocks into Python data structures
- Provide two formats for parsed markdown:
- List format: Each building block as separate dictionary in a list
- Dictionary format: Nested structure using headers as keys
- Convert parsed markdown to JSON
- Parse markdown data back to markdown formatted string
- Add options which data gets parsed back to markdown
- Extract specific building blocks (e.g., only tables or lists)
- Support for task lists (checkboxes)
- Enhanced code block handling with language detection
- Comprehensive blockquote support with nesting
- Consistent handling of definition lists
- Provide comprehensive documentation
- Add more test coverage --> 215 test cases
- Publish on PyPI
- Align with edge cases of Common Markdown Specification
pip install markdown-to-data
from markdown_to_data import Markdown
markdown = """
---
title: Example text
author: John Doe
---
# Main Header
- [ ] Pending task
- [x] Completed subtask
- [x] Completed task
## Table Example
| Column 1 | Column 2 |
|----------|----------|
| Cell 1 | Cell 2 |
´´´python
def hello():
print("Hello World!")
´´´
"""
md = Markdown(markdown)
# Get parsed markdown as list
print(md.md_list)
# Each building block is a separate dictionary in the list
# Get parsed markdown as nested dictionary
print(md.md_dict)
# Headers are used as keys for nesting content
# Get information about markdown elements
print(md.md_elements)
[
{'metadata': {'title': 'Example text', 'author': 'John Doe'}},
{'header': {'level': 1, 'content': 'Main Header'}},
{
'list': {
'type': 'ul',
'items': [
{
'content': 'Pending task',
'items': [
{
'content': 'Completed subtask',
'items': [],
'task': 'checked'
}
],
'task': 'unchecked'
},
{'content': 'Completed task', 'items': [], 'task': 'checked'}
]
}
},
{'header': {'level': 2, 'content': 'Table Example'}},
{'table': {'Column 1': ['Cell 1'], 'Column 2': ['Cell 2']}},
{
'code': {
'language': 'python',
'content': 'def hello():\n print("Hello World!")'
}
}
]
{
'metadata': {'title': 'Example text', 'author': 'John Doe'},
'Main Header': {
'list_1': {
'type': 'ul',
'items': [
{
'content': 'Pending task',
'items': [
{
'content': 'Completed subtask',
'items': [],
'task': 'checked'
}
],
'task': 'unchecked'
},
{'content': 'Completed task', 'items': [], 'task': 'checked'}
]
},
'Table Example': {
'table_1': {'Column 1': ['Cell 1'], 'Column 2': ['Cell 2']},
'code_1': {
'language': 'python',
'content': 'def hello():\n print("Hello World!")'
}
}
}
}
{
'metadata': {'count': 1, 'positions': [0], 'variants': set()},
'header': {'count': 2, 'positions': [1, 3], 'variants': set()},
'list': {'count': 1, 'positions': [2], 'variants': {'ul'}},
'table': {'count': 1, 'positions': [4], 'variants': set()},
'code': {'count': 1, 'positions': [5], 'variants': {'python'}}
}
The Markdown
class provides a method to parse markdown data back to markdown-formatted strings.
The to_md
method comes with options to customize the output:
from markdown_to_data import Markdown
markdown = """
---
title: Example
---
# Main Header
- [x] Task 1
- [ ] Subtask
- [ ] Task 2
## Code Example
´´´python
print("Hello")
´´´
"""
md = Markdown(markdown)
Example 1: Include specific elements
print(md.to_md(
include=['header', 'list'], # Include all headers and lists
spacer=1 # One empty line between elements
))
Output:
# Main Header
- [x] Task 1
- [ ] Subtask
- [ ] Task 2
Example 2: Include by position and exclude specific types
print(md.to_md(
include=[0, 1, 2], # Include first three elements
exclude=['code'], # But exclude any code blocks
spacer=2 # Two empty lines between elements
))
Output:
---
title: Example
---
# Main Header
- [x] Task 1
- [ ] Subtask
- [ ] Task 2
The to_md_parser
function can be used directly to convert markdown data structures to markdown text:
from markdown_to_data import to_md_parser
data = [
{
'metadata': {
'title': 'Document'
}
},
{
'header': {
'level': 1,
'content': 'Title'
}
},
{
'list': {
'type': 'ul',
'items': [
{
'content': 'Task 1',
'items': [],
'task': 'checked'
}
]
}
}
]
print(to_md_parser(data=data, spacer=1))
Output:
---
title: Document
---
# Title
- [x] Task 1
metadata = '''
---
title: Document
author: John Doe
tags: markdown, documentation
---
'''
md = Markdown(metadata)
print(md.md_list)
Output:
[
{
'metadata': {
'title': 'Document',
'author': 'John Doe',
'tags': ['markdown', 'documentation']
}
}
]
headers = '''
# Main Title
## Section
### Subsection
'''
md = Markdown(headers)
print(md.md_list)
Output:
[
{
'header': {
'level': 1,
'content': 'Main Title'
}
},
{
'header': {
'level': 2,
'content': 'Section'
}
},
{
'header': {
'level': 3,
'content': 'Subsection'
}
}
]
lists = '''
- Regular item
- Nested item
- [x] Completed task
- [ ] Pending subtask
1. Ordered item
1. Nested ordered
'''
md = Markdown(lists)
print(md.md_list)
Output:
[
{
'list': {
'type': 'ul',
'items': [
{
'content': 'Regular item',
'items': [
{'content': 'Nested item', 'items': [], 'task': None}
],
'task': None
},
{
'content': 'Completed task',
'items': [
{
'content': 'Pending subtask',
'items': [],
'task': 'unchecked'
}
],
'task': 'checked'
}
]
}
},
{
'list': {
'type': 'ol',
'items': [
{
'content': 'Ordered item',
'items': [
{'content': 'Nested ordered', 'items': [], 'task': None}
],
'task': None
}
]
}
}
]
tables = '''
| Header 1 | Header 2 |
|----------|----------|
| Value 1 | Value 2 |
| Value 3 | Value 4 |
'''
md = Markdown(tables)
print(md.md_list)
Output:
[
{
'table': {
'Header 1': ['Value 1', 'Value 3'],
'Header 2': ['Value 2', 'Value 4']
}
}
]
code = '''
´´´python
def example():
return "Hello"
´´´
´´´javascript
console.log("Hello");
´´´
'''
md = Markdown(code)
print(md.md_list)
Output:
[
{
'code': {
'language': 'python',
'content': 'def example():\n return "Hello"'
}
},
{
'code': {
'language': 'javascript',
'content': 'console.log("Hello");'
}
}
]
blockquotes = '''
> Simple quote
> Multiple lines
> Nested quote
>> Inner quote
> Back to outer
'''
md = Markdown(blockquotes)
print(md.md_list)
Output:
[
{
'blockquote': [
{'content': 'Simple quote', 'items': []},
{'content': 'Multiple lines', 'items': []}
]
},
{
'blockquote': [
{
'content': 'Nested quote',
'items': [
{'content': 'Inner quote', 'items': []}
]
},
{'content': 'Back to outer', 'items': []}
]
}
]
def_lists = '''
Term
: Definition 1
: Definition 2
'''
md = Markdown(def_lists)
print(md.md_list)
Output:
[
{
'def_list': {
'term': 'Term',
'list': ['Definition 1', 'Definition 2']
}
}
]
- Some extended markdown flavors might not be supported
- Inline formatting (bold, italic, links) is currently not parsed
- Table alignment specifications are not preserved
Contributions are welcome! Please feel free to submit a Pull Request or open an issue.