Skip to content

Python bindings for NodeTrie, a prefix tree data structure library

License

Notifications You must be signed in to change notification settings

NodeTrie/NodeTrie_Py

Repository files navigation

NodeTrie

A Trie data structure library.

Latest Version CI status

Installation

pip install nodetrie

Motivation, design goals

NodeTrie is a Python extension to a native C library written for this purpose.

It came about from a lack of viable alternatives for Python. While other trie library implementations exist, they suffer from severe limitations such as

  • Read only structures, no insertions
  • High memory use for large trees
  • Lack of searching, particularly file mask or wild card style searching
  • Slow inserts

Existing implementations on PyPi fall into these broad categories, including Marissa-Trie (read only) and datrie (slow inserts, very high memory use for large trees).

NodeTrie's C library is designed to minimize memory use as much as possible and still allow arbitrary length trees that can be searched.

Each node has a name associated with it as its data, along with children list and number of children.

Features and design notes

  • NodeTrie is an n-ary tree, meaning any one node can have any number of children
  • Node children arrays are dynamically resized as needed on insertion on a per node basis. No fixed minimum nor maximum size
  • Node names can be of arbitrary length, available memory allowing
  • Node names from Node.name are always unicode in either Python 2/3
  • Any python string type may be used on insertion
  • Node names are implicitly decoded from unicode on insertion, if needed, with nodetrie.ENCODING (utf-8) default encoding which can be overridden
  • New Python Node objects are created from the underlying C pointers every time Node.children is called. There is overhead on the Python interpreter to create these objects. It is safe and better performing to keep and re-use children references instead, see examples below

Limitations

  • Deletions are not implemented
  • The C library implementation uses pointer arrays for children to reduce search space complexity and character pointers for names to allow for arbitrary name lengths. This may lead to memory fragmentation
  • Node objects in python are read only. It is not possible to override the name of an existing Node object nor modify its attributes
  • Character encodings that allow for null characters such as UCS-2 should not be used

Example Usage

from nodetrie import Node

# This is the root of the tree, keep a reference to it.
# Deleting or letting the root node go out of scope will de-allocate
# the entire tree
node = Node()

# Insert a linked tree so that a->b->c->d where -> means 'has child node'
node.insert_split_path(['a', 'b', 'c', 'd'])
node.children[0].name == 'a'

# Sub-trees can be referred to by child nodes
a_node = node.children[0]
a_node.name == 'a'
a_node.children[0].name == 'b'
a_node.is_leaf() == False

# Insertions create only new nodes
# Insert linked tree so that a->b->c->dd
node.insert_split_path(['a', 'b', 'c', 'dd'])

# Only one 'a' node
node.children_size == 1

# Existing references to nodes will have correct children
# after insertion without recreating the node object.
# Here, a_node is an existing object prior to more nodes
# being added to its sub-tree. After insertion, a's sub-tree contains newly
# inserted nodes as expected

# 'c' node is first child of 'b' which is first child of 'a'
# 'c' node has two children, 'd' and 'dd'
c_node = a_node.children[0].children[0]
c_node.children_size == 2
c_node.is_leaf() == False

# 'd' and 'dd' are both leaf nodes
leaf_nodes = [c for c in c_node.children if c.is_leaf()]
len(leaf_nodes) == 2

Note

De-allocation

Tree is de-allocated when and only when root node goes out of scope or is deleted. Letting sub-tree objects go out of scope or explicitly deleting them will not de-allocate that sub-tree.

Note

Sub-tree insertions

Insertions on non-root nodes work as expected. However, Node.insert does not check if a node is already present, unlike Node.insert_split_path

Searching

NodeTrie supports exact name as well as file mask matching tree search.

from __future__ import print_function
from nodetrie import Node

node = Node()
for paths in [['a', 'b', 'c1', 'd1'], ['a', 'b', 'c1', 'd2'],
              ['a', 'b', 'c2', 'd1'], ['a', 'b', 'c2', 'd2']]:
    node.insert_split_path(paths)
for path, _node in node.search(node, ['a', 'b', '*', '*'], []):
    print(path, _node)

Output

[u'a', u'b', u'c1', u'd1'] Node: 'd1'
[u'a', u'b', u'c1', u'd2'] Node: 'd2'
[u'a', u'b', u'c2', u'd1'] Node: 'd1'
[u'a', u'b', u'c2', u'd2'] Node: 'd2'

Separator joined node names for a matched sub-tree are returned by the query function.

for match in node.query('a.b.*.*'):
    print(match)

for match in node.query('a|b|*|*', separator='|'):
   print(match)

Output

(u'a.b.c1.d1', Node: 'd1')
(u'a.b.c1.d2', Node: 'd2')
(u'a.b.c2.d1', Node: 'd1')
(u'a.b.c2.d2', Node: 'd2')

(u'a|b|c1|d1', Node: 'd1')
(u'a|b|c1|d2', Node: 'd2')
(u'a|b|c2|d1', Node: 'd1')
(u'a|b|c2|d2', Node: 'd2')

Contributions are most welcome.