This was the main goal of creating all the other bitty classes, whose interfaces are described below. All parser processing is done in PHP-CPP for the expected order of magnitude of speed gain.
##Requirements Linux version only.
C++ C++17 - std::string_view STL gets used a little, which is marked C++17 "experimental". PHP >= 7.1 PHP-CPP 2.0.0 for compilation, and shared library installed. libpcre2-8 The latest version of the PCRE2 as shared library.
The Makefile is very simplistic. The configucation folder for PHP is set in the Makefile as INI_DIR = /etc/php/conf.d
On debian stretch it might need to be /etc/php/7.0/mods-available plus a symbolic link set in /etc/php/7.0/fpm/conf.d and also /etc/php/7.0/cli/conf.d
If the Makefile configuration is correct then the steps ought to be:-
make sudo make install
use Pun\\TomlReader;
$reader = new TomlReader();
$doc_model = $reader->parse($myinput);
$array = $doc_model->toArray();
OR
use Pun\\TomlReader;
$doc_model = TomlReader::parseFile($file_path);
$array = $doc_model->toArray();
There are also two other static methods.
// Something about this build
TomlReader::getUseVersion() : string;
// Something about TOML version support
TomlReader::getTomlVersion() : string;
The $doc_model returned by the Toml Parser, is a tree of "Pun\KeyTable" and "Pun\ValueList" objects.
This Php class also supports the \ArrayAccess interface, even though the methods are not directly available. Its accessible methods for getting, setting values use string keys, internal keys are strings only, and Php-numeric type is converted to string, whereas PHP ArrayAccess and array syntax converts string keys that are numeric to integer.
Currently internal storage is a C++ std::map. The Traversable interface will iterate keys in alphabetical order.
// Set a string key, and PHP value type
setKV(string $key, $value);
// Get stored value using string key
getV(string $key) : mixed;
// remove the key and value
unsetK(string $key);
// check if key exists
hasK(string $key);
// Clear out all keys and values
clear();
// Recursive copy everything to Php Array, without KeyTable or ValueList objects.
toArray() : array;
// Set a general use "Tag" property
setTag($tag);
// Get the "Tag" property
getTag() : mixed;
This class checks that everything that is set inside is of the same type, and for objects, the same class name. It is implemented with C++ std::vector of PHP Value. Everything added must match the kind of value set with the first pushBack call. Traversable and Countable interfaces are implemented. ArrayAccess interface isn't implemented.
// Add a value to the stack
pushBack($value);
// Get the last value
back() : mixed;
// Remove the end (back) value
popBack();
// Get the number of values
size() : int;
// Clear all values, size back to zero.
clear();
// Set value at non-negative index, range checked
setV(int $index, $value);
// Get value at non-negative index, range checked
getV(int $index) : mixed;
// Set arbitrary tag value
setTag($tag);
// Get previously set tag value
getTag() : mixed;
// Return recursive contents as a PHP array, without KeyTable or ValueList objects.
toArray() : array;
The parser does utf-8 examples. It will check and handle file Byte Order Marks - BOM
This PHP extension, tentatively named "Pun8", with namespace "Pun", is developed using the PHP-CPP toolkit.
The motivation arises from frustrations experienced in engineering the TOML parser projects, toml and toml-zephir, which centred around the interface limitations of preg_match, and the regretful need to use substr all the time.
Pun8 is compiled with a shared link directly to the latest version of the libpcre2-8 library.
Of course utf-16 versions of this could be done but utf-8 was first priority.
Pun8 currently being developed on a linux system, using the PHP-CPP source code. This project is not likely to be modified, or even tried, on PHP versions earlier than 7.0
PHP-CPP does have a compatible "legacy version" for PHP 5.3+
The cause of going to so much development trouble is to create a TOML parser PHP-extension, using an existing design coded in PHP, which happens to make a lot of use of PCRE2. The current PHP interface for PCRE2 got in the way of trying various innovations.
A toml-PHP and toml-zephir versions have already been created and have set execution performance to be beaten.
Various classes have been used to try out lower level functions in the TomlReader. Pun8, IdRex8, Re8map, Token8, Token8Stream, useful in trying out PHP versions of the TOML parser.
The interface methods are close to the the minimum needed for a Toml parser, because of the time investment required.
A simpler version of the Pun8 class, which is now deprecated, as most of the match and parse functions are in Token8Stream. UStr8 holds a reference to shared string buffer. It has substring view of the string buffer. It can share the buffer by calling share to create an object clone, allowing multiple views of shared buffer. The UStr8 is iterable. Use of "foreach" generates the start byte offset, and unicode string segments which are a single or multibyte UTF8 characters. The string can also be stepped through using nextChar.
class Pun8 {
// method declarations (implmented in C++)
// Start with a utf-8 string to traverse
__construct(string $input);
// Reset with a new string, or just reset if no argument
setString(string $input);
// Get next character as one or more byte characters, advance the offset
// optional reference to unicode code number
nextChar([int& code]) : string
// Get next character as one or more byte characters, no change in offset
// optional reference to unicode code number
peekChar([int& code]) : string
// The offset properties, a byte position from 0
getBegin() : int;
getEnd() : int;
setRange(int $offset, int $endOffset);
setEnd(int $endOffset);
// Return the real buffer size
size() : int;
// The end of range of string marker, is independent of the internal buffer length,
// and can be "moved" to end of a match operation range, from 0 to actual buffer size.
getRangeEnd() : int;
setRangeEnd(int $pos);
// some primative encoding detection and management is available();
// return a string describing BOM if it exists, such as after assigning file_get_contents();
getBOMId() : string;
/* Detect BOM or not, convert to UTF8 if necessary.
* Throw exception if unknown or unconvertible.
* If UTF8 BOM exists, return offset to first real character (maybe 3, else 0)
* No end of line character management.
*/
ensureUTF8() : int;
This holds a single shareable IdRex8 internal object
// The Id and the Compiled Regular expression are stuck together
__construct(int $id, string $regex);
setIdString(int $id, string $regex);
// Match results, or false, optional start offset
match(string $target, int $offset = 0) : mixed;
// Return some properties
getString() : string;
getId() : int;
isCompiled() : bool;
This object has more map management functions than the Pun8 target string interface. It is convenient to set the shared Pun8 internal map from one of these.
// Add a Id, PCRE2 to the map
setIdRex(int $keyId, string $regex) : int;
// Map key Id exists
hasIdRex(int $keyId) : bool;
// return number of keys unset, 0 or 1
unsetIdRex(int $keyId) : int;
// Create a new IdRex8 object to hold a shared PCRE2 from the Map.
getIdRex(int $keyId) : IdRex8;
// Add any shared PCRE2, by keyID, from $source not already in this map, return count of new shares
// Share individual PCRE2 by specifying Id list
// If no Id list, try to share all
addMapIds(Re8map $source, array $idList);
// Get key Ids as PHP array list of integer
getIds() : array;
// Number of keys in map
count() : int;
Internally a vector of std::string stores the PCRE2 matches. This class is a wrapper to hold PCRE2 capture results created in method calls above.
// Return the indexed capture string
getCap(int $keyId) : string;
// Return how many captures.
count() : int;
Has read only properties. Pass to some functions of TokenStream, to fill with summary match values.
// value of match PCRE2 or the next character from front of string
getValue() : string;
// match token id from map, or single character lookup table.
getId() : int;
// source line number, starting from 1
getLine() : int;
// character found in singles lookup table or is EOL or is EOS.
isSingle() : bool;
This is composed with an internal Pun8 member, plus a singles map, line number counter, and id values to use for EOL, EOS (end of stream/string), and a token for any un-mapped character, not in singles map.
Adding this as a C++ extension cut TOML parser execution time around 50%
// Set some meaningful ids, usually taken from a kind-of enum list.
setEOSId(int $id);
setEOLId(int $id);
setUnknownId(int $id);
// string to parse
setInput(string $input);
// array of string => integer
setSingles(array $smap);
// map of id => sharable PCRE2
setRe8map(Re8map $map);
// array of ordered ids, integer key subset of map.
setExpSet(array $idList);
getExpSet() : array;
// fetch the current line from offset, to before newline.
beforeEOL() : string;
// Read only property, for debugging purposes
getOffset() : int;
// Get properties from last offset advance.
getToken() : Token8;
// these correspond to Token8 properties
getValue() : string;
getId() : int;
getLine() : int;
// Return a Token8 object, but first have to pass one as argument
// All the values come back in the Token8, no change in offset.
// This does no PCRE2 matching, just takes the front of the string.
peekToken(Token8 $token) : Token8;
// Give the same token back to advance internal offset.
acceptToken(Token8 $token);
/**
* Try in this order - any newline characters off front.
* First match PCRE2 in the internal IdList, or
* first character off front, with lookup in singles table.
* or return unknownId. Value retrived with getToken(),
* of getValue(), getLine(), getId();
*/
moveNextId() : int;
// Advanced the offset by the match argument, or failed to match
moveRegex(IdRex8 $regex) : bool;
// Advanced the offset by the match argument from the integer key, or failed to match
moveRegId(int $mapId) : bool;
This interface hasn't been used much yet, because it is just a wrapper around the enum value. Pun extension and the Toml tree use a different list of enum values than PHP. For instance - True and False get to have the same Pun type enum, whereas in PHP they are different. There are values for integer, string, float, array (php-kind).
Php objects have an enum, excepting that KeyTable, ValueList and Php's DateTime are assigned individual type enum, as they are used by the TOML parser.
/* DateTime has a type enum, and so does KeyTable and ValueList,
* all other PHP objects are just an "Object". The type enums are used by Toml document tree.
*/
// Set the enum from example Php value.
fromValue($any);
// return the type as a number
type();
// return the type as a string
name();
// return 1 if argument matches the type, 0 if it doesn't
isMatch($any);