[Proof of concept] Common file formats used for Data Science exported from HXL (The Humanitarian Exchange Language)
- HXL-Data-Science-file-formats
- 1. The main focus
- 2. Reasons behind
- HXLated datasets to test
- Additional Guides
In addition to this GitHub repository, check also the EticaAI-Data_HXL-Data-Science-file-formats Google Drive folder.
This project either use explicit HXL +attributes (easy to implement, but more verbose) or do inferences on well know HXLated datasets used on humanitarian areas. To make this work, the main reference is not software implementation, but reference tables.
- Extra content: urn-data-specification/ (warning: its complicated)
While find good URNs conventions to be used for typical datasets used on
humanitarian context is more complex than the
ISO URN or even the
LEX URN (this one
already used in Brazil),
one goal of the urnresolver
is accept that most data shared are VERY
sensitive and private, so this this actually is the challenge. So in addition
to converting some well known public datasets related to HXL, we're already
designing to eventually be used as abstraction to scripts and tools that
without this would need to have access to real datasets.
By using URNs, at worst case we're creating documentations and scripts that a new user would need to replace by the real one of its use case. But the ideal case is to allow exchange scripts or, when an issue happens in a new region, the personel who prepare the data could do it and then publish also on private URN listing so others could reuse.
Note that the URN Resolver, even if it does have links to resources and not just the contact page, the links themselves to download the real data could still require authentication case by case. Also same URNs, if you manage to have contact with several peers, in special for datasets that are not already an COD, but are often needed, are likely to exist with more than one option to use.
Deeper integration with CKAN instances and/or awareness of encrypted data still not implemented on the current version (v0.7.3)
Since the main goal of URNs is also help with auditing and sharing of
scripts and even how to reference "best acceptable use" of exchanced data
(with special focus for private/sensitive), while the URN:DATA
themselves
are mean to be NOT a secret and could be published on official documents, the
local implementations (aka how to resolve/redirect these URNs for real data)
need to take in account concepts that the "perfect optimization" (think
"secure from misuse" vs "protect privacy from legitimate use") often is
contraditory.
TODO: add more context
Note: while this project, in addition to CLI tools to convert URNs to usable tool ("the implementation"), also draft the logic about how to construct potentially useful URNs reusable at International level (e.g. what may seem as drafted "an standard", think ISO, or an Best Current Practice, think IETF) please do not take EticaAI/HXL-Data-Science-file-formats... as endorsed by any organization.
Also, authors from @EticaAI / @HXL-CPLP (both past and future ones who cooperate directly with this project) explicitly release both software and drafted 'how to Implement' under public domain-like licenses. Under ideal circumstances
data global namespace
(the ZZ onurn:data:ZZ:example
) may have more specific rules
See ontologia/
"In computer science and information science, an ontology encompasses a representation, formal naming and definition of the categories, properties and relations between the concepts, data and entities that substantiate one, many, or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject." -- [Wikipedia: Ontology (information science)](https://en.wikipedia.org/wiki/Ontology_(information_science)
The contents from ontologia/ both contain some selected datasets and (while not 100% converted) the main parts of how command line tools and libraries released by this repository use.
When feasible, even if it make harder to do initial implementation or be a bit less efficient than use dedicated "advanced" strategies with state of the art tools, the internal parts of hxlm.core that deal with ontology will be stored in this folder.
This strategy is likely to make it easier for non-developers to update internals, like individuals interested in adding new languages or proposing corrections.
For production usage, these files are both availible via:
- Installable with Python Pypi hdp-toolchain
- The GitHub repository https://github.com/EticaAI/HXL-Data-Science-file-formats
- Public "CDN": GitHub hosted + CloudFlare cached endpoint at https://hdp.etica.ai/ontologia/
- See folder bin/
- See discussions at
- See (not so docummented tests): tests/manual-tests.sh
- Source code: bin/hxl2example
The hxl2example
is an example python script with generic functionality that
allow you to create your custom functions. Feel free to add your name, edit
license etc.
What it does: hxl2example
accepts one HXLated dataset and save as .CSV.
Quick examples
### Basic examples
# This will output a local file to stdout (tip: you can disable local files)
hxl2example tests/files/iris_hxlated-csv.csv
# This will save to a local file
hxl2example tests/files/iris_hxlated-csv.csv my-local-file.example
# Since we use the libhxl-python, remote HXLated remote urls works too!
hxl2example https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/edit#gid=319251406
### Advanced usage (if you need to share work with others)
## Quick ad-hoc web proxy, local usage
# @see https://github.com/hugapi/hug
hug -f bin/hxl2example
# http://localhost:8000/ will how an JSON documentation of hug endpoints. TL;DR:
# http://localhost:8000/hxl2example.csv?source_url=http://example.com/remote-file.csv
## Expose local web proxy to others
# @see https://ngrok.com/
ngrok http 8000
- Main issue: #2
- Orange File Specification: https://orange-data-mining-library.readthedocs.io/en/latest/reference/data.io.html
- Source code: bin/hxl2tab
What it does: hxl2tab
uses an already HXLated dataset and then, based on
#hashtag+attributes
, generates an Orange Data Mining .tab format with extra
hints.
The
hxl2tab
v2.0 has some usable functionality to use a web interface instead of cli to generate the file. Uses hug 🐨 🤗.
If you want quick expose outside localhost, try ngrok.
Installation
This package can both be installed by doing a copy of bin/hxl2tab to a place on your executable path and installing dependencies manually.
The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:
python3 -m pip install hdp-toolchain[hxl2tab]
# python3 -m pip install hdp-toolchain[full]
- Main issue: #6
- Source code: bin/hxlquickmeta
What it does: hxlquickmeta
output information about a local or remote
dataset. If the file already is HXLated, it will print even more information.
v1.1.0 added support to give an overview by default, equivalent to users of Python Pandas.
Installation
This package can both be installed by doing a copy of bin/hxlquickmeta to a place on your executable path and installing dependencies manually.
The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:
python3 -m pip install hdp-toolchain[hxlquickmeta]
# python3 -m pip install hdp-toolchain[full]
Quick examples
#### inline result for and hashtag and (optional) value ________________________
hxlquickmeta --hxlquickmeta-hashtag="#adm2+code" --hxlquickmeta-value="BR3106200"
# > get_hashtag_info
# >> hashtag: #adm2+code
# >>> HXLMeta._parse_heading: #adm2+code
# >>> HXLMeta.is_hashtag_base_valid: None
# >>> libhxl_is_token None
# >> value: BR3106200
# >>> libhxl_is_empty False
# >>> libhxl_is_date False
# >>> libhxl_is_number False
# >>> libhxl_is_string True
# >>> libhxl_is_token None
# >>> libhxl_is_truthy False
# >>> libhxl_typeof string
#### Output information for an file, and (if any) HXLated information __________
# Local file
hxlquickmeta tests/files/iris_hxlated-csv.csv
# Remove file
hxlquickmeta https://docs.google.com/spreadsheets/u/1/d/1l7POf1WPfzgJb-ks4JM86akFSvaZOhAUWqafSJsm3Y4/edit#gid=634938833
- Main issue: #6
- Source code: bin/hxlquickimport
What it does: hxlquickimport
is similar to the hxltag
(cli tools that are
installed with libhxl
) mostly only try to by default slugfy whatever was
before on the old headers and add it as HXL attribute. Please consider using
the HXL-Proxy for serious usage. This quick script is more for internal
testing
Installation
This package can both be installed by doing a copy of bin/hxlquickimport to a place on your executable path and installing dependencies manually.
The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:
python3 -m pip install hdp-toolchain[hxlquickimport]
# python3 -m pip install hdp-toolchain[full]
Installation
The automated way to install is using the Python pypi package hdp-toolchain. urnresolver is installed by default.
python3 -m pip install hdp-toolchain
- Main issue: #13
- Source code: hxlm/core/bin/urnresolver.py
The urnresolver
is an proof of concept of an URN resolver. (see
Uniform Resource Name (URN) on Wikipedia).
Examples (note: early working draft!)
# Basic usage: based on local and (to be implemented) remote listing pages
# it translate one readable URN to one or more datasets
urnresolver urn:data:xz:hxl:standard:core:hashtag
# https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/pub?gid=319251406&single=true&output=csv
# Now, the more practical example: using to translate to other commands:
hxlselect "$(urnresolver urn:data:xz:hxl:standard:core:hashtag)" --query '#valid_vocab=+v_pcode'
# Hashtag,Hashtag one-liner,Hashtag long description,Release status,Data type restriction,First release,Default taxonomy,Category,Sample HXL,Sample description
# #valid_tag,#description+short+en,#description+long+en,#status,#valid_datatype,#meta+release,#valid_vocab+default,#meta+category,#meta+example+hxl,#meta+example+description+en
# #adm1,Level 1 subnational area,Top-level subnational administrative area (e.g. a governorate in Syria).,Released,,1.0,+v_pcode,1.1. Places,#adm1 +code,administrative level 1 P-code
# #adm2,Level 2 subnational area,Second-level subnational administrative area (e.g. a subdivision in Bangladesh).,Released,,1.0,+v_pcode,1.1. Places,#adm2 +name,administrative level 2 name
# #adm3,Level 3 subnational area,Third-level subnational administrative area (e.g. a subdistrict in Afghanistan).,Released,,1.0,+v_pcode,1.1. Places,#adm3 +code,administrative level 3 P-code
# #adm4,Level 4 subnational area,Fourth-level subnational administrative area (e.g. a barangay in the Philippines).,Released,,1.0,+v_pcode,1.1. Places,#adm4 +name,administrative level 4 name
# #adm5,Level 5 subnational area,Fifth-level subnational administrative area (e.g. a ward of a city).,Released,,1.0,+v_pcode,1.1. Places,#adm5 +code,administrative level 5 name
hxlselect "$(urnresolver urn:data:xz:hxlcplp:fod:lang)" --query '#vocab+id+v_iso6393_3letter=por'
# Id,Part2B,Part2T,Part1,Scope,Language_Type,Ref_Name,Comment
# #vocab+id+v_iso6393_3letter,#vocab+code+v_iso3692_3letter+z_bibliographic,#vocab+code+v_3692_3letter+z_terminology,#vocab+code+v_6391,#status,#vocab+type,#vocab+name,#description+comment+i_en
# por,por,por,pt,I,L,Portuguese,
- [Big Picture] The main GitHUb issue:
- https://en.wikipedia.org/wiki/Non-English-based_programming_languages#International_programming_languages
- Note: most of the logic that matters of HDP is likely to be on
Knowledge Graphs (YAML files that expand in memory).
- See hxlm/ontologia/
- In special ontologia/core.vkg.yml
- See hxlm/ontologia/
Installation
The automated way to install is using the Python pypi package hdp-toolchain. All the relevand parts, including bare minimal ontologia, are part of the default installation.
python3 -m pip install hdp-toolchain
- GitHub Gist
- Google Colab (Jupyter Notebook)
- File
- Folder
HXL-CPLP-Publico/Datasets/EticaAI-Data/EticaAI-Data_HXL-Data-Science-file-formats/HDP-playbooks
Dedicated documentation at https://hdp.etica.ai/hxltm
The Humanitarian Exchange Language Trānslātiōnem Memoriam (abbreviation: "HXLTM") is an HXLated valid HXL tabular format by HXL-CPLP to store community contributed translations and glossaries.
The hxltmcli
is an (initial reference) of an public domain python cli tool
allow reuse by others interested in export HXLTM files to common formats
used by professional translators. But software developers interested in promote
use cases of HXL are encouraged to either collaborate to hxltmcli
or create
other tools.
The HXL already is used in production in special humanitarian areas (see The Humanitarian Data Exchange). With one line change is possible to convert most of already used spreadsheet-like data to be machine readable without need to disturb end users as other alternatives. One notable implementation (data visualization) powered by HXL is HXLDash (see this HXLDash example video).
The idea of this project strategies to turn already HXLated datasets to be used directly on open source desktop tools like the Orange Data Mining and WEKA "The workbench for machine learning" with the the minimum extra explanation on how to convert already existing HXL datasets AND do exist tools that solve know issues that are likely to be found.
NOTE: already is possible to use HXLated CSVs on these tools! For either who is leaning HXL or who is using in production for humanitarian intent, the HXL-proxy (https://proxy.hxlstandard.org/) with "Strip text headers" can serve live-updated CSV-like files. Other usages can still use the HXL CLI tools or run the unocha/hxl-proxy with Docker on your machine or an private public server.
One way to implement this is to create minimum usable conversion tools that are able to export already HXLated datasets with additional hints to file formats used by default by their applications.
In practice this is beyond just file conversion (like XLSX to CSV), since it includes both "variable type" AND "intent to use (on data mining)". This is why this project also has the taxonomy/vocabulary reference table (and this ctually is more important than the implementation itself!). Without some extra step HXLated datasets work as averange CSV (good, but is just not great).
But yes, some of these converted files, in special Weka (at least if compared to Orange) are more strict on the tabular format it accepts, and this can be infuriating EVEN for who actually would know how to debug these issues! But this issue, at least, is more automatable.
Note: one practical reason to use HXLated files as base instead of plain CSV or XLSX (beyond obviously being available in humanitarian context) is because the grammar of HXL +attributes are flexible to export to several different formats with freetom to choose other aspects of the tagging.
- The software implementation for file formats not typically used by easy to use
desktop applications is a non-goal
- Yet, since as part of the HXL +attributes conversion tables, some of these proposed implementations may already be drafted. These reference tables are released under public domain licenses.
- Note that often humans who already use these formats already are likely to have skill to manually concert from CSVs (so could convert from HXL)
- The software implementation (at least at the start) will not optimize for
speed or low local disk usage
- but should work to convert large datasets with reasonable low memory usage
- The software implementations assume an already HXLated input dataset to keep
it simple
- Note that it is possible to quickly convert already well formatted CSVs to HXL by changing the header line (first line of the CSV).
- While is technically possible to import back (reconstruct the original
HXLated file) from exported files, this is an non-goal to be 100% compatible
- This applicable in special cases for .arff exports: the default export may need to clean known issues with exported strings.
- Generic search query: https://data.humdata.org/search?vocab_Topics=hxl
- HXL data on HDX
The Humanitarian Data Exchange ("HDX") contains public datasets and part of them already is HXLated and ready to test.
PROTIP: on the https://proxy.hxlstandard.org/data/source, the Option 2: choose from the cloud also have an icon "HDX" also can be used. This can be helpful if you are just looking around several datasets.
- tests/files
- tests/manual-tests.sh
- Google Drive Folder: https://drhttps://drive.google.com/drive/u/1/folders/1qyTPaDgm7Ca-62blkdQjUox47WWKRwD3ive.google.com/drive/u/1/folders/1qyTPaDgm7Ca-62blkdQjUox47WWKRwD3
Both Google Drive Folder and this repository has some test files. The not-so-documented manual tests may also give a quick idea on how it works.
Note: these additional guides are not part of the main focus of this project
NOTE: Often people who work with HXL simply use the HXL-proxy, including to convert from non-HXLated sources.
Here there is an an quick overview of different command line tools that worth at least mention, in special if are dealing with raw formats already not HXLated.
90% of the time 1.000.000 rows is likely to be enough even if you are dealing with data science projects. So it means that there is no need to use command line tools or use more complex solutions, like import to an database or pay for enterprise solutions.
This guide if when you need to go over these limits without change too much your tools.
The EticaAI has dedicated the work to the public domain by waiving all of their rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.