A mapping file consists of a series of specifications of which index
and measure dimensions we want a nanocube index to have based on the
columns of the input file (eg .csv, .psv).
index_dimension(NAME, INPUT, INDEX_SPEC)
NAME is the name of the nanocube dimension.
INPUT identifies the .csv column names that will be used
as input by the MAPSPEC rule to generate the info for
the nanocube index dimension
input() # no input columns needed
input('lat','lon') # two input values coming
# from columns 'lat' and
# 'lon'
INDEX_SPEC identifies (possibly-parameterized) the index dimension
encoding (flat-tree, binary tree, quad-tree, k-ary tree)
and resolution (number of levels), and how to map input
values into 'bins' in this dimension.
categorical(B,L)
categorical(B,L,S)
flat-tree with B bits of resolution (eg. 8) in L
levels. expects one input column and every distinct
value that appears in the .csv input column is mapped
into a unique number with L digits in {0,1,...,2^(B)-1}.
Numbers are automatically generated by their appearence
order in the .csv file.
S is a string encoding an alias table. In its simplest
form (A), we just name nodes on the deepest level of the
hierarchy. In its more elaborate form (B) we allow
naming intermediate nodes.
(A)
in_lbl_1 <nl>
in_lbl_2 <nl>
...
in_lbl_n <nl>
or
in_lbl_1 <tab> out_lbl_1 <nl>
in_lbl_2 <nl>
...
in_lbl_n <tab> out_lbl_1 <nl>
or
in_lbl_1_1 <tab> ... <tab> in_lbl_1_k <tab> out_lbl_1 <nl>
in_lbl_2 <nl>
...
in_lbl_n <tab> out_lbl_n <nl>
(B)
@hierarchy
alias_root
<tab> alias_level_1
<tab> <tab> alias_level_2
<tab> <tab> <tab> input_text (== alias_level_3)
<tab> <tab> <tab> input_text_1 <tab> input_text_2 <tab> alias_leaf_level
<tab> <tab> <tab> input_text (== alias_level_3)
<tab> alias_level_1
<tab> <tab> alias_level_2
<tab> <tab> <tab> input_text (== alias_level_3)
<tab> <tab> <tab> input_text (== alias_level_3)
or
@hierarchy
<tab> alias_level_1
<tab> <tab> <tab> input_text (== alias_level_3)
<tab> <tab> <tab> input_text_1 <tab> input_text_2 <tab> alias_leaf_level
<tab> <tab> <tab> input_text (== alias_level_3)
<tab> alias_level_1
<tab> <tab> <tab> input_text (== alias_level_3)
<tab> <tab> <tab> input_text (== alias_level_3)
latlon(L)
creates a quad-tree with L levels using the mercator
projection. Expects two input columns with floating
pointing numbers for latitude and longitude.
time(L,BASE,WIDTH_SECS,OFFSET_SECS)
creates a binary-tree with L levels. Expects one input
column with a string that is convertible to a timestamp.
Some accepted formats are:
'2000-01-01T00:00:00-06:00.125'
'2000-01-01T00:00:00-06:00'
'2000-01-01T00:00-06:00'
'2000-01-01T00:00'
'2000-01-01T00'
'2000-01-01'
It uses BASE the timestamp (also in a format from the
above) as the alignment point for temporal bins. The
bins have width given in seconds: WIDTH_SECS; and a
conversion is possibly applied to align cases where
data will come for instance in local time and we want
to correct it to UTC.
unixtime(L,BASE,WIDTH_SECS,OFFSET_SECS)
Analogous to time, but instead of expecting a date
and time string, it expects a string with the number
of seconds since unix epoch (1970-01-01 UTC).
ip(L)
creates a quad-tree with L levels mapping IPv4 entries
(eg. 123.122.122.98) into a corresponding entry using the
hilbert space-filling curve convention.
measure_dimension(NAME, INPUT, MEASURE_SPEC)
NAME and INPUT are the same as in the index_dimension.
MEASURE_SPEC
either a primitive scalar type like
signed integer, unsigned integer, or floating point
numbers with 32/64 bits. Or a pre-defined function to
convert input columns into something meaningful
u32, u64
f32, f64
row_bitset()
file(F)
reads content of file F as a string. Can be used together
with categorical index dimensions to point to alias mapping
descriptions (see categorical).
Here is an example of a MAPPING file (it accepts line comments using #) based
on New York City taxi datasets (http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).
# I1. quadtree index dimension for the pickup location
index_dimension('pickup_location',input('pickup_latitude','pickup_longitude'),latlon(25));
# I2. quadtree index dimension for the dropoff location
index_dimension('dropoff_location',input('dropoff_latitude','dropoff_longitude'),latlon(25));
# I3. binary tree for the time dimension: hourly bins
index_dimension('pickup_time', input('tpep_pickup_datetime'), time(17,'2009-01-01T00:00:00-05:00',3600,5*60));
# I4. weekday of pickup
index_dimension('weekday', input('tpep_pickup_datetime'), weekday());
# I5. hour of pickup
index_dimension('hour', input('tpep_pickup_datetime'), hour());
# M1. measure dimension for counting (u32 means integral non-negative
# and 2^32 - 1 max value) if no input .csv column is used in a measure
# dimention, it functions as a count (ie. one for each input record)
measure_dimension('count',input(),u32);
# M2. duration
measure_dimension('duration',input('tpep_pickup_datetime','tpep_dropoff_datetime') ,duration(f32,60));
# M3. duration squared (for stddev computations)
measure_dimension('duration2',input('tpep_pickup_datetime','tpep_dropoff_datetime') ,duration2(f32,60));
# M4. Map fare_amount into a 32-bit floating point value
measure_dimension('fare',input('fare_amount'),f32);