LIP0013

LIP 13 - Gathering columns into a 3D tensor (with padding if necessary)

LIP	13
Title	Gathering columns into a 3D tensor (with padding if necessary)
Author	A. Ranganath
Status	Draft
Type	Standard
Discussion	Issue #44
PR	#45
Created	March 6, 2018

Introduction

A custom tf-op for gathering columns from a 2D tensor into a 3D tensor. If the number of columns to be gathered is not even among 2D slices of the resulting 3D tensor, then the residual columns will be padded with a padding-value.

Technical Background

Multi-op nodes like Sums, Products and PermProds operate on 3D tensors, performing reduction operation, during value and path calculations. These 3D tensors are initially gathered as a single-wide 2D tensor, and then reshaped to 3D, where the first dimension corresponds to batch-size, followed by number of ops modeled as the second dimension, and input-size as the third dimension.

An assumption in these nodes are that all ops modeled within have the same (homogeneous) inputs-size. With newer multi-op nodes like SumsLayer and ProductsLayer, this assumption is dropped, as each op modeled can have different (heterogeneous) input-sizes.

To model ops with heterogeneous input-sizes, it would be necessary to insert column-vectors of zeros or ones (for sums or products) into the wide 2D tensor, before being reshaped into a 3D tensor.

Instead, a more optimal approach would be to develop a custom tf-op for gathering columns from a 2D tensor into a 3D tensor, wherein slices with fewer columns to gather are padded with a padding-value (eg: 0, 1, -inf, etc.) defined and set as an attribute of the op, during graph construction.

Proposal

Create a custom tf-op, with both cpu and gpu OpKernels, for gathering columns from a 2D tensor into a 3D tensor. The params parameter would accept either a 1D or 2D tensor, indices would be a nested list of indices, wherein each inner-list would be a set of column indices to be gathered, per 2D slice in the resulting 3D tensor.

Lengths of the inner lists of the indices parameter can either be homogeneous or heterogeneous. If homogeneous then the OpKernels would just gather values from the params tensor, into the shape of (batch x len(indices) x len(indices[0])). On the other hand, if heterogeneous, then the OpKernels would first initialize the output tensor with pad-elem value (attribute of the op), and then gather values from the params tensor, into the shape of (batch x len(indices) x max(len(ind) for ind in indices)).

Performance comparison

Following are the performance metric, comparing between three alternatives: (a) Custom gather_cols op, (b) Proposed gather_cols_3d op, (c) TF gather op. Test cases include 'Non-padded' (i.e, homogeneous column sizes) and 'Padded' (heterogeneous column sizes) cases.

-----------------------
Non-padded
-----------------------
CPU             op    dt:  size  setup_time  first_run_time  rest_run_time    correct
     custom_gather int32:    69       89.60           79.66          70.00       True
     custom_gather int64:    69      153.92           75.09          72.80       True
  custom_gather_3d int32:    49       14.86           21.49          19.91       True
  custom_gather_3d int64:    49       14.87           20.13          41.93       True
         tf_gather int32:    79       96.24           73.18          59.88       True
         tf_gather int64:    79      167.78           72.18          61.73       True
GPU             op    dt:  size  setup_time  first_run_time  rest_run_time    correct
     custom_gather int32:    69      161.59          284.48           1.50       True
     custom_gather int64:    69      101.24            8.45           1.31       True
  custom_gather_3d int32:    49       34.92           10.44           1.35       True
  custom_gather_3d int64:    49       16.96            6.68           1.34       True
         tf_gather int32:    79      139.51            8.92           1.37       True
         tf_gather int64:    79      103.47            8.14           1.31       True

-----------------------
Padded
-----------------------
CPU             op    dt:  size  setup_time  first_run_time  rest_run_time    correct
     custom_gather int32:  2019      738.01          172.90          12.31       True
     custom_gather int64:  2019      708.18          163.64          12.52       True
  custom_gather_3d int32:    49       22.29           41.60          34.66       True
  custom_gather_3d int64:    49       29.13           41.32          34.96       True
         tf_gather int32:  2519     1132.94          198.98          12.37       True
         tf_gather int64:  2519     1013.32          204.24          12.32       True
GPU             op    dt:  size  setup_time  first_run_time  rest_run_time    correct
     custom_gather int32:  2019      778.28          161.96          15.83       True
     custom_gather int64:  2019      922.06          179.78          14.24       True
  custom_gather_3d int32:    49       37.36            9.33           1.59       True
  custom_gather_3d int64:    49       23.63            7.43           1.65       True
         tf_gather int32:  2519     1016.83          205.57          14.25       True
         tf_gather int64:  2519     1034.35          213.37          14.26       True

For the 'Non-padded' case, the proposed op has a slightly smaller graph size, with comparable results in terms performance, with respect to the other two alternatives. For the 'Padded' case, there is a significant improvement, both in terms of graph size and performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly