Hints for Performance Testing

Always use feeding as the mechanism for providing all inputs. Using constants enables some pre-computing mechanism which results in ops being run on CPU, even if they are supposed to be run on a GPU.
Use tf.identity(tf.placeholder()) to provide inputs. This way, we can expect that the input is copied to GPU only once to the identity op, and then served to other ops.
Instead of stacking ops, make them dependent on each other using control flow ops. This simulates op stacking, while permitting the use of arbitrary inputs. For instance:
```
      ops = op_fun(params_op, indices)
      for _ in range(self.num_ops - 1):
          ops = op_fun(tf.tuple([params_op, ops])[0], indices)
```
Here, the tuple adds a dependency on the output of the previous op, but still feeds the proper input to the next op.
Always test performance on inputs of different dtypes. The performance my vary greatly between dtypes.
Test on several machines, performance varies between GPU/CPU configurations.

Provide feedback