feature generation

Feature Generation

Library : boto3, pandas, sklearn

aws-build-deployment-package -> pandas, sklearn

In a machine learning job, raw input data generally needs pre-processing to prepare the input as features for training. In the featurization workload, we use Amazon Fine Food Review3 text dataset assuming that each review is transformed into a TF-IDF vector. To run the workload on a FaaS environment with different RAM configuration in parallel, we partition the input dataset into various sizes. Also, to calculate a global TF-IDF vector from partitioned input datasets, multiple invocations of the function are necessary for parallel processing and aggregation.

Lambda

Orchestrator (code) Multiple invocations of the function are necessary for parallel processing

Feature Extractor (code) Data Preprocessing - Extract Word from sentence.

Feature Reducer (code) Generate global Tf-IDF vector.

Get-job-status (code) Check for number of s3 object in bucket.

Step Functions

step function state machine code

{
  "StartAt": "OrcheStrator",
  "States": {
    "OrcheStrator": {
      "Type": "Task",
      "Resource": [ORCHESTRATOR-FUNCTION-ARN],
      "ResultPath": "$.num_of_file",
      "Next": "Wait X Seconds"
    },
    "Wait X Seconds": {
      "Type": "Wait",
      "Seconds": 12,
      "Next": "Get Job Status"
    },
    "Get Job Status": {
      "Type": "Task",
      "Resource": [GET-JOB-STATUS-FUNTION-ARN],
      "Next": "Job Complete?",
      "InputPath": "$.num_of_file",
      "ResultPath": "$"
    },
    "Job Complete?": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status",
          "StringEquals": "FAILED",
          "Next": "Wait X Seconds"
        },
        {
           "Variable": "$.status",
           "StringEquals": "SUCCEEDED",
           "Next": "Feature Reducer"
        }
      ],
      "Default": "Wait X Seconds"
    },
    "Feature Reducer": {
      "Type": "Task",
      "Resource": [FEATURE-REDUCER-FUNCTION-ARN],
      "End": true
    }
  }
}

Workload Input : Text

Workload Output : Text

Lambda payload(test-event) example:

Datset-bucket is stored amazon-fine-food-reviews dataset which is needed one more partition file reviews10mb.csv, reviews20mb.csv, reviews50mb.csv, reviews100mb.csv or https://snap.stanford.edu/data/web-FineFoods.html

{
    "bucket": "[DATASET-BUCKET]"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature generation