confused about usage of `parquet_decode` and `batched` #2930

fearfate · 2024-10-11T08:27:37Z

I want to deal with the parquet file, with some mapping, and then encode the result back to parquet

input:
  label: projects
  file:
        paths:
          - source/*.parquet
        scanner:
          to_the_end: {}
        auto_replay_nacks: true
pipeline:
  processors:
  - parquet_decode: {}
  - for_each:
    - mapping: |
        #!blobl
        root = this
        root.ID = "project:%v/%v".format(root.Type.lowercase(), root.Name)
        root.SnapshotAt = root.SnapshotAt.ts_unix_nano()
        if root.exists("OSSFuzz") && root.OSSFuzz != null {
          if root.OSSFuzz.exists("Date") && root.OSSFuzz.Date != null {
            root.OSSFuzz.Date = root.OSSFuzz.Date.ts_unix_nano()
          }
        }
        meta SnapshotAt = root.SnapshotAt
  - parquet_encode:
      schema:
      - name: SnapshotAt
        type: INT64
      - name: Type
        type: UTF8
      - name: Name
        type: UTF8
      - name: OpenIssuesCount
        optional: true
        type: INT64
      - name: StarsCount
        optional: true
        type: INT64
      - name: ForksCount
        optional: true
        type: INT64
      - name: Licenses
        repeated: true
        type: UTF8
      - name: Description
        optional: true
        type: UTF8
      - name: Homepage
        optional: true
        type: UTF8
      - name: OSSFuzz
        optional: true
        fields:
          - name: LineCount
            optional: true
            type: INT64
          - name: LineCoverCount
            optional: true
            type: INT64
          - name: Date
            optional: true
            type: INT64
          - name: ConfigURL
            type: UTF8
      default_compression: zstd

output:
  file:
    path: target/${! timestamp_unix_nano() }.parquet # No default (required)
    codec: all-bytes

the config will break up a batch message to every single message and encode every message into independent file.

i try to using batched, then how could i using the origin filename rather than a random one( ${! timestamp_unix_nano() })

The text was updated successfully, but these errors were encountered:

mihaitodor · 2024-10-11T10:59:43Z

Hey @fearfate 👋

then how could i using the origin filename rather than a random one( ${! timestamp_unix_nano() })

The file input adds a bunch of metadata to each message including path: https://docs.redpanda.com/redpanda-connect/components/inputs/file/#metadata. Based on that, you can do ${! @path.filepath_split().index(-1) }.

PS: Moving to a discussion as per #2026.

redpanda-data locked and limited conversation to collaborators Oct 11, 2024

mihaitodor converted this issue into discussion #2931 Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

confused about usage of `parquet_decode` and `batched` #2930

confused about usage of `parquet_decode` and `batched` #2930

fearfate commented Oct 11, 2024

mihaitodor commented Oct 11, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

confused about usage of parquet_decode and batched #2930

confused about usage of parquet_decode and batched #2930

Comments

fearfate commented Oct 11, 2024

mihaitodor commented Oct 11, 2024

This issue was moved to a discussion.

confused about usage of `parquet_decode` and `batched` #2930

confused about usage of `parquet_decode` and `batched` #2930