How do you use a column that was added in a select call, in a where/order by call? #408

Spinnernicholas · 2024-12-15T04:57:37Z

Spinnernicholas
Dec 15, 2024

Example: Identifying duplicate images by hash.

import pixeltable as pxt
import imagehash

# Create user defined function
@pxt.udf
def calc_hash(image: pxt.Image) -> str:
    return str(imagehash.average_hash(image))

# Create table for storing images
t = pxt.create_table('raw_images',{
     'image': pxt.Image,
     'filename': pxt.String
})

# Add Calculated Column to compute hash
t.add_computed_column(hash=calc_hash(table_raw_image.image))

# Insert several records ...

# Group and count duplicate images by hash
df = t.group_by(t.hash).select(t.hash, count=pxt.functions.count(1))

# How do I do this?
# df.where(count > 1)
#It would be cool if we could do df.where(df.count > 1) but it would also be nice to do the select and where all in a single method chain

There seems to be a lack of documentation/examples on using group_by in general, but I was able to figure it out until I tried to do this.
I couldn't locate a ColumnRef (ie."t.hash") in the DataFrame.
I tried to create a view with the aggregate column but it appears group_by DataFrames cannot be used to create views.

Thank you!

Answered by mkornacker

Dec 15, 2024

You are encountering a gap in the query functionality. Essentially, what you want is the equivalent of the SQL Having clause, or more generally, being able to query a DataFrame like a table:
df = t.group_by(t.hash).select(t.hash, count=pxt.functions.count(1))
result = df.where(df.count > 1)

At the moment, Pixeltable doesn't support that. However, we are aware that this is useful and are planning on adding that functionality in the not-too-distant future.

Can you tell us a little more about your use case?

View full answer

mkornacker · 2024-12-15T05:07:47Z

mkornacker
Dec 15, 2024
Maintainer

You are encountering a gap in the query functionality. Essentially, what you want is the equivalent of the SQL Having clause, or more generally, being able to query a DataFrame like a table:
df = t.group_by(t.hash).select(t.hash, count=pxt.functions.count(1))
result = df.where(df.count > 1)

At the moment, Pixeltable doesn't support that. However, we are aware that this is useful and are planning on adding that functionality in the not-too-distant future.

Can you tell us a little more about your use case?

4 replies

Spinnernicholas Dec 15, 2024
Author

My example is my use case. I imagine there are other common cases as well.

I have about 70k images on a specific topic that I am using as a master dataset for fine tuning. I have a lot of duplicates because I prioritized gathering a large number of images quickly. I have been experimenting with using LLAVA models to generate metadata(tags, captions, etc...) about the images. The plan is to split the large dataset into small subsets for finetuning. My thought process is that I can hopefully get better finetuning results, with much smaller datasets and less training time, by curating multiple small datasets, each with a specific purpose.

PixelTable in this case would serve as the ever growing and evolving pool of training data.

Spinnernicholas Dec 15, 2024
Author

I am assuming I can do the group by select -> collect -> to pandas -> load into new table

I'll try it tomorrow.

mkornacker Dec 15, 2024
Maintainer

You don't even need to go through pandas:
df = t.group_by(t.hash).select(t.hash, count=...) # no collect()
count_t = pxt.create_table('counts', df)
res = count_t.where(count_t.count > 1).collect()

Spinnernicholas Dec 15, 2024
Author

Easy! Thanks for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pixeltable

How do you use a column that was added in a select call, in a where/order by call? #408

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Pixeltable

How do you use a column that was added in a select call, in a where/order by call? #408

Spinnernicholas Dec 15, 2024

Replies: 1 comment · 4 replies

mkornacker Dec 15, 2024 Maintainer

Spinnernicholas Dec 15, 2024 Author

Spinnernicholas Dec 15, 2024 Author

mkornacker Dec 15, 2024 Maintainer

Spinnernicholas Dec 15, 2024 Author

Spinnernicholas
Dec 15, 2024

Replies: 1 comment 4 replies

mkornacker
Dec 15, 2024
Maintainer

Spinnernicholas Dec 15, 2024
Author

Spinnernicholas Dec 15, 2024
Author

mkornacker Dec 15, 2024
Maintainer

Spinnernicholas Dec 15, 2024
Author