Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve filenames for downloaded assets #1232

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

Lymia
Copy link

@Lymia Lymia commented May 13, 2024

This changes the filenames output by DiscordChatExporter to be more meaningful (in the case of emojis and avatars) and more resilient against collisions.

The hash is 12 characters of base32 now, rather than 5 characters of hexadecimal. This allows for nearly 5 million downloads with the same name before there is even a 0.001% of a single collision. This should actually be enough, even for problematic filenames like those associated with Youtube thumbnails.

Emojis and Discord attachments are instead guaranteed to not collide as they instead contain the unique snowflake of the attachment or emoji. The 19 digit id is significantly longer than the old 5 digit ids - however, reencoding it in base32 would only save 6 characters, so better to use the more recognizable numeric form IMO.

Closes #1231

@Tyrrrz Tyrrrz changed the title Improve filenames for downloaded assets. Improve filenames for downloaded assets May 13, 2024
@Tyrrrz Tyrrrz added bug and removed enhancement labels May 13, 2024
@Lymia
Copy link
Author

Lymia commented Jul 27, 2024

Any update on this? I personally do use this patch to backup a few servers that trigger this bug.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Jul 27, 2024

Can't really change the naming algorithm as it will break "skip downloaded assets" for all existing exports.

@Lymia
Copy link
Author

Lymia commented Jul 28, 2024

That is, unfortunately, the point. "Skip Downloaded Assets" as it currently works has a severe risk of incorrectly skipping non-duplicated assets, and there's no good way to fix that without invalidating past downloads. Do you think it'd work to put this behind a command line flag, instead of changing the default behavior?

@Tyrrrz
Copy link
Owner

Tyrrrz commented Jul 29, 2024

I think your concern is statistically valid, but there yet haven't been any bug reports in regard to asset filename collision. If/when that happens, then it makes sense to introduce the breaking change to fix such an issue. I want to make sure we're fixing real problems when we add new ones (by breaking compatibility) 🙂

Maybe down the line there will be other related breaking changes, then it will make sense to group them together. Right now I don't see it happening any time soon.

@ofifoto
Copy link

ofifoto commented Jul 29, 2024

haven't been any bug reports in regard to asset filename collision

would there be an error? or would the file just silently be skipped? in a bigger discord that would be very easy to miss if it was just silently skipped

@Tyrrrz
Copy link
Owner

Tyrrrz commented Jul 29, 2024

would there be an error? or would the file just silently be skipped? in a bigger discord that would be very easy to miss if it was just silently skipped

There wouldn't be an error now, but that can be changed.

@ofifoto
Copy link

ofifoto commented Jul 29, 2024

There wouldn't be an error now, but that can be changed.

I think that's a nice compromise :)

@Twi-Hard
Copy link

Twi-Hard commented Jul 30, 2024

I have a single channel with 8774 conflicts.
This command counts how many times each filename occurred:

❯ cat 98609319519453184.json | jq -r '.messages[].attachments[].url' | sort | uniq -c | sort -nr > discord-98609319519453184-dupes.txt

This command removes the lines with only 1 instance of a filename then sums up all of the file counts for files with conflicts:

❯ cat discord-98609319519453184-dupes.txt | awk '{print $1}' | grep -vE '^1$' | paste -sd+ - | bc
17453

Then count the number of filenames that occur more than once:

❯ cat discord-98609319519453184-dupes.txt | awk '{print $1}' | grep -vE '^1$' | wc -l
8679

17453-8679 = 8774

I manually checked some of the messages that had duplicate filenames in discord and they were indeed unique images in discord but only one image was saved locally.
I've checked a different channel too and the same thing happened.
I can't remember if it's the default or not, but I store each channels images in its own folder

❯ ls -1U | head -10
670866322619498511-media
670866322619498511.json
704107851421057114-media
704107851421057114.json
704226060178292846-media
704226060178292846.json
862221868621758502-media
862221868621758502.json
862732083050971146-media
862732083050971146.json

It would be really great if you could implement an option to make each name unique. People that want to stick with the old method don't need to do anything if it's done this way.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Jul 30, 2024

@Twi-Hard please add this info to the linked issue (#1231). If you can, please also provide a few different URLs that resolve to the same hash/filename with the current algorithm.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Nov 6, 2024

Just as an update, this fix will be included in 3.x as part of #1271. I just can't promise when that will be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

--reuse-media creates an unacceptably high chance of a collision
4 participants