Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wp_get_attachment_metadata() strips what it thinks are html tags in Exif metadata #14

Open
rheaplex opened this issue Jul 27, 2016 · 11 comments
Labels
🚧 status: blocked Blocked & therefore, not ready for work 🏷 status: label work required Needs proper labelling before it can be worked on 🧹 status: ticket work required Needs more details before it can be worked on
Milestone

Comments

@rheaplex
Copy link
Contributor

rheaplex commented Jul 27, 2016

If we have a jpeg with a Copyright field like:

<no> tags, tags are <stripped>

then when we upload the file to WordPress and fetch the Exif metadata using:

wp_get_attachment_metadata($att_id[0], true);

then the string we get for Copyright is:

 tags, tags are 

I assume this is due to WordPress taking the sensible precaution of stripping HTML tags from outside input, but it does mean that the format we are using for license URLs falls foul of this.

I've chased this down the call stack a way and I can't find anywhere to change it. I'd rather not have to use php's exif parsing, although I've just tested that and it doesn't have the same problem.

Investigating further, but if anyone knows of a quick fix for this please let me know.

@rheaplex
Copy link
Contributor Author

rheaplex commented Jul 27, 2016

in wp-admin/includes/image.php :

wp_read_image_metadata()

calls:

wp_kses_post_deep()

which strips tags.

So we have to use the php exif parsing. I'll look at hooking this in for the Media editor, pulling the values for the fields if they are not otherwise populated.

@mattl does this indicate a more general problem with the Exif tag format we are proposing? I don't believe so, but worth considering. Also maybe we should consider adding source and CC+ if we haven't already.

@rheaplex
Copy link
Contributor Author

Making progress with the php Exif parsing, just trying to make it efficient for the code and logical for the user.

@rheaplex
Copy link
Contributor Author

Code now extracts license and attribution url when you view the media. Looking to see if I can hook this in to the image upload process, but if not this will be Good Enough, I think.

@rheaplex
Copy link
Contributor Author

Metadata now extracted on image upload.

This won't get metadata for existing images if the plugin is installed and we have (e.g.) 20,000 images with Exif already in the system.

@mattl we can run the extract code when you view the image in the Media editor, or is this something we might want to give the user the option of running manually from the settings for the plugin (a button [Scan Existing Images for License Metadata And Apply It] ) if that's possible?

@mattl
Copy link
Contributor

mattl commented Jul 29, 2016

Won't existing images have been previously stripped by WordPress?

@rheaplex
Copy link
Contributor Author

I don't believe so. The strings are stripped after reading from the file, rather than the file itself being sanitised.

@mattl
Copy link
Contributor

mattl commented Jul 29, 2016

screenshot from 2016-07-29 15-56-15

Maybe something like this? We could pull all the existing images from the CC website as a test, but also @ericsteuer has good insight into how this works on a big site liked Wired.com who probably have a few hundred thousand images.

@rheaplex
Copy link
Contributor Author

I had in mind more a global "Extract CC License metadata where present but
don't overwrite anything" option.

We could also add a button to the media manager to do this for individual
images.

So the former would support hundreds of thousands, the latter just a few if
you only want to use a few.

On Fri, Jul 29, 2016 at 1:57 PM, Matt Lee notifications@github.com wrote:

[image: screenshot from 2016-07-29 15-56-15]
https://cloud.githubusercontent.com/assets/33296/17263143/0538b4a8-55a5-11e6-9471-e62fa5f2e11b.png

Maybe something like this? We could pull all the existing images from the
CC website as a test, but also @ericsteuer https://github.com/ericsteuer
has good insight into how this works on a big site liked Wired.com who
probably have a few hundred thousand images.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#14 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABU8ocliyFl3zU0pVeCqom5cyrKPavHks5qamkxgaJpZM4JWuJI
.

@mattl
Copy link
Contributor

mattl commented Jul 29, 2016

The worry I have there is that we'd wind up adding extra captions to existing images all over the place.

@rheaplex
Copy link
Contributor Author

Sure. It's the sort of thing where the user will want the plugin to do the right thing, for a value of "the right thing" that will differ from case to case. And they'll really want an Undo button.

So if this is too difficult to do usefully we shouldn't make something that will just frustrate people. :-)

@BjornW
Copy link
Contributor

BjornW commented Jul 30, 2016

Why not use the 'regenerate thumbnails approach' in which you have a plugin run once for all existing images? This could be a seperate add-on plugin which can be removed after it has run, since it's likely to be run only once.

@rheaplex rheaplex self-assigned this Nov 11, 2017
@rheaplex rheaplex added this to the 2.1 Release milestone Nov 11, 2017
@rheaplex rheaplex removed their assignment Sep 7, 2018
@ahmadbilaldev ahmadbilaldev modified the milestones: 2.1 Release, Later Jul 25, 2019
@TimidRobot TimidRobot modified the milestones: Later, Someday Nov 8, 2019
@TimidRobot TimidRobot modified the milestones: Someday, Backlog Nov 20, 2020
@cc-open-source-bot cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020
GwynethLlewelyn added a commit to GwynethLlewelyn/wp-plugin-creativecommons that referenced this issue Aug 31, 2021
Sadly, after all my enthusiasm, I've noticed that issue creativecommons#14 had changed to `exif_read_data()` because  `wp_read_image_metadata()` was removing all tags from the image metadata... even non-HTML tags. So, project abandoned!
@TimidRobot TimidRobot moved this to Triage in possumbilities Sep 10, 2024
@possumbilities possumbilities added 🚧 status: blocked Blocked & therefore, not ready for work and removed 🏷 status: label work required Needs proper labelling before it can be worked on labels Oct 1, 2024
@possumbilities possumbilities moved this from Triage to Backlog in possumbilities Oct 1, 2024
@possumbilities possumbilities added the 🧹 status: ticket work required Needs more details before it can be worked on label Oct 1, 2024
@cc-open-source-bot cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🚧 status: blocked Blocked & therefore, not ready for work 🏷 status: label work required Needs proper labelling before it can be worked on 🧹 status: ticket work required Needs more details before it can be worked on
Projects
Status: Backlog
Development

No branches or pull requests

7 participants