26 Jun 2020 — notes dev

Accessing tags made in Shotwell with Python 3

fr en

Read it aloud: play pause stop

A quick and dirty hack to read images' tags made by Shotwell, in python.

As I was reorganising my gallery, I wanted optimize the classification process the images. Instead of using directories, subdirectories and python black magic to generate a file containing the list of files with categories and tags as it was previously, I decided to use a proper solution, that is an dedicated software. The plan was to use Shotwell to tag the images and then simply parse the tags in the metadata with python to generate the pages. I activated the "Write tags, titles and other metadata to photo files" optio in Shotwell, and started tagging.

final snippet at the bottom of the page

But when I tried the first snippet of code I found to access exif data in python...

>>> import PIL.Image
>>> img = PIL.Image.open('1_000010e.JPG')
>>> exif_data = img._getexif()
>>>
>>> exif_data
{296: 2, 34665: 220, 271: 'FUJI PHOTO FILM CO., LTD.', 272: 'SP-3000', 305: 'Shotwell 0.30.1', 274: 1, 306: '2019:05:17 15:29:52', 531: 1, 282: (72, 1), 283: (72, 1), 36864: b'0210', 37121: b'\x01\x02\x03\x00', 40960: b'0100', 36867: '    :  :     :  :  ', 36868: '2019:05:16 17:10:49', 40961: 1, 40962: 1703, 40963: 1168, 40965: 494, 41728: b'\x03', 41729: b'\x01', 37500: b'FUJIFILM\x0c\x00\x00\x00\x05\x00\x00\x00\x07\x00\x04\x00\x00\x000130\x00\x80\x02\x00\x06\x00\x00\x00N\x00\x00\x00\x02\x80\x04\x00\x01\x00\x00\x00\xff\xff\xff\xff \x80\x03\x00\x01\x00\x00\x00\x01\x00\x00\x00!\x80\x03\x00\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00135_C\x00'}

No tag to be seen. Same thing with the good ol' "file" command:

nemecle@yggdrasil:~/Pictures/$ file 1_000010e.JPG
1_000010e.JPG: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=10, manufacturer=FUJI PHOTO FILM CO., LTD., model=SP-3000, orientation=upper-left, xresolution=168, yresolution=176, resolutionunit=2, software=Shotwell 0.30.1, da

Or even the standard "exif" python library:

>>> from exif import Image
>>> with open('1_000010e.JPG', 'rb') as image_file:
...     my_image = Image(image_file)
...
>>> my_image.has_exif
True
>>> dir(my_image)
['_exif_ifd_pointer', '_interoperability_ifd_Pointer', '_segments', 'color_space', 'components_configuration', 'compression', 'datetime', 'datetime_digitized', 'datetime_original', 'delete', 'delete_all', 'exif_version', 'file_source', 'flashpix_version', 'get', 'get_file', 'get_thumbnail', 'has_exif', 'jpeg_interchange_format', 'jpeg_interchange_format_length', 'make', 'maker_note', 'model', 'orientation', 'pixel_x_dimension', 'pixel_y_dimension', 'resolution_unit', 'scene_type', 'software', 'x_resolution', 'y_and_c_positioning', 'y_resolution']

After some research, it seems that commonly avaliable exif commands and libraries are unable to read user-made data. Well.

Frustrated but still brave, I dove head first in the bytes, a bad habit I picked up debugging shitty non-standard services, but which came handy in that case. I fired a simple xxd in vim (":%!xxd" in normal mode) to ease the search. We can see some metadata at the beginning of the file, and the "JFIF" magic string indicating that the file is a jpeg:

00000000: ffd8 ffe0 0010 4a46 4946 0001 0101 0048  ......JFIF.....H
00000010: 0048 0000 ffe1 1790 4578 6966 0000 4949  .H......Exif..II
00000020: 2a00 0800 0000 0a00 0f01 0200 1a00 0000  *...............
00000030: 8600 0000 1001 0200 0800 0000 a000 0000  ................
00000040: 1201 0300 0100 0000 0100 0000 1a01 0500  ................
00000050: 0100 0000 a800 0000 1b01 0500 0100 0000  ................
00000060: b000 0000 2801 0300 0100 0000 0200 0000  ....(...........
00000070: 3101 0200 1000 0000 b800 0000 3201 0200  1...........2...
00000080: 1400 0000 c800 0000 1302 0300 0100 0000  ................
00000090: 0100 0000 6987 0400 0100 0000 dc00 0000  ....i...........
000000a0: 0c02 0000 4655 4a49 2050 484f 544f 2046  ....FUJI PHOTO F
000000b0: 494c 4d20 434f 2e2c 204c 5444 2e00 5350  ILM CO., LTD..SP
000000c0: 2d33 3030 3000 4800 0000 0100 0000 4800  -3000.H.......H.
000000d0: 0000 0100 0000 5368 6f74 7765 6c6c 2030  ......Shotwell 0
000000e0: 2e33 302e 3100 3230 3139 3a30 353a 3137  .30.1.2019:05:17

Knowing the keywords, I just searched for "home":

000017a0: 28a0 0fff d900 ffe1 0a20 6874 7470 3a2f  (........ http:/
000017b0: 2f6e 732e 6164 6f62 652e 636f 6d2f 7861  /ns.adobe.com/xa
000017c0: 702f 312e 302f 003c 3f78 7061 636b 6574  p/1.0/.<?xpacket
000017d0: 2062 6567 696e 3d22 efbb bf22 2069 643d   begin="..." id=
000017e0: 2257 354d 304d 7043 6568 6948 7a72 6553  "W5M0MpCehiHzreS
000017f0: 7a4e 5463 7a6b 6339 6422 3f3e 203c 783a  zNTczkc9d"?> <x:
00001800: 786d 706d 6574 6120 786d 6c6e 733a 783d  xmpmeta xmlns:x=
00001810: 2261 646f 6265 3a6e 733a 6d65 7461 2f22  "adobe:ns:meta/"
00001820: 2078 3a78 6d70 746b 3d22 584d 5020 436f   x:xmptk="XMP Co
00001830: 7265 2034 2e34 2e30 2d45 7869 7632 223e  re 4.4.0-Exiv2">
00001840: 203c 7264 663a 5244 4620 786d 6c6e 733a   <rdf:RDF xmlns:
00001850: 7264 663d 2268 7474 703a 2f2f 7777 772e  rdf="http://www.
00001860: 7733 2e6f 7267 2f31 3939 392f 3032 2f32  w3.org/1999/02/2
00001870: 322d 7264 662d 7379 6e74 6178 2d6e 7323  2-rdf-syntax-ns#
00001880: 223e 203c 7264 663a 4465 7363 7269 7074  "> <rdf:Descript
00001890: 696f 6e20 7264 663a 6162 6f75 743d 2222  ion rdf:about=""
000018a0: 2078 6d6c 6e73 3a64 633d 2268 7474 703a   xmlns:dc="http:
000018b0: 2f2f 7075 726c 2e6f 7267 2f64 632f 656c  //purl.org/dc/el
000018c0: 656d 656e 7473 2f31 2e31 2f22 2078 6d6c  ements/1.1/" xml
000018d0: 6e73 3a78 6d70 3d22 6874 7470 3a2f 2f6e  ns:xmp="http://n
000018e0: 732e 6164 6f62 652e 636f 6d2f 7861 702f  s.adobe.com/xap/
000018f0: 312e 302f 2220 786d 703a 4c61 6265 6c3d  1.0/" xmp:Label=
00001900: 2270 686f 746f 6772 6170 6879 223e 203c  "photography"> <
00001910: 6463 3a73 7562 6a65 6374 3e20 3c72 6466  dc:subject> <rdf
00001920: 3a42 6167 3e20 3c72 6466 3a6c 693e 616e  :Bag> <rdf:li>an
00001930: 616c 6f67 3c2f 7264 663a 6c69 3e20 3c72  alog</rdf:li> <r
00001940: 6466 3a6c 693e 686f 6d65 3c2f 7264 663a  df:li>home</rdf:
00001950: 6c69 3e20 3c72 6466 3a6c 693e 7068 6f74  li> <rdf:li>phot
00001960: 6f67 7261 7068 793c 2f72 6466 3a6c 693e  ography</rdf:li>
00001970: 203c 2f72 6466 3a42 6167 3e20 3c2f 6463   </rdf:Bag> </dc
00001980: 3a73 7562 6a65 6374 3e20 3c2f 7264 663a  :subject> </rdf:
00001990: 4465 7363 7269 7074 696f 6e3e 203c 2f72  Description> </r
000019a0: 6466 3a52 4446 3e20 3c2f 783a 786d 706d  df:RDF> </x:xmpm
000019b0: 6574 613e 2020 2020 2020 2020 2020 2020  eta>

Bingo... I guess? I had no idea of what this was. After some cleaning I ended up with:

http://ns.adobe.com/xap/1.0/.<?xpacket begin="..." id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 4.4.0-Exiv2">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description
      rdf:about=""
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:xmp="http://ns.adobe.com/xap/1.0/"
      xmp:Label="photography">
      <dc:subject>
        <rdf:Bag>
          <rdf:li>analog</rdf:li>
          <rdf:li>home</rdf:li>
          <rdf:li>photography</rdf:li>
        </rdf:Bag>
      </dc:subject>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

So, we are working with "RDF" elements, which apparently stands for "Resource Description Framework". I searched "rdf:li python", and finally found someone with a close enough issue to be useful but apparently too different to have showed up earlier:

import xml.etree.ElementTree as ET
from PIL import Image, ExifTags
with Image.open("1_000010e.JPG") as im:
    for segment, content in im.applist:
        marker, body = content.split(b'\x00', 1)
        if segment == 'APP1' and marker == b'http://ns.adobe.com/xap/1.0/':
            data = body.decode('"utf-8"')
            print (data)

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 4.4.0-Exiv2"> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xmp="http://ns.adobe.com/xap/1.0/" xmp:Label="photography"> <dc:subject> <rdf:Bag> <rdf:li>analog</rdf:li> <rdf:li>home</rdf:li> <rdf:li>photography</rdf:li> </rdf:Bag> </dc:subject> </rdf:Description> </rdf:RDF> </x:xmpmeta>
...
(a lot of whitespaces)
...
<?xpacket end="w"?>

Finally!

Then things got dirty because I don't actually care about most of the data, only the specific last-level "rdf:li" elements, so a quick and dirty regex did the job:

re.findall(r"(?<=<rdf:li>).*?(?=</rdf:li>)", data)
['analog', 'home', 'photography']

But when I ran the script, I quickly realised that it would not work on .png images, because the resulting PIL object didn't have the "applist" attribute that containing the data. I loaded a jpeg and png:

im = Image.open("Archives/IMG_1670.JPG"
im1 = Image.open("Miscellaneous digital drawings/windmill Mawi.png" #png not working

And enumerated the available attributes to compare:

object_methods = [method_name for method_name in dir(im)
                  if callable(getattr(im, method_name))]

object_methods1 = [method_name for method_name in dir(im1)
                  if callable(getattr(im1, method_name))]

And unsurprisingly:

for k,v in inspect.getmembers(im, lambda a:not(inspect.isroutine(a))):
        print(str(k))

# "applist" is listed

for k,v in inspect.getmembers(im, lambda a:not(inspect.isroutine(a))):
    print(str(k))

# "applist" is not listed

I cycled through some of the existing attributes ot see their content, and finally:

[...unrelated stuff...]

          \n                           \n<?xpacket end="w"?>', 'dpi': (72, 72), 'Comment': 'Created by Nemecle'}

After merging the two solutions, the final snippet (for jpeg and png, at least, and not considering the lack of tags) looks like this:

def read_tags(filepath):
    """
    read the shotwell tags from the metadata
    (require the "Write tags, titles and other metadata to photo files" option)

    """

    data = ""
    tags = []

    try:
        with Image.open(filepath) as im:
            if im.format is "PNG":
                data = str(im.info["XML:com.adobe.xmp"])

            elif im.format is "JPEG":
                for segment, content in im.applist:
                    marker, body = content.split(b'\x00', 1)
                    if segment == 'APP1' and marker == b'http://ns.adobe.com/xap/1.0/':
                        data = body.decode('"utf-8"')
    except Exception as e:
        print("Error while reading tags on %s: %s " % (filepath, str(e)))
        exit(1)


    try:
        pattern=re.compile(r"(?<=<dc:subject>).*?(?=</dc:subject>)", re.DOTALL)

        tag_data = pattern.search(data)

    except Exception as e:
        print("Error while extracting tag data on %s: %s" % (filepath, str(e)))
        exit(1)


    try:
        pattern=re.compile(r"(?<=<rdf:li>).*?(?=</rdf:li>)", re.DOTALL)

        tags = pattern.findall(tag_data.group(0))

    except Exception as e:
        print("Error while parsing tags on %s: %s" % (filepath, str(e)))
        exit(1)

And voilà.