Photo of a cute tiny robot with glowing eyes. Image by Jochen van Wylick taken from Unsplash.

Using OpenAI’s GPT-4 Vision API in Web Development

In the rapidly evolving landscape of web development, the integration of cutting-edge technologies is crucial for staying ahead. OpenAI’s GPT-4 Vision API emerges as a powerful tool that revolutionizes the way content creation should approach accessible visual content.

I only just started playing with this yesterday at the end of my workday. But within an hour I had output that was more or less presentable. This is clearly a very powerful place that technology is going that has many implications for content creation, or now perhaps now… generation.

Getting up and running with a simple python script was straightforward using OpenAI’s example script. Below I’ll share, broadly, my thoughts on this experience.

It all starts with data

The initial phase of my work involved extracting a targeted set of images from existing content using SQL queries. Through a series of tedious joins, a bit ugly to be sure, I managed to refine the target data into a clean and structured CSV, ready to be parsed by python, and subsequently transmitted to the API endpoint. I’ve realized that ensuring quality information is passed to the API is essential to yield decent output. It’s not magic, it’s computers (lol), you can’t just pipe some text to it with no context and expect some useful button. Even in this age of robots, easy buttons still must be developed and architected.

Can we automate it:

Well sure; in about 100 lines or so.

I’m not sure I’d call myself a programmer or engineer at the time of writing this. That is to say “I know what I don’t know”. Or put another way, the more you know the more you know what you don’t know. However, I do have a bit of programming knowledge and can at least put something functional together sometimes with a little help from my friends.

Copilot in VS code helped me expand on Open AI’s example script considerably in python. Using this derived CSV I was able, from URLs, to extract the content of the resource at that URL via the requests library for python. From there, I encoded the data to Base64 and passed it to OpenAI’s GPT-4 Vision API. Midjourney has a similar describe feature, but to my knowledge this isn’t accessible over API.

Using prompting I asked for just “alt text for the image” initially, which resulted in some pretty verbose output. After a little push and pull with the prompt the output was refined to get what I desired.

Now imagine

While my use case was to describe a specific “type of image” and is less generic in nature, there is a ton of power in being able to iterate over a list of URLs and then output a description of them. Provided you don’t hit the rate limits, or account for them in the script, this is a pretty awesome way to start with alt text.

From here, a human might review closer or gradually optimize the alt text for these images themselves to yield even higher quality output

Good Information yielded good output — human vs machine, battle of quality.

My use case here involved writing image alt text. This is usually a pretty tedious task that could easily take days or weeks of work depending on the amount of content. It is without a doubt that human output from real considered work would be objectively better and less “mechanical”. Though I myself haven’t tested this, human creativity is a force to be reckoned with. Computers just aren’t there yet. So I’d say if you’re looking for quality output, stick with a human for now.

But, I still wanted to investigate what AI could do in this area of image description. For properties with a ton of content, the tradeoff of human labor for the task vs API call is a no-brainer. You can get a ton of “work”, or, perhaps more accurately, output in a few minutes. While the quality disparity between AI writing and human writing is obvious to anyone taking the time to read, the output is enough to get you started, and sometimes… even passable.

In this case, the image descriptions returned were pretty solid. To be able to run a script against a CSV of information and derive this result was a powerful thing. Within an hour I had a script, albeit a messy one, put together and outputting passable results.

The Verdict: How should you approach using AI for writing?

At least in my view, AI is a tool, a fabulous wonderful tool. However, that tool is still powered by a computer; and, it’s wise to not be fooled by the flashy output of a function, computers are unambiguously idiots. They must be told what to do and told well. “AI” is only an abstraction for how we as human users provide instructions to the bots. An amazing abstraction, to be sure.

I used AI to help me write this. Could you tell, maybe? It doesn’t much matter so long as the idea is transmitted the way I, the author, want the idea transmitted. I think what our being able to build useful tools like this scripted alt text generations means is not “automation” or “easy buttons”. I believe this to be the wrong view. Instead, these tools should be augmentation for human output. AI should be enhancing humans.

During an interview in 1981, Ted Koeppel, adopting a pretty skeptical stance, asked a 26-year-old Steve Jobs about the perils associated with computer usage and the hypothetical scenario of computers gaining dominion over humans. Jobs responded by characterizing the personal computer as the “bicycle of the 21st century” and cited a study that assessed the efficiency of locomotion in various species.

This is what I believe AI is for humanity, at least at this moment. It’s better tires, or a slightly better bicycle, maybe one you don’t have to pedal as much.

What AI does is alleviates the “blank page”. Generating tons of alt descriptions for images is only useful if they are, at least generally, reviewed by a human, and useful and accessible for the reader.