The term Artificial Intelligence (AI) often conjures up images of sentient, omnipotent machines portrayed in
science fiction, such as those seen in movies like
The Matrix1. In
contrast, real-world AI systems perform
specialized, non-fantastical tasks within well-defined domains. This toolkit focuses on tools powered by
Generative AI—a subfield field of AI that excels at detecting patterns and generating outputs based on
large-scale datasets. Specifically, we explore its applications in image generation using Text-to-image
(T2I) algorithms.
Throughout this toolkit, we use the term 'AI' in reference to technologies engineered to synthesize
images.
Occasionally, we term outputs as 'artworks.' In so doing, we align with Rolling's (2013) definition: "A work
of art
is like a theory. A theory is a set of interrelated constructs represented in a
distinguishable manner or form, the major function of which is to describe,
explain, and/or interpret the variables and variability of a phenomenon or
experience within the world."
WHAT IS IMAGE GENERATION?
Generative AI is fueled by human-generated data. Platforms like Flickr, YouTube, Instagram, have become a robust data resource
for researchers across industry and academia.
A selfie you may have posted on a social media platform.
2
An image on your website.
3
A story on a news site.
4
A book you may have authored.
5
Although large corporations generate substantial profits from this data, the original creators, and
rightsholders, are almost never compensated
5, that has in turn
elicited a slew of copyright lawsuits.
The development process begins with assembling of massive web-scale datasets, each comprising billions of
images and their corresponding textual descriptions. AI researchers utilize automated web crawling
techniques to amass these datasets. Following this collection phase is a computationally demanding
“training” process, which spans several months. During this phase, researchers apply advanced statistical
methods to teach the AI model to identify and replicate the semantic relationships between words and their
visual representations.
Post-training, text-to-image models
6 may undergo further refinement to better align
with specific performance
goals or ethical standards. This phase, often referred to as "fine-tuning," typically involves additional
training on a narrower dataset. This step is crucial for addressing potential biases in the model and
improving its ability to handle sensitive or complex scenarios responsibly. Once fine-tuned, the model
acquires the capability to synthesize images that not only mirror characteristics observed in the dataset,
but also plausibly assemble disparate visual concepts in novel ways.