Finetuning - AI for Art Educators

IS FINE-TUNING NECESSARY?

Before diving into the technical aspects, let's begin by analyzing your text prompts with these critical questions (alternatively, use our critique worksheet):

Are visual elements rooted in oral traditions described in the text prompt? If yes, are these elements unvisualized in the generated image?
Are visual elements based on undigitized perspectives and histories described in the text prompt? If yes, are these elements unvisualized in the generated image?
Are visual elements from low-resource domains described in the text prompt? If yes, are these elements unvisualized in the generated image?
Are spurious correlations visualized in the generated image?
Are toxic and ethically questionable attributes visualized in the generated image?

Despite the advanced capabilities of text-to-image models, they often face a well-documented challenge: misalignment between text prompts and generated images. Misalignment makes models unreliable and prone to hallucinating expected visual representations. Research has shown that misalignment occurs because text-to-image models are statistical tools that replicate patterns observed in their vast, uncurated training datasets. These datasets are far from neutral; they are embedded with assumptions and biases shaped by institutional frameworks, resource distributions, and historical patterns. They overrepresent the views, values, and modes of communication of dominant voices, while simultaneously mis/underrepresenting minoritized perspectives. Therefore, datasets are partial representation of the world, and text-to-image algorithms trained on such corpora reflect this partiality which leads to inconsistent performance across different sociodemographic groups. It's important to note that the dimensions along which misalignment occurs can also be rooted in culture-specific or localized social hierarchies.

In scenarios where generated images fail to align with your artistic vision, you may consider abandoning image generation in favor of alternative visual processing methods. However, if you decide to proceed with image generation, we recommend finetuning the foundation model for improved performance.

Fine-tuning is the process of adapting a pre-trained text-to-image model to effectively generate specialized images based on relatively small amounts of relevant, in-domain data. By providing the model with supplementary samples, it learns additional parameters that help encode concepts and content relevant to your artistic goals.