Text-to-image AI models like Stable Diffusion, Midjourney, and DALL·E have redefined what’s possible in the creative generation. Yet while they can produce breathtaking visuals from simple prompts, their general-purpose training often falls short when users need specific styles, faces, concepts, or languages.
The solution? Fine-tuning. By adapting pretrained models to domain-specific data, we unlock personalization, improve prompt alignment, and inject cultural or stylistic nuance. For example, we can fine tune a model to learn the cultural nuances and face features aligned with our Arab region, to deliver image generation that feels more natural.
In this blog, we explore the main techniques for fine-tuning image generation models, compare fine tuning methods, and walk through the key steps for building your own tuned model.
Pretrained diffusion models like Stable Diffusion are trained on billions of image-text pairs, mostly scraped from the internet. While powerful, they can be:
Fine-tuning lets us steer the model more precisely, enabling it to generate:
Retraining the entire model (UNet, VAE, and text encoder) on new data.
Pros:
Cons:
A lightweight method that inserts trainable adapter modules into attention layers of the model (e.g., UNet or text encoder).
How it works:
Pros:
Cons:
QLoRA takes the efficiency of LoRA even further by enabling training on quantized models, typically 4-bit versions, without compromising much on performance.
How it works:
Why it matters for image generation:
Pros:
Cons:
Optional Enhancements:
A common challenge with popular online image generation models is that they are rarely trained on Arabic data.
Their datasets are overwhelmingly skewed toward Western or East Asian cultures, which means they often fail to capture the details of Arab clothing, people, environments, or architecture. This cultural gap can lead to inaccurate or generic outputs when generating content related to the Arab world.
Here are some concrete examples of the biases and stereotypes of such models:
Prompt: Arab scientist in a lab
Generated images:


As seen in the example above, the prompt “Arab scientist in a lab” returns images of a man wearing a keffiyeh, which is a traditional Gulf garment. While culturally significant, they are not appropriate attire for a laboratory setting. This highlights how general models often associate the term “Arab” with a limited and stereotypical visual representation.
Prompt: Portrait of a professional Emirati woman
Generate images:
In this example, the prompt “professional Emirati woman“ generates an image of a woman wearing a keffiyeh and an abaya. While the abaya is culturally appropriate, the keffiyeh is traditionally male attire. Moreover, the setting lacks any indicators of a professional environment. Instead of portraying a woman in a workplace or business attire, the model defaults to vague or mismatched cultural symbols, revealing a limited understanding of both gender roles and professional contexts in the region.
To address this, MadeBy partnered with the MiddleFrame to create the first image generation model designed for the Arab world.
The Goal: To improve how generative models depict Arab people, architecture, attire, and settings. Areas where mainstream models often show gaps or inaccuracies.
What We Did:
Check out Culturally Aware Image Generation on the MadeBy platform.
Here are some samples of the results on the same prompts:
Prompt: A woman in modest traditional clothing leaning against the sunlit stone wall of an old house, with carved wooden windows and climbing jasmine vines
Prompt: A traditional Arabic wedding scene outdoors with draped fabrics, colorful garments, and guests clapping to drum music
Prompt: Egyptian man in Moez street wearing traditional clothing
Customization is the Future
As generative AI continues to evolve, the ability to customize and align models to specific identities, cultures, and aesthetics is becoming essential, not optional. Fine-tuning isn’t just about making better art. It’s about making relevant, inclusive, and context-aware content.
Whether you’re trying to localize imagery, capture a new visual language, or run a powerful model on limited hardware, techniques like LoRA and QLoRA make fine-tuning accessible, efficient, and scalable.