diffusion-cpp

Native C++ addon for text-to-image generation using qvac-ext-stable-diffusion.cpp, built for the Bare Runtime. Supports Stable Diffusion 2.x / XL / 3 and FLUX.2 [klein].

diffusion-cpp

Supported platforms

Platform	Architecture	Status	GPU Backend
macOS	arm64	✅ Tier 1	Metal
macOS	x64	✅ Tier 1	Metal
Linux	arm64, x64	✅ Tier 1	Vulkan
Android	arm64	✅ Tier 1	Vulkan, OpenCL
iOS	arm64	✅ Tier 1	Metal
Windows	x64	✅ Tier 1	Vulkan

Dependencies:

qvac-ext-stable-diffusion.cpp
ggml
Bare Runtime ≥ 1.24.0
CMake ≥ 3.25 and a C++20-capable compiler

Building from Source

See build.md for prerequisites, platform-specific setup, cross-compilation, and troubleshooting.

Quick start:

npm install -g bare bare-make
npm install
npm run build

Downloading Model Files

A download script is provided that fetches all required files for FLUX.2 [klein] 4B:

./scripts/download-model.sh

This downloads three files into the models/ directory:

File	Size	Description
`flux-2-klein-4b-Q8_0.gguf`	~4.0 GB	FLUX.2 [klein] 4B diffusion model (Q8_0 quantised)
`Qwen3-4B-Q4_K_M.gguf`	~2.5 GB	Qwen3 4B text encoder (Q4_K_M quantised)
`flux2-vae.safetensors`	~321 MB	VAE decoder

Note: Downloads can be resumed if interrupted — the script uses curl -C - for resumable transfers.

Why these specific files?

FLUX.2 [klein] uses a split model layout. Three separate components are required:

Diffusion model (flux-2-klein-4b-Q8_0.gguf) — the main image transformer. This GGUF has no SD metadata KV pairs so it must be loaded via diffusion_model_path internally, not model_path.
Text encoder (Qwen3-4B-Q4_K_M.gguf) — Qwen3 4B in standard GGML Q4_K_M format.
VAE (flux2-vae.safetensors) — standard safetensors format, compatible as-is.

Disk and RAM requirements

Component	Disk	RAM at runtime
Diffusion model (Q8_0)	4.0 GB	~4.1 GB
Text encoder (Q4_K_M)	2.5 GB	~4.3 GB
VAE	321 MB	~95 MB
Total	~6.8 GB	~8.5 GB

A machine with 16 GB of unified memory (e.g. MacBook Air M-series) can run this model.

Running the Example

Two runnable examples are provided.

Load / unload only

Verifies the model loads and releases cleanly without running inference:

npm run example

Expected output:

FLUX.2 [klein] 4B — load/unload example
========================================
Model loaded in 12.0s
Model is ready. (No inference in this example.)
Done — all resources released.

Source: examples/load-model.js

Text-to-image generation

Generates a 512 × 512 PNG with a 20-step FLUX.2 run, saves it to output/:

npm run generate

Expected output:

FLUX.2 [klein] 4B — text-to-image inference
============================================
Loaded in 15.2s

Starting generation...
  [████████████████████] 20/20 steps

Generated in 610.0s
Got 1 image(s)
Saved → .../output/output_seed42_0.png

Source: examples/generate-image.js

Performance note: On an M1 MacBook Air (16 GB) with Metal enabled, loading takes ~15 s and 20 steps at 512 × 512 take ~10 minutes. Reduce STEPS to 4 for quick tests — FLUX.2's distilled model is designed for low step counts.

Other Examples

Quickstart – Minimal text-to-image generation with SD2.1.
Generate Image (SD2.1) – Text-to-image with an SD2.1 all-in-one GGUF model.
Generate Image (SD3) – Text-to-image with SD3 Medium (safetensors, diffusion + CLIP encoders).
Generate Image (SDXL) – Text-to-image with an SDXL base all-in-one GGUF model.
Post-generation ESRGAN Upscale – Text-to-image with SD2.1 followed by one or two ESRGAN upscale passes.
Standalone ESRGAN Upscale – Upscale an existing PNG/JPEG without loading a diffusion model (named export EsrganUpscaler).
Runtime Stats – Run SD2.1 inference and report runtime statistics.
img2img FLUX2 – Transform an image with FLUX2-klein (Q8_0, in-context conditioning).
img2img FLUX2 F16 – Transform an image with FLUX2-klein (F16 full precision).
img2img SD3 – Transform an image with SD3 Medium (SDEdit, flow-matching).

Usage

1. Import the Model Class

const ImgStableDiffusion = require('@qvac/diffusion-cpp')

2. Create the `args` object

const path = require('bare-path')

const MODELS_DIR = path.resolve(__dirname, './models')
const args = {
  logger: console,
  files: {
    model: path.join(MODELS_DIR, 'flux-2-klein-4b-Q8_0.gguf'),
    llm:   path.join(MODELS_DIR, 'Qwen3-4B-Q4_K_M.gguf'),   // Qwen3 text encoder for FLUX.2 [klein]
    vae:   path.join(MODELS_DIR, 'flux2-vae.safetensors')
  },
  config: { threads: 8 },
  opts: { stats: true }
}

Property	Required	Description
`files`	✅	Object of absolute paths to model files (see below)
`files.model`	✅	Absolute path to diffusion model file (all-in-one for SD2.x; diffusion-only GGUF for FLUX.2)
`files.clipL`	—	Absolute path to separate CLIP-L text encoder (SD3)
`files.clipG`	—	Absolute path to separate CLIP-G text encoder (SDXL / SD3)
`files.t5Xxl`	—	Absolute path to separate T5-XXL text encoder (SD3)
`files.llm`	—	Absolute path to Qwen3 LLM text encoder (FLUX.2 [klein])
`files.vae`	—	Absolute path to separate VAE file
`files.esrgan`	—	Absolute path to ESRGAN upscaler model for post-generation upscale
`config`	—	Native backend configuration object (see next section)
`logger`	—	Logger instance for JS wrapper logs (e.g. `console`)
`opts`	—	Additional options (e.g. `{ stats: true }`)

Native C++ logs are process-global. Configure native log routing once with require('@qvac/diffusion-cpp/addonLogging').setLogger(...).

3. Configure the native backend (`args.config`)

config is a field on the args object built in step 2 — there is no separate constructor argument. The native backend reads it during load().

args.config = {
  threads: 8  // CPU threads for tensor operations (Metal handles GPU automatically)
}

Config values are coerced to strings internally. Generation parameters (prompt, steps, seed, etc.) are JSON-serialized with their native types preserved.

Parameter	Type	Default	Description
`threads`	number	auto	Number of CPU threads for model loading and CPU ops
`type`	`'f32'` \| `'f16'` \| `'q4_0'` \| `'q8_0'` \| …	auto	Override weight quantisation type
`rng`	`'cpu'` \| `'cuda'` \| `'std_default'`	`'cuda'`	RNG backend (`'cuda'` = philox RNG — not GPU-specific despite the name; recommended)
`clip_on_cpu`	`true` \| `false`	`false`	Force CLIP encoder to run on CPU
`vae_on_cpu`	`true` \| `false`	`false`	Force VAE to run on CPU
`flash_attn`	`true` \| `false`	`false`	Enable flash attention for all components (text encoder, diffusion model, VAE)
`diffusion_fa`	`true` \| `false`	`true`	Enable flash attention for the diffusion model only. FLUX2 requires this to avoid materialising the full Q·Kᵀ matrix in VRAM. Safe for all model families — falls back to standard attention on backends that don't support it. Opt out with `false`.
`upscaler_tile_size`	number	`128`	ESRGAN upscaler tile size

4. Create a Model Instance

const model = new ImgStableDiffusion(args)

The constructor takes a single object containing files, config, logger, and opts. It stores configuration only — no memory is allocated yet.

5. Load the Model

await model.load()

This creates the native sd_ctx_t and loads all weights into memory. It can take 10–30 seconds depending on disk speed and model size. All model files must be passed as absolute paths via the files object.

6. Run Inference

Text-to-image (`model.run`)

The primary API. Returns a QvacResponse that streams step-progress ticks and the final PNG:

const images = []

const response = await model.run({
  prompt: 'a majestic red fox in a snowy forest, golden light, photorealistic',
  steps: 20,
  width: 512,
  height: 512,
  guidance: 3.5,   // distilled guidance scale — FLUX.2 specific
  seed: 42
})

await response
  .onUpdate(data => {
    if (data instanceof Uint8Array) {
      images.push(data)  // PNG-encoded output image
    } else if (typeof data === 'string') {
      try {
        const tick = JSON.parse(data)
        if ('step' in tick) process.stdout.write(`\rStep ${tick.step}/${tick.total}`)
      } catch (_) {}
    }
  })
  .await()

require('bare-fs').writeFileSync('output.png', images[0])

Generation parameters:

Parameter	Type	Default	Description
`prompt`	string	—	Text prompt
`negative_prompt`	string	`''`	Things to avoid in the output
`width`	number	`512` (FLUX img2img: `1024`)	Output width in pixels (multiple of 8)
`height`	number	`512` (FLUX img2img: `1024`)	Output height in pixels (multiple of 8)
`steps`	number	`20`	Number of diffusion steps
`guidance`	number	`3.5`	Distilled guidance scale (FLUX.2)
`cfg_scale`	number	`7.0`	Classifier-free guidance scale (SD2.x / SDXL / SD3)
`sampling_method`	string	auto	Sampler name; auto-selects `euler` for FLUX.2, `euler_a` for SD2.x
`scheduler`	string	auto	Scheduler; auto-selected per model family
`seed`	number	`-1`	Random seed (-1 for random)
`batch_count`	number	`1`	Number of images to generate
`vae_tiling`	boolean	`false`	Enable VAE tiling (required for large images on 16 GB)
`cache_preset`	string	—	Step-caching preset: `slow`, `medium`, `fast`, `ultra`
`upscale`	boolean \| `{ repeats?: number }`	`false`	Post-generation ESRGAN upscale. Requires `files.esrgan`; `repeats` defaults to `1`

Sampler note: Do not set sampling_method: 'euler_a' for FLUX.2 models — it will produce random noise. Leave the field unset to let the library auto-select euler for flow-matching models.

Image-to-image (`init_image`)

Pass init_image (a Uint8Array of PNG or JPEG bytes) to transform an existing image with a text prompt. For SDEdit models (SD2.x / SDXL / SD3) width and height default to the input image's pixel dimensions (rounded up to the next multiple of 8). For FLUX.2 models the output size is independent of the reference image — omit width/height to get the default 1024×1024 output, or supply them explicitly.

The addon automatically selects the correct img2img strategy based on the model's prediction type:

Model family	Prediction type	Strategy	How it works
FLUX.2	`flux2_flow`	In-context conditioning (`ref_images`)	Input image is VAE-encoded into separate latent tokens; the transformer attends to them via joint attention with distinct RoPE positions. The target starts from pure noise, so the model preserves features while generating a fully new image.
SD2.x / SDXL / SD3	All others	SDEdit (`init_image`)	Input image is noised according to `strength` (0.0–1.0), then denoised with the text prompt. Lower strength preserves more of the original; higher strength allows more creative freedom.

FLUX.2 example (in-context conditioning):

const fs = require('bare-fs')

const inputImage = fs.readFileSync('assets/von-neumann.jpg')

const response = await model.run({
  prompt: 'a modern tech CEO version of this person, professional headshot',
  init_image: inputImage,
  cfg_scale: 1.0,
  steps: 20,
  guidance: 9.0,
  seed: 42
})

SD3 example (SDEdit):

const inputImage = fs.readFileSync('headshot.jpeg')

const response = await model.run({
  prompt: 'anime portrait, same pose, studio ghibli style, soft cel shading',
  negative_prompt: 'photorealistic, blurry, low quality',
  init_image: inputImage,
  cfg_scale: 4.5,
  steps: 30,
  strength: 0.75,
  sampling_method: 'euler',
  seed: 42
})

SDEdit img2img limitations:

Black-and-white input images produce weaker results because the model must hallucinate all color information. Consider colorizing the image before feeding it in.

Low-resolution images (below ~512×512) give the model less detail to preserve identity. Upscaling beforehand helps.

High strength values (≥ 0.7) allow the model to deviate significantly from the input, including changing facial features, gender, or ethnicity. Use strength 0.35–0.55 for identity-preserving edits.

Style prompts like "anime" or "studio ghibli" carry training-data biases that can alter the subject's appearance. Anchor the prompt with terms like "same person, same face" and use the negative prompt to block unwanted changes.

Non-multiple-of-8 images are automatically aligned (nearest-neighbor resize to the next multiple of 8) before processing. For best quality, provide images with dimensions that are already multiples of 8.

The bundled test image (assets/von-neumann.jpg) is a 1956 portrait of John von Neumann sourced from the U.S. Department of Energy (Public Domain). See the Credits section for details.

Multi-reference fusion (`init_images`) — FLUX.2 only

FLUX.2-klein only. Pass init_images (an array of Uint8Array PNG/JPEG buffers) to blend multiple reference images into a single output via in-context conditioning. All references share one RoPE coordinate space (the library default, increase_ref_index: false), so their visual features blend via attention — this is the "fusion" behavior.

This differs from single-image init_image in three ways:

Parameter: init_images (array) instead of init_image (single buffer)
Target: Generated from pure noise (not a noisy version of a single input), so the model creates a new composition attending to all references
Text encoder behavior: FLUX2-klein's Qwen3 does not receive vision tokens for the references. The @imageN tags in the prompt are purely prose labels for the model — the actual visual fusion is learned via attention in the DiT. Use them to anchor the prompt semantically (e.g. "use @image1 and @image2 as the two scientists").

Setup:

const fs = require('bare-fs')

const refImage1 = fs.readFileSync('assets/von-neumann.jpg')
const refImage2 = fs.readFileSync('assets/claude-shannon.jpg')

const response = await model.run({
  prompt: 'two scientists in @image1 and @image2 shaking hands in a lab, use @image1 and @image2 as the two scientists, black studio background, colorized.',
  init_images: [refImage1, refImage2],
  width: 624,
  height: 624,
  sample_method: 'euler',
  cfg_scale: 1.0,
  guidance: 3.5,
  steps: 10,
  seed: 42
})

@imageN tag conventions:

Optional but recommended: Tags like @image1, @image2, … in your prompt help anchor the semantic meaning of each reference
Not vision tokens: Qwen3 on FLUX2-klein sees these as plain text; they don't bypass the text-only constraint
Naming: Use consistent, meaningful labels (e.g. @person1, @background, @style_ref if semantically clearer for your use case)
Example prompts:
- "blend the faces of @image1 and @image2 into one person" — fusion
- "use the style of @image1 and the subject of @image2 together" — style blending
- "@image1 in the setting of @image2" — composition blending

Parameters (specific to init_images):

Parameter	Type	Default	Description
`init_images`	`Uint8Array[]`	—	Array of PNG/JPEG reference image buffers (mutually exclusive with `init_image`)
`increase_ref_index`	boolean	`false`	If `false` (default), all refs share one RoPE coordinate slot → visual fusion via attention. If `true`, each ref gets its own RoPE index → typically makes one ref dominate (not recommended for FLUX2-klein)
`auto_resize_ref_image`	boolean	`true`	Auto-resize all reference images to match `width`/`height` before VAE encoding. Disable only if you've pre-resized the buffers

Tips for best results:

Similar aspect ratios: References with differing aspect ratios may not blend as smoothly. Pre-resize to the target aspect ratio if possible
Image quality matters: Low-quality or heavily compressed references produce weaker fusion. Use PNG or high-quality JPEG
Prompt anchoring: Use the @imageN tags in your prompt to help the model understand the intent (even though it's text-only)
Guidance & steps: Lower guidance (cfg_scale: 1.0) and moderate steps (8–20) work well for fusion; too much guidance can collapse the blending to one dominant reference
Identity preservation: For portrait fusion, add phrases like "blend the features of @image1 and @image2" or "same pose, fused appearance"

Example walkthrough:

See examples/generate-fusion.js for a complete working example that fuses two scientists (von Neumann + Shannon) into a handshake scene.

7. Release Resources

await model.unload()

unload() calls free_sd_ctx which releases all GPU and CPU memory. The JS object can be safely garbage collected afterwards.

Standalone ESRGAN Upscaler

The package also exports EsrganUpscaler for upscaling existing PNG/JPEG images without loading a diffusion model. Useful for post-processing pre-existing assets (screenshots, photos, third-party generated images).

const { EsrganUpscaler } = require('@qvac/diffusion-cpp')
const { setLogger, releaseLogger } = require('@qvac/diffusion-cpp/addonLogging')
const fs = require('bare-fs')

setLogger((priority, message) => console.log(`[C++] ${message}`))

const upscaler = new EsrganUpscaler({
  files: {
    esrgan: '/absolute/path/to/RealESRGAN_x4plus_anime_6B.pth'
  },
  config: {
    upscaler_tile_size: 128
  },
  logger: console
})

await upscaler.load()

const inputBytes = fs.readFileSync('/path/to/source.png')
const response = await upscaler.upscale(inputBytes, { repeats: 1 })

const images = []
await response
  .onUpdate(data => {
    if (data instanceof Uint8Array) images.push(data)
  })
  .await()

await upscaler.unload()
releaseLogger()

Constructor args

Property	Required	Description
`files.esrgan`	✅	Absolute path to ESRGAN upscaler model (`.pth`)
`config.upscaler_tile_size`	—	Tile size used during inference (default `128`)
`config.upscaler_threads`	—	CPU threads for the upscaler (`-1` = auto)
`config.upscaler_direct`	—	Use direct convolution (default `false`)
`config.upscaler_offload_params_to_cpu`	—	Keep weights on CPU and offload during compute (default `false`)
`logger`	—	Logger instance for JS wrapper logs (e.g. `console`). Native C++ logs are configured separately via `addonLogging.setLogger()`

upscale(imageBytes, options?)

Option	Type	Default	Description
`repeats`	number	`1`	Number of ESRGAN passes. Each pass multiplies output dimensions by the model's scale factor (typically 4×), so `repeats: 2` produces a 16× upscale

Cancellation works the same as ImgStableDiffusion: call upscaler.cancel() to interrupt an in-flight upscale (honored between repeat passes).

For a complete runnable example, see examples/standalone-esrgan-upscale.js.

Model File Reference

FLUX.2 [klein] 4B (recommended for 16 GB machines)

Role	File	Source
Diffusion model	`flux-2-klein-4b-Q8_0.gguf`	leejet/FLUX.2-klein-4B-GGUF
Text encoder	`Qwen3-4B-Q4_K_M.gguf`	unsloth/Qwen3-4B-GGUF
VAE	`flux2-vae.safetensors`	black-forest-labs/FLUX.2-klein-4B

Stable Diffusion 2.x / SDXL / SD3

Pass an all-in-one checkpoint absolute path as files.model. No separate encoders needed.

FLUX.2 Implementation Notes

This section documents non-obvious issues encountered integrating FLUX.2 [klein] into the addon and how each was resolved. These serve as a reference if the underlying qvac-ext-stable-diffusion.cpp version is upgraded.

1. Metal GPU backend not activated (macOS)

Symptom: Generation ran entirely on CPU at 700%+ CPU usage; 20 steps at 512 × 512 never completed.

Root cause: The vcpkg port passed -DGGML_METAL=ON to CMake, which compiled the ggml Metal library (libggml-metal.a). However, qvac-ext-stable-diffusion.cpp internally guards ggml_backend_metal_init() behind its own SD_USE_METAL preprocessor define, which is only set when -DSD_METAL=ON is passed — a separate flag from GGML_METAL.

Fix: Changed the stable-diffusion-cpp registry port from:

-DGGML_METAL=${SD_GGML_METAL}

to:

-DSD_METAL=${SD_GGML_METAL}

-DSD_METAL=ON causes qvac-ext-stable-diffusion.cpp's own CMakeLists.txt to set GGML_METAL=ON and emit -DSD_USE_METAL, which activates ggml_backend_metal_init() at runtime.

Verification: After the fix, CPU usage dropped from ~700% to ~0.5% during generation, confirming the GPU is handling the compute.

2. Noise output instead of image — wrong prediction type default

Symptom: Generation completed all 20 steps and produced a PNG, but the image was pure coloured noise (TV static).

Root cause: SdCtxConfig::prediction defaulted to EPS_PRED (the epsilon-prediction denoiser). When SdModel::load() passed this to sd_ctx_params_t.prediction, it overrode qvac-ext-stable-diffusion.cpp's auto-detection, forcing the wrong denoiser on a FLUX.2 flow-matching model. The correct sentinel value for auto-detection is PREDICTION_COUNT.

Fix: Changed the default in addon/src/handlers/SdCtxHandlers.hpp:

// Before
prediction_t prediction = EPS_PRED;

// After
prediction_t prediction = PREDICTION_COUNT;  // auto-detect from GGUF metadata

3. Noise output — wrong flow_shift default

Symptom: Same noise output as above (compounded with fix 2).

Root cause: SdCtxConfig::flowShift defaulted to 0.0f. For FLUX.2, qvac-ext-stable-diffusion.cpp expects INFINITY as the sentinel meaning "use the model's embedded flow-shift value". A value of 0.0f disabled flow-shifting entirely, breaking the entire noise schedule.

Fix:

// Before
float flowShift = 0.0f;

// After
float flowShift = std::numeric_limits<float>::infinity();  // use model's embedded value

4. Wrong sampler default bypassing auto-detection

Symptom: Even with fixes 1–3, the wrong sampler could be selected if passed explicitly.

Root cause: SdGenConfig::sampleMethod defaulted to EULER_A_SAMPLE_METHOD. The generate_image() function in qvac-ext-stable-diffusion.cpp only runs its auto-detection (sd_get_default_sample_method()) when sample_method == SAMPLE_METHOD_COUNT. Since we always passed EULER_A explicitly, FLUX.2 (a DiT flow-matching model that needs EULER) got the ancestral euler sampler instead, producing garbage.

Fix: Changed the default in addon/src/handlers/SdGenHandlers.hpp:

// Before
sample_method_t sampleMethod = EULER_A_SAMPLE_METHOD;
scheduler_t     scheduler    = DISCRETE_SCHEDULER;

// After
sample_method_t sampleMethod = SAMPLE_METHOD_COUNT;  // auto (euler for FLUX, euler_a for SD2.x)
scheduler_t     scheduler    = SCHEDULER_COUNT;      // auto

With these sentinel values, qvac-ext-stable-diffusion.cpp selects euler for DiT/FLUX models and euler_a for SD2.x automatically.

5. Wrong RNG default

Symptom: Minor correctness difference vs reference CLI output.

Root cause: SdCtxConfig defaulted to rngType = CPU_RNG (Mersenne Twister). sd_ctx_params_init() in qvac-ext-stable-diffusion.cpp sets CUDA_RNG (the philox RNG — named CUDA_RNG for historical reasons but not GPU-specific). The philox RNG is the expected default across all platforms.

Fix:

// Before
rng_type_t rngType        = CPU_RNG;
rng_type_t samplerRngType = CPU_RNG;

// After
rng_type_t rngType        = CUDA_RNG;       // philox RNG — matches sd_ctx_params_init default
rng_type_t samplerRngType = RNG_TYPE_COUNT; // auto

Summary of default alignment

The underlying pattern across all these fixes is the same: our C++ config structs had concrete default values that overrode qvac-ext-stable-diffusion.cpp's own sentinel-based auto-detection. The correct approach is to use the same sentinel values that sd_ctx_params_init() and sd_sample_params_init() set, and only pass concrete values when the caller explicitly requests them.

Field	Wrong default	Correct default	Effect of wrong value
`prediction`	`EPS_PRED`	`PREDICTION_COUNT`	Forces epsilon denoiser on FLUX.2 → noise
`flow_shift`	`0.0f`	`INFINITY`	Disables flow-shifting → broken noise schedule
`sample_method`	`EULER_A_SAMPLE_METHOD`	`SAMPLE_METHOD_COUNT`	Wrong sampler for flow-matching models → noise
`scheduler`	`DISCRETE_SCHEDULER`	`SCHEDULER_COUNT`	Wrong schedule for FLUX.2
`rng_type`	`CPU_RNG`	`CUDA_RNG`	Different noise seed generation vs reference
`ggml_metal` cmake flag	`-DGGML_METAL=ON`	`-DSD_METAL=ON`	Metal library compiled but never initialised

Credits

Test Images

assets/von-neumann.jpg — John von Neumann (1956). Source: U.S. Department of Energy, File ID: HD.3F.191. This image is in the Public Domain as a work of the U.S. Federal Government.

assets/claude-shannon.jpg — Claude Shannon. Source: Bell Labs / Wikimedia Commons. Licensed under Creative Commons Attribution-ShareAlike (CC BY-SA). Attribution must be preserved; any redistribution of this image or a derivative must be released under a compatible CC BY-SA license.

License

Apache-2.0 — see LICENSE for details.

@qvac/diffusion-cpp

0.12.0

@GitHub Actions

diffusion-cpp

Table of Contents

Supported platforms

Building from Source

Downloading Model Files

Why these specific files?

Disk and RAM requirements

Running the Example

Load / unload only

Text-to-image generation

Other Examples

Usage

1. Import the Model Class

2. Create the `args` object

3. Configure the native backend (`args.config`)

4. Create a Model Instance

5. Load the Model

6. Run Inference

Text-to-image (`model.run`)

Image-to-image (`init_image`)

Multi-reference fusion (`init_images`) — FLUX.2 only

7. Release Resources

Standalone ESRGAN Upscaler

Model File Reference

FLUX.2 [klein] 4B (recommended for 16 GB machines)

Stable Diffusion 2.x / SDXL / SD3

FLUX.2 Implementation Notes

1. Metal GPU backend not activated (macOS)

2. Noise output instead of image — wrong prediction type default

3. Noise output — wrong flow_shift default

4. Wrong sampler default bypassing auto-detection

5. Wrong RNG default

Summary of default alignment

Credits

Test Images

License

@qvac/diffusion-cpp

0.12.0

@GitHub Actions

diffusion-cpp

Table of Contents

Supported platforms

Building from Source

Downloading Model Files

Why these specific files?

Disk and RAM requirements

Running the Example

Load / unload only

Text-to-image generation

Other Examples

Usage

1. Import the Model Class

2. Create the args object

3. Configure the native backend (args.config)

4. Create a Model Instance

5. Load the Model

6. Run Inference

Text-to-image (model.run)

Image-to-image (init_image)

Multi-reference fusion (init_images) — FLUX.2 only

7. Release Resources

Standalone ESRGAN Upscaler

Model File Reference

FLUX.2 [klein] 4B (recommended for 16 GB machines)

Stable Diffusion 2.x / SDXL / SD3

FLUX.2 Implementation Notes

1. Metal GPU backend not activated (macOS)

2. Noise output instead of image — wrong prediction type default

3. Noise output — wrong flow_shift default

4. Wrong sampler default bypassing auto-detection

5. Wrong RNG default

Summary of default alignment

Credits

Test Images

License

2. Create the `args` object

3. Configure the native backend (`args.config`)

Text-to-image (`model.run`)

Image-to-image (`init_image`)

Multi-reference fusion (`init_images`) — FLUX.2 only