Performance – Michael Kazakov's quiet corner

December 29, 2025December 29, 2025

Drawing a Hosek-Wilkie sky on CPU, fast

While working on my toy software rasterizer, at some point I decided to try rendering a skybox using cube maps. Loading and drawing a pre-existing environment cube map as 12 triangles proved to be easy and boring. Next, I looked into generating the skybox programmatically on the fly during each frame. The first attempt used the Preetham daylight model. It worked, but I couldn’t tune it well enough to produce good-looking results for a dynamic sky with the Sun moving in real time from dawn till dusk. This paper explores the issues well: “A Critical Review of the Preetham Skylight Model”. Next attempt used the Hosek-Wilkie sky model (paper, presentation), which produced much more convincing results.

This model allows sampling sky radiance to build an image like this:

When combined with a visualization of a moving Sun and rendered for five cube-map faces each frame, it results in a lively sky background like this:

This video shows the Skybox example from the NIH2 software renderer, which runs at ~150 FPS at 720p on an Apple M1 CPU.

This blogpost summarizes the experience of running the Hosek-Wilkie sky model on the CPU and iteratively optimizing the implementation enough for semi real-time use cases. It also provides a distilled version of the source code detached from the software renderer. The code is written in Rust, but this doesn’t matter much, as the logic behind the optimizations is equally applicable to any compiled language.

For simplicity, the implementation builds only a single cube-map face (negative Z). Since the sky model is defined only for Sun directions with Y>=0, only the top half of this face is filled. Extending the logic for other faces is pretty straightforward, but makes the code hairier, so they were omitted.

The optimizations focus on single-threaded performance, as building multiple faces can be trivially parallelized across threads. A resolution of 1024×512 was chosen for benchmarking each iteration of the code, measurements were done on the base Apple M1 CPU.

V0 – Initial implementation

The sky model has two kinds of inputs:

- Per entire sky dome (during model initialization):
  - Turbidity [1..10] – measure of aerosol content in the air
  - Ground albedo [0..1 x 3] – fraction of sunlight reflected by the ground
  - Solar elevation [0°..90°] – how high the Sun is
- Per view direction (during sampling):
  - Theta θ [0°..90°] – view angle from the zenith
  - Gamma γ [0°..180°] – angle between the view direction and the Sun

Conceptually, building a single cube-map face consists of the following steps:

- Initialize the sky model (can be shared across faces)
- For each pixel on the face:
  - Compute θ and γ
  - Sample the model with (θ, γ) three times (once per RGB channel)
  - Tone-map, convert to sRGB, and write out as U8 x 3

Sky model initialization is performed once per frame and is very cheap. It mainly consists of evaluating equation (11) a few times on the table data and lerping between the results.

θ and γ are computed from pixel coordinates (x, y) as follows:

let u: f32 = 2.0 * (x as f32 + 0.5) / (width as f32) - 1.0; // [-1, 1]         
let v: f32 = ((height - 1 - y) as f32 + 0.5) / (height as f32); // [0, 1]
let dir: Vec3 = Vec3::new(u, v, -1.0).normalize(); // ZNeg => Z=-1
let theta: f32 = dir.y.acos(); // view angle from zenith
let cos_gamma: f32 = dir.dot(sun_direction);
let gamma: f32 = cos_gamma.acos(); // angle between view direction and sun

let u: f32 = 2.0 * (x as f32 + 0.5) / (width as f32) - 1.0; // [-1, 1]         
let v: f32 = ((height - 1 - y) as f32 + 0.5) / (height as f32); // [0, 1]
let dir: Vec3 = Vec3::new(u, v, -1.0).normalize(); // ZNeg => Z=-1
let theta: f32 = dir.y.acos(); // view angle from zenith
let cos_gamma: f32 = dir.dot(sun_direction);
let gamma: f32 = cos_gamma.acos(); // angle between view direction and sun

Sampling the model is done by implementing the equations (8, 9) directly:

pub fn f(&self, theta: f32, gamma: f32) -> (f32, f32, f32) {
  let chi = |g: f32, a: f32| -> f32 {
    let num: f32 = 1.0 + a.cos().powi(2);
    let denom: f32 = (1.0 + g.powi(2) - 2.0 * g * a.cos()).powf(3.0 / 2.0);
    num / denom
  };
  let eval = |p: [f32; 9], theta: f32, gamma: f32| -> f32 {
    let a: f32 = p[0];
    // ...
    let i: f32 = p[8];
    let term1: f32 = 1.0 + a * (b / (theta.cos() + 0.01)).exp();
    let term2: f32 = c + d * (e * gamma).exp() + f * gamma.cos().powi(2) +
      g * chi(i, gamma) + h * theta.cos().sqrt();
    term1 * term2
    };
  let f0: f32 = eval(self.distribution[0], theta, gamma);
  let f1: f32 = eval(self.distribution[1], theta, gamma);
  let f2: f32 = eval(self.distribution[2], theta, gamma);
  (f0 * self.radiance[0], f1 * self.radiance[1], f2 * self.radiance[2])
}

pub fn f(&self, theta: f32, gamma: f32) -> (f32, f32, f32) {
  let chi = |g: f32, a: f32| -> f32 {
    let num: f32 = 1.0 + a.cos().powi(2);
    let denom: f32 = (1.0 + g.powi(2) - 2.0 * g * a.cos()).powf(3.0 / 2.0);
    num / denom
  };
  let eval = |p: [f32; 9], theta: f32, gamma: f32| -> f32 {
    let a: f32 = p[0];
    // ...
    let i: f32 = p[8];
    let term1: f32 = 1.0 + a * (b / (theta.cos() + 0.01)).exp();
    let term2: f32 = c + d * (e * gamma).exp() + f * gamma.cos().powi(2) +
      g * chi(i, gamma) + h * theta.cos().sqrt();
    term1 * term2
    };
  let f0: f32 = eval(self.distribution[0], theta, gamma);
  let f1: f32 = eval(self.distribution[1], theta, gamma);
  let f2: f32 = eval(self.distribution[2], theta, gamma);
  (f0 * self.radiance[0], f1 * self.radiance[1], f2 * self.radiance[2])
}

For tone mapping I used Reinhard since it’s simple and robust. Gamma correction and clamping are applied before writing out a pixel:

  // ...
  let f: (f32, f32, f32) = sky.f(theta, gamma); // sample the radiance
  let c: (f32, f32, f32) = linear_to_rgb(f); // convert to display sRGB space
  // Write out as u8 RGB
  let idx: usize = y * width + x; // pixel index
  pixels[idx * 3 + 0] = (c.0 * 255.0).clamp(0.0, 255.0) as u8;
  pixels[idx * 3 + 1] = (c.1 * 255.0).clamp(0.0, 255.0) as u8;
  pixels[idx * 3 + 2] = (c.2 * 255.0).clamp(0.0, 255.0) as u8;
  // ...
        
fn to_srgb(c: (f32, f32, f32)) -> (f32, f32, f32) {
  let encode = |x: f32| {
    if x <= 0.0031308 {
      12.92 * x
    } else {
      1.055 * x.powf(1.0 / 2.4) - 0.055
    }
  };
  (encode(c.0), encode(c.1), encode(c.2))
}

fn tonemap_reinhard(rgb: (f32, f32, f32), exposure: f32, white: f32) ->
  (f32, f32, f32) {
  let r: f32 = rgb.0 * exposure;
  let g: f32 = rgb.1 * exposure;
  let b: f32 = rgb.2 * exposure;
  let y: f32 = 0.2126 * r + 0.7152 * g + 0.0722 * b;
  let s: f32 = (1.0 + y / (white * white)) / (1.0 + y);
  (r * s, g * s, b * s)
}

fn linear_to_rgb(c: (f32, f32, f32)) -> (f32, f32, f32) {
  let exposure: f32 = 0.5;
  let white_point: f32 = 14.0;
  let exposed: (f32, f32, f32) = tonemap_reinhard(c, exposure, white_point);
  let display: (f32, f32, f32) = to_srgb(exposed);
  display
}

  // ...
  let f: (f32, f32, f32) = sky.f(theta, gamma); // sample the radiance
  let c: (f32, f32, f32) = linear_to_rgb(f); // convert to display sRGB space
  // Write out as u8 RGB
  let idx: usize = y * width + x; // pixel index
  pixels[idx * 3 + 0] = (c.0 * 255.0).clamp(0.0, 255.0) as u8;
  pixels[idx * 3 + 1] = (c.1 * 255.0).clamp(0.0, 255.0) as u8;
  pixels[idx * 3 + 2] = (c.2 * 255.0).clamp(0.0, 255.0) as u8;
  // ...
        
fn to_srgb(c: (f32, f32, f32)) -> (f32, f32, f32) {
  let encode = |x: f32| {
    if x <= 0.0031308 {
      12.92 * x
    } else {
      1.055 * x.powf(1.0 / 2.4) - 0.055
    }
  };
  (encode(c.0), encode(c.1), encode(c.2))
}

fn tonemap_reinhard(rgb: (f32, f32, f32), exposure: f32, white: f32) ->
  (f32, f32, f32) {
  let r: f32 = rgb.0 * exposure;
  let g: f32 = rgb.1 * exposure;
  let b: f32 = rgb.2 * exposure;
  let y: f32 = 0.2126 * r + 0.7152 * g + 0.0722 * b;
  let s: f32 = (1.0 + y / (white * white)) / (1.0 + y);
  (r * s, g * s, b * s)
}

fn linear_to_rgb(c: (f32, f32, f32)) -> (f32, f32, f32) {
  let exposure: f32 = 0.5;
  let white_point: f32 = 14.0;
  let exposed: (f32, f32, f32) = tonemap_reinhard(c, exposure, white_point);
  let display: (f32, f32, f32) = to_srgb(exposed);
  display
}

Result: this version works, but it takes ~46ms to build a single face. Too slow…

V1 – Less wasteful computation

The amount of work the poor old scalar CPU has to perform per pixel is no joke. Worse still, compilers (including rustc) are constrained by strict IEEE floating-point semantics, which prevents many otherwise valid optimizations. However, it is possible to manually simplify parts of the formulas and hoist redundant computations:

- cos(θ) and cos(γ) are already available, there’s no need to re-calculate them again inside the function body
- v.powf(3.0 / 2.0) is mathematically equivalent to v * v.sqrt(), which is much cheaper
- v.powi(2) is equivalent to v * v, in case the compiler fails to expand it

Result: these simple transformations reduce the compute time down to ~28ms.

V2 – Per-pixel SIMD

Staring at this snippet for long enough:

    let f0: f32 = eval(self.distribution[0]);
    let f1: f32 = eval(self.distribution[1]);
    let f2: f32 = eval(self.distribution[2]);

    let f0: f32 = eval(self.distribution[0]);
    let f1: f32 = eval(self.distribution[1]);
    let f2: f32 = eval(self.distribution[2]);

… eventually raises the question – since we’re doing the same computation three times, just with different input data, why not do all three in parallel?

Of course, there are no SIMD registers with three lanes, but nothing prevents from using a fourth throwaway lane for free. Effectively, running the formulas (8, 9) for RGBX, where the distribution and radiance values for the fourth lane are zeroed out.

A direct translation of the sampling function to 4-way SIMD looks like this:

pub fn f(&self, theta: f32, gamma: f32, theta_cos: f32, gamma_cos: f32) -> (f32, f32, f32) {
  let a: F32x4 = F32x4::load(self.distribution4[0]);
  ...
  let i: F32x4 = F32x4::load(self.distribution4[8]);
  let one: F32x4 = F32x4::splat(1.0);
  let two: F32x4 = F32x4::splat(2.0);
  let zero_zero_one: F32x4 = F32x4::splat(0.01);
  let gamma: F32x4 = F32x4::splat(gamma);
  let theta_cos: F32x4 = F32x4::splat(theta_cos);
  let gamma_cos: F32x4 = F32x4::splat(gamma_cos);
  let radiance: F32x4 = F32x4::load(self.radiance4);
  let term1: F32x4 = (b / (theta_cos + zero_zero_one)).exp() * a + one;
  let chi_num: F32x4 = one + gamma_cos * gamma_cos;
  let chi_denom: F32x4 = one + i * (i - gamma_cos * two);
  let chi: F32x4 = chi_num / (chi_denom * chi_denom.sqrt());
  let term2: F32x4 = c + d * (e * gamma).exp() + f * gamma_cos * gamma_cos +
    g * chi + h * theta_cos.sqrt();
  let c: F32x4 = (term1 * term2) * radiance;
  let c4: [f32; 4] = c.store();
  (c4[0], c4[1], c4[2])
}

pub fn f(&self, theta: f32, gamma: f32, theta_cos: f32, gamma_cos: f32) -> (f32, f32, f32) {
  let a: F32x4 = F32x4::load(self.distribution4[0]);
  ...
  let i: F32x4 = F32x4::load(self.distribution4[8]);
  let one: F32x4 = F32x4::splat(1.0);
  let two: F32x4 = F32x4::splat(2.0);
  let zero_zero_one: F32x4 = F32x4::splat(0.01);
  let gamma: F32x4 = F32x4::splat(gamma);
  let theta_cos: F32x4 = F32x4::splat(theta_cos);
  let gamma_cos: F32x4 = F32x4::splat(gamma_cos);
  let radiance: F32x4 = F32x4::load(self.radiance4);
  let term1: F32x4 = (b / (theta_cos + zero_zero_one)).exp() * a + one;
  let chi_num: F32x4 = one + gamma_cos * gamma_cos;
  let chi_denom: F32x4 = one + i * (i - gamma_cos * two);
  let chi: F32x4 = chi_num / (chi_denom * chi_denom.sqrt());
  let term2: F32x4 = c + d * (e * gamma).exp() + f * gamma_cos * gamma_cos +
    g * chi + h * theta_cos.sqrt();
  let c: F32x4 = (term1 * term2) * radiance;
  let c4: [f32; 4] = c.store();
  (c4[0], c4[1], c4[2])
}

Here I use a custom F32x4 SIMD type that provides trig and pow/exp/log operations, but really any decent SIMD library would do. I’m mostly using ARM64, but also want the code to run on AMD64, so a 4-wide type is the common denominator.

Result: this version takes the previous ~28ms down to ~26ms. Meh, rather underwhelming. It clearly demonstrates a well-known truth – effective SIMD requires rethinking both data layout and function interfaces. Otherwise, the ceremony of setting up SIMD computation and getting the results back nullifies any gains from the parallel compute.

V3 – Per-row SIMD

The next step was to go into full SIMD. Drop the per-pixel approach entirely and instead perform the computation per rows, separately for the R, G, and B channels. With width=1024, this means first writing out (cos(θ), γ, cos(γ)) for the entire row (1024 x 3 x 4b = 12Kb), and then calculating the formulas (8, 9) from the paper and writing them into three output arrays (1024 x 3 x 4b = 12Kb). Since the size of the scratchpad arrays is about 24Kb, the CPU should rarely touch memory outside the L1 cache:

// Per-row scratch space
let mut theta_cos_row: Vec<f32> = vec![0.0; width];
let mut gamma_cos_row: Vec<f32> = vec![0.0; width];
let mut gamma_row: Vec<f32> = vec![0.0; width];
let mut r_row: Vec<f32> = vec![0.0; width];
let mut g_row: Vec<f32> = vec![0.0; width];
let mut b_row: Vec<f32> = vec![0.0; width];

//...

// Calculate radiance per each channel, entire row at a time
sky.f_simd_r(&gamma_row, &theta_cos_row, &gamma_cos_row, &mut r_row);
sky.f_simd_g(&gamma_row, &theta_cos_row, &gamma_cos_row, &mut g_row);
sky.f_simd_b(&gamma_row, &theta_cos_row, &gamma_cos_row, &mut b_row);

// Per-row scratch space
let mut theta_cos_row: Vec<f32> = vec![0.0; width];
let mut gamma_cos_row: Vec<f32> = vec![0.0; width];
let mut gamma_row: Vec<f32> = vec![0.0; width];
let mut r_row: Vec<f32> = vec![0.0; width];
let mut g_row: Vec<f32> = vec![0.0; width];
let mut b_row: Vec<f32> = vec![0.0; width];

//...

// Calculate radiance per each channel, entire row at a time
sky.f_simd_r(&gamma_row, &theta_cos_row, &gamma_cos_row, &mut r_row);
sky.f_simd_g(&gamma_row, &theta_cos_row, &gamma_cos_row, &mut g_row);
sky.f_simd_b(&gamma_row, &theta_cos_row, &gamma_cos_row, &mut b_row);

The function interface changes accordingly – it now accepts spans of input arrays and a span of an output array. The internal logic remains the same, but it operates on chunks of 4 input values, passing them through the transformation and writing out 4 results per iteration:

fn f_simd_channel<const CHANNEL: usize>(
  &self,
  gamma: &[f32],
  theta_cos: &[f32],
  gamma_cos: &[f32],
  output: &mut [f32]) {
  let gamma: *const f32 = gamma.as_ptr();
  let theta_cos: *const f32 = theta_cos.as_ptr();
  let gamma_cos: *const f32 = gamma_cos.as_ptr();
  let output: *mut f32 = output.as_mut_ptr();
  let a: F32x4 = F32x4::splat(self.distribution[CHANNEL][0]);
  // ...
  let i: F32x4 = F32x4::splat(self.distribution[CHANNEL][8]);
  let one: F32x4 = F32x4::splat(1.0);
  let two: F32x4 = F32x4::splat(2.0);
  let zero_zero_one: F32x4 = F32x4::splat(0.01);
  let radiance: F32x4 = F32x4::splat(self.radiance[CHANNEL]);
  for idx in (0..=len - 4).step_by(4) {
    let gamma: F32x4 = F32x4::load(unsafe {*(gamma.add(idx) as *const [f32; 4])});
    let theta_cos: F32x4 = F32x4::load(unsafe {*(theta_cos.add(idx) as *const [f32; 4])});
    let gamma_cos: F32x4 = F32x4::load(unsafe {*(gamma_cos.add(idx) as *const [f32; 4])});
    let term1: F32x4 = (b / (theta_cos + zero_zero_one)).exp() * a + one;
    let chi_num: F32x4 = one + gamma_cos * gamma_cos;
    let chi_denom: F32x4 = one + i * (i - gamma_cos * two);
    let chi: F32x4 = chi_num / (chi_denom * chi_denom.sqrt());
    let term2: F32x4 = c + d * (e * gamma).exp() + f * gamma_cos * gamma_cos + g * chi + h * theta_cos.sqrt();
    let c: F32x4 = (term1 * term2) * radiance;
    c.store_to(unsafe { &mut *(output.add(idx) as *mut [f32; 4]) });
  }
}

fn f_simd_channel<const CHANNEL: usize>(
  &self,
  gamma: &[f32],
  theta_cos: &[f32],
  gamma_cos: &[f32],
  output: &mut [f32]) {
  let gamma: *const f32 = gamma.as_ptr();
  let theta_cos: *const f32 = theta_cos.as_ptr();
  let gamma_cos: *const f32 = gamma_cos.as_ptr();
  let output: *mut f32 = output.as_mut_ptr();
  let a: F32x4 = F32x4::splat(self.distribution[CHANNEL][0]);
  // ...
  let i: F32x4 = F32x4::splat(self.distribution[CHANNEL][8]);
  let one: F32x4 = F32x4::splat(1.0);
  let two: F32x4 = F32x4::splat(2.0);
  let zero_zero_one: F32x4 = F32x4::splat(0.01);
  let radiance: F32x4 = F32x4::splat(self.radiance[CHANNEL]);
  for idx in (0..=len - 4).step_by(4) {
    let gamma: F32x4 = F32x4::load(unsafe {*(gamma.add(idx) as *const [f32; 4])});
    let theta_cos: F32x4 = F32x4::load(unsafe {*(theta_cos.add(idx) as *const [f32; 4])});
    let gamma_cos: F32x4 = F32x4::load(unsafe {*(gamma_cos.add(idx) as *const [f32; 4])});
    let term1: F32x4 = (b / (theta_cos + zero_zero_one)).exp() * a + one;
    let chi_num: F32x4 = one + gamma_cos * gamma_cos;
    let chi_denom: F32x4 = one + i * (i - gamma_cos * two);
    let chi: F32x4 = chi_num / (chi_denom * chi_denom.sqrt());
    let term2: F32x4 = c + d * (e * gamma).exp() + f * gamma_cos * gamma_cos + g * chi + h * theta_cos.sqrt();
    let c: F32x4 = (term1 * term2) * radiance;
    c.store_to(unsafe { &mut *(output.add(idx) as *mut [f32; 4]) });
  }
}

Result: this data-layout transformation cuts the runtime from ~26ms to ~13ms. In other words, it is twice as fast while conceptually doing more work in the process, since inputs and outputs are first written to scratch arrays instead of being consumed immediately one by one.

V4 – FMA and raw pointer arithmetic

Now that the implementation operates in SIMD intrinsics, it becomes possible to save a few more instructions by using fused multiply-add (FMA) where applicable. The assembly listing also showed some unnecessary checks around loop iteration and pointer arithmetic; this can be fixed by re-writing the code to be as primitive as possible:

fn f_simd_channel<const CHANNEL: usize>(
  &self,
  gamma: &[f32],
  theta_cos: &[f32],
  gamma_cos: &[f32],
  output: &mut [f32]) {
  let mut gamma_ptr: *const f32 = gamma.as_ptr();
  let mut theta_cos_ptr: *const f32 = theta_cos.as_ptr();
  let mut gamma_cos_ptr: *const f32 = gamma_cos.as_ptr();
  let mut output_ptr: *mut f32 = output.as_mut_ptr();
  let a: F32x4 = F32x4::splat(self.distribution[CHANNEL][0]);
  // ...
  let i: F32x4 = F32x4::splat(self.distribution[CHANNEL][8]);
  let one: F32x4 = F32x4::splat(1.0);
  let minus_two: F32x4 = F32x4::splat(-2.0);
  let zero_zero_one: F32x4 = F32x4::splat(0.01);
  let radiance: F32x4 = F32x4::splat(self.radiance[CHANNEL]);
  let steps: usize = len / 4;
  for _idx in 0..steps {
    let gamma: F32x4 = F32x4::load(unsafe { *(gamma_ptr as *const [f32; 4]) });
    let theta_cos: F32x4 = F32x4::load(unsafe { *(theta_cos_ptr as *const [f32; 4]) });
    let gamma_cos: F32x4 = F32x4::load(unsafe { *(gamma_cos_ptr as *const [f32; 4]) });
    let term1: F32x4 = (b / (theta_cos + zero_zero_one)).exp().fma(a, one);
    let chi_num: F32x4 = gamma_cos.fma(gamma_cos, one);
    let chi_denom: F32x4 = gamma_cos.fma(minus_two, i).fma(i, one);
    let chi: F32x4 = chi_num / (chi_denom * chi_denom.sqrt());
    let term2: F32x4 = theta_cos
        .sqrt()
        .fma(h, (f * gamma_cos).fma(gamma_cos, g.fma(chi, (e * gamma).exp().fma(d, c))));
    let channel_radiance: F32x4 = (term1 * term2) * radiance;
    channel_radiance.store_to(unsafe { &mut *(output_ptr as *mut [f32; 4]) });
    gamma_ptr = unsafe { gamma_ptr.add(4) };
    theta_cos_ptr = unsafe { theta_cos_ptr.add(4) };
    gamma_cos_ptr = unsafe { gamma_cos_ptr.add(4) };
    output_ptr = unsafe { output_ptr.add(4) };
  }
}

fn f_simd_channel<const CHANNEL: usize>(
  &self,
  gamma: &[f32],
  theta_cos: &[f32],
  gamma_cos: &[f32],
  output: &mut [f32]) {
  let mut gamma_ptr: *const f32 = gamma.as_ptr();
  let mut theta_cos_ptr: *const f32 = theta_cos.as_ptr();
  let mut gamma_cos_ptr: *const f32 = gamma_cos.as_ptr();
  let mut output_ptr: *mut f32 = output.as_mut_ptr();
  let a: F32x4 = F32x4::splat(self.distribution[CHANNEL][0]);
  // ...
  let i: F32x4 = F32x4::splat(self.distribution[CHANNEL][8]);
  let one: F32x4 = F32x4::splat(1.0);
  let minus_two: F32x4 = F32x4::splat(-2.0);
  let zero_zero_one: F32x4 = F32x4::splat(0.01);
  let radiance: F32x4 = F32x4::splat(self.radiance[CHANNEL]);
  let steps: usize = len / 4;
  for _idx in 0..steps {
    let gamma: F32x4 = F32x4::load(unsafe { *(gamma_ptr as *const [f32; 4]) });
    let theta_cos: F32x4 = F32x4::load(unsafe { *(theta_cos_ptr as *const [f32; 4]) });
    let gamma_cos: F32x4 = F32x4::load(unsafe { *(gamma_cos_ptr as *const [f32; 4]) });
    let term1: F32x4 = (b / (theta_cos + zero_zero_one)).exp().fma(a, one);
    let chi_num: F32x4 = gamma_cos.fma(gamma_cos, one);
    let chi_denom: F32x4 = gamma_cos.fma(minus_two, i).fma(i, one);
    let chi: F32x4 = chi_num / (chi_denom * chi_denom.sqrt());
    let term2: F32x4 = theta_cos
        .sqrt()
        .fma(h, (f * gamma_cos).fma(gamma_cos, g.fma(chi, (e * gamma).exp().fma(d, c))));
    let channel_radiance: F32x4 = (term1 * term2) * radiance;
    channel_radiance.store_to(unsafe { &mut *(output_ptr as *mut [f32; 4]) });
    gamma_ptr = unsafe { gamma_ptr.add(4) };
    theta_cos_ptr = unsafe { theta_cos_ptr.add(4) };
    gamma_cos_ptr = unsafe { gamma_cos_ptr.add(4) };
    output_ptr = unsafe { output_ptr.add(4) };
  }
}

With these changes, the core of the implementation turns into a large loop with one conditional jump. The assembly output becomes a wall of branchless 4-wide vector operations, beautiful!

Result: these tweaks shave off another ~1ms from the runtime, bringing it down to ~12ms.

V5 – Computing theta and gamma with SIMD

At this point there are no more low-hanging fruits in the Hosek-Wilkie sky sampling itself; the scaffolding around it became the bottleneck instead. The calculation of theta and gamma, while not super expensive, can still be accelerated by deriving 4 triplets at a time. Instead of recalculating (u, v) from (x, y) per-pixel, the initial direction vector and the dX/dY stepping constants are computed once. Next, updating the direction component, while stepping through 4 values, is then done via a single add instruction:

for x in (0..width).step_by(4) {
  // normalize the components of the direction vector
  let recip_len_sqrt: F32x4 = vec_x_4.fma(vec_x_4, vec_y2_z2_4).rsqrt();
  let normalized_vec_x_4: F32x4 = vec_x_4 * recip_len_sqrt;
  let normalized_vec_y_4: F32x4 = vec_y_4 * recip_len_sqrt;
  let normalized_vec_z_4: F32x4 = vec_z_4 * recip_len_sqrt;
  // cos(theta) - cos(angle between the zenith and the view direction)
  let theta_cos_4: F32x4 = normalized_vec_y_4;
  // gamma_cos = dot(dir, sun_dir).clamp(-1.0, 1.0);
  let gamma_cos_4: F32x4 = (normalized_vec_x_4 * sun_dir_x_4
    + normalized_vec_y_4 * sun_dir_y_4
    + normalized_vec_z_4 * sun_dir_z_4)
      .min(F32x4::splat(1.0))
      .max(F32x4::splat(-1.0));
  // gamma - angle between the view direction and the Sun
  let gamma_4: F32x4 = gamma_cos_4.acos();
  theta_cos_4.store_to(unsafe {&mut *(theta_cos_row.as_mut_ptr().add(x) as *mut [f32; 4])});
  gamma_cos_4.store_to(unsafe {&mut *(gamma_cos_row.as_mut_ptr().add(x) as *mut [f32; 4])});
  gamma_4.store_to(unsafe {&mut *(gamma_row.as_mut_ptr().add(x) as *mut [f32; 4])});
  vec_x_4 += dir_dx_x_4; // step the direction vector forward by 4 texels
}

for x in (0..width).step_by(4) {
  // normalize the components of the direction vector
  let recip_len_sqrt: F32x4 = vec_x_4.fma(vec_x_4, vec_y2_z2_4).rsqrt();
  let normalized_vec_x_4: F32x4 = vec_x_4 * recip_len_sqrt;
  let normalized_vec_y_4: F32x4 = vec_y_4 * recip_len_sqrt;
  let normalized_vec_z_4: F32x4 = vec_z_4 * recip_len_sqrt;
  // cos(theta) - cos(angle between the zenith and the view direction)
  let theta_cos_4: F32x4 = normalized_vec_y_4;
  // gamma_cos = dot(dir, sun_dir).clamp(-1.0, 1.0);
  let gamma_cos_4: F32x4 = (normalized_vec_x_4 * sun_dir_x_4
    + normalized_vec_y_4 * sun_dir_y_4
    + normalized_vec_z_4 * sun_dir_z_4)
      .min(F32x4::splat(1.0))
      .max(F32x4::splat(-1.0));
  // gamma - angle between the view direction and the Sun
  let gamma_4: F32x4 = gamma_cos_4.acos();
  theta_cos_4.store_to(unsafe {&mut *(theta_cos_row.as_mut_ptr().add(x) as *mut [f32; 4])});
  gamma_cos_4.store_to(unsafe {&mut *(gamma_cos_row.as_mut_ptr().add(x) as *mut [f32; 4])});
  gamma_4.store_to(unsafe {&mut *(gamma_row.as_mut_ptr().add(x) as *mut [f32; 4])});
  vec_x_4 += dir_dx_x_4; // step the direction vector forward by 4 texels
}

Result: calculating four (cos(θ), γ, cos(γ)) triplets at a time removes another ~1ms, now at ~11ms. Likely this part can be optimized a bit further, but it already feels like a territory of diminishing returns.

V6 – Tone mapping and gamma correction in SIMD

The only scalar part left is processing the HDR output from the Hosek-Wilkie sky model. This stage also needs to be re-written to take 3 per-component arrays and produce a single SDR RGB array. The operation is somewhat awkward – the input data are laid out per component (RRRR… GGGG… BBBB…), while the output must be interleaved per pixel (rgbrgbrgbrgb…). Still, with the help of a few useful ARM64 instructions and LLVM’s excellent backend optimizer, the result is surprisingly good.

Another important ingredient in making this fast is CHEATING:

- For skybox rendering, the sRGB transfer function can be simplified to a power function powf(1.0/2.2) without visually noticeable difference
- powf(1.0/2.2) is close enough to powf(1.0/2.0), which is just sqrt(), i.e. a single CPU instruction, voila!

It is also convenient to inject a bit of noise into this post-processing stage to reduce visible banding (before and after debanding):

The combined and SIMDified version of post-processing looks like this:

pub fn map(&self, r: &[f32], g: &[f32], b: &[f32],
  texels24: &mut [u8], y: usize) {
  let mut r_ptr: *const f32 = r.as_ptr();
  let mut g_ptr: *const f32 = g.as_ptr();
  let mut b_ptr: *const f32 = b.as_ptr();
  let mut output_ptr: *mut u8 = texels24.as_mut_ptr();
  let steps: usize = r.len() / 4;
  let zero: F32x4 = F32x4::splat(0.0);
  let one: F32x4 = F32x4::splat(1.0);
  let to_255: F32x4 = F32x4::splat(255.0);
  let exposure: F32x4 = self.exposure;
  let luma_weights_r: F32x4 = self.luma_weights_r;
  let luma_weights_g: F32x4 = self.luma_weights_g;
  let luma_weights_b: F32x4 = self.luma_weights_b;
  let inv_white_point2: F32x4 = self.inv_white_point2;
  let noise_r: F32x4 = F32x4::load(NOISE_TABLE[(y + 0) % 16]);
  let noise_g: F32x4 = F32x4::load(NOISE_TABLE[(y + 1) % 16]);
  let noise_b: F32x4 = F32x4::load(NOISE_TABLE[(y + 2) % 16]);
  for _idx in 0..steps {
    // Load inputs in sRGB primaries with a linear gamma ramp
    let r: F32x4 = F32x4::load(unsafe { *(r_ptr as *const [f32; 4]) });
    let g: F32x4 = F32x4::load(unsafe { *(g_ptr as *const [f32; 4]) });
    let b: F32x4 = F32x4::load(unsafe { *(b_ptr as *const [f32; 4]) });
    // Expose
    let re: F32x4 = r * exposure;
    let ge: F32x4 = g * exposure;
    let be: F32x4 = b * exposure;
    // Calculate luminance: dot(rgb, luma_weights)
    let luma: F32x4 = re * luma_weights_r + ge * luma_weights_g + be * luma_weights_b;
    // Calculate white scale: (1 + luma / white_point^2) / (1 + luma)
    let scale: F32x4 = luma.fma(inv_white_point2, one) / (one + luma);
    // Map to SDR
    let rt: F32x4 = re * scale;
    let gt: F32x4 = ge * scale;
    let bt: F32x4 = be * scale;
    // Gamma-correction: v = v^(1.0/2.0)
    let rc: F32x4 = rt.sqrt();
    let gc: F32x4 = gt.sqrt();
    let bc: F32x4 = bt.sqrt();
    // Apply some noise for dithering
    let r_final: F32x4 = rc + noise_r;
    let g_final: F32x4 = gc + noise_g;
    let b_final: F32x4 = bc + noise_b;
    // Clamp the values to [0.0, 1.0] and convert to [0.0, 255.0]
    let r_out: F32x4 = r_final.min(one).max(zero) * to_255;
    let g_out: F32x4 = g_final.min(one).max(zero) * to_255;
    let b_out: F32x4 = b_final.min(one).max(zero) * to_255;
    // Convert to integers [0, 255]
    let r_u32: [u32; 4] = r_out.to_u32().store();
    let g_u32: [u32; 4] = g_out.to_u32().store();
    let b_u32: [u32; 4] = b_out.to_u32().store();
    // Store the output texels
    unsafe {
      *output_ptr.add(0) = r_u32[0] as u8;
      *output_ptr.add(1) = g_u32[0] as u8;
      *output_ptr.add(2) = b_u32[0] as u8;
      *output_ptr.add(3) = r_u32[1] as u8;
      *output_ptr.add(4) = g_u32[1] as u8;
      *output_ptr.add(5) = b_u32[1] as u8;
      *output_ptr.add(6) = r_u32[2] as u8;
      *output_ptr.add(7) = g_u32[2] as u8;
      *output_ptr.add(8) = b_u32[2] as u8;
      *output_ptr.add(9) = r_u32[3] as u8;
      *output_ptr.add(10) = g_u32[3] as u8;
      *output_ptr.add(11) = b_u32[3] as u8;
    };
    // Advance the input/output pointers
    r_ptr = unsafe { r_ptr.add(4) };
    g_ptr = unsafe { g_ptr.add(4) };
    b_ptr = unsafe { b_ptr.add(4) };
    output_ptr = unsafe { output_ptr.add(12) };
  }
}

pub fn map(&self, r: &[f32], g: &[f32], b: &[f32],
  texels24: &mut [u8], y: usize) {
  let mut r_ptr: *const f32 = r.as_ptr();
  let mut g_ptr: *const f32 = g.as_ptr();
  let mut b_ptr: *const f32 = b.as_ptr();
  let mut output_ptr: *mut u8 = texels24.as_mut_ptr();
  let steps: usize = r.len() / 4;
  let zero: F32x4 = F32x4::splat(0.0);
  let one: F32x4 = F32x4::splat(1.0);
  let to_255: F32x4 = F32x4::splat(255.0);
  let exposure: F32x4 = self.exposure;
  let luma_weights_r: F32x4 = self.luma_weights_r;
  let luma_weights_g: F32x4 = self.luma_weights_g;
  let luma_weights_b: F32x4 = self.luma_weights_b;
  let inv_white_point2: F32x4 = self.inv_white_point2;
  let noise_r: F32x4 = F32x4::load(NOISE_TABLE[(y + 0) % 16]);
  let noise_g: F32x4 = F32x4::load(NOISE_TABLE[(y + 1) % 16]);
  let noise_b: F32x4 = F32x4::load(NOISE_TABLE[(y + 2) % 16]);
  for _idx in 0..steps {
    // Load inputs in sRGB primaries with a linear gamma ramp
    let r: F32x4 = F32x4::load(unsafe { *(r_ptr as *const [f32; 4]) });
    let g: F32x4 = F32x4::load(unsafe { *(g_ptr as *const [f32; 4]) });
    let b: F32x4 = F32x4::load(unsafe { *(b_ptr as *const [f32; 4]) });
    // Expose
    let re: F32x4 = r * exposure;
    let ge: F32x4 = g * exposure;
    let be: F32x4 = b * exposure;
    // Calculate luminance: dot(rgb, luma_weights)
    let luma: F32x4 = re * luma_weights_r + ge * luma_weights_g + be * luma_weights_b;
    // Calculate white scale: (1 + luma / white_point^2) / (1 + luma)
    let scale: F32x4 = luma.fma(inv_white_point2, one) / (one + luma);
    // Map to SDR
    let rt: F32x4 = re * scale;
    let gt: F32x4 = ge * scale;
    let bt: F32x4 = be * scale;
    // Gamma-correction: v = v^(1.0/2.0)
    let rc: F32x4 = rt.sqrt();
    let gc: F32x4 = gt.sqrt();
    let bc: F32x4 = bt.sqrt();
    // Apply some noise for dithering
    let r_final: F32x4 = rc + noise_r;
    let g_final: F32x4 = gc + noise_g;
    let b_final: F32x4 = bc + noise_b;
    // Clamp the values to [0.0, 1.0] and convert to [0.0, 255.0]
    let r_out: F32x4 = r_final.min(one).max(zero) * to_255;
    let g_out: F32x4 = g_final.min(one).max(zero) * to_255;
    let b_out: F32x4 = b_final.min(one).max(zero) * to_255;
    // Convert to integers [0, 255]
    let r_u32: [u32; 4] = r_out.to_u32().store();
    let g_u32: [u32; 4] = g_out.to_u32().store();
    let b_u32: [u32; 4] = b_out.to_u32().store();
    // Store the output texels
    unsafe {
      *output_ptr.add(0) = r_u32[0] as u8;
      *output_ptr.add(1) = g_u32[0] as u8;
      *output_ptr.add(2) = b_u32[0] as u8;
      *output_ptr.add(3) = r_u32[1] as u8;
      *output_ptr.add(4) = g_u32[1] as u8;
      *output_ptr.add(5) = b_u32[1] as u8;
      *output_ptr.add(6) = r_u32[2] as u8;
      *output_ptr.add(7) = g_u32[2] as u8;
      *output_ptr.add(8) = b_u32[2] as u8;
      *output_ptr.add(9) = r_u32[3] as u8;
      *output_ptr.add(10) = g_u32[3] as u8;
      *output_ptr.add(11) = b_u32[3] as u8;
    };
    // Advance the input/output pointers
    r_ptr = unsafe { r_ptr.add(4) };
    g_ptr = unsafe { g_ptr.add(4) };
    b_ptr = unsafe { b_ptr.add(4) };
    output_ptr = unsafe { output_ptr.add(12) };
  }
}

Again, the main loop becomes a branchless wall of 4-wide vector operations.

Result: this change pushes the time down to ~4ms. At this point, the Hosek-Wilkie skybox can be regenerated comfortably every frame, allowing smooth real-time Sun movement.

Summary

Overall optimization progress from version to version is summarized in the table below:

Initial version	~46ms
Less wasteful computation	~28ms
Per-pixel SIMD	~26ms
Per-row SIMD	~13ms
FMA and raw ptr arithmetic	~12ms
(cos(θ), γ, cos(γ)) in SIMD	~11ms
Post-processing in SIMD	~4ms

A back-of-the-envelope calculation gives an average wall time of ~8ns to produce each pixel. Considering how much math is being squeezed into these 8 nanoseconds, it’s almost miraculous what modern CPU cores can sustain. Even if my base M1 MacBook is now five years old, it feels that well-written software can run astronomically fast on these machines.

Of course, it’s also tempting to ask: why not just throw a dozen teraflops of GPU compute at this embarrassingly parallel problem and call it a day? Yep, for any practical setting, that likely should be the default approach. But for recreational and educational programming, where’s fun and challenge in that?

November 1, 2017May 8, 2018

Cryptocurrency mining on iOS devices

XMR-STAK-CPU running on iPad

Disclaimer

This post should not be treated as an advice to use iOS devices as a cryptocurrency mining machine. That can destroy the battery, fry the CPU/SoC, ruin the system’s responsiveness etc. This is a purely academic research driven by sheer curiosity.

Reasons

Since I got my hands on the latest iPad, I was eager to write something to check horsepower of that machine. Thanks to the recent bubble of cryptocurrencies prices, this ridiculous idea appeared. Of course, there’s no sense in trying to mine bitcoins or similar currencies since CPUs can’t compete with specialized solutions like ASICs in mining those. On the other hand, cryptocurrencies based on CryptoNote, like Monero(XMR ticker), have memory-bound properties which make them hard to crack on tiny dumb devices. That brings at least some amount of sense into solving these crypto puzzles on CPUs. I chose the XMR-STAK-CPU mining software, which is available in a source code, to try to run on iOS, first in a simulator and the on a real device.
As part of this porting experiment, I aimed to keep the original source code untouched and to use the files right out of the repository. Oddly enough, the endeavor was successful and within a few days, I got a complete solution. Challenges of porting and the outcome are described below.

Challenges

SSE vs. NEON
The source code of xmr-stak-cpu contains tons of SIMD instructions. Fortunately, there’re no inline assembler instructions and all calls are made through _mm_XXX intrinsics. That means it’s possible to mimic these calls with C-style functions and macros. The same applies to the data type definitions.
Thanks to the SSE2NEON project, the lion’s share of the work is already done and I basically needed only to properly fiddle with the source code. A trick with a precompiled header was used to do it: when the source was built for a real iOS device – SSE2 was mimicked with NEON and the original includes (<x86intrin.h>, <intrin.h>, <immintrin.h>) were suppressed by defining theirs include guards in advance. Nothing was substituted for iOS Simulator builds since it runs on an x86 machine and there’re no NEON instructions there.

But of course, that could not be absolutely smooth. A couple of x86 instructions was missing in SSE2NEON: _mm_prefetch, _mm_set_epi64x, _mm_cvtsi128_si64, _mm_aesenc_si128 and _mm_aeskeygenassist_si128.

_mm_set_epi64x and _mm_cvtsi128_si64 are trivial to implement on NEON with 1:1 mapping to SSE.

_mm_prefetch is a bit trickier since Intel and ARM have a different approach to controlling of the prefetch instruction and there’s no 1:1 mapping between those. I ended with the __builtin_prefetch(p) intrinsic to mimic _mm_prefetch, which is only a rough approximation.

The most interesting instructions were the cryptographic _mm_aesenc_si128 and _mm_aeskeygenassist_si128. Intel and ARM have a different idea of how to split the AES encryption into a set of commands. Here’s a good visualization of the issue:

It requires a set of instructions to mimic _mm_aesenc_si128 on ARM. The trick is to eliminate the AddRoundKey stage of vaeseq_u8() by providing a key of zeros and to add the actual key in the end by manually doing an XOR operation. This yields 3 instructions instead of one on SSE, but semantics remains the same. Here’s the code:

static inline __attribute__((always_inline))
__m128i _mm_aesenc_si128( __m128i v, __m128i rkey )
{
    const __attribute__((aligned(16))) __m128i zero = {0};
    return veorq_u8( vaesmcq_u8( vaeseq_u8(v, zero) ), rkey );
}

AFAIK there’s no support for encryption keys expansion in NEON, so the _mm_aeskeygenassist_si128 had to be implemented manually. I used the software implementation from xmr-stack-cpu’s soft_aes.c and packed it to fake a single instruction call:

static inline __attribute__((always_inline))
__m128i _mm_aeskeygenassist_si128(__m128i key, const int rcon)
{
    static const uint8_t sbox[256] = {
    0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5, 0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76,
    0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0, 0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0,
    0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7, 0xcc, 0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15,
    0x04, 0xc7, 0x23, 0xc3, 0x18, 0x96, 0x05, 0x9a, 0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75,
    0x09, 0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0, 0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3, 0x2f, 0x84,
    0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b, 0x6a, 0xcb, 0xbe, 0x39, 0x4a, 0x4c, 0x58, 0xcf,
    0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85, 0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8,
    0x51, 0xa3, 0x40, 0x8f, 0x92, 0x9d, 0x38, 0xf5, 0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2,
    0xcd, 0x0c, 0x13, 0xec, 0x5f, 0x97, 0x44, 0x17, 0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19, 0x73,
    0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88, 0x46, 0xee, 0xb8, 0x14, 0xde, 0x5e, 0x0b, 0xdb,
    0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c, 0xc2, 0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79,
    0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5, 0x4e, 0xa9, 0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08,
    0xba, 0x78, 0x25, 0x2e, 0x1c, 0xa6, 0xb4, 0xc6, 0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a,
    0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e, 0x61, 0x35, 0x57, 0xb9, 0x86, 0xc1, 0x1d, 0x9e,
    0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94, 0x9b, 0x1e, 0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf,
    0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42, 0x68, 0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16};
    uint32_t X1 = _mm_cvtsi128_si32(_mm_shuffle_epi32(key, 0x55));
    uint32_t X3 = _mm_cvtsi128_si32(_mm_shuffle_epi32(key, 0xFF));
    for( int i = 0; i < 4; ++i ) {
        ((uint8_t*)&X1)[i] = sbox[ ((uint8_t*)&X1)[i] ];
        ((uint8_t*)&X3)[i] = sbox[ ((uint8_t*)&X3)[i] ];
    }
    return _mm_set_epi32(((X3 >> 8) | (X3 << 24)) ^ rcon, X3, ((X1 >> 8) | (X1 << 24)) ^ rcon, X1);
}

cpuid
xmr-stack-cpu uses the cpuid command to determine whether SSE and AES instructions are supported on the CPU. The problem was that <cpuid.h> shipped with Xcode doesn’t have an include guard, so it’s not possible to suppress its inclusion as it was done with <x86intrin.h>. Instead, <cpuid.h> had to be faked entirely by fiddling with headers search paths. Here’s the fake header to make xmr-stack-cpu believe that ARM chip supports everything:

#pragma once
#include "TargetConditionals.h"
#if TARGET_OS_SIMULATOR
#define __cpuid_count(__level, __count, __eax, __ebx, __ecx, __edx) \
    __asm(" xchgq %%rbx,%q1\n" \
          " cpuid\n" \
          " xchgq %%rbx,%q1" \
        : "=a"(__eax), "=r" (__ebx), "=c"(__ecx), "=d"(__edx) \
        : "0"(__level), "2"(__count))
#else
static inline __attribute__((always_inline))
void __cpuid_count(uint32_t __level, int32_t __count,
                   int32_t &__eax, int32_t &__ebx, int32_t &__ecx, int32_t &__edx)
{
    __eax = __ebx = __ecx = __edx = -1;
}
#endif

stdout capture
xmr-stack-cpu is a console-based software and I wanted to keep that as is, regardless of what Apple thinks about stdout in iOS. A simple dup2 syscall does the job – stdout could be redirected into a pipe, while another end of that pipe is connected with some UI control like UITextView. Here’s the snippet:

let pipe = Pipe()
var fileHandle: FileHandle!
var source: DispatchSourceRead!

func setupStdout() {
    fileHandle = pipe.fileHandleForReading
    fflush(stdout)
    dup2(pipe.fileHandleForWriting.fileDescriptor, fileno(stdout))
    setvbuf(stdout, nil, _IONBF, 0)
    source = DispatchSource.makeReadSource(fileDescriptor: fileHandle.fileDescriptor,
                                           queue: DispatchQueue.global())
    source.setEventHandler {
        self.readStdout()
    };
    source.resume()
}

func readStdout() {
    let buffer = malloc(4096)!
    let read_ret = read(fileHandle.fileDescriptor, buffer, 4096)
    if read_ret > 0 {
        let data = UnsafeBufferPointer(start: buffer.assumingMemoryBound(to: UInt8.self),
                                       count: read_ret)
        if let str = String(bytes: data, encoding: String.Encoding.utf8) {
            DispatchQueue.main.async {
                self.acceptLog(str: str)
            }
        }
    }
    free(buffer)
 }

Unlimited execution in background
That’s what Apple doesn’t like at all and tries to prevent at any cost. Of course, that makes sense in a perspective of battery life, but when a device is connected to a power source these restrictions look ridiculous. After all, that’s my device and I want it to be able to perform any computations, no matter how time-consuming and complex they are. There’s no universal solution for this problem, but at least one particular combination worked for me on iOS11:
– Creation of a background task upon switching to background mode via UIApplication.shared.beginBackgroundTask and the consequent creation of next tasks in the expiration handler.
– Infinite looped playback of an empty sound file at the same time. I used this solution as a starting point and made a few performance-wise tweaks after.
This hack lets the application to run indefinitely long and prevents it from putting to sleep and closing its network connections. During my tests, it was absolutely fine to leave the miner app working for 12+ hours and that didn’t lead to any terminations or suspensions or connections droppings.

Results

I benchmarked the performance on three Macs from 2012 and two iOS devices. To be fair, all of these Macs have a “notebook-level” hardware and it wouldn’t be correct to make assumptions about “desktop-level” Intel CPUs based on the gathered data. The tests were run with low_power_mode=false and no_prefetch=true flags, during at least 15 minutes.
The results were surprising – despite the usage of an almost brute-force method of instructions translation and lack of any hardware-specific optimizations made for Apple CPUs, iPad 2017 showed pretty solid performance. A9 shows the same hashrate as Core i5-3427U, which itself cost $225 when it was introduced in 2012 (A9 was introduced in 2015) and has a TDP of 17W (A9 has about 4W). This graph also clearly shows the memory-bound limitations of CryptoNote.

The source code and build instructions are available in this repository.

May 27, 2017May 27, 2017

CoreFoundation and memory allocators – why bother?

Any seasoned C++ programmer knows that object allocation does cost CPU cycles, and may cost lots of them. The language itself provides various object allocation types. Such mess might surprise folks who use other user-friendlier languages, especially languages with a garbage collection. But that’s only the beginning. Any C++ Jedi knows about custom allocation strategies, such as memory pools, buddy allocation, grow-only allocators, can write a generic-purpose memory allocator (probably quite a crappy one) and so forth.
Does it help? Sometimes usage of a custom allocator allows tuning up an application’s performance, by exploiting a specific knowledge about properties of the system.
Does it mean that it might be a good idea to write your own malloc() implementation? Absolutely not. It’s a good challenge for educational purposes, but almost never this will bring any performance benefits.

So what about Cocoa in this aspect?

On Foundation level, Objective-C once had some options to customize the allocation process via NSZone, but they were discarded upon a transition to the ARC. Swift, on the other hand, AFAIK doesn’t even pretend to provide any allocation options.

On CoreFoundation level, many APIs accept a pointer to a memory allocator (CFAllocatorRef) as the first parameter. kCFAllocatorDefault or NULL is passed to use the default allocator, i.e. CFAllocatorGetDefault(), i.e. kCFAllocatorSystemDefault in most cases. CoreFoundation also provides a set of APIs to manipulate the allocation process:
– CFAllocatorCreate
– CFAllocatorAllocate
– CFAllocatorReallocate
– CFAllocatorDeallocate
– CFAllocatorGetPreferredSizeForSize
– CFAllocatorGetContext
An overall mechanics around CFAllocatorRef is quite well documented and, even better, it’s always possible to take a look at the source code of CoreFoundation. So, it’s absolutely ok to use a custom memory allocator on the CoreFoundation level.

“What for?” might be a reasonable question here. Introducing any additional low-level components also implies some maintenance burden in the future, so there should be some heavy pros to bother with a custom memory allocation. Traditionally, the Achilles’ heel of generic-purpose memory allocators is a dealing with many allocations and consequent deallocations of small amounts of memory. There’re plenty of optimization techniques developed for such tasks, so why not check it on Cocoa?

Suppose we want to spend as less time on memory allocation as possible. Nothing is faster than allocating memory on the stack, obviously. But there are some issues with a stack-based allocation:
• The stack is not limitless. A typical program, which does nothing tricky, is very unlikely to hit the stack limit, but that’s not an advice to carelessly use alloca() everywhere – it will strike back eventually.
• Deallocating a stack-based memory in an arbitrary order is painful and requires some time to manage. In a perfect world, however, it would be great to have an O(1) time complexity for both allocation and deallocation.
• All allocated objects must be freed before an escaping out of allocator’s visibility scope, otherwise, an access to the “leaked” object will lead to an undefined behavior.
To mitigate these issues, a compromise strategy exists:
• Use a stack memory when possible, fall back to a generic-purpose memory allocator otherwise.
• Do increase a stack pointer on allocations, don’t decrease it upon deallocations.
In such case, allocations will be blazingly fast most of the time, while it’s still possible to process requests for big memory chunks. As for the third issue, it falls onto the developer, since the memory allocator can only help with some diagnostic. It’s incredibly easy to write such memory allocator, main steps are described below.

The stack-based allocator is conceptually a classic C++ RAII object. It’s assumed that the client source code will be compiled as C++ or as Objective-C++. The only public method, apart from the constructor and the destructor, provides a CFAllocatorRef pointer to pass into CoreFoundation APIs. The internal state of the allocator consists of the stack itself, a stack pointer, two allocations counters for diagnostic purposes and the CFAllocatorRef pointer.

struct CFStackAllocator
{
  CFStackAllocator() noexcept;
  ~CFStackAllocator() noexcept;
  inline CFAllocatorRef Alloc() const noexcept { return m_Alloc; }
private:
... 
  static const int m_Size = 4096 - 16;
  char m_Buffer[m_Size];
  int m_Left;
  short m_StackObjects;
  short m_HeapObjects;
  const CFAllocatorRef m_Alloc;
};

To initialize the object, the constructor fills counters with defaults and creates a CFAllocatorRef frontend. Only two callbacks are required to build a working CFAllocatorRef: CFAllocatorAllocateCallBack and CFAllocatorDeallocateCallBack.

CFStackAllocator::CFStackAllocator() noexcept:
  m_Left(m_Size),
  m_StackObjects(0),
  m_HeapObjects(0),
  m_Alloc(Construct())
{}

CFAllocatorRef CFStackAllocator::Construct() noexcept
{
  CFAllocatorContext context = {
    0,
    this,
    nullptr,
    nullptr,
    nullptr,
    DoAlloc,
    nullptr,
    DoDealloc,
    nullptr
  };
  return CFAllocatorCreate(kCFAllocatorUseContext, &context);
}

To allocate a memory block, it’s only needed to check whether requested block could be placed in the stack buffer. In this case, the allocation process itself consists only of updating the free space counter. Otherwise, the allocation falls back to the generic malloc().

void *CFStackAllocator::DoAlloc
  (CFIndex _alloc_size, CFOptionFlags _hint, void *_info)
{
  auto me = (CFStackAllocator *)_info;
  if( _alloc_size <= me->m_Left ) {
    void *v = me->m_Buffer + m_Size - me->m_Left;
    me->m_Left -= _alloc_size;
    me->m_StackObjects++;
    return v;
  }
  else {
    me->m_HeapObjects++;
    return malloc(_alloc_size);
  }
}

To deallocate a previously allocated memory block, it’s only needed to check whether that allocation was dispatched to the malloc() and to call free() accordingly.

void CFStackAllocator::DoDealloc(void *_ptr, void *_info)
{
  auto me = (CFStackAllocator *)_info;
  if( _ptr < me->m_Buffer || _ptr >= me->m_Buffer + m_Size ) {
    free(_ptr);
    me->m_HeapObjects--;
  }
  else {
    me->m_StackObjects--;
  }
}

To measure the performance difference between a default Objective-C allocator, a default CoreFoundation allocator and the CFStackAllocator, the following task was executed:
Given N UTF-8 strings, calculate hash values of derived strings which are lowercase and normalized.

An Objective-C variant of the computation:

unsigned long Hash_NSString( const vector<string> &_data )
{
  unsigned long hash = 0;
  @autoreleasepool {
    for( const auto &s: _data ) {
      const auto nsstring = [[NSString alloc] initWithBytes:s.data()
                                                     length:s.length()
                                                   encoding:NSUTF8StringEncoding];
      hash += nsstring.lowercaseString.decomposedStringWithCanonicalMapping.hash;
    }
  }
  return hash;
}

A CoreFoundation counterpart of this task:

unsigned long Hash_CFString( const vector<string> &_data )
{
  unsigned long hash = 0;
  const auto locale = CFLocaleCopyCurrent();
  for( const auto &s: _data ) {
    const auto cfstring = CFStringCreateWithBytes(0,
                                                  (UInt8*)s.data(),
                                                  s.length(),
                                                  kCFStringEncodingUTF8,
                                                  false);
    const auto cfmstring = CFStringCreateMutableCopy(0, 0, cfstring);
    CFStringLowercase(cfmstring, locale);
    CFStringNormalize(cfmstring, kCFStringNormalizationFormD);
    hash += CFHash(cfmstring);
    CFRelease(cfmstring);
    CFRelease(cfstring);
  }
  CFRelease(locale);
  return hash;
}

A CoreFoundation counterpart using a stack-based memory allocation:

unsigned long Hash_CFString_SA( const vector<string> &_data )
{
  unsigned long hash = 0;
  const auto locale = CFLocaleCopyCurrent();
  for( const auto &s: _data ) {
    CFStackAllocator alloc;
    const auto cfstring = CFStringCreateWithBytes(alloc.Alloc(),
                                                  (UInt8*)s.data(),
                                                  s.length(),
                                                  kCFStringEncodingUTF8,
                                                  false);
    const auto cfmstring = CFStringCreateMutableCopy(alloc.Alloc(), 0, cfstring);
    CFStringLowercase(cfmstring, locale);
    CFStringNormalize(cfmstring, kCFStringNormalizationFormD);
    hash += CFHash(cfmstring);
    CFRelease(cfmstring);
    CFRelease(cfstring);
  }
  CFRelease(locale);
  return hash;
}

And here are the results. These functions were called with the same data set consisting of 1,000,000 randomly generated strings with varying lengths.

On the provided data sets range, the CoreFoundation+CFStackAllocator implementation variant is 20%-50% faster than the pure Objective-C implementation and is 7%-20% faster than the pure CoreFoundation implementation. It’s easy to observe that Δ between timings is almost constant and represents the difference between times spent in the management tasks. To be precise, the time spent in management tasks in the CoreFoundation+CFStackAllocator variant is ~800ms less than in the Objective-C variant and is ~270ms less than in the pure CoreFoundation variant. Divided by the strings amount, this Δ is ~800ns and ~270ns per string accordingly.
The stack-based memory allocation is a micro-optimization, of course, but it might be very useful in a time-critical execution path. The complete source code of the CFStackAllocator and of the benchmarking is available in this repository: https://github.com/mikekazakov/CFStackAllocator.