Arm v7 instruction set provides a fast instruction for inverse reciprocal square root calculation `vrsqrte_f32`

for two simultaneous approximations and `vrsqrteq_f32`

for four approximations. (The scalar variant `vrsqrtes_f32`

is only available on Arm64 v8.2).

Then the result can be simply calculated by `x * vrsqrte_f32(x);`

, which has better than 0.33% relative accuracy over the whole range of positive values x. See https://www.mdpi.com/2079-3197/9/2/21/pdf

ARM NEON instruction FRSQRTE gives 8.25 correct bits of the result.

At `x==0`

vrsqrtes_f32(x) == Inf, so x*vrsqrtes_f32(x) would be NaN.

If the value of x==0 is unavoidable, the optimal two instruction sequence needs a bit more adjustment:

```
float sqrtest(float a) {
// need to "transfer" or "convert" the scalar input
// to a vector of two
// - optimally we would not need an instruction for that
// but we would just let the processor calculate the instruction
// for all the lanes in the register
float32x2_t a2 = vdup_n_f32(a);
// next we create a mask that is all ones for the legal
// domain of 1/sqrt(x)
auto is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
// calculate two reciprocal estimates in parallel
float32x2_t a2est = vrsqrte_f32(a2);
// we need to mask the result, so that effectively
// all non-legal values of a2est are zeroed
a2est = vand_u32(is_legal, a2est);
// x * 1/sqrt(x) == sqrt(x)
a2 = vmul_f32(a2, a2est);
// finally we get only the zero lane of the result
// discarding the other half
return vget_lane_f32(a2, 0);
}
```

Surely this method will have almost twice the throughput with

```
void sqrtest2(float &a, float &b) {
float32x2_t a2 = vset_lane_f32(b, vdup_n_f32(a), 1);
float32x2_t is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
float32x2_t a2est = vrsqrte_f32(a2);
a2est = vand_u32(is_legal, a2est);
a2 = vmul_f32(a2, a2est);
a = vget_lane_f32(a2,0);
b = vget_lane_f32(a2,1);
}
```

And even better, if you can work directly with `float32x2_t`

or `float32x4_t`

inputs and outputs.

```
float32x2_t sqrtest2(float32x2_t a2) {
float32x2_t is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
float32x2_t a2est = vrsqrte_f32(a2);
a2est = vand_u32(is_legal, a2est);
return vmul_f32(a2, a2est);
}
```

This implementation gives `sqrtest2(1) == 0.998`

and `sqrtest2(400) == 19.97`

(tested on MacBook M1 with arm64). Being branchless and LUT free, this has likely a constant execution time, assuming that all the instructions execute in constant number of cycles.