Numeric Derivatives

根据微积分的定义,函数 f(x)f(x) 的导数定义可以写为如下形式:

Df(x)=limh0f(x+h)f(x)hDf(x)=\lim_{h\rightarrow 0}\frac{f(x+h)-f(x)}{h}

Forward Differences

当前计算机尚不能进行极限这种运算,我们可以做近似运算,就是将 hh 取一个极小值即可。

Df(x)f(x+h)f(x)hDf(x)\approx \frac{f(x+h)-f(x)}{h}

上面的公式是最简单最基本的数值微分形式。它被称为 Forward Difference 公式。那么如何在 Ceres Solver 中构建 Rat43Analytic (Rat43) 的数值微分版本。

  • 定义 Functor,给定参数值 (x,y)(x,y) 评估残差。

  • 构建一个 CostFunction 通过使用 NumericDiffCostFunction。

具体函数实现如下,用户需要做的唯一一件事,就是确保正确、高效地计算残差。

struct Rat43CostFunctor {
  Rat43CostFunctor(const double x, const double y) : x_(x), y_(y) {}

  bool operator()(const double* parameters, double* residuals) const {
    const double b1 = parameters[0];
    const double b2 = parameters[1];
    const double b3 = parameters[2];
    const double b4 = parameters[3];
    residuals[0] = b1 * pow(1.0 + exp(b2 -  b3 * x_), -1.0 / b4) - y_;
    return true;
  }

  const double x_;
  const double y_;
}

CostFunction* cost_function =
  new NumericDiffCostFunction<Rat43CostFunctor, FORWARD, 1, 4>(x, y);

在进行进一步的分析之间,估算 Forward Difference 公式的误差是很有帮助的。为此,我们需要考虑 xx 附近 ff 的泰勒展开。the error in the forward difference formula is O(h)3O(h)^{3}

f(x+h)=f(x)+hDf(x)+h22!D2f(x)+h33!D3f(x)+Df(x)=f(x+h)f(x)h[h2!D2f(x)+h23!D3f(x)+]Df(x)=f(x+h)f(x)h+O(h)\begin{aligned} f(x+h) & =f(x)+h D f(x)+\frac{h^{2}}{2 !} D^{2} f(x)+\frac{h^{3}}{3 !} D^{3} f(x)+\cdots \\ D f(x) & =\frac{f(x+h)-f(x)}{h}-\left[\frac{h}{2 !} D^{2} f(x)+\frac{h^{2}}{3 !} D^{3} f(x)+\cdots\right] \\ D f(x) & =\frac{f(x+h)-f(x)}{h}+O(h) \end{aligned}

NumericDiffCostFunction实现了一种对给定函子进行数值微分的通用算法,虽然 NumericDiffCostFunction的实际实现很复杂,但最终结果是一个大致如下的 CostFunction,实现了一种对给定函子进行数值微分的通用算法。

class Rat43NumericDiffForward : public SizedCostFunction<1,4> {
   public:
     Rat43NumericDiffForward(const Rat43Functor* functor) : functor_(functor) {}
     virtual ~Rat43NumericDiffForward() {}
     virtual bool Evaluate(double const* const* parameters,
                           double* residuals,
                           double** jacobians) const {
       functor_(parameters[0], residuals);
       if (!jacobians) return true;
       double* jacobian = jacobians[0];
       if (!jacobian) return true;

       const double f = residuals[0];
       double parameters_plus_h[4];
       for (int i = 0; i < 4; ++i) {
         std::copy(parameters, parameters + 4, parameters_plus_h);
         const double kRelativeStepSize = 1e-6;
         const double h = std::abs(parameters[i]) * kRelativeStepSize;
         parameters_plus_h[i] += h;
         double f_plus;
         functor_(parameters_plus_h, &f_plus);
         jacobian[i] = (f_plus - f) / h;
       }
       return true;
     }

   private:
     std::unique_ptr<Rat43Functor> functor_;
 };

注意上面中的步长 hh,不是所有的参数都使用相同的绝对步长,我们使用相对步长 kRelativeStepSize = 10610^{-6}。相对于绝对步长来说,对导数的估计更加准确,这种步长选择仅适用于不接近于零的参数值。因此 NumericDiffCostFunction 的实际实现使用了更复杂的步长选择逻辑,在接近于零的情况下,它切换到固定步长。

Central Differences

Forward Difference 公式中的误差 O(h)O(h) 是可以接受的,但却不是最优的。更好的方法是使用 Central Difference 公式:

Df(x)f(x+h)f(xh)2hDf(x)\approx \frac{f(x+h)-f(x-h)}{2h}

若是在 f(x)f(x) 的值已知的情况下,Forward Difference 公式只需要一次额外的计算,就是计算 f(x+h)f(x+h),但 Central Difference 公式需要两次计算,使其成本增加一倍。那么这样做值得吗?为了回答这个问题,我们再次计算 Central Difference 公式中的近似误差:

f(x+h)=f(x)+hDf(x)+h22!D2f(x)+h33!D3f(x)+h44!D4f(x)+f(xh)=f(x)hDf(x)+h22!D2f(x)h33!D3f(c2)+h44!D4f(x)+Df(x)=f(x+h)f(xh)2h+h23!D3f(x)+h45!D5f(x)+Df(x)=f(x+h)f(xh)2h+O(h2)\begin{aligned} f(x+h) & =f(x)+h D f(x)+\frac{h^{2}}{2 !} D^{2} f(x)+\frac{h^{3}}{3 !} D^{3} f(x)+\frac{h^{4}}{4 !} D^{4} f(x)+\cdots \\ f(x-h) & =f(x)-h D f(x)+\frac{h^{2}}{2 !} D^{2} f(x)-\frac{h^{3}}{3 !} D^{3} f\left(c_{2}\right)+\frac{h^{4}}{4 !} D^{4} f(x)+\cdots \\ D f(x) & =\frac{f(x+h)-f(x-h)}{2 h}+\frac{h^{2}}{3 !} D^{3} f(x)+\frac{h^{4}}{5 !} D^{5} f(x)+\cdots \\ D f(x) & =\frac{f(x+h)-f(x-h)}{2 h}+O\left(h^{2}\right) \end{aligned}

Central Difference 公式中的误差为 O(h2)O(h^2),误差呈二次方下降,而前 Forward Difference 公式中的误差仅呈线性下降。使用哪种方式在 Ceres 中的配置也很简单

CostFunction* cost_function =
  new NumericDiffCostFunction<Rat43CostFunctor, CENTRAL, 1, 4>(
    new Rat43CostFunctor(x, y));

但这些误差差异在实践中意味着什么?要看到这一点,我们考虑评估如下单变量函数在 x=1.0x=1.0 处的导数的问题

f(x)=exsinxx2f(x)=\frac{e^x}{sinx-x^2}

我们可以很容易的计算得到在 x=1x=1 处的导数为 Df(1.0)=140.73773557129685Df(1.0)=140.73773557129685。使用该值作为参考值,我们现在可以计算 Central DifferenceForward Difference 的相对误差并绘制它们。

从右向左看上图会发现如下几点信息:

  1. 首先,两种求导方式都是从一个很大的误差值开始,随着 truncating the Taylor series dominates,误差会下降,但是当下降到一定程度之后,误差开始出现了急剧上升。所以我们不能够一直通过减小 hh 去获得对 DfDf 的准确估计,计算机所能存储的数据精度是有限的。

  2. Forward Difference 不是最好的用来评估导数的方法,随着步长的减小, Central Difference 可以更快地收敛到更准确的导数估计。所以说,除非计算一次 $f(x)$ 极其耗时,在一般情况下尽量使用 Central Difference

  3. 对于选择不当的 hh ,这两种方法都不能很好的工作。

Ridders’ Method

那么,我们能否在不要求 hh 值小到会产生浮点舍入误差的情况下,获得更好的 DfDf 估计值呢?一种可行的方案是找到一种误差下降速度快于 O(2)O(2) 的方法。这可以通过对微分问题应用 Richardson Extrapolation来实现。这也被称为 Ridders' Method [Ridders]

回顾一下 central differences 的误差

Df(x)=f(x+h)f(xh)2h+h23!D3f(x)+h45!D5f(x)+=f(x+h)f(xh)2h+K2h2+K4h4+\begin{aligned} D f(x) & =\frac{f(x+h)-f(x-h)}{2 h}+\frac{h^{2}}{3 !} D^{3} f(x)+\frac{h^{4}}{5 !} D^{5} f(x)+\cdots \\ & =\frac{f(x+h)-f(x-h)}{2 h}+K_{2} h^{2}+K_{4} h^{4}+\cdots \end{aligned}

最需要注意的是式子中的 K2,K4K_{2}, K_{4}, \ldots 是独立于 hh 仅仅依赖 xx 的。现在我们定义:

A(1,m)=f(x+h/2m1)f(xh/2m1)2h/2m1.A(1, m)=\frac{f\left(x+h / 2^{m-1}\right)-f\left(x-h / 2^{m-1}\right)}{2 h / 2^{m-1}} .

我们观察到

Df(x)=A(1,1)+K2h2+K4h4+D f(x)=A(1,1)+K_{2} h^{2}+K_{4} h^{4}+\cdots

同时也有

Df(x)=A(1,2)+K2(h/2)2+K4(h/2)4+D f(x)=A(1,2)+K_{2}(h / 2)^{2}+K_{4}(h / 2)^{4}+\cdots

在这里,我们将步长减半,得到了 Df(x)\operatorname{Df}(x) 的第二个 Central Difference 估计值。结合这两个估计值,我们可以得到

Df(x)=4A(1,2)A(1,1)41+O(h4)D f(x)=\frac{4 A(1,2)-A(1,1)}{4-1}+O\left(h^{4}\right)

上式是 Df(x)D f(x) 的近似值,导数近似误差随着 O(h4)O\left(h^{4}\right) 的减小而减小。我们可以重复这个过程,得到更精确的估计值,如下所示:

A(n,m)={f(x+h/2m1)f(xh/2m1)2h/2m1n=14n1A(n1,m+1)A(n1,m)4n11n>1A(n, m)=\left\{\begin{array}{ll} \frac{f\left(x+h / 2^{m-1}\right)-f\left(x-h / 2^{m-1}\right)}{2 h / 2^{m-1}} & n=1 \\ \frac{4^{n-1} A(n-1, m+1)-A(n-1, m)}{4^{n-1}-1} & n>1 \end{array}\right.

很容易证明 A(n,1)A(n, 1) 的近似误差为 O(h2n)O\left(h^{2 n}\right) 。为了了解如何在实践中利用上式计算 A(n,1)A(n,1),将计算结构化为下表是很有帮助的:

A(1,1)A(1,2)A(1,3)A(1,4)A(2,1)A(2,2)A(2,3)A(3,1)A(3,2)A(4,1)\begin{array}{lllll} A(1,1) & A(1,2) & A(1,3) & A(1,4) & \cdots \\ & A(2,1) & A(2,2) & A(2,3) & \cdots \\ & & A(3,1) & A(3,2) & \cdots \\ & & & A(4,1) & \cdots \end{array}

So, to compute A(n,1)A(n, 1) for increasing values of nn we move from the left to the right, computing one column at a time. Assuming that the primary cost here is the evaluation of the function f(x)f(x) , the cost of computing a new column of the above tableau is two function evaluations. Since the cost of evaluating A(1,n)A(1, n) , requires evaluating the central difference formula for step size of 21nh2^{1-n} h

Applying this method to f(x)=exsinxx2f(x)=\frac{e^{x}}{\sin x-x^{2}} starting with a fairly large step size h=0.01 , we get:

141.678097131140.971663667140.796145400140.752333523140.741384778140.736185846140.737639311140.737729564140.737735196140.737736209140.737735581140.737735571140.737735571140.737735571140.737735571\begin{array}{llllll} 141.678097131 & 140.971663667 & 140.796145400 & 140.752333523 & 140.741384778 \\ & 140.736185846 & 140.737639311 & 140.737729564 & 140.737735196 \\ & & 140.737736209 & 140.737735581 & 140.737735571 \\ & & & 140.737735571 & 140.737735571 \\ & & & & 140.737735571 \end{array}

Compared to the correct value Df(1.0)=140.73773557129658,A(5,1)D f(1.0)=140.73773557129658, A(5,1) has a relative error of 101310^{-13} . For comparison, the relative error for the central difference formula with the same step size (0.01/24=0.000625)( \left.0.01 / 2^{4}=0.000625\right) is 10510^{-5} .

The above tableau is the basis of Ridders’ method for numeric differentiation. The full implementation is an adaptive scheme that tracks its own estimation error and stops automatically when the desired precision is reached. Of course it is more expensive than the forward and central difference formulae, but is also significantly more robust and accurate.

Using Ridder’s method instead of forward or central differences in Ceres is again a simple matter of changing a template argument to NumericDiffCostFunction as follows:

CostFunction* cost_function =
  new NumericDiffCostFunction<Rat43CostFunctor, RIDDERS, 1, 4>(
    new Rat43CostFunctor(x, y));

The following graph shows the relative error of the three methods as a function of the absolute step size. For Ridders’s method we assume that the step size for evaluating A(n,1)A(n,1) is 21nh2^{1-n}h.

Using the 10 function evaluations that are needed to compute A(5,1)A(5,1) we are able to approximate Df(1.0)Df(1.0) about a 1000 times better than the best central differences estimate. To put these numbers in perspective, machine epsilon for double precision arithmetic is 2.22×1016≈2.22×10−16.

Going back to Rat43, let us also look at the runtime cost of the various methods for computing numeric derivatives.

CostFunction

Times

Rat43Analytic

255

Rat43AnalyticOptimized

92

Rat43NumericDiffForward

262

Rat43NumericDiffCentral

517

Rat43NumericDiffRidders

3760

As expected, Central Differences is about twice as expensive as Forward Differences and the remarkable accuracy improvements of Ridders’ method cost an order of magnitude more runtime.

Recommendations

Numeric differentiation should be used when you cannot compute the derivatives either analytically or using automatic differentiation. This is usually the case when you are calling an external library or function whose analytic form you do not know or even if you do, you are not in a position to re-write it in a manner required to use Automatic Derivatives.

When using numeric differentiation, use at least Central Differences, and if execution time is not a concern or the objective function is such that determining a good static relative step size is hard, Ridders’ method is recommended.

Last updated