A while ago,
Michael Strömberg presented a
Java & C# Performance Comparison on his blog.
As a basis for comparison, he included the original C++ reference
implementation of the algorithm, a pulse-coupled neural network.
The reference program was slower than Michaels java code even on the
slowest of the Java environments.
Since I believe that a reasonably sane implementation in C++ has a good chance of outperforming even a well-tuned Java implementation, I decided to take a look at the reference implementation and see if it could be made to run faster. It certainly could.
The first step was, naturally, to get the old code to
compile and run at all, and then add
-Wall to the
compiler flags and fix all warnings. This mainly meant changing
#include <iostream.h> to
<iostream> and adding
in a few files.
Project-specific code or the standard library
A common problem in old C++ code is the use of own, often rather simplistic, code for data containers, such as vectors or strings. There is an excuse for this, as the C++ language and library was standardized as late as in 1997, and compilers lagged quite long after that.
This project had its own string class (used only for a
file name construction). I removed the 150 lines of string
class and changed the 6 lines that used it to use standard
string (actually, a
The main offender, however, was a matrix implemented as an
array of arrays of
double. The matrix class then
had an index operator that simply returned the inner array, used
int x,y; for(x = 0; x<width; ++x) for(y = 0; y<width; ++y) matrix[x][y] *= factor;
This meant that the matrix class gave free access to its inner workings to its users. The first thing to do if I wanted to change anything in the matrix was to create access operators that take two arguments, to replace returning the inner array:
for(int x = 0; x<width; ++x) for(int y = 0; y<width; ++y) matrix(x, y) *= factor;
At first I implemented these with a double index just like
before, but I soon changed the data to just a
std::vector<double>, indexed with
width * y). Then, for the methods that just operated on
each element (which used a lot of the running time), there was
no need to use separate
indexes at all, just iterate over all the elements:
for(iterator i = begin(); i != end(); ++i) *i *= factor;
By value or by reference
Another common problem is passing parameters by value instead of by reference. Looking at the function declarations, the differece is minimal:
void foo(Arg a); // By value
void foo(const Arg &a); // By reference
The ampersand in the later declaration means that that
function takes a reference to an
Arg, rather than
Arg, as argument. If an
a large object, copying it might take quite a while.
Of course, when the method gets a reference to the
original object rather than a copy, any change the method makes
will be to the original object. To avoid unintended changes, we
can declare the argument
const. To put lots of
"const" in the program also facilitates the mechanical
optimization the compiler does for you.
Some classes, notably the
family, even makes the copy constructor (and the assignment
private, to prohibit copying. Allthough
this is to ensure proper semantics rather than as a mere
optimization, those two often goes hand in hand.
Where to tune
Given that the program was several hundred lines of code – even after simplifying it with STL – it was not obvious where to optimize. Since I took pride in only spending a few hours on the optimization, I wanted to put my effort where it would matter the most (in real-world projects, often several magnitudes larger, having to pay the developers per hour gives a similar motivation).
Luckily, there is a tool for this: just compile the program
with the -pg flag, run the program, and then run
gprof, and you get a listing of which methods
the program spent most time in, and how many times they were
called. Then you just have to figure out either how to make those
methods run faster, or how to call them less often.