Michael Kazakov's quiet corner – Software Development, C++, Tech Rant

December 27, 2019December 27, 2019

Bare web experiment

Introduction
I have a complicated relationship with website-related technologies. Previously I occasionally had to hack around with various web software, so I got some understanding of how different components work and interact with each other, but generally that’s something I’d prefer to not touch at all. Fiddling with bits & bytes and staring at assembly listings is much closed to me. Unfortunately sometimes one has to deal with it, for instance when developing a software product. In my case, it was Nimble Commander – this app needed at least some presence on the Internet. App Store won’t even accept an application without URLs of general product info, a support page, and a privacy policy page. A combination of NIH syndrome and a finance model of the project left outsourcing this task out of the question, so I had to pick my poison.

The intention was to find a simple WYSIWYG design system that would allow to slam together some content, export it as plain HTML files and then upload them to a server. For such a simple website nothing else was required. However, as it turned out in 2013, this niche was already dead and everything moved into CMS-style systems with backend programming languages, databases, and other cruft. One semi-obvious choice from Apple was its software for simple web design called iWeb. It was already outdated, but seemed to be decent enough for my needs. And so it worked for some time until its age became clear and I had to find something else. After some research, perhaps carried out incompetently, the verdict was to “just use WordPress and don’t reinvent the wheel”. The idea was appealing – a WYSIWYG editor plus a range of various visual themes (both free and paid) with an ability to plug additional components on top. And, when really needed, to fine-tune elements via a custom CSS as a cherry on the cake. Sounds awesome, right?

Not exactly. As it turned out, the built-in flexibility wasn’t that high and the design involved quite a lot of CSS hammering. WordPress was slow as hell and needed echelons of minifiers and caches. The HTML files spat out of the system was an incomprehensible pile of <div>s which made a web browser choke. And, on top of all, at some point, I discovered that the theme referenced some external resources via hardcoded “http://..” paths somewhere deep in its guts, which prevented me from making the website accessible via https. Not that encryption bothered me at all – who really cares about the fact that some person visited a website of a dual-pane file manager for Mac? But in 2018 web-browsers ignited a crusade against non-encrypted internet traffic and Chrome, for instance, started to happily print “not secure” next to the domain name. It was clear that somehow the website had to be moved to https.

There were basically two ways to get there – either hack the existing WordPress instance even further or deploy something fresh. The first option was doable, but was also disgustingly boring and would leave me with even more hacks to maintain. The latter was more time-consuming but at least would provide some new knowledge and entertain a bit.

It should be mentioned that IMHO the current state of the software industry could be fairly described as “dumpster fire” and manifests like this do resonate with me. Having started programming with 8-bit Z80, I’m often quite baffled with how inefficient a certain piece of software must be to perform so slowly on modern hardware. Looking critically at this website, it also appeared to be an over-engineered monstrosity for the tasks it performed. Simplified enough, the website looked like this: MySQL ↔ WordPress(PHP framework, 3rd pty theme, custom CSS hacks) ↔ Minifiers ↔ Caches ↔ Apache ↔ Internet. Since the site contained only ~5 webpages that weren’t changed often, I wondered – why not ditch this cruft altogether and write these pages manually in raw HTML+CSS? Then maybe I could serve 5 static html files without talking to a freaking DATABASE. Surely this area is far away from my usual system programming, but it shouldn’t be that hard. So I decided to conduct an experiment – to rewrite everything from scratch and to measure both how much effort it requires and how much faster the website could be if implemented this way.

Experience
What firstly strikes out is how smooth the web development is. When dealing with C++ it sometimes takes literally hours to just compile the project before you can run anything. Being able to hit Cmd+S in one window and Cmd+R in another to get feedback without a delay does feel like a bliss. When there’s almost no experience in the field (like my case), such frictionless makes the learning process very comfortable.

The combination of HTML5 with CSS3 feels incredibly powerful and intuitive in most cases. There were many instances when I couldn’t believe it – “is that SO easy?”. Perhaps the only case which caused hiccups and had several iterations was the image “gallery”. The problem was to make it lazy loading and somewhat responsive while keeping everything as simple as possible. In the end, the gallery was implemented in ~20 lines of JS code and that’s the only JS code these pages have now.

One especially amusing detail was that nowadays it’s often being recommended to combine images into large “sprite sheets“. Haha, with a gamedev background this does immediately ring a bell, though we tended to call them “texture atlases” instead. Good old “trading latency for bandwidth”. Well, that was a no-brainer – easy to implement and provides measurable speed improvement. With this trick, the front page requires only 4 resources to display – an HTML document, a CSS stylesheet, and two images.

It’s hard to tell exactly how much time this rewrite took. The repository with the source files contains 45 commits over three weeks. Assuming the average time spent per day was about 1 hour, this gives a total of ~21 hours or around three full working days. That sounds about right.

In the end, I’ve got something very close to the original design on WordPress, but an order of magnitude simpler and faster at the same time. The result is ~1300LoC in HTML and ~600LoC in CSS.

Data data data
The graphs below show the performance characteristics when accessing the website from London, i.e. doing round trips across Atlantic (the server is located in the NY area). In short: the front page now loads ~4X faster, downloads ~7X fewer bytes and performs 12X fewer requests. To be precise though, it’s kind of comparing apples to oranges, since the original WordPress served unencrypted HTTP, whiles the rewritten site uses HTTPS, which is more complex and requires handshaking, i.e the difference is actually even higher.

March 25, 2019March 25, 2019

Compilation time: Boost vs Std

This is a small experiment to compare compilation times of some Boost facilities against their standardised counterparts. The goal was to assess the time penalty which comes with vast compilers and platforms support.
Each measured file contains a minimal code snippet to employ some facility. The difference between Boost and Std versions boils down to including another header and picking the right namespace. Both preprocessing/parsing and instantiation contribute to the timing, so technically speaking it’s not exactly only compilation time. Each source file was compiled a hundred times to minimise measurement errors.
Boost version is 1.69.
Std implementation is libc++ coming with Xcode10.1.
Both were compiled using clang-1000.11.45.5 using C++17 mode and -O2 optimisation level.
The snippets themselves and the driver script are available here.

February 12, 2019February 12, 2019

Measuring templates bloat

Recently I’ve been investigating slow compilation of source files which used one particular library. The library was written in-house and has its roots in pre-C++11 era, including abundant usage of Boost. The library itself provides a sophisticated mechanism of reflection and operates with type-erased objects. Obviously, it heavily relies on the C++ type system and has a lot of template code.
Boost became my primary suspect almost immediately – it’s notorious for compiler torture and I personally try to avoid it wherever possible. So the most prominent red flags like Boost.MPL were almost completely removed and other pieces were converted to their C++11 standardised counterparts. Results, however, weren’t inspiring – compilation time moved a bit, but only marginally. The bottleneck was somewhere else.

Looking at MSVC’s time report (“/d2cgsummary”) didn’t provide anything meaningful – it basically stated that each file contained dozens of functions with “Anomalistic Compile Times”™️. No details why though.
GCC’s time report (“-ftime-report”), on the other hand, was much more helpful. It clearly showed that the lion’s share of time was spent on “phase opt and generate”, which, to my understanding, is actual instructions generation. That was somewhat surprising given the fact that the majority of these source files weren’t large nor performed any rocket surgery.

It should be mentioned that almost all reflection code in that library was written in templates, which makes sense. And, apparently, the compiler was spending time generating instructions for these methods per each instantiation type over and over again for each translation unit, to be later simply thrown away by the linker. It’s hard to estimate the number of instantiation types in the final product itself, but 50-100 can serve as ballpark estimation. So I decided to make an experiment and tried offloading some portions of templated code into a private non-templated “base” class. It immediately became evident that removing even tiny pieces of code, like the formatting of an exception message, results in a reduction of overall object files size (*.obj) by literally megabytes.

In this post I, roughly model the situation with synthetic code generation. Imagine the following pattern (let’s call it Pattern 1):

struct Base {
    virtual ~Base() = default;
    virtual void Method(int v) = 0;
};

template <typename T>
struct Impl : Base {
    void Method( int v ) override {
        if( v == 0 ) // some error checking
            throw std::logic_error
            ( "you're so unlucky with Method() for \'"s + typeid(T).name() + "\'!" );
        // some useful stuff
    }
};

struct Type{};

Base *Spawn_Type() { return new Impl<Type>; }

It’s quite easy to generate such code for a given number of class methods and instantiation types. Each additional method adds an entry in a virtual methods table and a templated implementation in Impl<T> by analogy with Method(). Each additional instantiation type introduces a new type and a new spawning function by analogy with Type / Spawn_Type().
And, for comparison, below is a slightly altered version (Pattern 2). ImplBase provides non-templated functionality and Impl<T> does just the same but redirects the exception composing and throwing to the ImplBase class. Performance hit introduced by additional function call can be neglected in 99.9% of cases.

// [...]
struct ImplBase {
    static void ThrowLogicErrorAtMethod( const std::type_info& typeid_t );
};

template <typename T>
struct Impl : Base, private ImplBase {
    void Method( int v ) override {
        if( v == 0 ) // some error checking
            ThrowLogicErrorAtMethod( typeid(T) );
        // some useful stuff
    }
};
// [...]

This repo contains generators and measurement scripts for both patterns. Scripts execute these generators for each of combinations of [1..20] methods by [1..20] instantiations and measure compilation of produced source code. The measurements shown below were made with “Apple LLVM version 10.0.0 (clang-1000.11.45.5)” on i7-3520M with “-std=c++17 -O2 -c” flags.

These are the compilation times and the object file sizes for the first pattern. Both compilation time and file size scale roughly proportionally to both number of methods and types. In the worst case scenario (20 methods x 20 types) it takes almost 3 seconds to compile the code which does basically nothing apart from error reporting. If there would be any actual code instead of “// some useful stuff” the graph will look much scarier.

The graphs below show the scaling of the second pattern. The worst-case scenario takes 0.75s to compile instead of 2.73s with the first pattern. The object file is 4 times smaller in that case.

Of course, both patterns generated a completely synthetic code which is far too simple to look at concrete absolute numbers. Adding any reasonable logic into these methods would radically shift the results. But I guess it’s safe to assume that delta between these two patterns will not go anywhere – a compiler will still have to generate these instructions regardless of other complexity. So it should be fair to look at delta numbers:

These delta numbers show something interesting. For instance, for the case of 10 methods and 10 instantiation types (which doesn’t seem too extreme), the difference is about half a second of compilation time per file. Or, to rephrase, there is a choice of two approaches:
a) Pattern I: clearer code – easier to maintain, but it costs 0.4 seconds of wait time per compilation per file;
b) Pattern II: a bit more obfuscated code – harder to maintain, but doesn’t introduce additional cost in terms of compilation time.
This choice, as usual in engineering, doesn’t provide a “right” option – it’s always a tradeoff. Often times, however, such choices are being done unconsciously just because something is considered to be a “default” way by the C++ community.

The “zero-cost abstractions” are sometimes being presented as the main C++ feature, but there are many hidden costs – graphs above show just one aspect of such penalties. The recent debate on Modern C++ vs. GameDev touched this problem and the ascetic approach of “C with classes” definitely has many valid points. At least such code compiles fast.

June 7, 2018

IO2D demo: Maps

Introduction
This blog post describes another IO2D demo I wrote as a showcase of the library’s capabilities. The demo is a simple yet working GIS renderer. The OpenStreetMap service is used as a raw data provider, allowing for the visualization of any reasonably sized rectangular region. The demo supports querying OSM servers directly or loading existing data files. The entire source code of the sample is less than 800 lines of code, of which 250 lines deal with the rendering itself and another 360 lines handle the data model.

OpenStreetMap API
OpenStreetMap has an API which lets you download map data specified by an arbitrary coordinate bounding box. This interface has a number of limitations related to data transfer. For instance, the API might not fetch more than 50K nodes in some cases. Also, the interface may provide an incomplete geometry, which happens when a complex region is only partially covered by the bounding box. The latter is especially apparent with water regions like rivers, lakes and coasts. These limitations are however quite tolerable for sample code.
The API is accessible via the following HTTP GET request: /api/0.6/map?bbox=MinLong,MinLatt,MaxLong,MaxLatt. For example, these are coordinates for Rapperswil:

wget https://api.openstreetmap.org/api/0.6/map?bbox=8.81598,47.22277,8.83,47.23

The returned data will contain a raw OpenStreetMap XML file with nodes, ways and relations between them.

External libraries
Obviously (no sarcasm implied), C++ has no standard networking capabilities, so some external facility is required to download map data. Boost.Beast was chosen to talk with OSM servers in the sample code. Once a file is received, that XML has to be parsed. PugiXML was employed to deal with it.

Data representation
This demo uses a very simple interpretation of OpenStreetMap data. Instead of trying to handle myriads of different tags, it grabs objects of several types and ignores everything else. The Model class transforms the input XML file into a set of linear containers which hold all information required to render the map. The OSM format uses 64-bit integers to uniquely identify entities and to maintain connections, which assumes storing objects in some kind of a hash map. The Model class transforms these unordered identifiers into raw array indices to reduce the impact on the memory subsystem and to enforce consistency.
The transformed map data is accessible via several POD types. A Node object represents some point of interest and carries just a pair of coordinates. A Way object represents a collection of Nodes. A Road and a Railway point at some Way to describe an underlying geometry. A Road also has its enumeration type, like Motorway or Footway, to visually distinguish between different types of roads. A Multipolygon represents a set of outer and inner polygons, which basically means two sets of Way objects. Building, Leisure, Landuse and Water are different types of Multipolygon objects. Landuse also has type information, like Commercial, Construction, Industrial etc. The overall logic model looks like this:

Coordinates transformations
OpenStreetMap works with latitudes and longitudes, so these coordinates must be projected into the convenient Cartesian coordinate system. A simple Pseudo-Mercator metric projection is used to transform input coordinates:

auto pi = 3.14159265358979323846264338327950288;
auto deg_to_rad = 2. * pi / 360.;
auto earth_radius = 6378137.;
auto lat2ym = [&](double lat) { return log(tan(lat * deg_to_rad / 2 + pi/4)) / 2 * earth_radius; };
auto lon2xm = [&](double lon) { return lon * deg_to_rad / 2 * earth_radius; };

It is also worth noting that a precision of 32-bit float values is not enough, so 64-bit double values are used for initial storage and projection. Once Cartesian coordinates are calculated, they are translated and scaled into the range of [0..1].

Polygons composition
OSM lets polygons to be defined as a composition of multiple non-closed Ways. The idea behind this is a sharing of Ways data between several adjacent areas to remove the necessity to declare the same border twice. Such an approach leads to an intermediate step of composing polygons out of pieces. To complicate matters, OSM does not mandate a strict order of Ways declaration and only requires that a closed polygon should be composable out of a given set. This even includes a possible interpretation of Way’s nodes in the reversed order: ABC + EDC + AFE = ABCDEF. The goal of this step is to get a set of closed Ways, so this data can be fed to a graphics API later. The sample code implements the polygons composition in a pretty blunt brute-force manner. This implementation works well enough on real data, but in theory, its performance may significantly degrade due to the high algorithmic complexity.

Rendering
Once the data is parsed and transformed, the Render class can start drawing the map. The drawing process is sequential and follows this order: landuse regions, leisure regions, water regions, railways, highways and buildings.
Each object has to be represented as a path before it can be drawn. Two methods do that: PathFromWay and PathFromMP. The difference between them is that PathFromWay deals with non-closed ways while PathFromMP composes a path from a collection of closed Ways. Straight lines are used to connect nodes along a Way:

io2d::interpreted_path Render::PathFromWay(const Model::Way &way) const { 
  if( way.nodes.empty() )
    return {};

  const auto nodes = m_Model.Nodes().data(); 

  auto pb = io2d::path_builder{};
  pb.matrix(m_Matrix);
  pb.new_figure( ToPoint2D(nodes[way.nodes.front()]) );
  for( auto it = ++way.nodes.begin(); it != std::end(way.nodes); ++it )
    pb.line( ToPoint2D(nodes[*it]) ); 
  return io2d::interpreted_path{pb};
}

Each region type has its visual properties like fill color, outline color, stroke width and dashes pattern. These properties are defined once during construction of a Render object and most of the times are used as-is. The exception is road/railroad width, which is defined in meters and has to be scaled into pixel width according to a map scale and a window size.
This render code utilizes only solid color brushes, however nothing stops us from using image brushes instead. The main issue with them is that such images need to be drawn by someone and IMHO the programmer art should be avoided like the plague.
Some regions might have holes inside, which is specified via separation of outer and inner polygons. The demo combines such polygons into a single path which is drawn under io2d::fill_rule::winding rule.
The drawing itself is pretty straightforward, for example, these 7 lines of code display the buildings on the map:

void Render::DrawBuildings(io2d::output_surface &surface) const {
  for( auto &building: m_Model.Buildings() ) {
    auto path = PathFromMP(building);
    surface.fill(m_BuildingFillBrush, path);
    surface.stroke(m_BuildingOutlineBrush, path, std::nullopt, m_BuildingOutlineStrokeProps);
  }
}

Examples

Central Park:
./maps -b -73.9866,40.7635,-73.9613,40.7775

Acropolis of Athens:
./maps -b 23.7125,37.9647,23.7332,37.9765

Vatican:
./maps -b 12.44609,41.897,12.46575,41.907

Performance statistics
This demo renders the entire graphics set from scratch every frame. This, of course, is not how such software usually behaves, but for the sake of simplicity, the choice was not to introduce any caching. So how does the Reference Implementation cope with this task? For testing purposes, I used the Core Graphics backend running on macOS 10.13. The source code was compiled in Xcode9.3 in Release configuration. The hardware underneath is an old 2012 MacMini with a 2,3GHz Core i7 processor. The maps were rendered at the resolution of 1920 x 1080.

Dataset	Central Park	Acropolis of Athens	Vatican
Nodes	36,909	51,126	27,614
Ways	4,636	6,105	3,410
Roads	1,082	989	1,060
Railroads	41	42	44
Buildings	2,329	4,336	889
Leisures	44	77	101
Waters	13	0	31
Landuses	23	66	66
FPS	11	9	14

Conclusion
So, it takes 90ms to display the Central Park dataset, which consists of ~37K points in ~3,5K paths. Not a terrible result for a software rendering engine, which shows that the library is clearly capable of handling a casual graphics output. Of course, a hardware-accelerated backend like Direct2D would perform much faster, but it’s not here yet.

The sample’s source code is available here: https://github.com/mikebmcl/P0267_RefImpl/tree/master/P0267_RefImpl/Samples/Maps.