The Place to Get Punched: Programming

Showing posts with label Programming. Show all posts

Monday, 30 June 2014

Brandscreen: Debian All the Things! Puppet All the Things! Teamcity All the Things!

Brandscreen has a high level of automation for deployment and configuration. The graphing and monitoring is also critical for deployment.

Debian

All code releases at Brandscreen are packaged as deb packages and a number of reference files are also packaged as deb packages. Configuration is nearly all done through puppet. Any cron jobs simply call a wrapper.

A number of third party applications, including pip packages, were packaged as debs.

The deb packages are all installed in a repository. The dependencies made it easy to upgrade and downgrade packages. Therefore we could release very frequently with the confidence that if there were any problems the software and configuration could be rolled back easily.

Puppet

Puppet is used for all configuration that is not handled by a debian. Also most tasks are performed using puppet. Very little is performed using ssh.

Graphing and Monitoring

Brandscreen uses Teamcity to build and deploy releases. We generally deployed to a canary first, then checked the graphs in Graphite and Gdash. If there were any issues the release would then be rolled back.

Some of the items which were graphed in GDash are:

discrepancy with partner
CPU usage
disk throughput
disk space used
network throughput
memory usage
partner bid requests
partner bid responses
partner impressions
clicks
number of log files
pricing
response times
spending
offline requests
many many many other things that would give away proprietary information :)

I really can't understate how cool GDash is. And graphite itself is er useful. Here is an example from the GDash website:

Many other items including custom logging is available through graphite itself. The data was collected through collectd.

For other systems logs were processed through logstash and presented using Kibana.

Nagios and pagerduty are used for monitoring. To present the graphs for big screens newrelic and geckoboard are used.

I will go into more detail on graphing and alerts in a later article.

Teamcity

Teamcity is used to build, package and test each check-in. In addition deployment is done through teamcity.

Monday, 23 June 2014

Brandscreen: Hive vs. Redshift

Redshift is an instance of ParAccel's 'Big Data' database offered through AWS. It is similar to Netezza and is similarly based on Postgres.

Redshift is relational and ACID compliant, however the types available are limited, there are no indices and no stored procedures.

That doesn't sound great until you start working with data. Redshift will typically be significantly faster working on atomic data than a traditional Oracle or SQL Server installation dealing with summarised, indexed data.

Redshift is also significantly faster and lower latency than using Hive. In addition the AWS pricing makes Redshift more economical for frequent jobs than Hive.

Monday, 16 June 2014

Brandscreen Technology: Hive

Hive is a technology that sits on top of Hadoop which allows the searching of text files using SQL statements.

org.apache.hadoop.hive.contrib.serde2.RegexSerDe in hive-contrib may be used to process irregular text files (e.g. webserver logs).

Hive can directly use CSV files and can directly use s3, which is particularly useful when staying inside the AWS environment.

Key steps to running a query are:

Files in s3 should be laid out as <root>/<top key>/<partition key1=value1>/<partition key2=value2>...

e.g. s3://test/logs/dt=2014-06-16/hr=15

Startup up EC2 for Amazon, login using SSH to hadoop@<master node name> using the pem key you set up for Amazon
'Add' any external modules (such as hive-contrib) if required
CREATE EXTERNAL TABLE with, specifying:

PARTITIONED BY with partitioning matching partition 1... in step 1

e.g. PARTITIONED BY (dt string, hr int)

ROW FORMAT as:

DELIMITED FIELDS
SERDE (with regex)

ALTER TABLE xxx ADD IF NOT EXISTS PARTITION (key1=partition1...);

e.g. ALTER TABLE abc ADD IF NOT EXISTS PARTITION (dt=2014-06-16, hr=15);
A separate ALTER TABLE statement must be provided for each partition required

Query all the things! Just the same as a normal SQL query

The advantages of Hive are that you don't need to import your log files into a database in order to query them. The disadvantage of Hive is that although it has good throughput it has very poor latency. Some tools such as Shark may be able to mitigate this problem.

Friday, 30 May 2014

A Quick Tour of Brandscreen Technology

This article is more likely to be more like a shopping list than an article. Its main purpose is to act as a reminder for when I have time to write a longer article. Items which are starred I have had some interaction. I am not claiming to have set up the systems I have starred.

Packaging and Deployment

Debian all the things*
Debian repository
Teamcity (build and deploy)
MSBuild
Puppet
Vagrant*

Monitoring

Nagios
Graphite*
Sentry
Collectd*

Data

Hive (on Hadoop)*
Redshift*
S3*

SQL Server
MongoDB*
Formats:

JSON*
Protobuf / BSON*
CSV*

Cassandra

Software Engineering

Confluence*
JIRA*
Rhodecode*
Mercurial*
Google Mail*
Hipchat / Skype / Google Hangouts*

Technologies

C++*

gtest*
protobuf*
Boost*
Poco*

Python*

Django*

.NET

Wednesday, 4 December 2013

Python: The Need For Speed

Python is a great language, I use it practically every day. But Python has a dirty little secret. It is hideously slow.

I have first hand experience of Python just being too slow. In a real world project I found that the best Python implementation ran in 4 seconds and an unoptimised Java version ran in 128ms: a 30 fold improvement.

That comparison is a little unfair - Java is a compiled statically typed language after all. Unfortunately compared to Javascript on the v8 engine Python speed still sucks. The median result was that Python programmes took 12 times longer to run, with a maximum of 50 times slower.

Now some other Pythonistas have mounted the following arguments:

Most of the time it doesn't matter
It is fast enough for scientific calculations using scipy and numpy
Computation intensive tasks should be re-written as a C plugin
Use pypy

The first argument is usually correct. Does it really matter if a programme runs in 1s instead of 1ms? No. There are a minority of cases where performance does matter. It is in this case that Python is not acceptable.

Points 2 and 3 are really the same. Scipi and numpi are written in C, not native Python. The answer to "how do you write a fast Python programme?" seems to be "write it in C". This sounds like the old joke about the tourist wanting to find their way to Dublin with the response being "If I were you I wouldn't start from here". If I have to write a module in C for a performance critical app, why not just write the entire app in a faster language such as Java or C++

Point 4 is cited by a number of Pythonistas - but most C modules are incompatible with pypy and pypy currently has poor Python 3 support. So we are told to write C modules in responses 2 and 3 but then that paints us into a corner with using pypy.

There is another issue with having multiple interpreters - Python has limited resources which are spread between CPython, pypy, Jython and IronPython.

Python needs to merge CPython and pypy - providing a JIT reference implementation and preventing fragmentation in the Python community. This will significantly reduce the number of cases where a C module needs to be written.

I love Python and it has only once been too slow for my needs. But Van Rossum's statement that "It is usually much more effective to take that one piece and replace that one function or module with a little bit of code you wrote in C or C++..." is a cop-out. Python can do better - pypy is already doing better.

Python should adopt pypy as the default implementation, so Python can achieve good speeds without having to call out to C.

Friday, 31 May 2013

Qt4 and Boost On Windows

Instructions

The Qt4 binary on Windows conflicts with the boost pro binaries. The key problem is the -Zc:wchar_t compiler option

I used Microsoft Visual C++ 2010 Express

Download the source code

In "mkspecs\win32-msvc2010\qmake.conf" change:
QMAKE_CFLAGS = -nologo -Zm200 -Zc:wchar_t-
to
QMAKE_CFLAGS = -nologo -Zm200 -Zc:wchar_t

Then configure (is an .exe), nmake release. Done! Then add the includes and library into your own application project.

Helpful Links

Monday, 22 April 2013

What I Have Learned About Unit Testing

I am just an ordinary programmer trying to do his job. I do not sacrifice to the altar of Test Driven Development or consider myself an uber-programmer. I am just trying to avoid being called at 3am.

I took over the software itself from another programmer. His coding was quite nice with only one or two minor oddities which I am happy to overlook. However the unit tests that were available did not consist of much and were not documented (which is fair enough).

The software has had about 20 minutes downtime in 10 years so the quality expectations were very high. I assumed I would make mistakes and so took multiple steps to try to reduce the risk. This article concentrates on the testing step.

Use the Tools You Get For Free

I used to write an Apache module in C++ which handled about $800 million a year in advertising clicks at its peak, called the clickserver.

One of the most amazing tools for C++ is valgrind, specifically memcheck. It is vital that C++ software be run through valgrind memcheck with representative data. Yes valgrind does vomit up thousands of references to the STL string library which is extremely annoying. The most important thing to pay attention to, in my experience, is uninitialised variables. Uninitialised will almost certainly cause logic errors in your code. Rather than memory leaks which are serious but not as serious as logic errors.

Play It Again Sam

One of the most successful strategies I used when developing the clickserver was to replay real world interactions from real visitors. This made it easy to find any regression issues with the new software.

I would frequently catch one or two major bugs: both regression and in the new features.

The previous programmer wrote a script to replay the log but the difference tool just used diff, which ignored the subtleties of the log format: it should always find a difference (e.g. timestamp). I wrote a more sophisticated tool which would diff the log files and allow elimination of false positives.

There were a few challenges with the replay scenario: environment and false positives.

The output of the clickserver (redirect location, log entry and outgoing cookies) was determined by the input URL, incoming cookies, cached data from the database and configuration. All four inputs must be identical to the original.

The input URL and incoming cookies can be reconstructed from the Apache log, with care to adjust timestamps that were contained in the inputs.

Fortunately the database caches were in file form and were archived. Files were used in case the database went offline and could easily be rolled back to an archive if things were messed up enough. Fortunately that never happened.

The false positives were another challenge. The issue is that bugs are fixed and new features are added in the new software. This will mean that there are differences that make sense. Initially I added an option to my log difference tool to ignore differences in various fields.

The problem is that I only want to ignore various fields if it is definitely a bug fix or new feature. To achieve that I added the rhino engine (ScriptEngine) that is built into Java to allow a more nuanced elimination of false positives. Performance was a real challenge using ScriptEngine. I eventually split the contents of the ignore script into two parts: one function returned a list of fields to ignore (for example completely new fields or those that changed every time) and one did a one-by-one analysis on each result that was a positive match. I did not design the script system to be able to override false negatives.

My Eyes Are Glazing Over...

As anyone who has done any form of in-depth testing can attest to, having too much information to check can cause one's brain to go into neutral. Particularly where we are looking for an error in a thousand or a hundred thousand entries.

Having a closed loop unit test is critical in that regard. An automated unit test should require no human intervention. It sounds obvious but many fall into the trap of creating a bunch of stimuli that a human has to go back to check.

A human should only be involved in filtering out false positives and debugging problems. Of course nothing can be perfect.

Closed Loop Automated Testing

In addition to the log difference utility I also wrote a suite of unit tests. Athough they simply used the interfaces presented by the software (http and log file) and did not test each class directly, which could be considered "functional testing", it used a low-level white box approach which should qualify as "unit testing". I used my knowledge of the internals of the software to create tests which should test the difficult parts of the software.

The tests were written in pyunit (unittest2). The tests were built up from a number of sources:

the (undocumented) unit tests that had been present before
use cases for visitors. Including what the visitor should do and what the visitor can do
protocol tests: both following our protocols, following our protocols in edge cases and breaking our protocols
bugs (and their fixes)
new features

When I initially started the test and documentation project I made a huge spreadsheet with a number of test ideas from all these sources. I did the testing and documentation in unison because I kept finding that a number of "bugs" were not bugs - they were compromises that were not documented anywhere or misunderstandings.

Furthermore writing documentation shook out a number of inconsistencies and misunderstandings. I will go into further detail later.

I wrote a number of support classes, including a python implementation of tail. The end result ran in a few seconds, which was important because I wanted to be able to run it frequently. When the tests were slower I ran them less frequently.

Testing And Documentation?

Fortunately my former employer was forward thinking enough to allow me to do a testing project. I also included documentation in this for two good reasons. Firstly it was something else that needed to be done.

More importantly documentation and testing are intertwined. What is the correct behaviour? What if there are conflicts between what makes sense as "correct" behaviour in two different instances? I once heard that generally each software feature is trivial, it is the interactions between those features that generates complexity.

Furthermore in my experience good documentation improves a design. Having requirements, design, technical documentation or user documentation that has condition upon condition is usually a code smell for bad design. It requires users, programmers and application support hold more information in their minds.

Some people argue that the code is the documentation. If that is the case then the code by definition cannot have any bugs because it is being tested against itself. Furthermore less technical readers cannot read code. Even technical documentation may be read by application support, testers or sysadmins.

Documentation is required to describe a "contract" that the software will adhere to. The protocols and business requirements also form a "contract" that must be included. These contracts are then used in testing. These contracts are not enough to completely describe the software. I am not an advocate of design by contract (I will write about this in another post) but I am happy to pinch an idea here and there.

So documentation gives you something to test against.

The Future: Statistical Methods

Just before I left my last work I was investigating using statistical methods for monitoring new software which was deployed.

I started using Student's t-test but found that our data was too noisy to get a decent separation.

We knew anecdotally that our data was affected by a number of factors: local time of day, day of the week, day of the month, day of the year, client expenditure and many others.

Thus a multivariate analysis would have been likely to be helpful but I didn't have a chance to try it.

Conclusion

Using these steps, along with other quality assurance measures, helped keep the clickserver reliable with only a small number of bugs making it into production.

Sunday, 21 April 2013

I Will Fix Your Computer If...

Most professions have a concept of doing pro-bono work. Programming is an awesome job where we get to sit in the office all day. So we should give something back to the community. However we need some ground rules to make sure our time is not drained on a single job.

Computer

Licence:

Does it have a valid Windows licence?
Will you buy a valid Windows licence?
Can I install Ubuntu?
Sorry can't help!

Is it a hardware problem?

A disk drive, RAM or RTC battery?
Are you willing to donate it to be refurbished and given to someone?
Sorry, can't help!

Reformatting - the Windows disc

I will take a backup
Do you have the OS disks?
Do you have a recovery partition?
Have you not deleted the recovery partition?
Will you buy a Windows disc?
Sorry can't help!

Re-installation - other software

Do you have a valid license and disc / legit download?
Will you buy it?
Sorry can't help! (Warez is a PITA)

Crapware / Warez

Is this the second time?
Can I restrict administrator rights?
Can I install Ubuntu?
Sorry can't help!

Network

Are basic settings correct?
Is it a wireless problem?

Is the modem / router wireless N?
Replace the modem / router

Does replacing the modem help?

Saturday, 2 June 2012

C++ vs Java vs Python

Whatever your job it is important to use the right tool, so the job can be done quickly and you can go home on time. For a software engineer no tool is more important than programming languages.

Which programming language to use frequently descends into something akin to a religious war. I will try to steer clear of this mode of argument. A programming language is simply a tool.

To declare any bias upfront - I am primarily a C++ programmer and I frequently also write Python. I have dabbled in Java but have not done a great deal with it.

Syntax Complexity

In terms of syntax complexity, I hate to say this as a C++ programmer, but C++ is far more complex than Python or Java. C++ has a grammar which is much more context sensitive than Python or Java. Part of the complexity of C++ is that it is really four programming languages in one: core C++, template metaprogramming, C and preprocessor.

I find Java slightly less complex than C++. There is no preprocessor, templates are very limited and obviously there is no C legacy.

In my experience Python has the simplest syntax. The amount of syntax I have to memorise and recall in Python is much smaller than Java or C++.

This is all personal opinion and impressions. You may believe me because I am a C++ programmer, yet I am saying that Python is best and Java is good. Can we have a more objective measure though?

What about looking at each language's full grammar in EBNF-like format? Python has a short specification of about 116 lines. In comparison a Java specification (see section 18, via stackoverflow) is about 545 lines long. Finally C++ comes in at 987 lines.

Although it is difficult to objectively measure a language syntax's complexity, these measures give some idea.

Libraries

Libraries can truly make or break a language. Libraries can paper over problems with the language itself and prevent reinventing the wheel.

Python has built-in object oriented support for dates, compression, retrieving web pages, sending and receiving email, manipulating paths, csv files, command line options, INI style configuration files, threading and more. A large number of third party libraries are available using easy_install.

Java does have built-in support dates, threading and path manipulation utilities. Apache Commons provides many of the features that are built-in to Python: compression, CSV reader, command line options and INI style configuration files. Apache HTTPClient provides a way to retrieve web pages. Although there is no central repository there are an enormous number of open source and commercial libraries available for Java.

C++ and Boost cover very few of these common use cases. Boost supports dates, path manipulation, command line options, INI style configuration file and threading. It is also possible to write a CSV reader in C++ using boost::tokenizer. Compression, retrieving web pages and sending & receiving emails are not supported.

C++11 moves some of the features from Boost into the language itself and there are third party libraries which fill some of these voids. C++ also integrates easily with C libraries, such as libcurl for retrieving web pages.

The advantage of using C libraries is that there are thousands of C libraries. The disadvantage of using C libraries is that you are back using procedural programming for that specific library, unless you wrap the libraries in an object oriented layer yourself. In addition the C libraries have no knowledge of exceptions, so may not clean up correctly if there is an exception.

Databases

Databases are critical for almost any back-end business application.

Java has by far the best native support. JDBC is a standardised interface and is widely supported by database vendors. A JDBC to ODBC bridge is included with the JDK. Others can be easily installed by adding the jar to the classpath.

Python is a bit more mixed. Python has a standardised interface, however it generally relies on third parties to write drivers for C based drivers. Most vendors have not provided Python drivers. Even an ODBC driver is not built-in. A number of drivers are available, most through easy_install. There are some duplicates of some drivers and it is not always easy for a newbie to determine which one is still being actively developed. For example there are six different ODBC drivers, two different MySQL drivers and nine different PostgreSQL drivers.

C++ support for databases is also a bit mixed. There is no support for databases in C++ or Boost. OTL supports accessing Oracle, ODBC and DB2-CLI from C++. Databases other than Oracle or DB2-CLI are only supported via ODBC.

Functional Programming

Wikipedia states that "In computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data"

Python supports various aspects of functional programming. With the map() and reduce() functions. The itertools library is also included in later versions. Iterators, list comprehension and generator expressions are included in the language syntax and are faster than map() or reduce() in most cases.

Lambda functions are included in the language syntax. Lambda functions can be useful for "glue" logic: for example extracting the nth element from a list or a property from an object. However lambda functions are slow, like all python functions, limited and often not as clear as a separate function.

C++ STL includes template functions similar to python's map() such as transform(). It is possible to pass an object with an operator(), so the function can be executed very efficiently as an inline function.

Boost supports lambda function programming using some complex template programming. When Boost.Lambda works it is very useful but when it stops working it can be difficult to debug.

The C++11 now supports lambda functions in the language syntax.

I have not had any experience with functional programming in Java but I should note that Hadoop, one of the first MapReduce frameworks was written in Java.

Consistency

Python is really rather good with consistency. Iterators are widely used and all iterators are consistent. There is one preferred way to do most things.

C++ is fairly consistent but there are some oddities. A number of those oddities are due to C++'s C heritage, particularly with respect to syntax. Others are simply design deficiencies: for example in the STL the fstream classes take C-style character strings, rather than C++ std::strings (this has been fixed in C++11).

Java is quite consistent with syntax and the libraries are quite consistent. However iterators are somewhat inconsistent. For example the CharacterIterators interface is very different to the Iterator interface.

Memory and Speed

In my experience C++ is the most frugal with memory. Python is not too far behind considering it is a dynamic language. Java is probably the least frugal with memory but if the program is written properly there is usually no problem.

Java and C++ aren't too far apart in terms of speed, although C++ seems a bit faster for some tasks. In comparison Python can be painfully slow for various types of processing. In the example linked a highly optimised python version took 4.55s but a naive Java version took 130ms. Projects such as pypy may improve the situation

All three languages have sufficient performance in terms of memory or speed that it would not be a top priority for most application. Other issues such as library support, existing codebase, database support, familiarity and developer time are higher priorities.

Conclusion

I am happy to use all three languages. I use C++ and Python most frequently. I haven't seen as many of the dark corners of Python and I am most familiar with C++.

Database access is really easy in Java, most of the code bases I work with are already in C++ and python is great for just hacking things together quickly while being able to actually read it later.

Wednesday, 14 March 2012

Going Native 2012

I recently viewed some amazing videos from the Going Native 2012 conference. I would thoroughly recommend the videos for C++ developers on all platforms.

There is substantial coverage of the new C++11 features and clang. Even though this conference was sponsored by Microsoft the majority of the material is common to all platforms.

Specific videos I would recommend are:

If you have time I would also recommend the end of day round up: Interactive Panel: Ask Us Anything!

Update: the question in Interactive Panel: Ask Us Anything! at 1:16 regarding tooling is very enlightening, particularly Carruth's (clang) answer at 1:20 Looks like Java-like analysis tools are coming through.