Monday, 30 June 2014

Brandscreen: Debian All the Things! Puppet All the Things! Teamcity All the Things!

Brandscreen has a high level of automation for deployment and configuration. The graphing and monitoring is also critical for deployment.

Debian

All code releases at Brandscreen are packaged as deb packages and a number of reference files are also packaged as deb packages. Configuration is nearly all done through puppet. Any cron jobs simply call a wrapper.

A number of third party applications, including pip packages, were packaged as debs.

The deb packages are all installed in a repository. The dependencies made it easy to upgrade and downgrade packages. Therefore we could release very frequently with the confidence that if there were any problems the software and configuration could be rolled back easily.

Puppet

Puppet is used for all configuration that is not handled by a debian. Also most tasks are performed using puppet. Very little is performed using ssh.

Graphing and Monitoring

Brandscreen uses Teamcity to build and deploy releases. We generally deployed to a canary first, then checked the graphs in Graphite and Gdash. If there were any issues the release would then be rolled back.

Some of the items which were graphed in GDash are:

  • discrepancy with partner
  • CPU usage
  • disk throughput
  • disk space used
  • network throughput
  • memory usage
  • partner bid requests
  • partner bid responses
  • partner impressions
  • clicks
  • number of log files
  • pricing
  • response times
  • spending
  • offline requests
  • many many many other things that would give away proprietary information :)
I really can't understate how cool GDash is. And graphite itself is er useful. Here is an example from the GDash website:

Many other items including custom logging is available through graphite itself. The data was collected through collectd.

For other systems logs were processed through logstash and presented using Kibana.

Nagios and pagerduty are used for monitoring. To present the graphs for big screens newrelic and geckoboard are used.

I will go into more detail on graphing and alerts in a later article.

Teamcity

Teamcity is used to build, package and test each check-in. In addition deployment is done through teamcity.

Monday, 23 June 2014

Brandscreen: Hive vs. Redshift

Redshift is an instance of ParAccel's 'Big Data' database offered through AWS. It is similar to Netezza and is similarly based on Postgres.

Redshift is relational and ACID compliant, however the types available are limited, there are no indices and no stored procedures.

That doesn't sound great until you start working with data. Redshift will typically be significantly faster working on atomic data than a traditional Oracle or SQL Server installation dealing with summarised, indexed data.

Redshift is also significantly faster and lower latency than using Hive. In addition the AWS pricing makes Redshift more economical for frequent jobs than Hive.

Monday, 16 June 2014

Brandscreen Technology: Hive

Hive is a technology that sits on top of Hadoop which allows the searching of text files using SQL  statements.

org.apache.hadoop.hive.contrib.serde2.RegexSerDe in hive-contrib may be used to process irregular text files (e.g. webserver logs).

Hive can directly use CSV files and can directly use s3, which is particularly useful when staying inside the AWS environment.

Key steps to running a query are:

  1. Files in s3 should be laid out as <root>/<top key>/<partition key1=value1>/<partition key2=value2>...
    1. e.g. s3://test/logs/dt=2014-06-16/hr=15
  2. Startup up EC2 for Amazon, login using SSH to hadoop@<master node name> using the pem key you set up for Amazon
  3. 'Add' any external modules (such as hive-contrib) if required
  4. CREATE EXTERNAL TABLE with, specifying:
    1. PARTITIONED BY with partitioning matching partition 1... in step 1
      1. e.g. PARTITIONED BY (dt string, hr int)
    2. ROW FORMAT as:
      1. DELIMITED FIELDS
      2. SERDE (with regex)
  5. ALTER TABLE xxx ADD IF NOT EXISTS PARTITION (key1=partition1...);
    1. e.g. ALTER TABLE abc ADD IF NOT EXISTS PARTITION (dt=2014-06-16, hr=15);
    2. A separate ALTER TABLE statement must be provided for each partition required
  6. Query all the things! Just the same as a normal SQL query
The advantages of Hive are that you don't need to import your log files into a database in order to query them. The disadvantage of Hive is that although it has good throughput it has very poor latency. Some tools such as Shark may be able to mitigate this problem.