14 Feb

rewriting the trading algorithm

I have the script now running at a realistic 1.14% per day return, taking into account the average number of times that I’ve had to pay a trading fee, the average offset of what price a trade happens vs what price the algorithm wanted it to happen, and other unpredictable things.

The script as it is, is very hard to train in any way other than almost a brute force fashion. Because most of the variables (length of EMA long tail, number of bars to measure for ATR, etc) are integers, I can’t hone into the right values using a momentum-based trainer, so I need to do it using long, laborious loops.

There is one improvement that I could do, which would probably double or more the return, without drastically changing the script. At the moment, there is a lot of “down time”, where the script has some cash sitting in the wallet, and is waiting for a good opportunity to jump on a deal. If the script were to consider other currencies at the same time, then it would be able to use that down time to buy in those currencies.

On second thought, I think the returns would be much higher, because when the script currently makes a BUY trade, its usually sold again within about 30 minutes. That means that it could potentially make 48 BUY trades in a day, each with an average of maybe a .5% return. That’s about a 27% return ((1+(.5/100))48 = 1.27).

That would be nice, but it’s impossible with the GDAX market, because the GDAX market only caters to four cryptocurrencies at the moment. Also, while the money aspect is nice, I’m actually doing this more for the puzzle-solving aspect. I get a real thrill out of adding in the real-life toils and troubles (fees, unpredictable trades, etc) and coming up with solutions that return interest despite those. So, I’m not going to refine the hell out of the script. Instead, I’m going to start working on another version.

Before I started on the current version, I had made a naive attempt at market prediction using neural networks. While some of the results looked very realistic, none of them stood up to scrutiny. I failed. But, I also learned a lot.

I’m going to make another neural-net-based attempt, using a different approach.

The idea I have is that instead of trying to predict the upcoming market values, I’ll simply try to predict whether I should buy or sell. This reduces the complexity a lot, because instead of trying to predict an arbitrary number, I’m only outputting a 1 (buy), 0 (nothing), or -1 (sell). This can then be checked by running the market from day 1 of the test data using each iteration of the network, and then adjusting the weights of the network based on whether the end result was higher or lower than the last time it was run.

I noticed that the best tunings from my current script only made a very few trades, on the very few days where the market basically dropped like a stone and then rebounded. But I can’t assume those will happen very often, so I like to see a few trades happen every day. With the neural network, I can increase the number of trades by simply adjusting the output function that decides whether a result is -1, 0, or 1 (this is usually done by converting the inputs into a value between -1 and 1, then rounding to an integer based on whether the value is above/below 0.7/-0.7. That value can be adjusted easily to provide more 1/-1 hits).

The current approach also involves a lot of math to measure ATR (average true range), running EMA (exponential moving average), etc. With the neural network approach, I will just need to squish the numbers down to form a pattern between -1 and 1 (where 0 is the current market price) and run directly against those numbers.

Because neural networks are a lot more “touchy-feely” than the current EMA/stop-gain/stop-loss approach I’m using, I will be able to “hone in” on good values – something I cannot do at the moment because of the step-like nature of integers.

I won’t be using any off-the-shell networks, as I can’t imagine how to write a good trainer for them. Instead, I’ll write my own, using a cascade-correlation approach.

Cascade-correlation is an approach to neural networks which allows a network to “grow” and gradually learn features of the input data. You train a network layer against every single input, until the output stops improving. You then “freeze” that network layer so is not adjusted anymore. You then create a second layer which trains against every single input and the previously trained network. You continue adding layers until there is no more noticeable improvement.

The point of this is that the first training layer will notice a feature in the data that produces a good result, and will train itself to recognise that feature very well. The second layer will then be able to ignore that feature (because it’s already being checked for), and find another one that improves the results. It’s like how you decide what animal is in a picture – does it have 6 legs (level one), is it red (level two), does it have a stinger (level three) – it’s a scorpion! Instead of trying to learn all the features of a successful sale at once, the algorithm picks up a new one at each level.

Around Christmas, I was playing with the FANN version of cascade correlation, and I think it’s very limited. It appears to create its new levels based on all inputs, but only the last feature-detection layer. Using the above example, this would make it difficult to recognise a black scorpion, as it would not be red. I believe that ideally, each feature layer should be treated as a separate new input, letting the end output make decisions based on multiple parallel features, not a single linear feature decision.

08 May

Scaling FieldMotion

I wrote a quick overview of how I approach the scaling-up of FieldMotion, over at the FieldMotion blog.

Basically, we started off with a huge monolithic block of code and data, then looked carefully at it to see how we could tear it apart into logical chunks.

The simplest to start with was data, so we separated out the database into a MySQL master/slave cluster, and a MongoDB replicated shard cluster.

Afterwards, we looked at the actual services that were performed by the server and started separating those out so they were completely independent from the main block.

At the moment, FieldMotion runs across more than 60 servers, and can tolerate sudden catastrophes on pretty much any part of it. We actually had an incident recently where an entire datacentre went offline for about 8 hours, but no-one noticed because we make sure each of our replicas is in a different datacentre.

05 May

yet another update

Time flies. I keep on planning to do things, and then failing to do them because there isn’t enough time, in between working 12 hours a day and trying not to fall asleep as soon as I get home.

I finished the basics of my next book, Live Forever, which I put up in website form so I can figure out through statistics which pages (a lot of them!) need work. Tonight, I’m working on the Cancer chapter so haven’t put that in there yet.

Over the weekend, I hope to get a start on a new project, which will help to design 100% nutrition diets based on common supermarket produce. There are known recommended daily allowances (RDAs) for all nutrients, but when you make your dinner, you don’t calculate an optimal meal because it’s just not practical or easy. The new project is designed to get around that by offering meal plans that are affordable and personalisable (you will be able to put your preferences into it). We’ll see if that gets off the ground!

In CoderDojo, some of my students (I really mentor them, more than teach, but what do you call someone you mentor? Mentoree?) are working on some interesting projects for this year’s Coder Dojo conference and next year’s Young Scientist. Two examples: programmable magnetic levitation, and a laser harp.

In work, we’ve moved beyond the frantic development stage that all companies go through, and are now in stabilisation mode, making sure the system is bulletproof and can scale well beyond current needs. I still find it interesting, even though the work I’m doing at the moment is not flashy and user-visible. Today, for example, I was writing a logging system to make sure that even though users access our mobile servers in a “round robin” method at the moment and the logs of their visits are therefore scattered among the servers, I can still aggregate them on the other end into something that can be searched easily. Not flashy, but quietly satisfying.

11 May

quick method to clone a MySQL database

let’s say you have a MySQL database on db1.db and you want to clone it to db2.db

the “official” way to do this is to run a “mysqldump” on db1.db and then import the resulting .sql file into the db2.db server.

There are problems with this approach:

  • mysqldump locks the source database, making it inaccessible while the dump is happening.
  • mysqldump creates files which may be many times the size of the source database’s binary files, potentially exhausting the space on your source server before it’s even done.
  • the resulting file then needs to be imported into the target server, which could take hours depending on the size.

I needed to clone some databases in a hurry that are about 20G in size. The method I used ended up taking less than half an hour to complete, and the source database (db1.db) only had to be down for less than a minute, instead of the potential /hours/ in the mysqldump method.

  1. use rsync on db2.db to copy the data directories from db1.db to db2.db:
    cd /var/lib/ && rsync root@db1.db:/var/lib/mysql ./ -rva –progress –delete
  2. use rsync on db2.db to copy binary logs from db1.db to db2.db:
    cd /var/log/ && rsync root@db1.db:/var/log/mysql ./ -rva –progress –delete
  3. repeat 1&2 (the first time around would take some time. the second time around will be quick)
  4. on db1.db, stop the database
    service mysqld stop
  5. on db2.db, repeat 1&2 one last time
  6. on db1.db, start the database again, and start the slave service if you need to
    service mysqld start
  7. on db2.db, remove auto.cnf and any innodb log files
    cd /var/lib/mysql/ && rm -f auto.cnf ib_logfile*
  8. start the database, and start the slave if needed
    service mysqld start

With the above method, your source database will be down for only a minute or so (steps 4-6).

The reason that 1&2 are repeated 3 times:

  1. clone the db1.db database from scratch. this will take a while
  2. because it took so long to run #1, there are probably a lot of changes. repeat to get those changes
  3. when you stop db1.db, some files will get final changes as they are changed. grab those after db1.db has been stopped

You need to delete any existing innodb logs (step 7) which might cause the system to attempt to “fix” some tables it might think are broken. but, because we did a clean shutdown in step 4, this is not necessary. so delete the log files (they will be recreated automatically).

If you are doing the clone because you want to create a new slave database, then the database needs a new internal ID that it will send to the master. By deleting auto.cnf, you force the MySQL server to create a new unique ID.

24 Jun

kbarcode part 1 – JavaScript

Short story:

Github repository for kbarcode – the JavaScript part of the solution. You can use it on its own, without needing Cordova at all.

Demo of kbarcode finding a barcode and then the barcode parameters being printed out onto the image that the barcode was found in.

Long story:

There are already a few barcode readers for Cordova. The most popular one is the official Phonegap barcode plugin, which is based on the amazingly comprehensive ZXing library of algorithms.

At FieldMotion, we were using the official plugin, but it had a few short-comings that meant we had to look for a better solution:

  • When looking for a barcode, the plugin opens up an external camera application. This means that your own application stops, the external app is started, and when you find your barcode, yur app is started up again. This process is very jarring, and noticeably slow.
  • You have absolutely no say over the look of the barcode scanner.
  • If you want multiple barcodes, you are out of luck – you’re just going to have to go through the selection process manually for every one of them.

What we wanted was:

  • A small camera view to appear when we press to select a barcode.
  • To be able to style this ourselves in whatever way we want.
  • To optionally keep the scanner open after it has found a barcode, so that it can keep scanning for others if need be.
  • It must feel natural and fast.

So, we went looking.

The nearest thing to a solution that we found was a combination of two plugins – Moonware’s CameraPlus plugin, which allows the camera to be opened in the background and its photos returned to a JavaScript callback for you to handle however you wish, and Eddie Larsson’s JOB (JavaScript-only Barcode Reader).

In combination, these appear to be perfect – we could get images via CameraPlus, display them in a popup UI that could be used by the user to center the barcode, and then use JOB to scan the image and retrieve the code.

Unfortunately, this method is SLOW.

I identified two main reasons for this:

  1. Streaming images to JavaScript via a Java bridge is very slow, because the images need to be encrypted in Base64 (increasing their size), and the images also need to be in high resolution in order to give the barcode reader the best chance it can get.
  2. The method that Eddie’s algorithm uses is to find the barcode in the image, no matter where it is, which involves reading the entire image. In JavaScript. Brilliant, but slow.

After some wracking of the brain, I came up with this solution:

  • Tweak the CameraPlus plugin so it returns just a small image to be displayed, and also a 1px high gray-scale strip from the center of a higher-resolution image (in byte array format).
  • Write a barcode decryption algorithm that will find a barcode in a 1D array of gray-scale values, instead of a 2D image.

This worked wonderfully. We now have a very usable barcode reader that is not very laggy, and finds the barcodes incredibly quickly. We’re also only interested in the EAN-13 encoding, so we don’t need to check for other encodings.

The reason we chose to use a 1D strip instead of the entire image, is that if you have a UI which has a marker displayed where you want the user to put the barcode, they are psychologically inclined to do so, so you really only need to consider that single central strip, and can safely ignore the rest of the image.

It’s a Worker, so it runs in a separate thread to the rest of your code. No need to include it in your HTML file – just correct the reference to the file in the code example below.

Example usage:

var kbarcode=new Worker('kbarcode.js');
kbarcode.addEventListener('message', function(result) {
  if (result.value) {
    alert(result.value);
  }
});
kbarcode.postMessage({
  'cmd': 'decode',
  'img': [43,43,42,42,42,42,42,42,42,42,43,43,43,43,43,43,43,43,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,41,41,41,41,41,41,41,41,41,41,41,41,41,41,42,41,41,41,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,41,41,41,41,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,39,39,39,39,39,39,39,40,39,39,39,39,38,38,38,38,38,39,39,39,40,40,40,39,39,39,39,39,39,39,39,40,40,40,40,40,40,40,40,40,39,39,39,39,39,39,39,39,39,40,40,40,40,40,40,40,40,40,40,40,40,40,40,39,40,40,40,40,39,39,39,39,39,39,38,38,38,38,38,38,38,39,39,39,39,39,39,39,39,39,39,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,39,39,39,39,39,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,41,41,41,41,41,42,42,42,42,42,43,43,43,43,43,42,42,42,42,42,42,43,43,43,43,43,44,44,44,44,44,45,45,45,45,45,46,46,46,46,46,47,47,47,47,48,48,48,48,49,49,49,49,50,50,50,51,51,51,52,52,52,53,53,53,53,54,54,54,55,55,56,56,57,57,58,58,59,59,59,60,61,61,62,63,63,64,65,65,65,66,66,67,68,69,70,71,71,73,74,79,115,166,191,200,202,204,204,204,205,206,206,207,207,207,207,207,207,207,207,207,207,207,207,207,207,207,207,207,207,207,207,207,207,207,207,207,207,206,206,206,206,206,206,206,205,205,205,205,203,198,184,145,93,97,143,178,181,156,105,90,114,162,175,152,105,83,81,80,79,80,80,81,88,121,168,176,141,99,86,85,85,85,98,159,189,196,198,200,200,200,198,194,175,123,95,113,162,189,196,196,195,186,146,98,96,135,174,177,147,100,91,116,171,189,195,195,191,173,127,85,81,80,79,79,78,78,84,107,163,174,157,107,83,79,79,79,79,78,78,78,78,78,85,116,167,175,155,108,88,104,156,180,171,124,95,100,147,186,195,199,201,201,201,201,202,201,199,193,172,120,91,104,150,186,196,198,199,200,199,197,191,168,101,84,82,82,82,98,149,178,175,132,94,90,123,169,179,155,101,89,108,154,174,170,121,88,89,135,170,171,137,90,85,120,165,176,156,111,85,80,80,80,80,81,82,91,131,179,191,193,193,188,167,115,90,102,155,174,168,126,89,82,82,82,82,82,83,84,106,161,187,193,194,193,182,142,96,95,155,183,193,193,194,190,172,117,95,115,163,186,195,198,198,198,197,196,189,159,113,91,104,161,176,171,128,96,97,138,181,192,196,198,198,199,199,199,199,196,193,182,142,99,92,130,175,189,192,192,189,173,127,87,81,81,80,79,79,80,81,110,155,171,159,118,87,81,80,79,79,79,80,85,112,161,173,158,116,94,99,141,176,187,189,189,181,154,113,93,99,152,173,168,131,97,96,131,174,191,196,198,198,198,199,200,201,201,201,201,201,201,200,199,199,199,198,196,193,181,150,101,91,124,159,169,155,116,89,82,82,82,89,133,173,187,189,189,186,167,124,96,97,131,166,167,142,103,84,79,78,77,77,77,77,85,123,161,170,150,105,89,104,146,169,164,133,92,89,126,166,182,186,187,185,171,125,93,80,78,78,77,78,78,80,107,149,166,158,120,91,94,136,169,180,183,183,181,165,125,94,81,80,79,82,134,169,182,184,184,179,163,121,88,97,138,161,159,128,93,88,129,164,179,186,187,188,188,187,180,163,124,91,79,77,77,81,128,157,159,133,97,85,105,145,157,143,106,87,97,146,174,181,183,184,184,184,182,174,132,96,82,77,76,78,106,154,158,135,99,86,105,150,172,182,185,186,186,187,188,189,189,189,189,189,189,189,189,189,189,189,189,186,185,184,184,183,183,179,167,136,100,79,70,68,66,65,64,64,62,61,60,60,60,60,59,59,59,59,58,58,58,58,57,57,57,57,56,56,55,55,54,54,53,53,53,52,52,51,50,50,49,48,47,46,45,45,44,43,43,42,41,41,40,39,39,38,38,38,38,38,38,37,37,37,37,37,37,37,37,36,36,36,37,37,37,38,38,38,38,39,39,38,37,37,37,36,35,34,33,33,32,32,32,32,32,31,31,31,31,31,31,31,31,31,31,31,32,32,32,31,31,31,31,31,30,30,29,29,29,29,29,28,28,28,28,28,28,28,28,28,28,28,28,28,29,29,29,29,29,29,29,29,28,28,28,28,28,28,28,28,28,28,28,28,28,28,29,29,29,29,28,28,28]
});

In the next post in this series, I’ll upload the Cordova plugin we developed to use with this.

06 Nov

Distributed File Storage, using PHP and MongoDB

Scenario:

  • Alice creates an entry on Server1, and uploads an image to it.
  • Bob views that entry on Server2, but can’t view the image because the server doesn’t have it.

There are a number of solutions to this.

  1. after each upload, push the new file out to all servers so they also have a copy
  2. mount an external file system, networked to all servers
  3. create a caching distributed file system centered around an external database

The first solution, ensuring that every uploaded file is simultaneously uploaded to all servers, is wrong for an obvious reason: hard-drive space. Imagine you have 20 servers and the file is likely only to ever be read on 3 of them (maybe they’re location-based?) – by uploading to all servers, you waste space, increasing storage costs and also slowing down the servers as they are busy doing work that they really don’t need to be doing.

The second solution is better – an external mounted solution such as NFS, S3QL, or Samba can store your files on file servers that are backed up and replicated, and are simultaneously available to all your web servers. But these solutions come at a huge speed cost – every file check involves network access, lock checking, POSIX compliance and other ugliness. Also, network file systems of this sort are very sensitive to network outages, however temporary they are.

The solution we will build in this article is to create an external file system that

  1. supports local caching of files for speed
  2. has immediate availability of files across all servers
  3. is “shardable”, so files only exist on servers where they are actually needed

Storage

To store the files, you need an external storage solution. For reasons that we will see later, the solution I use is MongoDB and its GridFS solution.

MongoDB is a NoSQL database, that stores information in binary JSON files. It is extremely scalable, and shards nicely as well, allowing us to concentrate more on our application and less on database maintenance.

To store the files, we will upload them into the MongoDB network, where they will be stored as “chunks”. Retrieving and storing the files is a simple matter, as we’ll see.

Saving Files

Up until now, all your files were recorded on the system using direct access – using file_put_contents(), for example.

We need to find all instances of these calls and route them through a new function called mdbFileSet (MongoDB File Set) that will record the file as requested, but will also upload it to the database.

In most cases, this is a simple matter – if the user-files directory is $_SERVER[‘DOCUMENT_ROOT’].’/userfiles/’, then a call such as file_put_contents($_SERVER[‘DOCUMENT_ROOT’].’/userfiles/’.$filename, $filecontent) will be replaced with mdbFileSet($filename, $filecontent). This is obviously more readable, and we are abstracting the user-files location as well, making it flexible.

The actual mdbFileSet() function works like this

  1. parameters are $fname and $file, which contain the filename (including the directories, delimited by ‘/’), and the file content as a string.
  2. check GridFS to see if the file already exists. If it does:
    1. delete the existing file (see Deleting Files in this article)
  3. copy the uploaded file to the local user-files location (to act as a cache)
  4. upload the file using GridFS

Code for the mdbFileSet function:

function mdbFileSet($fname, $file) {
  if (strpos($fname, '..')!==false) { // hack attempt
    return false;
  }
  global $MDBVARS;
  if (strpos($fname, '/')!==false) {
    @mkdir($MDBVARS['cache'].preg_replace('/[^\/]*$/', '', $fname), 0755, true);
  }
  file_put_contents($MDBVARS['cache'].$fname, $file);
  $conn=new Mongo($MDBVARS['dbhost']);
  $db=$conn->{$MDBVARS['dbname']};                
  $db->authenticate($MDBVARS['username'], $MDBVARS['password']);
  $grid=$db->getGridFS();                    
  $existing=$grid->findOne($fname);               
  if (!is_null($existing)) {
    $grid->delete($existing->file['_id']);
  }
  $grid->storeBytes($file, array('filename'=>$fname), array('safe'=>true));
  $conn->close();
}

You will need to set the $MDBVARS global array before running the function. I keep mine in the server’s config.php.

Example:

$MDBVARS=array(
  'cache'=>$_SERVER['DOCUMENT_ROOT'].'/userfiles/',
  'username'=>'username',
  'password'=>'password',
  'dbname'=>'filesdb',
  'dbhost'=>'mdb1.yourmongodbserver.com'
);

Replace the values in the above code with your own values.

You can test this easily. Create a test.php file with the following code:

<?php
require_once 'php/basics.php'; // link to file containing common functions
mdbFileSet('test/file.php', file_get_contents(__FILE__));
?>

The above code will upload a copy of the test.php file you just created, and will store a copy in your cache as well. After loading the file in your browser, you can test this by looking in your cache on the server:

[root@cp3 server]# ls userfiles/test/file.php -l
-rw-r--r-- 1 apache apache 142 Nov  6 10:36 userfiles/test/file.php

And also by logging into the MongoDB server and searching for the file:

> db.fs.files.find({filename:'test/file.php'})
{ "_id" : ObjectId("545b4f1560b99367688b456b"), "filename" : "test/file.php", "uploadDate" : ISODate("2014-11-06T10:36:05.121Z"), "length" : NumberLong(142), "chunkSize" : NumberLong(261120), "md5" : "00397d7306c53cda5ea9446d7bd62594" }

Before going any further, you should go through your code now and edit all your user-file-writing functions so they use the mdbFileSet() function. Everything should still work as before, but now, there will be a copy of each file saved in the MongoDB database as well.

Reading Files

Okay, so let’s say all your work so far in this article has been done on Server1. You now switch over to Server2 and want to open a record that includes an image uploaded to Server1. The image is obviously not on Server2, so how do we transparently download it to Server2 such that the end-user never needs to know?

For this, we will write a function called mdbFileGet (MongoDB File Get), which will retrieve it from the MongoDB server if it is not already cached locally. How it works:

  1. there is one parameter, $fname, which is the filename including the directories.
  2. if the file already exists in the local server’s cache, then return that file’s contents.
  3. otherwise, download the file from GridFS, store a copy in the local cache, and return the file’s contents.

There is an issue to do with the cache, which I’ll explain in a moment, but in the meantime, here is the code for the function:

function mdbFileGet($fname) {
  if (strpos($fname, '..')!==false) { // hack attempt
    return false;
  } 
  global $MDBVARS;
  if (file_exists($MDBVARS['cache'].$fname)) {
    return file_get_contents($MDBVARS['cache'].$fname);
  }
  $conn=new Mongo($MDBVARS['dbhost']);         
  $db=$conn->{$MDBVARS['dbname']};                
  $db->authenticate($MDBVARS['username'], $MDBVARS['password']);        
  $fdata=$db->fs->files->findOne(array('filename'=>$fname));
  if (is_null($fdata)) { // file doesn't exist
    $conn->close();
    return false;
  }
  $grid=$db->getGridFS();                    
  $file=$grid->findOne(array('filename'=>$fname));
  if (strpos($fname, '/')!==false) {
    @mkdir($MDBVARS['cache'].preg_replace('/[^\/]*$/', '', $fname), 0755, true);
  } 
  $bytes=$file->getBytes();
  file_put_contents($MDBVARS['cache'].$fname, $bytes);
  $ftime=date('YmdHis', $file->file['uploadDate']->sec);
  touch($MDBVARS['cache'].$fname, $ftime);
  $conn->close();
  return $bytes;
}

For an example of this in use, let’s consider an image, /userfiles/1/image.jpg that was uploaded to Server1. It’s obviously not yet on Server2, so how do we view it there?
When loading the file up (let’s say http://server2.yourcomp.any/userfiles/1/image.jpg), the server looks directly for the image, and doesn’t find it. We need to route the request through a script that makes sure the file is there before sending it back.
To do that in this case, we can use mod_rewrite so that calls to /userfiles/[whatever] are routed to something like /php/file-get.php, which handles the work.
Edit your .htaccess file, and add in something like this:

RewriteEngine on
RewriteRule ^userfiles/.*$ /php/file-get.php [QSA,L]

Now create the file php/file-get.php:

<?php
require_once 'basics.php'; // load common functions and config.php
$fname=preg_replace('/^\/userfiles\/|\?.*/', '', $_SERVER['REQUEST_URI']);
if (strpos($fname, '..')!==false) { // hack attempt
  exit;
}
$ext=strtolower(preg_replace('/.*\./', '', $fname));
switch ($ext) {
  case 'png':
    header('Content-type: image/png');
  break;
  case 'jpg': case 'jpeg':
    header('Content-type: image/jpg');
  break;
  case 'gif':
    header('Content-type: image/gif');
  break;
  default:
    header('Content-type: ');
}
echo mdbFileGet($fname);
?>

You can see that most of the file’s code is actually just figuring out the mime-type to show. The downloading and showing of the file is done right at the last line.

You can now transparently upload files on one server and view them on another!

In fact, once the file is uploaded, you can remove it completely from all servers, and then when you next need it, just load it up through mdbFileGet() as normal and it will download again.

The caching issue that I mentioned earlier has to do with cache invalidation. Let’s say we upload image.jpg and it is distributed to a number of servers. After a few hours, we might upload a replacement image – how do we tell the servers that the old image is invalid and it should be downloaded again?

We will start solving that in the next section.

Deleting Files

Deleting files is not as obvious as it sounds. On a one-server system, it’s simply a matter of using unlink() to remove the file, and there’s no more to be said about that.

However, in a multi-server system, we have three steps:

  1. delete the local cached file
  2. delete the database-stored file
  3. find all servers that have a copy of the file and delete the file from those servers.

#1 and #2 can be solved immediately in a very simple function:

function mdbFileRemove($fname) {
  if (strpos($fname, '..')!==false) { // hack attempt
    return false;
  }
  global $MDBVARS;
  @unlink($MDBVARS['cache'].$fname);
  $conn=new Mongo($MDBVARS['dbhost']);
  $db=$conn->{$MDBVARS['dbname']};
  $db->authenticate($MDBVARS['username'], $MDBVARS['password']);
  $grid=$db->getGridFS();
  $existing=$grid->findOne($fname);
  if (!is_null($existing)) {
    $grid->delete($existing->file['_id']);
  }
  $conn->close();
}

The above will delete a file from the local server and from the MongoDB database, but will not clear the file from other server caches.

To delete from the other machines, we need to set up a deletion queue, which we’ll do later in the File Delete Queues section.

Creating File Delete Queues

To delete files from all servers, we need to send a message to those servers to tell them to delete their local copies of the file.

Sending a message to every single server in your network is a waste of resources, as most of the servers may not actually have a copy of the file you are trying to delete.

So, we need to adapt the mdbFileSet and mdbFileGet functions so they add a record to the database telling it exactly what servers have copies of the files. This will then allow us to target just those servers and to know that we’re not wasting time.

Edit the mdbFileSet function and change this line:

$grid->storeBytes($file, array('filename'=>$fname), array('safe'=>true));

to this:

$grid->storeBytes(
  $file,
  array('filename'=>$fname, 'servers'=>array($_SERVER['HTTP_HOST'])),
  array('safe'=>true)
);

As a test, I uploaded an image called 3184/user-photos/3184.jpg, then checked my MongoDB instance:

> db.fs.files.find({filename:'3184/user-photos/3184.jpg'})
{ "_id" : ObjectId("545b68a160b993b86b8b4567"), "filename" : "3184/user-photos/3184.jpg", "servers" : [ "cp3.myserver.com" ], "uploadDate" : ISODate("2014-11-06T12:25:05.344Z"), "length" : NumberLong(37182), "chunkSize" : NumberLong(261120), "md5" : "9def8b14cb1611097e755692d04dcbdd" }

Note the highlighted servers section. As part of the file upload, we are initialising an array which states what servers have a copy of that file.

An important thing to note as well, is that in GridFS, the file is recorded in a set of chunks which are standard MongoDB documents, and the metadata of the file is recorded in another normal document. What we look at with db.fs.files.find is the metadata, not the file chunks. It would be uneconomical to store metadata within the same document(s) as the file chunks, as checking something as simple as its creation date, or the list of servers that have it, would then involve downloading the entire file.

Next, we need to adapt the mdbFileGet() function. Change the following:

touch($MDBVARS['cache'].$fname, date('YmdHis', $file->file['uploadDate']->sec));
$conn->close();

to this:

touch($MDBVARS['cache'].$fname, date('YmdHis', $file->file['uploadDate']->sec));
$db->fs->files->update(
  array('filename'=>$fname),
  array('$push'=>array('servers'=>$_SERVER['HTTP_HOST']))
);
$conn->close();

In this, we inline-update the server array that we created in mdbFileSet(). There is no need to download, change, and re-upload the record. In fact, there is a race condition there, in that some other server may be doing the same thing at the same time. It is safer to have the MongoDB server handle the update of the document directly.

If you then open the image on another server and check the file again on the MongoDB server, you’ll see something like this:

> db.fs.files.find({filename:'3184/user-photos/3184.jpg'})
{ "_id" : ObjectId("545b68a160b993b86b8b4567"), "filename" : "3184/user-photos/3184.jpg", "servers" : [ "cp3.myserver.com", "cp4.myserver.com" ], "uploadDate" : ISODate("2014-11-06T12:25:05.344Z"), "length" : NumberLong(37182), "chunkSize" : NumberLong(261120), "md5" : "9def8b14cb1611097e755692d04dcbdd" }

Note that the servers array has an extra entry in it, but nothing else was touched. Exactly what we want.

Next, we need to adapt the mdbFileRemove function, so it builds the queue of files to delete (and what servers to delete them from).

To do that, change the following:

  if (!is_null($existing)) {
    $grid->delete($existing->file['_id']);
  }

to this:

  if (!is_null($existing)) {
    if (isset($existing->file['servers'])) {
      $servers=array_unique($existing->file['servers']);
      $idx=array_search($_SERVER['HTTP_HOST'], $servers);
      if ($idx!==false) {
        unset($servers[$idx]);
      }
      $list=array_values(array_map(function($server) use ($fname) {
        return array(
          'filename'=>$fname,
          'server'=>$server
        );
      }, $servers));
      $ret=$db->command(array('insert'=>'deletes', 'documents'=>$list));
    }
    $grid->delete($existing->file['_id']);
  }

This code inserts an entry into a db.deletes collection on the MongoDB server for every server that has a cached copy of the file. Of course, it removes a reference to the local server before doing so, as we can handle that immediately.

After doing an update of the image on cp3.myserver.com, I then checked the MongoDB deletes collection:

> db.deletes.find()
{ "_id" : ObjectId("545b7f6165f402bccee49573"), "filename" : "3184/user-photos/3184.jpg", "server" : "cp4.myserver.com" }

This means we can now work on the next part; writing a deletion daemon.

Running a File Deletion Queue

We now have a list of the cached files and the servers that have them. But how do we tell those servers to delete those cached files?

A way to do this is to write a cron job that runs every minute and checks the MongoDB deletes collection to see if there are any cached files that need to be deleted, then call those servers and tell them to delete the files.

This script will need to run directly on the MongoDB server, so install PHP on that server. In particular, you will need the command-line version of PHP. In Centos7, it is installed like this:

[root@mdb1 ~]# yum install php-cli php-devel php-pear gcc openssl-devel
[root@mdb1 ~]# pecl install mongo
[root@mdb1 ~]# echo "extension=mongo.so" >> /etc/php.ini

On the MongoDB server, create a user called mongo (useradd mongo), and create a file called /home/mongo/checkCaches.php:

<?php
$MDBVARS=array(
        'username'=>'username',
        'password'=>'password',
        'dbname'=>'filesdb',
        'dbhost'=>'mdb1.yourmongodbserver.com',
        'apikey'=>'805f73958de1653e073e6a8c674bb1e8'
);
$conn=new Mongo($MDBVARS['dbhost']);
$db=$conn->{$MDBVARS['dbname']};
$db->authenticate($MDBVARS['username'], $MDBVARS['password']);
$fdata=$db->deletes->find();
while ($d=$fdata->getNext()) {
        $url='http://'.$d['server'].'/php/cacheClear.php';
        $ch=curl_init($url);
        curl_setopt($ch, CURLOPT_POST, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $fname=$d['filename'];
        $now=''.microtime(true);
        $md5=md5($fname.$now.$MDBVARS['apikey']);
        curl_setopt($ch, CURLOPT_POSTFIELDS, array(
                'filename'=>$fname,
                'time'=>$now,
                'md5'=>$md5
        ));
        $ret=curl_exec($ch);
        if ($ret=='ok') {
                $db->deletes->remove(array('_id'=>new MongoId($d['_id'])));

        }
        curl_close($ch);
}
?>

The $MDBVARS array is almost the same as those on the application servers. We add a new item, though, apikey, which helps us provide some authentication without needing usernames and passwords. By running the filename, the time, and the apikey through an MD5 function, we create a value that can only reasonably be reproduced by another MD5 function that knows the same details. So, we send the filename, time and MD5 result through to the target server, and if the target server can reproduce the MD5 result by MD5ing the filename, time, and its own copy of the apikey, then that’s enough proof that the call is valid.

Make sure to add the apikey entry to all your servers’ $MDBVARS arrays.

On the target server, then, we create the /php/cacheClear.php file:

<?php
require_once 'basics.php';
$fname=$_REQUEST['filename'];
$md5=md5($fname.$_REQUEST['time'].$MDBVARS['apikey']);
if ($md5!=$_REQUEST['md5']) {
  echo 'incorrect API key';
  exit;
}
if (strpos($fname, '..')!==false) { // check for hacks
  exit;
}
@unlink($MDBVARS['cache'].$fname);
echo 'ok';
?>

As usual, there is a potential flaw to consider. The checkCaches.php file on the MongoDB server goes through every delete entry in the database, but what if this takes more than a minute to finish?

If it takes more than a minute to finish, and the script is being called once a minute, then eventually, the server will have multiple copies of the script running against overlapping lists of files, and it will crash.

The solution to this is simply to add a timeout to the script, so it runs for 55 seconds (say) and then stops.

In checkCaches.php on the MongoDB server, change the following:

while ($d=$fdata->getNext()) {

to this:

$now2=time();
while ($d=$fdata->getNext()) {
  if ($time-55>$now) {
    exit;
  }

Now it will simply stop after second 55, and continue when it is called again.

Edit cron for the mongo user (su mongo -c “crontab -e”) and add this line then save the file:

* * * * * php /home/mongo/checkCache.php >/dev/null 2>/dev/null

That’s it! You now have a working distributed filesystem.

22 Sep

Keeping a PHP session alive

I get asked this a lot. When you log into a session-based system, and walk away for half an hour, frequently you’ll come back to find it is no longer logged in. How do you keep the session alive when you are not active on the site?

The solution is to regularly “poll” the server with JavaScript and have the server change the session so its expiry is reset.

Each page of the site should have the following JavaScript code. Place it in your shared library if you have one.

window.setInterval(function() {
  var el=document.createElement('img');
  el.src='/sessionRenew.php?rand='+Math.random();
  el.style.opacity=.01;
  el.style.width=1;
  el.style.height=1;
  el.onload=function() {
    document.body.removeChild(el);
  }
  document.body.appendChild(el);
}, 60000);

This creates an image that is 1px*1px and mostly invisible (so it retrieves it from the server immediately). As soon as the data is loaded from the server, it is then deleted. No particular JavaScript library is needed.

Now, on the server, create the following file as sessionRenew.php:

<php
session_start();
$_SESSION['rand']=rand();
echo $_SESSION['rand'];
?>

This opens (or creates) the active session, records a change to it (the ‘rand’ value), and echos that out. The reason you force a change is that some session systems, such as memcache, don’t update the session expiry for simple reads, so you need to make an actual change.

23 Jun

Gardenbot 2014

Every year, I start a new Gardenbot project, and it rarely gets any further than a wish list. This year is different. I have a fully-functional robot that is battery-powered and can be controlled remotely via WiFi.

IMAG0838

Getting this far has not been easy, so I’ll write up what I can remember so you can do the same (or so I can do the same again next year after I forget!)

The biggest problem was the computer itself. I’m using a Raspberry Pi, but powering it was tricky.

The Pi takes a 5V input, but I couldn’t find any ready-made 5V batteries, and didn’t want to use battery packs as I wanted to easily recharge individual batteries.

In the past, my experience with using batteries in series with each other was that one battery would discharge fully way before any others, leaving an apparently dead pack. To solve that, I’m using li-ion batteries scavenged from phones; each with at least 2800mAh in them. I link them in parallel, and “boost” the voltage using some regulators.

There are currently two voltage boosters in the system. The first one powers the Pi, and the second powers the USB hub. You can’t power the USB hub directly off the Pi as the Pi uses 700mA, and there’s not enough left over to power anything useful. So, for anything external, such as the WiFi and the camera on the robot, you need to use a powered hub.

To save space, I stuck the voltage boosters for the USB hub and the Pi inside the Pi case, as you can see in this photo. They’re the rectangular circuits with the large capacitors on them.

IMAG0839

The capacitors are there to help stop fluctuations in power supply as various bits and pieces are turned on. There’s nothing quite as annoying as turning on a motor only to find that you have lost WiFi because of it and now have no way to turn off the motor.

The robot chassis is a hand-built case made from two perspex sides, a wooden base, and a wooden front. I didn’t measure anything – it was all done by trial/error.

The tracks are from Tamiya (example store). The box comes with enough for a larger base, but I didn’t need it all.

The claw at the front doesn’t work perfectly yet. The one I currently have is one I bought a few years back. It never seems to work properly for me. I think I need either a stronger servo, or just replace the claw completely.

The servo cable has three wires – ground, power in, and signal. The ground and power in can be plugged directly into the batteries. The signal, I hooked to GPIO 1 on the Pi (using this wiring guide for the main GPIO connector), which is then controlled using pulse width modulation (PWM) through the pin.

The motors for the treads are scavenged from the legs of a Robosapien bot I got for Christmas a few years back. These are standard DC motors, probably for up to 5V, but I’m running them off 3V and happy with them.

To control the motors, I was initially planning to create my own motor controller using some PNP and NPN transistors, but found a motor controller circuit from an old Cybot that handily does exactly what I need.

IMAG0840

The camera is a standard web-cam, with the cables shortened.

Turning the machine on is done by simply connecting the little red cable between the battery-side and the “other stuff” side of the breadboard as you can see in the image above.

To charge the battery, I simply hook in a Li-ion charger directly into the left of the board (below). The charging circuit will happily charge multiple Li-ion batteries.

IMAG0841

Hardware-wise, I’m almost happy. I want to replace the claw soon, but apart from that, I’m ready to work on software.

I already have code written for controlling the motors, which I’ll upload into Github over the next few days. I’m looking into SLAM now for creating maps via the camera system. I might have to write the solution myself, though, as the code I’ve found so far is written in academicese and I don’t understand it.

Funny, that, as I’m certain I can write the bloody code, but can’t understand the words that the academics use!

20 Mar

Salting Passwords

The simplest way to store a password in a database is as a plain string

insert into users set email="kae@kvsites.ie", password="password";
["kae@kvsites.ie", "password"]

But, if someone hacks into the server, or you have a malicious admin, then those passwords can be stolen. This is a big security risk as passwords tend to be re-used by people for other purposes, such as PayPal, etc.

So, the next stage is to encrypt the password using a hash such as MD5:

insert into users set email="kae@kvsites.ie", password=md5("password");
["kae@kvsites.ie", "5f4dcc3b5aa765d61d8327deb882cf99"]

That /looks/ secure, but there are huge databases on the Internet with MD5 translations of all words, so it is trivial to hack these.
https://www.google.ie/search?q=5f4dcc3b5aa765d61d8327deb882cf99

The next stage is to “salt” the password by adding a prefix to it before hashing. For example, let’s use “123ghjzxc” as the salt key.

insert into users set email="kae@kvsites.ie", password=md5(concat("123ghjzxc", "password"));
["kae@kvsites.ie", "9f400bac0b5a9b3d66c9c98aae09fab5"]

This is much more secure now. A search for the MD5 hash will not return any results at all (well, this page… but you know what I mean).
https://www.google.ie/search?q=9f400bac0b5a9b3d66c9c98aae09fab5

Another method is to hash the password before prefixing it with the salt, then hashing again. This may be a bit more secure again.

insert into users set email="kae@kvsites.ie", password=md5(concat("123ghjzxc", md5("password")));
["kae@kvsites.ie", "d1dddda63a6dde54fb1740dffe3faa27"];

As an extra step, do all the MD5ing outside the database, so the password is not sent over the wire to the database.

29 Jan

CSRF

CSRF (cross-site request forgery) are hacks where a user on one system is tricked into doing something on that system while browsing another system.

Example

Let’s say you are logged into http://yoursite.example.com/ as an admin, and you can easily delete an object by clicking a link that sends a request to http://yoursite.example.com/a/delete.php?object=1.

You take a break and go read some websites.

Now, let’s say that one of those websites has had a little piece of code attached

<img src="http://yoursite.example.com/a/delete.php?object=1" style="display:none"/>

Other readers of the same site will not notice anything – they’re not logged into your site, and so have no delete rights. But, you are!

This vulnerability is called “CSRF” because the hack happens on a different website than your own, taking advantage of the fact that you are logged in, to delete stuff (or move money, etc).

Solution

On the server, you should create a CSRF token, send it to the client, and make sure that all actions that are requested include that token.

To set the token, just create a random string and save it to your session:

<?php
if (!isset($_SESSION['csrf'])) {
  $_SESSION['csrf']=md5(mt_rand().time());
}

Then, whenever an action is performed, make sure that the request includes that token before the action is performed.

<?php
if (!((isset($_REQUEST['_csrf']) && $_REQUEST['_csrf']==$_SESSION['csrf'])
  || apache_request_headers()['X-CSRF']==$_SESSION['csrf'])
) {
  header('Content-type: text/json');
  echo json_encode(array(
    'error'=>'CSRF violation'
  ));
  exit;
}

Note that my code above allows two ways to send the CSRF – as a request variable (GET/PUT/POST), or as a header.

For HTML forms, make sure that each form includes the CSRF token:

<input name="_csrf" value="<?=$_SESSION['csrf'];?>"/>

And finally, for AJAX, make sure that the token is included by default. Personally I use jQuery, so this does that:

  $.ajaxSetup({
    'beforeSend': function(xhr) {
      xhr.setRequestHeader('X-CSRF', window.csrf);
    }
  });

(make sure that window.csrf is set as inline javascript in the page)

Conclusion

Now what happens is that each time a request is made to the server, the CSRF token that’s sent is checked against the session’s CSRF token, and if they don’t match (or no token is sent), then the action is ignored.

It is not possible for any website to guess your CSRF token (we set it to a random MD5), so you are safe.