Distributed File Storage, using PHP and MongoDB

Scenario:

  • Alice creates an entry on Server1, and uploads an image to it.
  • Bob views that entry on Server2, but can’t view the image because the server doesn’t have it.

There are a number of solutions to this.

  1. after each upload, push the new file out to all servers so they also have a copy
  2. mount an external file system, networked to all servers
  3. create a caching distributed file system centered around an external database

The first solution, ensuring that every uploaded file is simultaneously uploaded to all servers, is wrong for an obvious reason: hard-drive space. Imagine you have 20 servers and the file is likely only to ever be read on 3 of them (maybe they’re location-based?) – by uploading to all servers, you waste space, increasing storage costs and also slowing down the servers as they are busy doing work that they really don’t need to be doing.

The second solution is better – an external mounted solution such as NFS, S3QL, or Samba can store your files on file servers that are backed up and replicated, and are simultaneously available to all your web servers. But these solutions come at a huge speed cost – every file check involves network access, lock checking, POSIX compliance and other ugliness. Also, network file systems of this sort are very sensitive to network outages, however temporary they are.

The solution we will build in this article is to create an external file system that

  1. supports local caching of files for speed
  2. has immediate availability of files across all servers
  3. is “shardable”, so files only exist on servers where they are actually needed

Storage

To store the files, you need an external storage solution. For reasons that we will see later, the solution I use is MongoDB and its GridFS solution.

MongoDB is a NoSQL database, that stores information in binary JSON files. It is extremely scalable, and shards nicely as well, allowing us to concentrate more on our application and less on database maintenance.

To store the files, we will upload them into the MongoDB network, where they will be stored as “chunks”. Retrieving and storing the files is a simple matter, as we’ll see.

Saving Files

Up until now, all your files were recorded on the system using direct access – using file_put_contents(), for example.

We need to find all instances of these calls and route them through a new function called mdbFileSet (MongoDB File Set) that will record the file as requested, but will also upload it to the database.

In most cases, this is a simple matter – if the user-files directory is $_SERVER['DOCUMENT_ROOT'].’/userfiles/’, then a call such as file_put_contents($_SERVER['DOCUMENT_ROOT'].’/userfiles/’.$filename, $filecontent) will be replaced with mdbFileSet($filename, $filecontent). This is obviously more readable, and we are abstracting the user-files location as well, making it flexible.

The actual mdbFileSet() function works like this

  1. parameters are $fname and $file, which contain the filename (including the directories, delimited by ‘/’), and the file content as a string.
  2. check GridFS to see if the file already exists. If it does:
    1. delete the existing file (see Deleting Files in this article)
  3. copy the uploaded file to the local user-files location (to act as a cache)
  4. upload the file using GridFS

Code for the mdbFileSet function:

function mdbFileSet($fname, $file) {
  if (strpos($fname, '..')!==false) { // hack attempt
    return false;
  }
  global $MDBVARS;
  if (strpos($fname, '/')!==false) {
    @mkdir($MDBVARS['cache'].preg_replace('/[^\/]*$/', '', $fname), 0755, true);
  }
  file_put_contents($MDBVARS['cache'].$fname, $file);
  $conn=new Mongo($MDBVARS['dbhost']);
  $db=$conn->{$MDBVARS['dbname']};                
  $db->authenticate($MDBVARS['username'], $MDBVARS['password']);
  $grid=$db->getGridFS();                    
  $existing=$grid->findOne($fname);               
  if (!is_null($existing)) {
    $grid->delete($existing->file['_id']);
  }
  $grid->storeBytes($file, array('filename'=>$fname), array('safe'=>true));
  $conn->close();
}

You will need to set the $MDBVARS global array before running the function. I keep mine in the server’s config.php.

Example:

$MDBVARS=array(
  'cache'=>$_SERVER['DOCUMENT_ROOT'].'/userfiles/',
  'username'=>'username',
  'password'=>'password',
  'dbname'=>'filesdb',
  'dbhost'=>'mdb1.yourmongodbserver.com'
);

Replace the values in the above code with your own values.

You can test this easily. Create a test.php file with the following code:

<?php
require_once 'php/basics.php'; // link to file containing common functions
mdbFileSet('test/file.php', file_get_contents(__FILE__));
?>

The above code will upload a copy of the test.php file you just created, and will store a copy in your cache as well. After loading the file in your browser, you can test this by looking in your cache on the server:

[root@cp3 server]# ls userfiles/test/file.php -l
-rw-r--r-- 1 apache apache 142 Nov  6 10:36 userfiles/test/file.php

And also by logging into the MongoDB server and searching for the file:

> db.fs.files.find({filename:'test/file.php'})
{ "_id" : ObjectId("545b4f1560b99367688b456b"), "filename" : "test/file.php", "uploadDate" : ISODate("2014-11-06T10:36:05.121Z"), "length" : NumberLong(142), "chunkSize" : NumberLong(261120), "md5" : "00397d7306c53cda5ea9446d7bd62594" }

Before going any further, you should go through your code now and edit all your user-file-writing functions so they use the mdbFileSet() function. Everything should still work as before, but now, there will be a copy of each file saved in the MongoDB database as well.

Reading Files

Okay, so let’s say all your work so far in this article has been done on Server1. You now switch over to Server2 and want to open a record that includes an image uploaded to Server1. The image is obviously not on Server2, so how do we transparently download it to Server2 such that the end-user never needs to know?

For this, we will write a function called mdbFileGet (MongoDB File Get), which will retrieve it from the MongoDB server if it is not already cached locally. How it works:

  1. there is one parameter, $fname, which is the filename including the directories.
  2. if the file already exists in the local server’s cache, then return that file’s contents.
  3. otherwise, download the file from GridFS, store a copy in the local cache, and return the file’s contents.

There is an issue to do with the cache, which I’ll explain in a moment, but in the meantime, here is the code for the function:

function mdbFileGet($fname) {
  if (strpos($fname, '..')!==false) { // hack attempt
    return false;
  } 
  global $MDBVARS;
  if (file_exists($MDBVARS['cache'].$fname)) {
    return file_get_contents($MDBVARS['cache'].$fname);
  }
  $conn=new Mongo($MDBVARS['dbhost']);         
  $db=$conn->{$MDBVARS['dbname']};                
  $db->authenticate($MDBVARS['username'], $MDBVARS['password']);        
  $fdata=$db->fs->files->findOne(array('filename'=>$fname));
  if (is_null($fdata)) { // file doesn't exist
    $conn->close();
    return false;
  }
  $grid=$db->getGridFS();                    
  $file=$grid->findOne(array('filename'=>$fname));
  if (strpos($fname, '/')!==false) {
    @mkdir($MDBVARS['cache'].preg_replace('/[^\/]*$/', '', $fname), 0755, true);
  } 
  $bytes=$file->getBytes();
  file_put_contents($MDBVARS['cache'].$fname, $bytes);
  $ftime=date('YmdHis', $file->file['uploadDate']->sec);
  touch($MDBVARS['cache'].$fname, $ftime);
  $conn->close();
  return $bytes;
}

For an example of this in use, let’s consider an image, /userfiles/1/image.jpg that was uploaded to Server1. It’s obviously not yet on Server2, so how do we view it there?
When loading the file up (let’s say http://server2.yourcomp.any/userfiles/1/image.jpg), the server looks directly for the image, and doesn’t find it. We need to route the request through a script that makes sure the file is there before sending it back.
To do that in this case, we can use mod_rewrite so that calls to /userfiles/[whatever] are routed to something like /php/file-get.php, which handles the work.
Edit your .htaccess file, and add in something like this:

RewriteEngine on
RewriteRule ^userfiles/.*$ /php/file-get.php [QSA,L]

Now create the file php/file-get.php:

<?php
require_once 'basics.php'; // load common functions and config.php
$fname=preg_replace('/^\/userfiles\/|\?.*/', '', $_SERVER['REQUEST_URI']);
if (strpos($fname, '..')!==false) { // hack attempt
  exit;
}
$ext=strtolower(preg_replace('/.*\./', '', $fname));
switch ($ext) {
  case 'png':
    header('Content-type: image/png');
  break;
  case 'jpg': case 'jpeg':
    header('Content-type: image/jpg');
  break;
  case 'gif':
    header('Content-type: image/gif');
  break;
  default:
    header('Content-type: ');
}
echo mdbFileGet($fname);
?>

You can see that most of the file’s code is actually just figuring out the mime-type to show. The downloading and showing of the file is done right at the last line.

You can now transparently upload files on one server and view them on another!

In fact, once the file is uploaded, you can remove it completely from all servers, and then when you next need it, just load it up through mdbFileGet() as normal and it will download again.

The caching issue that I mentioned earlier has to do with cache invalidation. Let’s say we upload image.jpg and it is distributed to a number of servers. After a few hours, we might upload a replacement image – how do we tell the servers that the old image is invalid and it should be downloaded again?

We will start solving that in the next section.

Deleting Files

Deleting files is not as obvious as it sounds. On a one-server system, it’s simply a matter of using unlink() to remove the file, and there’s no more to be said about that.

However, in a multi-server system, we have three steps:

  1. delete the local cached file
  2. delete the database-stored file
  3. find all servers that have a copy of the file and delete the file from those servers.

#1 and #2 can be solved immediately in a very simple function:

function mdbFileRemove($fname) {
  if (strpos($fname, '..')!==false) { // hack attempt
    return false;
  }
  global $MDBVARS;
  @unlink($MDBVARS['cache'].$fname);
  $conn=new Mongo($MDBVARS['dbhost']);
  $db=$conn->{$MDBVARS['dbname']};
  $db->authenticate($MDBVARS['username'], $MDBVARS['password']);
  $grid=$db->getGridFS();
  $existing=$grid->findOne($fname);
  if (!is_null($existing)) {
    $grid->delete($existing->file['_id']);
  }
  $conn->close();
}

The above will delete a file from the local server and from the MongoDB database, but will not clear the file from other server caches.

To delete from the other machines, we need to set up a deletion queue, which we’ll do later in the File Delete Queues section.

Creating File Delete Queues

To delete files from all servers, we need to send a message to those servers to tell them to delete their local copies of the file.

Sending a message to every single server in your network is a waste of resources, as most of the servers may not actually have a copy of the file you are trying to delete.

So, we need to adapt the mdbFileSet and mdbFileGet functions so they add a record to the database telling it exactly what servers have copies of the files. This will then allow us to target just those servers and to know that we’re not wasting time.

Edit the mdbFileSet function and change this line:

$grid->storeBytes($file, array('filename'=>$fname), array('safe'=>true));

to this:

$grid->storeBytes(
  $file,
  array('filename'=>$fname, 'servers'=>array($_SERVER['HTTP_HOST'])),
  array('safe'=>true)
);

As a test, I uploaded an image called 3184/user-photos/3184.jpg, then checked my MongoDB instance:

> db.fs.files.find({filename:'3184/user-photos/3184.jpg'})
{ "_id" : ObjectId("545b68a160b993b86b8b4567"), "filename" : "3184/user-photos/3184.jpg", "servers" : [ "cp3.myserver.com" ], "uploadDate" : ISODate("2014-11-06T12:25:05.344Z"), "length" : NumberLong(37182), "chunkSize" : NumberLong(261120), "md5" : "9def8b14cb1611097e755692d04dcbdd" }

Note the highlighted servers section. As part of the file upload, we are initialising an array which states what servers have a copy of that file.

An important thing to note as well, is that in GridFS, the file is recorded in a set of chunks which are standard MongoDB documents, and the metadata of the file is recorded in another normal document. What we look at with db.fs.files.find is the metadata, not the file chunks. It would be uneconomical to store metadata within the same document(s) as the file chunks, as checking something as simple as its creation date, or the list of servers that have it, would then involve downloading the entire file.

Next, we need to adapt the mdbFileGet() function. Change the following:

touch($MDBVARS['cache'].$fname, date('YmdHis', $file->file['uploadDate']->sec));
$conn->close();

to this:

touch($MDBVARS['cache'].$fname, date('YmdHis', $file->file['uploadDate']->sec));
$db->fs->files->update(
  array('filename'=>$fname),
  array('$push'=>array('servers'=>$_SERVER['HTTP_HOST']))
);
$conn->close();

In this, we inline-update the server array that we created in mdbFileSet(). There is no need to download, change, and re-upload the record. In fact, there is a race condition there, in that some other server may be doing the same thing at the same time. It is safer to have the MongoDB server handle the update of the document directly.

If you then open the image on another server and check the file again on the MongoDB server, you’ll see something like this:

> db.fs.files.find({filename:'3184/user-photos/3184.jpg'})
{ "_id" : ObjectId("545b68a160b993b86b8b4567"), "filename" : "3184/user-photos/3184.jpg", "servers" : [ "cp3.myserver.com", "cp4.myserver.com" ], "uploadDate" : ISODate("2014-11-06T12:25:05.344Z"), "length" : NumberLong(37182), "chunkSize" : NumberLong(261120), "md5" : "9def8b14cb1611097e755692d04dcbdd" }

Note that the servers array has an extra entry in it, but nothing else was touched. Exactly what we want.

Next, we need to adapt the mdbFileRemove function, so it builds the queue of files to delete (and what servers to delete them from).

To do that, change the following:

  if (!is_null($existing)) {
    $grid->delete($existing->file['_id']);
  }

to this:

  if (!is_null($existing)) {
    if (isset($existing->file['servers'])) {
      $servers=array_unique($existing->file['servers']);
      $idx=array_search($_SERVER['HTTP_HOST'], $servers);
      if ($idx!==false) {
        unset($servers[$idx]);
      }
      $list=array_values(array_map(function($server) use ($fname) {
        return array(
          'filename'=>$fname,
          'server'=>$server
        );
      }, $servers));
      $ret=$db->command(array('insert'=>'deletes', 'documents'=>$list));
    }
    $grid->delete($existing->file['_id']);
  }

This code inserts an entry into a db.deletes collection on the MongoDB server for every server that has a cached copy of the file. Of course, it removes a reference to the local server before doing so, as we can handle that immediately.

After doing an update of the image on cp3.myserver.com, I then checked the MongoDB deletes collection:

> db.deletes.find()
{ "_id" : ObjectId("545b7f6165f402bccee49573"), "filename" : "3184/user-photos/3184.jpg", "server" : "cp4.myserver.com" }

This means we can now work on the next part; writing a deletion daemon.

Running a File Deletion Queue

We now have a list of the cached files and the servers that have them. But how do we tell those servers to delete those cached files?

A way to do this is to write a cron job that runs every minute and checks the MongoDB deletes collection to see if there are any cached files that need to be deleted, then call those servers and tell them to delete the files.

This script will need to run directly on the MongoDB server, so install PHP on that server. In particular, you will need the command-line version of PHP. In Centos7, it is installed like this:

[root@mdb1 ~]# yum install php-cli php-devel php-pear gcc
[root@mdb1 ~]# pecl install mongo
[root@mdb1 ~]# echo "extension=mongo.so" >> /etc/php.ini

On the MongoDB server, create a user called mongo (useradd mongo), and create a file called /home/mongo/checkCaches.php:

<?php
$MDBVARS=array(
        'username'=>'username',
        'password'=>'password',
        'dbname'=>'filesdb',
        'dbhost'=>'mdb1.yourmongodbserver.com',
        'apikey'=>'805f73958de1653e073e6a8c674bb1e8'
);
$conn=new Mongo($MDBVARS['dbhost']);
$db=$conn->{$MDBVARS['dbname']};
$db->authenticate($MDBVARS['username'], $MDBVARS['password']);
$fdata=$db->deletes->find();
while ($d=$fdata->getNext()) {
        $url='http://'.$d['server'].'/php/cacheClear.php';
        $ch=curl_init($url);
        curl_setopt($ch, CURLOPT_POST, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $fname=$d['filename'];
        $now=''.microtime(true);
        $md5=md5($fname.$now.$MDBVARS['apikey']);
        curl_setopt($ch, CURLOPT_POSTFIELDS, array(
                'filename'=>$fname,
                'time'=>$now,
                'md5'=>$md5
        ));
        $ret=curl_exec($ch);
        if ($ret=='ok') {
                $db->deletes->remove($d['_id']);

        }
        curl_close($ch);
}
?>

The $MDBVARS array is almost the same as those on the application servers. We add a new item, though, apikey, which helps us provide some authentication without needing usernames and passwords. By running the filename, the time, and the apikey through an MD5 function, we create a value that can only reasonably be reproduced by another MD5 function that knows the same details. So, we send the filename, time and MD5 result through to the target server, and if the target server can reproduce the MD5 result by MD5ing the filename, time, and its own copy of the apikey, then that’s enough proof that the call is valid.

Make sure to add the apikey entry to all your servers’ $MDBVARS arrays.

On the target server, then, we create the /php/cacheClear.php file:

<?php
require_once 'basics.php';
$fname=$_REQUEST['filename'];
$md5=md5($fname.$_REQUEST['time'].$MDBVARS['apikey']);
if ($md5!=$_REQUEST['md5']) {
  echo 'incorrect API key';
  exit;
}
if (strpos($fname, '..')!==false) { // check for hacks
  exit;
}
@unlink($MDBVARS['cache'].$fname);
echo 'ok';
?>

As usual, there is a potential flaw to consider. The checkCaches.php file on the MongoDB server goes through every delete entry in the database, but what if this takes more than a minute to finish?

If it takes more than a minute to finish, and the script is being called once a minute, then eventually, the server will have multiple copies of the script running against overlapping lists of files, and it will crash.

The solution to this is simply to add a timeout to the script, so it runs for 55 seconds (say) and then stops.

In checkCaches.php on the MongoDB server, change the following:

while ($d=$fdata->getNext()) {

to this:

$now2=time();
while ($d=$fdata->getNext() && time()-55<$now2) {

Now it will simply stop after second 55, and continue when it is called again.

Edit cron for the mongo user (su mongo -c “crontab -e”) and add this line then save the file:

* * * * * php /home/mongo/checkCache.php >/dev/null 2>/dev/null

That’s it! You now have a working distributed filesystem.

Keeping a PHP session alive

I get asked this a lot. When you log into a session-based system, and walk away for half an hour, frequently you’ll come back to find it is no longer logged in. How do you keep the session alive when you are not active on the site?

The solution is to regularly “poll” the server with JavaScript and have the server change the session so its expiry is reset.

Each page of the site should have the following JavaScript code. Place it in your shared library if you have one.

window.setInterval(function() {
  var el=document.createElement('img');
  el.src='/sessionRenew.php?rand='+Math.random();
  el.style.opacity=.01;
  el.style.width=1;
  el.style.height=1;
  el.onload=function() {
    document.body.removeChild(el);
  }
  document.body.appendChild(el);
}, 60000);

This creates an image that is 1px*1px and mostly invisible (so it retrieves it from the server immediately). As soon as the data is loaded from the server, it is then deleted. No particular JavaScript library is needed.

Now, on the server, create the following file as sessionRenew.php:

<php
session_start();
$_SESSION['rand']=rand();
echo $_SESSION['rand'];
?>

This opens (or creates) the active session, records a change to it (the ‘rand’ value), and echos that out. The reason you force a change is that some session systems, such as memcache, don’t update the session expiry for simple reads, so you need to make an actual change.

list of scientists that embrace creationism

I’ve been arguing on Facebook with some people that seem convinced that by showing a list of scientists that are also creationists, they somehow “prove” that there is a controversy over evolution.

The list can be seen here, and contains 214 entries at the moment.

Of that list, only 35 of them are biologists. The rest really don’t matter. Who cares if a food scientist doesn’t believe in evolution?

So, the list of remaining biologists:

Dr Kimberly Berrine Microbiology & Immunology does not appear to exist
Prof. Vladimir Betina Microbiology, Biochemistry & Biology Does not appear to have ever mentioned his opinion either way
Dr Raymond G. Bohlin Biology Biased: leads a religious ministry.
Dr Andrew Bosanquet Biology, Microbiology No evidence he supports creationism. A number of published articles on evolution of the p53 protein [1] [2]
Dr Robert W. Carter Marine Biology Biased: works for Creation Ministries.
Prof. Chung-Il Cho Biology Education No evidence he is a creationist. Has written articles providing evidence for evolution. [1] [2]
Dr Ken Cumming Biology Biased: dean of the Institution of Creation Research.
Dr David A. DeWitt Biology, Biochemistry, Neuroscience Biased: Associate Director of the Center for Creation Studies at Liberty University.
Prof. Carl B. Fliermans Professor of Biology Biased: member of the technical board for the Institute for Creation Research, teaches biblical studies
Prof. Robert H. Franks Associate Professor of Biology Does not appear to exist. The only references I can find to him are in this list, or here (which makes him about 80 and probably dead).
Dr Pierre Jerlström Molecular Biology Biased: staff scientist at Creation Ministries International, and editorial co-ordinator of Journal of Creation [1].
Dr Arthur Jones Biology Biased: works for the Christian Schools’ Trust. Taught religion at two UK schools.
Dr Dean Kenyon Biology Biased: fellow of the Discovery Institute [1]. No longer a professor. Witness for the losing side of two important evolution/creationism court cases. [2].
Prof. Gi-Tai Kim Biology Does not appear to have published an opinion for or against evolution [1].
Dr John W. Klotz Biology Has been dead for nearly twenty years.
Dr Leonid Korochkin M.D., Genetics, Molecular Biology, Neurobiology is NOT a creationist. One of his articles [1] says he is an “adherent of the macromutation evolution”.
Dr Wolfgang Kuhn biology researcher and lecturer This guy has been dead more than ten years.
Dr Heather Kuruvilla Plant Physiology, Senior Professor of Biology, Cedarville University Does not appear to have ever given an opinion on evolution or creationism
Dr John G. Leslie biochemistry, molecular biology, medicine, biblical archaeology has no biology education [1]
Prof. Lane P. Lester Biology, Genetics Biased: on the board of directors of the Creation Research Society.
Dr Ian Macreadie Molecular Biology and Microbiology appears to be professionally unbiased but blinkered. reading through one of his articles [1], I can see signs of him simply discarding evolutionary ideas, instead of exploring them.
Dr John Marcus Molecular Biology does not appear to exist [1]
Professor Douglas Oliver Professor of Biology Biased: associate director of Center for Creation Studies.
Prof. Chris D. Osborne Assistant Professor of Biology Biased: works for Logos Research Associates. Hasn’t published in over 20 years.
Dr Gary E. Parker Biology, Cognate in Geology (Paleontology) Biased: founder of Creation Adventures Museum.
Dr Terry Phipps Professor of Biology, Cedarville University Biased: Cedarville University is a biblical school, not an unbiased school.
Dr Jung-Goo Roe Biology Does not appear to exist outside this list [1]
Dr Ariel A. Roth Biology has stated that “creation science” is not a science [1].
Dr Alicia (Lisa) Schaffner Associate Professor of Biology, Cedarville University Biased: teaches at a christian school [1].
Dr Timothy G. Standish Biology Biased: works for the Geoscience Research Institute.
Dr Dennis Sullivan Biology, surgery, chemistry, Professor of Biology, Cedarville University Biased: Cedarville University is a biblical school.
Dr Larry Thaete Molecular and Cellular Biology and Pathobiology He’s a gynaecologist and obstetrician. doesn’t study anything to do with evolution [1].
Dr Joachim Vetter Biology German. Dead. [1]
Dr Sung-Hee Yoon Biology Does not appear to exist [1]
Dr Henry Zuill Biology Retired professor. Does not appear to have ever written a peer-reviewed article.

Of the above 35,

  • 16 are biased – they get paid to promote a creationist viewpoint.
  • 4 don’t appear to exist. I could find no reference to these people outside this list.
  • 3 are dead.
  • the rest appear to have no opinion either way on evolution, or don’t have any professional link to evolution.

In short: this list is rubbish.

Gardenbot 2014

Every year, I start a new Gardenbot project, and it rarely gets any further than a wish list. This year is different. I have a fully-functional robot that is battery-powered and can be controlled remotely via WiFi.

IMAG0838

Getting this far has not been easy, so I’ll write up what I can remember so you can do the same (or so I can do the same again next year after I forget!)

The biggest problem was the computer itself. I’m using a Raspberry Pi, but powering it was tricky.

The Pi takes a 5V input, but I couldn’t find any ready-made 5V batteries, and didn’t want to use battery packs as I wanted to easily recharge individual batteries.

In the past, my experience with using batteries in series with each other was that one battery would discharge fully way before any others, leaving an apparently dead pack. To solve that, I’m using li-ion batteries scavenged from phones; each with at least 2800mAh in them. I link them in parallel, and “boost” the voltage using some regulators.

There are currently two voltage boosters in the system. The first one powers the Pi, and the second powers the USB hub. You can’t power the USB hub directly off the Pi as the Pi uses 700mA, and there’s not enough left over to power anything useful. So, for anything external, such as the WiFi and the camera on the robot, you need to use a powered hub.

To save space, I stuck the voltage boosters for the USB hub and the Pi inside the Pi case, as you can see in this photo. They’re the rectangular circuits with the large capacitors on them.

IMAG0839

The capacitors are there to help stop fluctuations in power supply as various bits and pieces are turned on. There’s nothing quite as annoying as turning on a motor only to find that you have lost WiFi because of it and now have no way to turn off the motor.

The robot chassis is a hand-built case made from two perspex sides, a wooden base, and a wooden front. I didn’t measure anything – it was all done by trial/error.

The tracks are from Tamiya (example store). The box comes with enough for a larger base, but I didn’t need it all.

The claw at the front doesn’t work perfectly yet. The one I currently have is one I bought a few years back. It never seems to work properly for me. I think I need either a stronger servo, or just replace the claw completely.

The servo cable has three wires – ground, power in, and signal. The ground and power in can be plugged directly into the batteries. The signal, I hooked to GPIO 1 on the Pi (using this wiring guide for the main GPIO connector), which is then controlled using pulse width modulation (PWM) through the pin.

The motors for the treads are scavenged from the legs of a Robosapien bot I got for Christmas a few years back. These are standard DC motors, probably for up to 5V, but I’m running them off 3V and happy with them.

To control the motors, I was initially planning to create my own motor controller using some PNP and NPN transistors, but found a motor controller circuit from an old Cybot that handily does exactly what I need.

IMAG0840

The camera is a standard web-cam, with the cables shortened.

Turning the machine on is done by simply connecting the little red cable between the battery-side and the “other stuff” side of the breadboard as you can see in the image above.

To charge the battery, I simply hook in a Li-ion charger directly into the left of the board (below). The charging circuit will happily charge multiple Li-ion batteries.

IMAG0841

Hardware-wise, I’m almost happy. I want to replace the claw soon, but apart from that, I’m ready to work on software.

I already have code written for controlling the motors, which I’ll upload into Github over the next few days. I’m looking into SLAM now for creating maps via the camera system. I might have to write the solution myself, though, as the code I’ve found so far is written in academicese and I don’t understand it.

Funny, that, as I’m certain I can write the bloody code, but can’t understand the words that the academics use!

Salting Passwords

The simplest way to store a password in a database is as a plain string

insert into users set email="kae@kvsites.ie", password="password";
["kae@kvsites.ie", "password"]

But, if someone hacks into the server, or you have a malicious admin, then those passwords can be stolen. This is a big security risk as passwords tend to be re-used by people for other purposes, such as PayPal, etc.

So, the next stage is to encrypt the password using a hash such as MD5:

insert into users set email="kae@kvsites.ie", password=md5("password");
["kae@kvsites.ie", "5f4dcc3b5aa765d61d8327deb882cf99"]

That /looks/ secure, but there are huge databases on the Internet with MD5 translations of all words, so it is trivial to hack these.
https://www.google.ie/search?q=5f4dcc3b5aa765d61d8327deb882cf99

The next stage is to “salt” the password by adding a prefix to it before hashing. For example, let’s use “123ghjzxc” as the salt key.

insert into users set email="kae@kvsites.ie", password=md5(concat("123ghjzxc", "password"));
["kae@kvsites.ie", "9f400bac0b5a9b3d66c9c98aae09fab5"]

This is much more secure now. A search for the MD5 hash will not return any results at all (well, this page… but you know what I mean).
https://www.google.ie/search?q=9f400bac0b5a9b3d66c9c98aae09fab5

Another method is to hash the password before prefixing it with the salt, then hashing again. This may be a bit more secure again.

insert into users set email="kae@kvsites.ie", password=md5(concat("123ghjzxc", md5("password")));
["kae@kvsites.ie", "d1dddda63a6dde54fb1740dffe3faa27"];

As an extra step, do all the MD5ing outside the database, so the password is not sent over the wire to the database.

CSRF

CSRF (cross-site request forgery) are hacks where a user on one system is tricked into doing something on that system while browsing another system.

Example

Let’s say you are logged into http://yoursite.example.com/ as an admin, and you can easily delete an object by clicking a link that sends a request to http://yoursite.example.com/a/delete.php?object=1.

You take a break and go read some websites.

Now, let’s say that one of those websites has had a little piece of code attached

<img src="http://yoursite.example.com/a/delete.php?object=1" style="display:none"/>

Other readers of the same site will not notice anything – they’re not logged into your site, and so have no delete rights. But, you are!

This vulnerability is called “CSRF” because the hack happens on a different website than your own, taking advantage of the fact that you are logged in, to delete stuff (or move money, etc).

Solution

On the server, you should create a CSRF token, send it to the client, and make sure that all actions that are requested include that token.

To set the token, just create a random string and save it to your session:

<?php
if (!isset($_SESSION['csrf'])) {
  $_SESSION['csrf']=md5(mt_rand().time());
}

Then, whenever an action is performed, make sure that the request includes that token before the action is performed.

<?php
if (!((isset($_REQUEST['_csrf']) && $_REQUEST['_csrf']==$_SESSION['csrf'])
  || apache_request_headers()['X-CSRF']==$_SESSION['csrf'])
) {
  header('Content-type: text/json');
  echo json_encode(array(
    'error'=>'CSRF violation'
  ));
  exit;
}

Note that my code above allows two ways to send the CSRF – as a request variable (GET/PUT/POST), or as a header.

For HTML forms, make sure that each form includes the CSRF token:

<input name="_csrf" value="<?=$_SESSION['csrf'];?>"/>

And finally, for AJAX, make sure that the token is included by default. Personally I use jQuery, so this does that:

  $.ajaxSetup({
    'beforeSend': function(xhr) {
      xhr.setRequestHeader('X-CSRF', window.csrf);
    }
  });

(make sure that window.csrf is set as inline javascript in the page)

Conclusion

Now what happens is that each time a request is made to the server, the CSRF token that’s sent is checked against the session’s CSRF token, and if they don’t match (or no token is sent), then the action is ignored.

It is not possible for any website to guess your CSRF token (we set it to a random MD5), so you are safe.

idea for music recognition, conversion and composition using artificial neural networks

I had this idea while walking the kids to school. Starting from a simple network that can classify music styles as rock/metal/classical/folk/etc, I think that it would be possible to adapt the same algorithm to convert a music file from one style to another, and even write music from scratch in whatever style you want. And if I’m right, I think it would be very simple to write.

Recognition

This is the simplest task. To recognise the style of a music file, all you need is a feed-forward network with a few thousand inputs, at least one hidden layer, and one output for each style you want to recognise.

A standard data rate for recorded music is 160kbps. That means that every second, there are 10,240 separate wave heights (160*1024/16) that need to be examined. Of course, you can recognise music using lower bps values, but let’s use the same setting for the whole process (160 will be wanted for later parts).

So, the input layer would need 10,240*n inputs, where n is the number of seconds you want the network to sample in order to determine the style. In some cases (metal/classical), you may get away with sampling just a single second, but for better results, you might want a larger value. I’ll be setting n to 300, so it samples the entire song in most cases. This makes it easier to be accurate about the result, but will also be useful in a later stage.

The output layer needs to have one node per tag you want to measure. For example, you might have an output that measures how “rock” a song is, and another that measures how “baroque” it is. You could use output nodes that return a simple Yes/No result, but there is a good reason to return a more linear certainty instead (which we’ll get to).

The hidden network needs at least one neuron, obviously, but I don’t think there is any way to say exactly how many it needs, so it would be better to use a network model which grows automatically as it learns (I don’t know the technical term – I just build the things!).

After building the network, you need to train it. This is the easiest part – you just need a large database of music, and tags for every one of those tunes.

One handy idea: if you’re training a 5 second network (for example), then a 3 minute song has at least 36 completely separate training sets for you to sample – all you need to do is start linking to the inputs at second 0, 1, 2, .5, etc, and the network will see what it thinks (initially) is a completely different data set.

After training this for a while, you should be able to run a few seconds of a song through the network and have fairly accurate results of how “funk” or “jazz” a song is.

Conversion

After figuring out the above, I started thinking of alternative uses for the idea, and one surprising idea took hold.

Let’s say that you have a “folk” song played on guitar and violin. How would you go about making it “metal”? You could start by fuzzing the violin and distorting the guitar, and maybe adding some drums in.

I think it would be possible to write a program which lets you convert a song from one style to another literally at the click of a button.

Remember I mentioned that the output neurons should say how metal/classical/etc a song is, not just that it is or is not.

If the network is written with enough precision, then adjusting one or more of the input values should give a different value in the outputs.

As an example, let’s say you have a folk tune that you want to convert to neo-punk. Adjusting the inputs such that the sounds are more distorted (clipping high values, for example), or faster (shifting later inputs to the left, maybe) might change the tune’s “neo-punk” output from 0.00024 to 0.00025.

If you repeat this over and over (automatically, obviously), discarding changes that reduce the output and repeating changes that increase the output, until the “neo-punk” output reaches an acceptable threshold such as .9, then you have just created an automatic way to convert a tune from one style to another.

I think this has a lot of applications. For example, let’s say you want to convert a piano tune to guitar? You train your network to recognise what piano and guitar tunes sound like, and then simply convert as above!

Composition

This may be the simplest of the lot.

After creating the above programs, try inputting a sound sample of pure static into the conversion program, and tell it to convert the static to piano. I think it would come up with some interesting tunes. Maybe not completely accurate tunes, but they would be interesting.

I think the network would automatically learn rules about harmony and rhythm, but don’t think it would learn about structure. For example, you could train a network to recognise a 3/4 rhythm, but I don’t know if you could write something that recognises a fugue.

Simple geo-ip based links

simple geoip based links, for when you need to link to different files depending on the client’s country. requires PHP, jQuery.

in the head of the document, have this:

<script>window.geoip_data='<?php echo file_get_contents('http://freegeoip.net/json/'.$_SERVER['REMOTE_ADDR']); ?>'</script>

for the HTML links, write the default link target into the HTML, with alternatives written in for each country. here’s an example for the UK and Ireland:

<a href="/link.html" class="geo" data-link-UK="/link-uk.html" data-link-IE="ie.html">click here</a>

now in the JavaScript, process all .geo links:

$(function() {
    var country=geoip_data.country_code.substring(0,1)
    +geoip_data.country_code.substring(1,2).toLowerCase();
  $('a.geo').each(function() { 
    var $this=$(this), dataName='link'+country;
    if ($this.data(dataName)) { 
      $this.attr('href', $this.data(dataName));
    } 
  });
});

Done!

unwatermarking images

I’ve started a website where I intend to sell thousands of products from a number of distributors through drop-shipping (the products go directly to the customer).

For reasons that I don’t understand, the distributors have watermarked their images, and don’t provide unwatermarked versions unless you’re an already well-established customer of theirs.

For the purpose of this demo, a watermark is a constant-colour “stamp” which is given opacity and pasted into the original image.

As I intend to be a good customer, I figured it would be okay for me to simply “unwatermark” the images.

There are a number of instructions online which show how to /fake/ an unwatermaking – by basically smudging the area where the watermark is.

However, as most watermarking appears to follow a single method, it is actually possible to simply reverse the process and remove the watermark, after a little trial and error.

Let’s consider an example. Here is an image, a stamp, and the merge of the two:

(original is here)

  • demo1
  • demo2
  • demo3

To reverse this, you need to know what algorithm was used to create the watermark, and what the original watermark was.

Most people use a fairly simple method to watermark their images:

The stamp is one single colour, usually gray (#808080 in RGB) which will be visible on images which are both light and dark.

The stamp is then given an opacity (30% in my case above), and pasted directly over the original image.

The formula for any particular colour channel (R, G and B) on any pixel is: C3=(1-p)C1+pC2, where p is opacity (0 to 1), C1 is the colour value for the original image, C2 is the stamp’s colour value, and C3 is the resulting image’s colour value.

To reverse the watermarking, you need to convert the formula to see what it is in respect to C1: C1=(C3-pC2)/(1-p).

As most stamps will be using a middle gray (#808080), you just have to guess at the opacity. .3 is a good start.

For some reason I’m not yet sure of, the code I came up with did unwatermark the image, but too much… the points where the watermark were, ended up being too bright. So I needed to add a darkening aspect, reducing the brightness of the result of the above calculation.

I’m not going to hold your hand if you can’t make this work, but here’s the code I ended up with (assumes the images are exactly 400×400 in size). The original should be ‘original.jpg’, and the stamp should be ‘stamp.png’ (with white where transparent pixels should be).

$p=.3; // opacity

$f1=imagecreatefrompng('stamp.png');
imagepalettetotruecolor($f1);
$f2=imagecreatetruecolor(400, 400);
$f3=imagecreatefromjpeg('original.jpg');
imagepalettetotruecolor($f3);

for ($x=0;$x<400; ++$x) {
  for ($y=0; $y<400; ++$y) {
    $rgb1=imagecolorat($f1, $x, $y);
    $rgb3=imagecolorat($f3, $x, $y);
    $r3 = ($rgb3 >> 16) & 0xFF;
    $g3 = ($rgb3 >> 8) & 0xFF;
    $b3 = $rgb3 & 0xFF;
    if ($rgb1==16777215) { // white. just copy
      $c=imagecolorallocate($f2, $r3, $g3, $b3);
      imagesetpixel($f2, $x, $y, $c);
      continue;
    }
    $r1 = ($rgb1 >> 16) & 0xFF;
    $g1 = ($rgb1 >> 8) & 0xFF;
    $b1 = $rgb1 & 0xFF;
    $r2=c($r1, $r3, $p);
    $g2=c($g1, $g3, $p);
    $b2=c($b1, $b3, $p);
    $c=imagecolorallocate($f2, $r2, $g2, $b2);
    imagesetpixel($f2, $x, $y, $c);
  }
}
imagejpeg($f2, 'unwatermarked.jpg');

function c($c1, $c2, $p) {
  $c=c1($c1, $c2, $p);
  $c3=$c-(255-$c)*.2;
  return $c3<0?0:(int)$c3; 
} 
function c1($c2, $c3, $p) {
  $c=($c3-$c2*$p)/(1-$p);
  return $c>255?255:(int)$c;
}

Quantum Immortality – Organ Transplants

Organ Transplants

The human body evolved to reproduce, and then last just a while longer to help to raise the young after that. The fact that we live so long these days raises problems, because our bodies are not “designed” to do so!

There is no good evolutionary reason for a body to last very long after the reproductive age has been reached, yet we try our best to stave off death a while, by replacing the aging and dying parts of our bodies with younger, healthier parts.

A short history

The first successful transplant was in 1905, and was the cornea of an eye.

Nothing much happened after that for almost fifty years, and then after a stuttering start, medical research started producing wonder after wonder.

If you count the number of new transplant types that have been completed in each decade, the curve is unmistakable – we are fast on our way to being able to transplant virtually anything at all from one person to another. [1]

Donor shortfall

While organ transplantation is becoming easier over time, there is a problem with supply.

In order for you to receive a new kidney (for example), someone else must donate one of theirs. This involves finding someone with a similar body chemistry to yours, so the organ isn’t rejected, and also hoping that the person is willing to donate the kidney.

If no live donor is available, then you need to hope that someone dies to provide you with their kidney. This is a tragic thing to hope for.

A kidney is a best-case scenario, as humans have two each, so the donor can survive without it.

But if you need a heart, then the donor will most likely be dead before you get it. And you’d better hope that the donor didn’t die of heart disease!

The problem that there are simply not enough donors for each needed organ is a huge one. [2]

There is also a problem that organs can only survive so long outside the body, so once a donor has provided its organs, the organs must be transplanted nearly immediately, or they will die.

The current way to transport organs to the transplanting hospital is by freezing them so that decay is minimal. But even this can cause cell damage as ice crystals form and break apart the cells.

Luckily, these problems are also being solved.

Only this week, as I write, there is news of a liver-preserving machine which you can hook a liver up to. This device will then keep the liver alive, by emulating a living body. In essence, the liver does not know that it is no longer in a body, and continues functioning. [3]

Now that this has been done for livers, it can be expected that similar news will be announced in the next few years for almost every other organ.

Artificial Organs

The shortfall problem, that there are simply not enough donors per required organ, can be fixed with artificial organs.

Organs are generally very difficult to replace, as they do quite a number of different things. But some of the simpler organs have already been successfully replicated.

An obvious example is the heart. The first successful artificial hearts (not a pacemaker, but an actual pump) were created in the 1982. While their recipients lasted only 112 days and two years respectively after surgery, that’s still time that the patients didn’t have without the hearts.

This artificial heart design was primitive by present-day standards, but encouraged further research.

Artificial hearts are usually used as “bridges”, to keep a patient alive while waiting for a donor to supply a “real” heart. But sometimes, the artificial heart’s help gives the patients’ own heart enough rest to heal itself, and a transplant is no longer needed. [4]

Almost every organ can be replaced, given enough time and research.

Ears can be replaced with cochlear implants.

Eyes can be replaced, but artificial eye resolution is still very low. There are many different threads of research ongoing in this area. [5]

Some of the more “bag-like” organs can be very successfully replaced right now with artificial versions.

Bladders, trachea, arteries; these can all be created from stem-cells and/or plastics.

Legs and arms deserve a full chapter. There is some amazing work being done in these areas.

The most difficult organs (in the body itself) to replace are the pancreas, liver, lungs, and kidneys. These perform specialised functions, and currently, artificial versions are not small enough to implant.

I expect to hear within a year or two of the first completely artificial kidney implant. There already is an implantable artificial kidney available, but it’s a lot larger than a natural kidney.

In the future, I expect that the only organ that you will not be able to replace, will be the brain itself. Not because it can’t be moved, of course, but because the brain is your identity – there is no point replacing your brain with someone else’s.

However, having said that, if your entire body was failing, you could transplant your brain into a younger body.

There are a number of reasons you should not hope for this to happen, though.

For one example, in order for you to do a brain transplant, there must be a younger body available for you to transplant into. But if a younger body is available and it is healthy, then it makes greater ethical logic to offer its organs to save multiple people, instead of just you.

For you to get a whole new body all to yourself, you would need to provide it yourself, and i can’t think of any legal or even close to ethical way that you could do this!

If it turns out you need a whole new body, you’re probably better off looking into brain uploads instead, which will be discussed later in the book. Currently, brain uploading is not possible, but the technology should be ready soon; probably sooner than the first successful brain transplant.

Conclusion

In this chapter, you learned a short history of organ transplants.

There is currently a shortfall of available donor organs.

Artificial replacements are available for some organs, and others are on the way.

The only organ that will never be replaced fully is the brain, but we’ll talk more about that in a later chapter.