Category Archives: Uncategorized

Rebuilding MP3Unsigned #2 – Basic Site

Yesterday, I finished importing the data from the old database to the new. This is far from finished, because the export didn’t include some vital information as the field types, keys, or that ephemeral quality, what it all means.

The way I’m building the new version, it is in four separate parts:

  1. plain HTML “archive” version of site, perfect for search engines
  2. JavaScript overlay for progressive enhancement of site
  3. CSS themes, which you can switch between
  4. API, for the JavaScript and any external tools to talk to

Today I’m working on the first part, the plain HTML archive. Technically, I should really do the fourth part first so the HTML archive can build itself up using data provided from the API, but I want to get something visual today, so will fix this up yesterday (technical debt!).

On the front page, I want to show a list of all bands. Unfortunately, the old database is …not great… for this basic function. After reading the tables, I found that “bands” are a combination of the table “tblbands” and “subscribers”, such that the band is named after the subscriber’s username.

This is not great because it means subscribers can only be in one band at a time, and they can either use their own name, or the name of the band, but not both. I’ll fix that after phase one of the project (rebuilding the existing site) is done.

I wrote a small “router” that can rip apart the URL and use that to determine what page to display. In the new site, we will use a language code to determine what language to use for user interface strings. This code will always be the first item in the URL.

Long story short, here’s the absolute basics of the new site, in English, Irish, and Japanese.

This stage of the rebuild is not designed to be pleasant to look at. It’s designed to be very quick to load, so search engines can find the content easily.

The next stage is to create the API so that database access is done through only one part of the system, and all requests (from a client, from the base I just built, from third-party tools) go through that API system instead of directly to the database.

Getting to the current stage (visible band names, visible MP3 lists, playable MP3s) was important for morale reasons. Now we can really get building.

Rebuilding MP3Unsigned #1 – MSSQL to MySQL

Last week, I took on a project for my friend Phil Sneyd, who said he and Dave Kelly were on the verge of completely scrapping their long-term project MP3Unsigned, a website for unsigned musicians to upload their music, allowing the music to be charted monthly, commented on, etc.

I told him I’d have a look, and asked for some access details. I won’t get into the numbers, but when I saw the hosting costs, I did a quick check and found that moving everything to Digital Ocean with Volumes to handle the storage would cost 6 times less. A while later, I realised if I use Spaces instead of Volumes, it would be 22 times cheaper.

It was impossible to simply copy the code over, though. The website was written in an old form of ASP, the source-code of which was long gone, and the database was Microsoft SQL Server, which doesn’t play nice with anything other than Microsoft products.

So, I needed to extract the current data, and rebuild the website. This is a project that has data right back to 2003, so this is not going to be a single article rebuild πŸ˜‰

This first article is about converting the database from MSSQL to MySQL.

Exporting the Database

The hardest part in this, in my opinion, was in extracting the database data. I could not find an easy way to dump the data in a format that could be converted to MySQL (the DBMS I would use on the new one) and imported.

The old hosting included a copy of myLittleAdmin, which claims on their website to “Export data to XLS, XML, CSV“. I tried this.

The “CSV” export actually exports into a semi-colon delimited format (not commas), and it doesn’t quote strings, so when data itself contains semi-colons, you can’t trust the export.

The “XLS” export is not XLS. It exports an XML file which I assume can, after some manipulation, be converted into an XLSX file, but it’s definitely not XLS. I couldn’t use it.

The “XML” export at least did what it said it would! I finally was able to export the data. But I had to do it one table at a time, and the old system had 94 tables… It would be handy if myLittleAdmin had a single “export all as XML” function, but I can only work with what I have, so 94 exports it is…

There were a lot of tables that I considered irrelevant – tables created for specific competitions in 2006, or tables created for sponsors and then never used. I ignored those.

Converting to MySQL

The XML format that was created looked a little like this:

In that, we have the table name as the root, a “row” object containing row data, and then each field’s data is contained in an item named after it. Makes sense.

To convert that to MySQL, I used PHP’s built-in SimpleXML support to convert the XML file to a navigable structure, then started iterating.

It wasn’t quite as straightforward as that, but that was the gist.

To begin with, I assumed that all fields are text strings, so I set longtext as the field type. Except the first item in each row, which I assumed was an auto_increment integer (the unique primary ID of each piece of data), unless the item was obviously text, as was the case in some tables… And unless the table was the Chart data, because …who knows. When creating a table, a rule of thumb is that you should always enter a primary key. Always.

I found that when I ran the script, there were some crashes where the original export used some characters that SimpleXML complained about. For example, “”, “ ” (yes, there’s a character there…), “”, “”. After checking the context of these characters, I simply removed them. I had tried various methods including UTF8 encoding, MB transliteration. In the end, deleting them was easiest.

Once the table structure was recreated in a vaguely correct MySQL form, I could then import the XML files directly using MySQL’s “load into” method.

Here’s the PHP code I used to automate this (the dbQuery function just runs a query)

$files=new DirectoryIterator('.');
foreach ($files as $f) {
if ($f->isDot() || $f->getFilename()=='table_import.php') {
continue;
}
$fname=$f->getFilename();
$tname=strtolower(preg_replace('/mla_export_(TBL)?(.?)../', '\2', $fname));
echo $tname."\n";
dbQuery('drop table if exists '.$tname);
$xmlstr=file_get_contents($fname);
$xmlstr=str_replace(['', '', ' ', '', '', '', '', '', '', '', ''], ' ', $xmlstr);
$xml=simplexml_load_string($xmlstr, 'SimpleXMLElement', LIBXML_COMPACT | LIBXML_PARSEHUGE);
if (!$xml || !isset($xml->row) || !count($xml->row)) {
continue;
}
$i=0;
$fields=[];
foreach ($xml->row[0] as $k=>$v) {
if ($i==0 && preg_match('/^[0-9]+$/', $v) && $fname!='mla_export_TBLOfficialChart.xml') {
$fields[]=$k.' int auto_increment not null primary key';
}
else {
$fields[]=$k.' longtext';
}
$i++;
}
$sql='create table '.$tname.' ('.join(', ', $fields).') default charset=utf8 engine=innodb';
echo $sql."\n";
dbQuery($sql);
echo "importing data\n";
dbQuery('LOAD XML LOCAL INFILE "'.$f->getPathname().'" INTO TABLE '.$tname);
}

The import failed on a few of the XML files without error – simply saying “process killed”. I fixed this by increasing the RAM of the machine temporarily. There’s no need for a running production server that’s only serving simply things to have a load of RAM in it, but for the initial setup, it needs enough RAM to completely hold all of the imported data in memory, with plenty to spare while it manipulates it into a shape that can be recorded in MySQL.

The actual import took hours for the server to do with this project. At least 3 hours of non-stop crunching.

Next Steps

After the import, I have a load of data in very simple and inefficient format – inconsistent table names, inconsistent field names, and just primary keys and long-text formats.

I’m going to leave the names alone until after we’ve completed the initial rebuild, so that when I “refresh” the data just before we go live, I don’t have to do a lot of awkward remapping.

The long-texts, I’ll adjust as I rewrite the parts that read/update them.

Tomorrow, the plan is to create a test server that simply outputs the existing data in read-only format. This will act as a low-level site that will be “progressively enhanced” to what we want it to be.

Optimising a WordPress website #2 – MotoWitch

In the last post, I had improved Kojii’s website from a PageSpeed score of 6 to a score of 28.

I just checked, and it’s at 29 now. The PageSpeed site may give slightly different values depending on network traffic, etc, so I don’t think the site has gotten 1 point better just through waiting.

I hadn’t finished optimising all images, so I worked further on those, using the PageSpeed results as a guide.

I optimised a few of them, but they’re mostly okay – the savings are tiny compared to the effort, and some I had already optimised and Google was claiming it could get even more compression on them (but didn’t say how – Google: how?)

Images are now number 2 in the PageSpeed recommendations. The new #1 is Server Response Time

Server Response can be affected by a number of things, including network, hard-drive speeds, computer load. In this particular case, though, I think it is the number of database queries that the website has to run in order to generate its pages.

A plugin called “WP Super Cache” fixes this – whenever a page is generated, its resulting HTML is stored in a cache file. When the page is next requested, the cached version is delivered, instead of generating a new one. Donncha O Caoimh from the People’s Republic of Cork wrote this, and it may just be the smartest thing to come out of that dismal place!

After activating the plugin, go into the settings and enable it (it’s off, by default!)


Ok – let’s check PageSpeed again and see what that did. I open up a different browser (if you’re logged in, caching is disabled, so you need either a different browser, to log out, or to use anonymous mode) and load up the front page to prime it.

Hmm… no change in the score? It’s still at 27, but after the slight optimisations to the images, they’re at number 4 now instead of 2. But the Server Response is still at #1.

Note that the number is actually higher (1.75s now vs 1.66s before). Again, I think this is just a blip in the various interactions between server speed and network speed. To be sure, I waited a few minutes then ran PageSpeed again.

Ah, that’s better πŸ™‚ I think that my Firefox priming didn’t work. So, Google’s first retrieval of the page after enabling WP Super Cache did the priming for us, so the second retrieval was quicker.

Now, what was #1 (server response time) isn’t even in the list of issues:

Next, we tackle render-blocking resources. When you include CSS, JavaScript, or other resources in your website, the browser “blocks” further loading until those resources are downloaded. This is in case something later in the page relies on those resources.

For example, if you have a script that uses jQuery to do its stuff, then you need jQuery to load first. In this case, jQuery is a blocking resource.

Since the browser doesn’t know what relies on what, it treats every resource as a blocker.

To solve this, you can use the HTML “defer” parameter, which tells the browser that the script the parameter is in is safe to load at a later time when it’s not busy.

The WordPress plugin “Async JavaScript” can do this, giving you the option to specify which scripts are deferred and which are important to load immediately. It’s written by the same guy that wrote the plugin Autoptimize I talked about in the last blog post, and is designed to work with it.

After installing and activating, I enable it, but only for non-logged-in users. Since the vast majority of visitors to a website will be anonymous, these are the users you want to optimise for. Users that actually log in should probably get the full blast of scripts, etc, in case there’s something they use that might be affected by the optimising tricks:

And then click the Apply Defer button. The difference between “defer” and “async” is that “defer” will delay the loading of the scripts, and will then load them in the same order that they were requested in the web-page. Async, though, will load everything in the background and run the results as soon as they’re returned. Async loading can cause a “race condition”, where you may accidentally load something and run it before the scripts it relies on are loaded.

After all that, run PageSpeed again (twice, a minute apart). I see that we’re down to 36, and Eliminate Blocking Resources is still #1

Unfortunately, to do more on this, I think I would need to start removing things from the page, and I really don’t want to do that.

I don’t have root access to the machine, and it uses a theme I am unfamiliar with and don’t want to start hacking, so I think I’m stuck at this for now.

As a last test, I went into Autoptimise and enabled a few more things:

I was unable to verify that the image optimisation worked. I assume that what’s meant to happen is that images are loaded through a CDN that optimises them. But I didn’t see any sign of that (probably because the theme puts most images as CSS backgrounds?).

However, there was a minor improvement in score which I’ll take.

At the end of this blog post, the scores are now at 38 in Mobile, up from 28:

And at 82 in Desktop:

Optimising a WordPress website #1 – MotoWitch

I recently offered to optimise friends’ websites for free, in the interest of trying out a few methods of improving responsiveness and improving the Google PageSpeed score, which Google uses as one indicator in their PageRank system.

MotoWitch.com is run by Kojii Helnwein, a friend from back when we were teenagers (a long time ago now!) – it’s a website for female bikers to get together and share stories from all over the world.

Kojii asked me to go easy on her – she built the website herself and her skills lie more on the visual side than the technical guts. That’s okay – that’s what I’m here for!

So, the first thing I did was to check out what kind of score the website was getting already.

6%! To be honest, I’m very impressed that it’s this low, as the website works well on a subjective test on good broadband, but obviously it has a long way to go to be good on a phone, and the percentage of people using mobile for Internet access is always increasing.

6% score on Google’s PageSpeed tool

The “Opportunities” section of the PageSpeed results gives some hints as to what’s wrong.

Opportunities

Okay – it’s basically saying that the images are a really big problem. Clicking the first row of results will expand it to give some more details:

More Detail

The list of images actually goes on for another page, but we can already see the first issues to address.

Those images are HUGE. When multiplied by the number of images on the front page, it’s obvious that this will cause a problem to mobile users.

I see three immediate tasks:
1. apply some compression to those images – there really is no reason they should be so large if they’re on a website. As a test, I opened the first image in Gimp, and saved it as an 86% compression .jpg file. I could not see the difference visually, but the filesize was 6 times smaller. That’s a start.
2. add some “lazy loading” to the page so that images are only loaded when page scrolls bring the images into view. WordPress has built-in tools for this within its JetPack plugin.
3. add some coding to allow the page to load different versions of the images depending on what size the screen actually is – no need to load a 2000px wide image if you’re using a screen that’s 320px wide.

The JetPack plugin comes as standard with most WordPress sites. Clicking into it, the first edit is to enable WordPress’s Site Acelerator and Lazy Loading.

enable all of these

The “site accelerator” changes the site code so that some of the site’s images and static files (like JavaScript and CSS) come through WordPress’s global CDN servers (content delivery network), which reduces stress on your own server, helping speed up the websites in a number of subtle ways.

Lazy Loading means that images are only loaded when they are about to become visible on the page. JetPack’s Lazy Loading does not apply to all images, so we may need to manually edit some later.

With just this change, we increase the PageSpeed from 6 to 8, with the opportunities changing to this:

pagespeed after enabling lazy loading and site accelerator

The largest changes are that items 1 and 3 in the image above were 43.8s and 43.5s (estimated download times on mobile networks), and are now 7.05s and 3.78s. That’s 8 times faster.

The next step is to go through the list of large images, and optimise them. Here’s what PageSpeed now says are the worst offenders:

images with potential savings

The PageSpeed interface does not give an obvious link to the original image, but if you right-click the little thumbnails on the left and click “open in new tab”, that lets you then download the originals and optimise them.

The simplest method is to use an auto-optimising tool to handle it for you. I used imagecompressor.com’s tool. It does a good job by default, but sometimes you can get a much higher compression without compromising visible details by tweaking the compression slider.

After editing a lot of the images (I couldn’t find where some of them are used, so will address them later), I re-ran PageSpeed. It was still at 8%, but the opportunities list looked better again:

Item 3 there was 7.05s (originally 43.8s) and is now 1.95s.

The “Eliminate render-blocking resources” one is tricky to correct in WordPress sites. Almost every plugin you add to WordPress includes CSS in some way, and they rarely if ever include “defer” keywords to let the browser download them later.

I opened the list:


In total about 30 items are in the list. How to address this?

There are plugins that collect the CSS and JS files that WordPress emits, and combine them before they are outputed. An article I read recommended Fast Velocity Minify, however, I found that this mangled some of the CSS of the website. Autoptimize did a much better job here.

After installing, activating, and then going into the plugin and enabling all items but the CDN on the front page, I primed the front page by loading it in a browser, and then checked PageSpeed again:

getting there…

The list of opportunities still has blocking resources as the first priority, but it is reduced from 30 items to only 14, and they’re all on external domain names:

But I notice something – the c0.wp.com references are from CSS/JS files that were offloaded to CDN by the JetPack plugin. Let’s turn that off…

Then prime the website by loading the front page, and check PageSpeed. The score is increased again to 23, and the number of blocking files reduced from 14 to 8:

Now I notice that half of those are Google Fonts references.

Autoptimize has some advanced settings that let you handle Google Fonts in various ways. I chose to load the fonts asynchronously through webfont.js:

The PageSpeed score is increased yet again, this time to 28, and the number of render-blocking resources is down to just 4:

I think that’s about as much optimisation I can do easily here, without needing to manually edit the CSS files to reduce their size.

I’ll leave it at that for today and carry on tomorrow.

salepredict3: automated test results

Based on a suggestion by ChΓ© Lucero (LinkedIn), I wrote a test to see exactly how accurate this machine is.

I had 41 domains already entered into the engine and categorised as Sale or Fail, so the test was based on those.

For each of the domains, the test:

  1. changed the domain’s type from sale/fail to prospect
  2. retrained the neural net using the rest of the domains as its reference data
  3. calculated how much of a match the domain was to a sale using that neural net
  4. if the calculation indicated correctly a sale or a fail, then that counted as a correct test
  5. finally, clean up – reset the domain’s type back to sale/fail, ready for the next test

After 41 tests, it got 27 correct – an accuracy of 65.85%. That’s much more than chance (50%).

I’m going to get some more data now, but I expect it will only improve the value, not decrease it.

What does this mean for your own business?

Well, let’s say you have 100 companies you can potentially sell to, and you expect that 50% of them might end up being a waste of time, but you still need to spend about 2 hours on each in order to find that out.

Without using my engine, after 100 hours of selling, you will have made 25 sales. (100 hours is 50 companies. 50% success rate so 50/2 = 25).

With my engine, after 100 hours of selling, you will have made 33 sales, because it will have pre-ordered the companies and got it 66% correct, so in the first 50 companies, it will have correctly placed 66% of all successful sales.

web stats analysis program

Every day, I spend about an hour taking figures from Google Webmaster Search Analytics and adding them to a spreadsheet I manage that does some calculations and then highlights anything interesting in the data.

I got tired of the drudgery of it, so spent most of the day today building something to cut that down to just a few seconds. Basically, I just open up the new stats thing, upload a CSV file into it, and it does all the hard work itself.

And then I stuck the whole thing up on GitHub: here you go

what I’m up to these days

I’m going to start blogging again soon. I’m getting more and more into Arduino stuff, and want to be able to explain stuff properly to my students at the Monaghan Coder Dojo, so will be writing articles explaining what I’m teaching.

In work (FieldMotion, where we do field service management software), we’re doing some really interesting things with Zapier, which lets us easily integrate with more than 750 applications.

We’re also doing a new thing with the reports that we generate, where we can apply “skip logic” to the actual reports, resulting in personalised reports depending on the questions answered in the job forms.

I’ve had to take a break from the 3D Printer stuff, because the printer I had (a Makibox) broke down so often that I was spending more time fixing it than using it. I’m not quitting with 3D printing – just need to wait for a bit of cash to come in so I can afford a better printer. Once I have that new printer, the first thing I’m doing is making a second printer with it!

a more precise measurement of the number of cows murdered for the new UK fivers

image taken from https://www.theguardian.com/business/2016/dec/02/5-animal-fat-bank-note-british-vegetarians-being-stupid-says-inventor

There is a Vice article that claims to have an accurate measurement of how many cows were killed to make the new UK five-pound notes.

The say it’s half a cow.

Ignoring that you can’t really kill just half of a cow, let’s look at their maths.

How much do cows weigh? Between 1,100kg for a male (bull) and 720kg for a female. So, on average, a cow weighs 910kg.

Not true. If you want the average weight of a cow, you need to remember that 50% of male cows are murdered before they become adults, so the average needs to take that into account.

It looks like Vice got their weights from Google, which says 1100kg for a male, and 720kg for a female.

Given a 2:1 female:male ratio, the average is more like (720*2+1100)/3=850kg per cow.

The body fat content of an average cow is 25 percent. Therefore, the amount of fat in an average cow’s body is 227.5kg.

Vice appears to take the word of this question‘s answerer when it states 25%.

However, if you trust the word of the Canadian government, then it’s more like 15%. The Oklahoma state government says it’s less than 18% (including bone, skin), making the 15% sound about right.

So, 15% of 720Kg. Therefore, the amount of fat in an average cow’s body is 108kg, not 227.5kg. Less than half what Vice stated.

How many kilograms of this fat is contained in offcuts you could use to make tallow? About 40kg, according to a man at the James Elliott butcher in Islington.

Tallow is rendered from suet, which is made from the fat found around the loins and kidneys. So we can’t just use all the fat from the cow. According to the Oklahoma Department of Agriculture, about 4% of a cow’s weight is suet. That’s 28.8kg (720*.04), not 40kg.

How much tallow is used in one note, according to the Bank of England? “A trace”, which chemically means less than 100 parts per million, or 0.01 percent. A polymer consultant I called confirmed that the tallow present in a given polymer would be a fraction of a single percentage.

Again, where do they come up with these numbers? No references given.

First off, when some PR guy says “there’s just a trace”, they are not speaking as an analytical chemist. They are saying “stop asking. not enough per note to make a difference on a scale”.

They say that trace means less than 100ppm. But atmospheric CO2 is a trace gas, and that is 335ppm.

Here’s an actual definition of what “trace” means – 0.1%. That’s still a “fraction of a single percentage” as Vice said, but it’s ten times larger than the 0.01% they pulled out of nowhere.

New Β£5 notes weigh 0.7g, therefore there is roughly 0.00007 g of tallow present in one Β£5 note.

0.1% of 0.7g is 0.0007. Ten times larger than the figure Vice comes up with.

How many fivers are in circulation now, and therefore will be around by May of 2017, when all the old paper ones have been phased out? 329 million notes.

To work out how much tallow will be used in total in all of these fivers, we need to multiply 0.00007g by 329 million, which gives us 23,030g, or 23kg.

Again, multiply by ten. 329,000,000*0.0007g=230,300g, or 230kg.

And if you get about 40kg of tallow-worthy fat from the average cow, how many cows would you need to make every single Β£5 note in circulation?

Well, since it’s actually 28.8kg per cow… take the 230kg required, divide it by 28.8kg, and you get:

8 cows.

Not half a cow.

16 times larger.

You might say “yeah – but who cares? it’s a fucking cow!”.

Well yeah – people that say things like that are not going to be budged anyway.

Makibox 3D Printer

I had the option to get my birthday present about two months early. Jumped at the chance.

Makibox, a 3D printer company, is selling off its entire stock of printers (makiboxclearance.co.uk), so it was a chance to get something cheap that I can hack on.

The package only took a week or so deliver, which is much better service than I expected, based on some of the messages I’d seen online.

I bought the unheated version (here) in kit form.

It took a few hours to put the machine together. I didn’t try printing anything until the next day.

The printer works by raising and lowering a print bed (the Z axis), and moving a “hot end” around on top of that in X and Y. The hot end hangs from the centre of two crossbeams, one of which moves in X and the other in Y.

The first problem I encountered, was that when I went to print for the first time, the hot end immediately started carving a pretty pattern into the bed. The printer didn’t know where the bed was, so was lowering the hotend too far down.

This kept happening even after I used the “bed leveling wizard” in Cura, the first step of which is /supposed/ to define where the bed is. But, no matter how accurately I did the first step, it totally ignored that and reset automatically to a level where it thought the bed was a few millimetres lower than it actually was, making the hot end drive straight into the bed.

It took me a while to figure out the problem – that the bed depth was “hard-coded” into the printer’s hardware – before every print, it would raise the bed right up until the platform-raising piece on the X axis screw touched against the “end-stop” switch at the top.

The solution to that was to glue something to the top of the platform-raising piece so it would hit the switch sooner. In the end, I glued a scrabble piece and a sim card (I had them at hand) on. This artificially lowered the expected bed depth by about 2cm, which is much more than is needed for the hot-end that comes with the printer, but is perfect for the replacement hot-end I ordered next.

The original hot-end sucks. They even say it themselves – in their words, “the standard hotend in the makibox kit is not the greatest piece of engineering ever made by man, it does have a tendency to burn out”.

The first problem I encountered with that hot-end was that it has no way of cooling off. There is an aluminium wall on one side of the base-plate, which could hold a heat-sink, but the heat-sink would be a case of “too little, too late”, as the hot-end should really be cooled right above the heating element, not 3cm above it. The problem is that when the hot-end’s heat spreads upwards, the plastic being pushed into it melts too soon, and it ends up like trying to push goo through a small hole at the bottom of a can, using a piece of spaghetti.

I /was/ going to try solve this by wrapping some tubing around the hollow bolt above the heating element, and run water through it, but the hot-end just stopped working on me completely, so I decided to pay for a better solution.

This solution was the E3D V6 (Lite), which has a proper heatsink, and a fan.

The E3D V6 took a few days to arrive, and when it did, there was a few hours assembly needed. The hardest part was figuring out how to connect the Bowden tube to the Makibox’s extruder. I managed this in the end by taking an M6 (I think. maybe it was M5?) nut and screwing it directly onto the end of the Bowden tube, then the new tube would dock into the extruder just like the original.

The next problem is one I’m still working on solving. The hot-end is positioned by moving two beams. The hot-end hangs from where the beams cross each other. The problem is that the beams are moved by long screws on /one end/ of the beam. The part that connects to the screw tries to keep the beams perpendicular to the screw, but it’s like trying to lift a plank of wood by lifting just one end – difficult.

The solution for this, I think, is to run some strings around a series of wheels that guide the strings such that when one end of the Y axis moves (for example), the other end is pulled by the string to keep the beams perpendicular to each other.

So, the first prints I’m doing are holders for the wheels. The prints are really terrible, as the printer is obviously not yet in perfect working order, but after I finish fixing this problem, I can print them again in better quality πŸ™‚

unwatermarking images

I’ve started a website where I intend to sell thousands of products from a number of distributors through drop-shipping (the products go directly to the customer).

For reasons that I don’t understand, the distributors have watermarked their images, and don’t provide unwatermarked versions unless you’re an already well-established customer of theirs.

For the purpose of this demo, a watermark is a constant-colour “stamp” which is given opacity and pasted into the original image.

As I intend to be a good customer, I figured it would be okay for me to simply “unwatermark” the images.

There are a number of instructions online which show how to /fake/ an unwatermaking – by basically smudging the area where the watermark is.

However, as most watermarking appears to follow a single method, it is actually possible to simply reverse the process and remove the watermark, after a little trial and error.

Let’s consider an example. Here is an image, a stamp, and the merge of the two:

(original is here)

  • demo1
  • demo2
  • demo3

To reverse this, you need to know what algorithm was used to create the watermark, and what the original watermark was.

Most people use a fairly simple method to watermark their images:

The stamp is one single colour, usually gray (#808080 in RGB) which will be visible on images which are both light and dark.

The stamp is then given an opacity (30% in my case above), and pasted directly over the original image.

The formula for any particular colour channel (R, G and B) on any pixel is: C3=(1-p)C1+pC2, where p is opacity (0 to 1), C1 is the colour value for the original image, C2 is the stamp’s colour value, and C3 is the resulting image’s colour value.

To reverse the watermarking, you need to convert the formula to see what it is in respect to C1: C1=(C3-pC2)/(1-p).

As most stamps will be using a middle gray (#808080), you just have to guess at the opacity. .3 is a good start.

For some reason I’m not yet sure of, the code I came up with did unwatermark the image, but too much… the points where the watermark were, ended up being too bright. So I needed to add a darkening aspect, reducing the brightness of the result of the above calculation.

I’m not going to hold your hand if you can’t make this work, but here’s the code I ended up with (assumes the images are exactly 400×400 in size). The original should be ‘original.jpg’, and the stamp should be ‘stamp.png’ (with white where transparent pixels should be).

$p=.3; // opacity

$f1=imagecreatefrompng('stamp.png');
imagepalettetotruecolor($f1);
$f2=imagecreatetruecolor(400, 400);
$f3=imagecreatefromjpeg('original.jpg');
imagepalettetotruecolor($f3);

for ($x=0;$x<400; ++$x) {
  for ($y=0; $y<400; ++$y) {
    $rgb1=imagecolorat($f1, $x, $y);
    $rgb3=imagecolorat($f3, $x, $y);
    $r3 = ($rgb3 >> 16) & 0xFF;
    $g3 = ($rgb3 >> 8) & 0xFF;
    $b3 = $rgb3 & 0xFF;
    if ($rgb1==16777215) { // white. just copy
      $c=imagecolorallocate($f2, $r3, $g3, $b3);
      imagesetpixel($f2, $x, $y, $c);
      continue;
    }
    $r1 = ($rgb1 >> 16) & 0xFF;
    $g1 = ($rgb1 >> 8) & 0xFF;
    $b1 = $rgb1 & 0xFF;
    $r2=c($r1, $r3, $p);
    $g2=c($g1, $g3, $p);
    $b2=c($b1, $b3, $p);
    $c=imagecolorallocate($f2, $r2, $g2, $b2);
    imagesetpixel($f2, $x, $y, $c);
  }
}
imagejpeg($f2, 'unwatermarked.jpg');

function c($c1, $c2, $p) {
  $c=c1($c1, $c2, $p);
  $c3=$c-(255-$c)*.2;
  return $c3<0?0:(int)$c3; 
} 
function c1($c2, $c3, $p) {
  $c=($c3-$c2*$p)/(1-$p);
  return $c>255?255:(int)$c;
}