Home | Not logged in
<<< Back to blog

CaaS - Cats as a Service

Written by Stephen
January 10, 2020

Imagine there was an API where you could get random cat pictures from the internet. This sounds like an amazing idea, but it doesn't exist! So I decided to build it myself.

The basic idea was as follows. First, train an AI to differentiate between cats and not-cats. Then, I could feed it random pictures off the internet, and it would discard non-cat pictures. The cat pictures would be saved, to be served by the API.

Most of the AI work was already done for me. I found this project, which is designed to classify cat and dog pictures. Nothing really had to be done to the code. When training the model, I just gave it non-cat pictures, rather than dog pictures. My computer is not optimized for AI training(I used my CPU) so I was not able to train the model as much as I would like.

I was also planning on distributing the classification over many computers. For this, I set up a Rails API that would allow grabbing of unclassified pictures and classifying them. I wrote a Python script to interface with the API and use the previously trained model. It turns out that my computer can classify images much faster than I can download them, so I only ran the program on a single computer.

The next issue was where to get pictures. For that, I used Common Crawl. Common Crawl is a non-profit that crawls the internet and builds up a massive database of web pages. The entire thing - nearly 50TB compressed - is available for download. For this project, I only needed the URLs, which was only 200GB compressed. I wrote a script that would filter for URLs ending in .png, .jpg, and .gif. I ended up with a 957MB text file full of picture URLs, which I imported into a Postgres database. This is what the Rails API connected to.

Classification took several days. I only ended up classifying 53% of the 9.6 million URLs I had. Just over 3 million URLs didn't work(404, connection error, etc.), or about 59% of the classified URLs. I ended up with 6018 cat pictures - just 0.117% of classified URLs, or 0.29% of working URLs. I expected more, but I have a strong feeling there were a lot of false negatives. (I had to bias the AI towards "not a cat" because it was getting tripped up by many false positives)

With the pictures classified, the next step was to write a web interface. I didn't really feel like making it good, since the project is basically just an elaborate joke. The main page is available here and features pagination. There is also a URL for getting a random cat picture.

The most common criticism I've received is how many pictures aren't cat pictures. Between the first 2 pages, I count 27 cat pictures, out of 50 pictures total, for a success rate of 54%. However, if we assume that 1% of pictures on the internet are cat pictures(the actual percentage is probably lower) then that is a significant enrichment.

I want to emphasize again that the frontend is really, really bad. The random cat page returns a 500 sometimes, because it will try to fetch the original image if not cached, and sometimes that request is unsuccessful. Pagination can also be very slow for the same reason - images have to be fetched off of external servers. If this was a more serious project, I would just cache the image when classified as a cat.

0 comments Categories: Uncategorized

Leave a comment

You have to sign in to leave a comment.