Imagine there was an API where you could get random cat pictures from the internet. This sounds like an amazing idea, but it doesn't exist! So I decided to build it myself.
The basic idea was as follows. First, train an AI to differentiate between cats and not-cats. Then, I could feed it random pictures off the internet, and it would discard non-cat pictures. The cat pictures would be saved, to be served by the API.
Most of the AI work was already done for me. I found this project, which is designed to classify cat and dog pictures. Nothing really had to be done to the code. When training the model, I just gave it non-cat pictures, rather than dog pictures. My computer is not optimized for AI training(I used my CPU) so I was not able to train the model as much as I would like.
I was also planning on distributing the classification over many computers. For this, I set up a Rails API that would allow grabbing of unclassified pictures and classifying them. I wrote a Python script to interface with the API and use the previously trained model. It turns out that my computer can classify images much faster than I can download them, so I only ran the program on a single computer.
The next issue was where to get pictures. For that, I used Common Crawl. Common Crawl is a non-profit that crawls the internet and builds up a massive database of web pages. The entire thing - nearly 50TB compressed - is available for download. For this project, I only needed the URLs, which was only 200GB compressed. I wrote a script that would filter for URLs ending in .png, .jpg, and .gif. I ended up with a 957MB text file full of picture URLs, which I imported into a Postgres database. This is what the Rails API connected to.
Classification took several days. I only ended up classifying 53% of the 9.6 million URLs I had. Just over 3 million URLs didn't work(404, connection error, etc.), or about 59% of the classified URLs. I ended up with 6018 cat pictures - just 0.117% of classified URLs, or 0.29% of working URLs. I expected more, but I have a strong feeling there were a lot of false negatives. (I had to bias the AI towards "not a cat" because it was getting tripped up by many false positives)
With the pictures classified, the next step was to write a web interface. I didn't really feel like making it good, since the project is basically just an elaborate joke. The main page is available here and features pagination. There is also a URL for getting a random cat picture.
The most common criticism I've received is how many pictures aren't cat pictures. Between the first 2 pages, I count 27 cat pictures, out of 50 pictures total, for a success rate of 54%. However, if we assume that 1% of pictures on the internet are cat pictures(the actual percentage is probably lower) then that is a significant enrichment.
I want to emphasize again that the frontend is really, really bad. The random cat page returns a 500 sometimes, because it will try to fetch the original image if not cached, and sometimes that request is unsuccessful. Pagination can also be very slow for the same reason - images have to be fetched off of external servers. If this was a more serious project, I would just cache the image when classified as a cat.0 comments