The idea for the application came from my use of the CLIP model in a cloud-based video pipeline. I had looked a lot of image models for various purposes, and it really stood out a truly astonishing mechanism to search for photos using natural language. Not only can it do simple searches, like “beach”, but it could also do more complicated searches, like “kids building a sandcastle”. It is even able to make connections between abstract queries and photos like “best friends” that would be impossible with any other previous models. Though I had extracted a photo pipeline and used CLIP to index all my photos in the cloud I thought that it would be even more compelling if a user didn’t have to upload their photos to anyone and could search them locally on their phone. That was the start of the journey to build the app.

PyTorch to CoreML

Having last built an iOS app right when the app store was launched in 2008, I knew I had a lot to learn about iOS app development. The first big thing that I would need to do was to port the PyTorch-based CLIP model to CoreML. Basically, all neural network frameworks under the covers are very similar calculation machines where layers and operators are represented similarly. This allows you to take a trained model like CLIP and convert it to be used in other frameworks. Fortunately, I didn’t have to do this entirely on my own. Apple has released a set of tools to help convert models from other frameworks into CoreML. The basic idea is that you load the model, add some tracing hooks to it and then use the model to make a prediction. It then saves out all that information in a format that CoreML understands. That is the first step. With CLIP there were a few issues doing that. A bug converting CLIP had been reported 3 weeks earlier and there was already a workaround being developed and was released a day after I started working on the project. Another interesting thing that came up during this process is that CLIP is actually two entirely different models. There is a model that converts an image into an embedding (a 512-dimensional vector describing the image) and another model that does the same thing with text. So, you end up actually doing two conversions in this case and generate two separate CoreML models.

Text Preprocessing

Converting the model is only one part of porting a model from PyTorch. In additional to the neural network, you also need to port the code that prepares the text or image for processing. For text there is generally some kind of tokenization strategy that converts a text prompt into a set of numbers that can be processed. For images there is often several steps of normalization, cropping and scaling. I needed port both of these preprocessing steps from Python to Swift code so they could be integrated into the app. For tokenization there was a cute shortcut because a lot of the work that is done was always the same so instead of porting that work, I just output the result of that work as JSON and load it into Swift. Then there was a much smaller amount of dynamic code to port and that part was done quickly.

Image Preprocessing

On the image processing side, it was more complicated. CoreML actually has built in support for doing a lot of the preprocessing normally done, however, in this case, CLIP calls for a color normalization step that was completely incompatible with CoreML’s. So instead of being able to directly pass an image to CoreML I need to have the model take a tensor (an MLMultiArray in CoreML) and I would need to generate that tensor myself from the original image. This is pretty easy to do naively, however, that naive mechanism ends up being extremely expensive computationally. In the first version of RememberWhen it dominated the time taken to index photos and it was only able to process 2 photos per second. A lot of work had to be done to figure out a better way to do that. Ultimately the processing pipeline is multithreaded and highly optimized where we can now process in excess of 50 photos per second on most iOS 15 devices.

The Processing Pipeline

Now that we have a processing pipeline that can index photos, we need to feed the user’s photos into this pipeline. That involved learning about PhotoKit and creating a strategy to be able to update when new photos are added to the user's library. We would also need to store the results and quickly retrieve them later when the app is restarted and continue indexing. Where we landed were to store everything in sqlite3. Initially I was storing these 512-dimensional vectors as text converted into JSON in the database. This also proved to be extremely expensive as parsing floating point numbers isn’t cheap. Fortunately, I was later able to convert the field into a memory copy of the structure and avoid any parsing at all thus allowing it to load 100k photo index in 1-2s. Keeping track of what has been indexed was relatively easy. It looks at the min/max creation date of the photos in the library and then indexes photos that are newer or older than those creation dates, always starting with newer photos.

Searching Photos

Now that we are able to create an index for the photos, we move on to how to search them. The basic process is to convert the text into its own embedding using the converted model and then comparing that embedding with all of the photo embeddings and find the nearest neighbors. I initially leveraged various open-source libraries that will do approximate nearest neighbor searches. The biggest issue with them (HORA, faiss, granne, etc) is that they are painful to upgrade incrementally as you need to reevaluate the index they generate with each additional photo. Generating this index after an update was taking many seconds and was ultimately unacceptable to me. There is another strategy though, brute force embedding search. The idea there would be to literally compare them all and grab the top-k matches as your search result. But would it be fast enough for this app? As it turns out, computers are fast. My multi-threaded, Rust-based, brute force embedding search can search 100k photos in under 10ms. Another interesting part of this process was figuring out how to create an Xcode framework from a Rust project and run it on 6 different devices. It wasn’t that hard but for Mac Catalyst (creating a Mac app from an iOS app) it was only supported in Rust nightly and there were some bugs. The maintainers over at safer_ffi though fixed the problems I ran into.

Building the user interface

The last bit of building the app was learning SwiftUI. Thankfully, I had taken some time a couple years ago to learn React and SwiftUI is pretty similar. In the end I have probably deleted more UI code than I have in the project now as I have aggressively simplified how the application works based on feedback from testers. There is now basically a search box, search results and a single photo view. One interesting thing just recently added is the ability to zoom and pan an image on the single photo view. Amazingly, this isn’t something that is supported out of the box by SwiftUI and had to leverage some crazy code from the internet to add the feature.

Asynchronous Swift

Probably my favorite bit of Swift programming is the photo processing pipeline. Swift has about 4 different asynchronous programming paradigms but ultimately didn’t have the primitive that I needed. I wanted a way to add a step in the pipeline that distributed the work over multiple cores but at the same time kept the results in the same order as the work was inserted. With nothing out of the box, I ended up making something that looks like a LinkedBlockingQueue from Java where I insert N asynchronous Tasks at a time and then pop them off the queue in order for the next sequential stage when the results are written to the index. This was one of the keys to drastically increasing the photo processing performance.

Wrapping it up

Some other cute things that came up during development: