How does Catana handle your data


I have had a couple of users asking me how Catana handles data and why it requires “read-content access” to GitHub repositories. I wanted to address this question publicly and highlight what Catana does and what not.

Source of Truth

A common problem with TODOs coupled with an Issue tracker is that over time their information diverges. The issue is closed, but the TODO remains in your code (or the other way around). The issue points to a TODO on line X in file A, but in reality it moved to line Y in file B. Etc…

Catana addresses this problem by making your repository the source of truth.

For this to work, Catana requests “read-content” access for repositories installing its GitHub application.

Read-content access

The “read-content” permission allows a GitHub application to view or download the content of any files or even “git clone” the whole repository. For private repositories, you shouldn’t overlook this permission, and a good practice is to question why such an application requires it.

First and foremost: Catana never “git clone” your repository nor does it store any of your source code on its server.

Catana pulls its data solely from git diffs each time a user pushes code. The git diff output itself is not written to disk but parsed on the fly. Catana then stores in its database the metadata information of any TODO found.

Added code and its corresponding git diff.
Added code and its corresponding git diff.

In the above example, Catana will create a corresponding TODO database record and store the following information:

  • Its assignee (Edouard).
  • The event with its argument. In this case a Date event and ‘2023-01-01’.
  • The title (“TODO Title.”).
  • The TODO filename location.
  • The line number of the TODO.

Nothing more, nothing less.

Without the “read-content” permission, Catana wouldn’t be able to get diffs output from GitHub.

Synchronizing existing TODO

As users push changes, add/delete lines of codes, or rename files, existing TODOs in Catana’s database need to be updated. Catana proceeds the same way as above by parsing the git diff, calculating the line additions/deletions from the diff, and updating the location of any TODOs accordingly.

This approach makes Catana efficient (it takes less than a second for Catana to process each code push, even the ones that contain massive changes). But more importantly, it ensures that your codebase is never stored anywhere, which mitigates data leaks.

Tokens

When you grant access to a GitHub application, you implicitly authorize it to request access tokens. Catana uses the most secure option that GitHub offers. Each token requested is valid for eight hours only. After this period, Catana has to request a new token to GitHub. All tokens are encrypted during their lifespan using AES-GCM 256.

If you wish to no longer use Catana (:sad_panda:), uninstalling the application from GitHub will instantly revoke any existing tokens, blocking Catana from accessing resources in your repositories. Recorded TODOs linked to your repositories will not be destroyed automatically. The rationale is that TODO metadata is not considered sensitive and allows you to recover data if you wish to reinstall Catana (:happy_panda:).

Other permissions

Catana requests other permissions considered less sensible, such as creating CI checks. In parallel to this blog post, I’m writing a comprehensive security manifesto covering each permission requested by Catana and what they are used for.

Stay tuned!

Updated: