🤔 Overview • 🪄 Demos • 🔧 Installation • 💻 Usage • 🧠 How it works
This is a CLI application that analyzes a source code file using an AI model. It then shows you parts that look suspicious to it.
It does not use rules or static analysis the way a linter tool would. Instead, the model generates its own code suggestions based on the surrounding context. Check out how it works.
NB: All processing is done on your hardware and no data is transmitted to the Internet
Example output:
Here's the output of running the application on its own source files (so meta).
cli.py
— source code → generated outputrender.py
— source code → generated outputsus.py
— source code → generated output
There was this post AI found a bug in my code on Hacker News which was pretty cool. I wanted to try it on my own code, so I went ahead and built my implementation of the idea.
You can install sus
via pip
or from the source.
pip3 install suspicious
git clone git@github.com:sturdy-dev/suspicious.git
cd suspicious
python -m pip install .
You can run the program like this:
sus /path/to/file.py
Note that when you run this for the first time, the application will need to download a model (~500 MB) — more info section.
This will generate and open an .html
file with the results.
grey
means prediction is the same as the originallight grey
means the model had a different prediction but with super low confidencelight red
means things are looking a little susred
means there was a different prediction and confidence was higher
Unclear. You run sus
on a file and skim over the red stuff, maybe it spots something you missed. Ping me on twitter if you catch something cool with it.
In a nutshell, it feeds a tokenized representation of your source text into a Transformer model and asks the model to predict one token at a time using Masked Language Modelling.
For a general overview about Transformer models, check out The Illustrated Transformer article by Jay Alammar, which helped me out in understanding the core ideas.
sus
uses a model called UniXcoder which has been trained on the CodeSearchNet dataset. To do the MLM (masked language modelling) we are adding a lm_head
layer.
When sus
processes your code, it first tokenizes the text, where a token could be a special character or programming language keyword, English word or part of a word.
Before feeding the sequence of token ids to the model, one or multiple tokens are replaced with a special <mask>
token. After feeding the input through the network, we extract just the value at the masked location. This masking is done in a loop for each token to generate individual predictions.
Since this process is impractically slow, instead of masking one token at a time, sus
masks 10% of the tokens, making sure that the masked locations are spread out (so that there is sufficient context around each prediction site).
The output of this entire process is a list of structs that contain the original and predicted values for each token. Example:
{
"idx": 0, // position in sequence
"original": "foo", // as originally written in the source file
"predicted": "bar", // what the model predicted
"cosine_similarity": 0.23, // how different the prediction is from the original in the vector space
"probability": 0.92, // how confident the model is in it's prediction
}
This is then fed into an html
template to be rendered for the user. Easy-peasy.
sus
uses the decoder of UniXcoder, specifically the unixcoder-base-nine checkpoint. What's cool is that it's only 500 MB and ~120M parameters, which means it's quick to download and fast enough to run locally.
Larger models produce higher quality outputs, but you need to run the inference on a server.
You can try sus
on any source file, but you can expect best results with the following languages:
- java
- ruby
- python
- php
- javascript
- go
- c
- c++
- c#
- Accuracy —
sus
is meant to be executed locally (aka not sending code to a server), which puts some constraints on the AI model size. Larger models will produce higher quality results, but they can be tens of GB in size and without a beefy GPU could take a long time to generate the output. Because of this,sus
uses a modestly sized model. - Large files — The model also puts constraints on the input size (analyzed file size).
sus
works around this by batching the input, but as a result of this, batches are not aware of the 'context' / code that is in other batches. Files are split in batches of 2500 characters which is super crude and is meant to correspond to ~1024 tokens. - Masking is done on per token basis. It could be interesting to first generate syntax tree from the code and then mask the entire node instead.
Semantic Code Search is distributed under AGPL-3.0-only. For Apache-2.0 exceptions — kiril@codeball.ai