Sky-Feeder

Feed Generators are services that provide custom algorithms to users through the AT Protocol.

Powered by The AT Protocol SDK for Python

This project is a forked version of the original ATProto Feed Generator. It extends that functionality by:

Adding a web portal for creating and manipulating feeds,
Establishing a "feed manifest" syntax for expressing the rule set for any type of algorithmic recipe for a feed,
Creating algorithm "operators" that generate boolean results for different types of comparisons to be made against the skeet stream

A public-facing site that anyone can use to create their own feeds is currently available at the Hellofeed demo site. To learn more about feed generators please review this documentation.

Key Features

Rule Chaining: Supports chaining multiple rules using logical operators like AND and OR to build complex conditions for feed filtration.
Regex Evaluation: Utilize regular expressions to match patterns within the data feed content.
Transformer Similarities: Integrate transformer models for semantic text similarity analysis.
Social Graph Filtering: Select/Reject skeets based on follower/following graph properties.
Attribute Matching: Select/Reject skeets based on direct comparison on properties of the skeet stream.
ML Probability Assessments: Calculate probabilities based on feature evaluations using ML models to classify and filter data.
Modular Design: Facilitates the addition of new models and rule types with minimal changes to existing code.

Getting Started

We've set up this server with Postgres to store and query data. Feel free to switch this out for whichever database you prefer.

Prerequisites

Python 3.7+
Optionally, create a virtual environment.
Can run as Dockerized set up

Installation

Install dependencies:

pip install -r requirements.txt

Copy .env.example as .env. Fill in the variables.

Implementing Logic

Next, you will need to do two things:

JetStream ingest in server/data_filter.py.
Skeet Filtering Logic in server/algos.
- Use the provided algorithm_manifest.json.example file as the configuration blueprint for deploying the feed algorithms.
- The algorithm_manifest.json.example contains definitions of models, their configurations, and how they interlink through rules to filter data.
Management Layer in server/app.py, server/database.py and files referenced from there on.

Algorithm Manifest

`algorithm_manifest.json`

The algorithm_manifest.json file serves as the configuration blueprint for deploying the feed algorithms. In our implementation, we store algorithm manifests against UserAlgorithm objects, but these can be applied any other way you'd prefer. It contains the definition of models, their respective configurations, and how they interlink through rules to filter data. To learn more about the full set of available operators please review the manifest documentation.

Structure

filter: Defines the condition set used for evaluating each feed item. The conditions use operations like regex_matches, text_similarity, and model_probability.
models: Lists the ML models used along with their feature modules. Note that, for any model, you must provide the correct definition for a model as well as its feature modules in the correct order. We'll make that unnecessary later... at some point. Each model declaration includes:
- model_name: Unique identifier for the ML model.
- training_file: Source data used for model training.
- feature_modules: Features used to generate the necessary input vector for ML predictions.
author: Provides credentials to authenticate the model-building process and social graph traversals.

Example

{
  "filter": {
    "and": [
      {
        "regex_matches": [
          {"var": "text"},
          "\\bimportant\\b"
        ]
      },
      {
        "regex_negation_matches": [
          {"var": "text"},
          "\\bunwanted_term\\b"
        ]
      },
      {
        "social_graph": [
          "devingaffney.com",
          "is_in",
          "follows",
        ]
      },
      {
        "text_similarity": [
          {"var": "text"},
          {
            "model_name": "all-MiniLM-L6-v2",
            "anchor_text": "This is an important update"
          },
          ">=",
          0.3
        ]
      },
      {
        "model_probability": [
          {"model_name": "toxicity_model"},
          ">=",
          0.9
        ]
      }
    ]
  },
  "models": [
    {
      "model_name": "toxicity_model",
      "training_file": "prototype_labeled_dataset.json",
      "feature_modules": [
        {"type": "time_features"},
        {"type": "vectorizer", "model_name": "all-MiniLM-L6-v2"},
        {"type": "post_metadata"}
      ]
    }
  ],
  "author": {
    "username": "user",
    "password": "pass"
  }
}

Running the Server

Run the development FastAPI server:

uvicorn $APP_MODULE --host $HOST --port $PORT

Note Duplication of data stream instances in debug mode is fine. Read the warning below.

Warning In production, you should use a production WSGI server instead.

Warning If you want to run the server with multiple workers, you should run the Data Stream (Firehose) separately.

Bluesky-Facing Endpoints

/.well-known/did.json
/xrpc/app.bsky.feed.describeFeedGenerator
/xrpc/app.bsky.feed.getFeedSkeleton

Major Enhancements

Logic Evaluation: Integrated a Logic Evaluator in Python that applies JSON-like conditions using registered operations such as regex matching, text similarity, and model probability calculation.
Algorithmic Operator Implementations in Python: Implemented key classes like AttributeParser, ProbabilityParser, RegexParser, SocialParser, and TransformerParser which handle respective tasks and provide operations for evaluation.
Feature Generation: The FeatureGenerator class captures various features such as vectorized text using transformer models, metadata, and temporal features, which enrich the ML model inputs.
Compatibility with AT Protocol: The generator interfaces seamlessly with the AT Protocol for publishing and managing feed generators.
Enhanced Model Management: Support for XGBoost and transformer models, including features for model training, evaluation, and inference.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github		.github
alembic		alembic
probability_models		probability_models
server		server
templates		templates
.env.example		.env.example
.flaskenv		.flaskenv
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST_DOC.md		MANIFEST_DOC.md
README.md		README.md
alembic.ini.example		alembic.ini.example
algo_dependencies.json		algo_dependencies.json
algo_manifest.json		algo_manifest.json
algo_manifest.json.example		algo_manifest.json.example
algo_matcher.py		algo_matcher.py
blah.py		blah.py
bluesky_user_dids.json		bluesky_user_dids.json
dataset.json		dataset.json
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
prototype_labeled_dataset.json		prototype_labeled_dataset.json
requirements.txt		requirements.txt
streamer.py		streamer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sky-Feeder

Key Features

Getting Started

Prerequisites

Installation

Implementing Logic

Algorithm Manifest

`algorithm_manifest.json`

Structure

Example

Running the Server

Bluesky-Facing Endpoints

Major Enhancements

About

Releases

Sponsor this project

Packages

Languages

License

DGaffney/sky-feeder

Folders and files

Latest commit

History

Repository files navigation

Sky-Feeder

Key Features

Getting Started

Prerequisites

Installation

Implementing Logic

Algorithm Manifest

algorithm_manifest.json

Structure

Example

Running the Server

Bluesky-Facing Endpoints

Major Enhancements

About

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

`algorithm_manifest.json`

Packages