Skip to content

Latest commit

 

History

History
446 lines (337 loc) · 17.4 KB

01-migration-from-analysis-runner.adoc

File metadata and controls

446 lines (337 loc) · 17.4 KB

Migration from analysis runner

Status

In production.

Introduction

This document records analysis and thoughts while migrating from cljdoc’s analysis-runner module which used a fork of codox to this dedicated cljdoc-analyzer project.

Decisions

We will:

  • focus on the cljdoc use case of analyzing from the jar as opposed to the codox use case of analyzing from the source repository root. This means we do not need both of codox`s :root-path and :source-paths options. We’ll use :root-path and drop :source-paths.

  • move to cljdoc-analyzer namespace. This will allow us to move ahead unfettered by the past. It will mean minor adjustments to cljdoc. We have confirmed that database BLOB metadata uses is unaffected by our changes.

  • allow ^:no-doc elements to be returned for ad hoc usage. At least initially, cljdoc will continue to receive the analysis filtered as it is currently.

  • change codox :language option to :languages and accept one of:

    • vector of languages to analyze which can be one or both of: "clj" "cljs"

    • :auto-detect

  • process languages in a single pass

  • in the spirit of dropping features that have no known current use, drop codox:

    • :metadata option - we have no need to provide default metadata

    • :writer option - folks can do what they want with the returned data

    • :exclude-vars option - we’ll hardcode the codox default - we can bring this back if ever there proves to be a need

    • :exception-handler option - we’ll keep the inner workings in metagetta and start with a fast failing error handler

  • we’ll leave in codox wildcard support for namespaces, code is simple and the feature seems useful.

  • project specific overrides for languages (aka platforms), namespaces and deps will be moved from cljdoc to a local cljdoc-analyzer config file.

Other minor breaking changes:

  • cljdoc will invoke cljdoc-analyzer.cljdoc-main

    • :repos is now :extra-repos

We considered, but will not at this time:

  • support failing slow. Codox currently stops processing on the first problem encountered. We’ll look at trying to find multiple errors in a single run.

Potential users

In addition to cljdoc, who else might find value in cljdoc-analyzer?

  • clj-kondo is a static source analyzer. It has special coding to to understand the potemkin import-vars API, but it does not know about other load time metadata manipulations. The output of this tool might be useful for clj-kondo to fill in any gaps.

  • Codox could potentially make use of this library, but at this time, the original author does not see a benefit (which is totally fine). So we’ll not need to invest in maintaining codox feature compatibility.

  • ad hoc use. I (lread) am interested in using cljdoc-analyzer to compare API signatures between rewrite-clj, rewrite-cljs and rewrite-cljc to detect any unintended breakage and to document changes. (aside: API comparison is also of interest for a future feature of cljdoc).

Analysis of current behavior

Codox was designed to allow authors to generate documentation for their Clojure/ClojureScript lein and boot projects. It operates on the repository sources of a project and generates html.

Cljdoc does not follow the common codox use case. Cljdoc uses codox to retrieve API metadata only. It works on published artifacts (jars) instead of repository sources (note that cljdoc does make use of the source repository for documentation contained in articles and resolving API source files, but codox does not come into play for this work).

Working at the published jar level instead of repository sources level means cljdoc:

  • does not care whether a project uses leiningen, boot or deps tools, it simply refers to the source code contained in the jar, and the pom.xml.

  • takes on responsibility of resolving dependencies from pom.xml rather than relying on lein, boot or tools.deps.alpha.

  • can assume the classpath for the source code in the jar is always at the exploded jar root.

The fundamental inputs for retrieval of metadata for both worlds are the same:

  • classpath of sources and dependencies (although for normal codox use the dependencies are resolved by lein or boot)

  • codox options

Because cljdoc works on unknown projects, it goes through some special steps to avoid potential problems with analysis. And because code is evaluated while getting metadata, cljdoc takes care to isolate this work and minimize dependencies by launching a separate process.

external integration points

The current cljdoc analysis runner-ng.main is launched, as far a I can see, only by: . cljdoc/analysis/service.clj

tools support

Codox contains specific tool support for lein and boot.

Cljdoc does not make use of this support.

cljdoc jar processing steps

In a nutshell cljdoc analysis runner:

  1. unzips the published jar to a work directory

  2. removes problematic directories and files

  3. copies over cljdoc wrapper source (which calls codox)

  4. resolves classpath from pom (and includes extra deps as needed)

  5. overrides languages and namespaces for problematic libraries

  6. launches the cljdoc wrapper (which calls codox) for each found language with a resolved classpath

  7. wraps codox language results into map for cljdoc consumption

  8. saves results to an edn file to share back with cljdoc

A goal of these steps is to limit dependencies of the wrapper to the minimum required to fetch the actual metadata. The less dependencies our actual analysis phase has, the less chance we have for project library collisions and confusions.

current cljdoc codox inputs

cljdoc uses all options internally; none are exposed to project authors. The following table lists current option usages and muses about what we might minimally and potentially support moving forward. I’ve put a star beside the options we settled on for the initial release.

option key codox usage cljdoc usage mimimally potentially

:language

return metadata for :clojure or :clojurescript

intelligently determines languages from source and calls codox once for each, with custom overrides for problematic projects

continue to support, rename to :clj and :cljs

⭐ allow to request an array of languages to parse, or :auto-detect

:root-path

the github project root, used to calculate relative :source-paths

sets to current dir (ie. had no use for this)

⭐ if we are only supporting exploded jars, we could keep this and turf :source-paths

if we want to remain general purpose, this concept still has use

:source-paths

the list of paths to search for source. When working from source and not a jar, this makes sense

a single path, the root of exploded jar

⭐ if we are only supporting exploded jars, we could keep :root-path and turf this

continue to support

:namespaces

a list of namespaces to include, includes support for regex.

used by cljdoc to limit to specific namespaces for problematic projects, otherwise parse all. Does not use regex.

continue to support without regex

⭐ continue to support with regex

:exception-handler

behavior to execute on exception

ditto

⭐ turf eternal option, hardcode to fail fast

continue to support for general usage, perhaps extend to allow to fail slow (continue after failure in ns)

:metadata

a way to provide default metadata where it is missing

unused

⭐ turf it

continue to support for general usage

:writer

a clever way to support different outputs, codox defaults to writing out html

cljdoc uses 'clojure.core/identity to write out edn

⭐ turf it, and hard code to return map only

continue to support, but default to spitting out edn (and nothing included to spit out anything else)

:exclude-vars

clj and cljs sometimes return data we are not interested in and this offers a way to exclude it, by default excludes record constructor functions returned by clj

cljdoc hardcodes to default

⭐ turf it and hard code to current default

continue to support, I wonder if any codox uses this…​

Turfing does not necessarily mean deleting all associated source, it can mean simply removing as an option, when that makes more sense.

current outputs

Codox currently treats clj and cljs as separate analysis passes. The returned analysis for a pass is a list of namespaces each with a list of public vars. Codox skips namespaces and public elements tagged with ^:no-doc metadata.

  • codox analysis for a language is a list of maps of:

    • :name namespace name

    • :doc namespace doc string

    • :publics namespace publics which is a list of maps of:

      • :name public element name

      • :type one of: :macro :multimethod :protocol :var

      • :doc doc string

      • :file file relative to :source-paths

      • :path file relative to :root-path returned as File object. Ignored by cljdoc; theoretically effectively the same as :file for analysis of an exploded jar

      • :line line number

      • :arglists list of vectors of arglists, omitted for def record and protocol elements

      • :members only applicable when :type is :protocol, list of maps of:

        • :arglists list of vectors of arglists

        • :name name of protocol method

        • :type can this be only :var?

special metadata tags when present are included in publics:

  • :added version an element was added

  • :deprecated version an element was deprecated

  • :dynamic for dynamic defs

cljdoc then takes this output and massages it to a map of:

  • :group-id project group-id

  • :artifact-id project artifact-id

  • :version project version

  • :codox codox analysis for languages which can consist of a map with none, one or both of:

    • :clj the above codox analysis for clojure with :path removed

    • :cljs the above codox analysis for for clojurescript with :path removed

  • :pom-str slurp of pom.xml

This is serialized for later ingestion to a sqlite database by cljdoc. I do see some small tweaks by cljdoc here. Before serialization, it makes regexes in argslists serializable. After deserialization it sanitizes macros (which does not really sanitize, it asserts no duplicate publics). An important observation is that while some map values get their own columns in the db, the map is saved as a nippy blob in the database, so preserving the map structure will be important at the individual var (aka public above) and namespace level.

I was curious how source links for api docs were resolved to correct scm urls. This happens at render time. The list of all scm files is also saved to the database as part of the separate git analysis. This list is compared against the :file above for a best match. This work is similar to what codox does when populating :path

cljdoc blobs

Neutral observation: although some fields are stored outside of blobs in their own columns, on retrieval database row, the data is taken primarily from the blob. This is not unusual for NoSQL type designs.

table column blob content compatibility concern?

versions

meta

info on scm, files and docs keys from map:

  • :jar

  • :scm - version control info including list of all files

  • :doc - cljdoc edn analysis result hydrated including file content

  • :config - cljdoc edn analysis result in original format

nope we are good. no api information

namespaces

meta

info on namespace:

  • :doc - doc string

  • :name - namespace name

  • :platform - "clj" or "cljs"

yes, this comes from codox analysis, at save time :publics are removed and :platform is added.

vars

meta

info on public var

  • :name

  • :file

  • :type

  • :line

  • :members

  • :arglists

  • :doc

  • :namespace

  • :platform

yes, this comes from codox analysis, at save time :namespace and :platform are added.

curiosities

Questions we do not necessarily need to answer:

  • is protocol :members → :type always :var?

Musings on new behavior

In short, I think cljdoc-analyzer should steal responsibilities from the current cljdoc analysis runner and, at least initially, focus on the cljdoc use case of operating on jars (rather than source repos).

choice #1 - cljdoc scope

  1. Do nothing. Abort. Keep using codox as is.

  2. Streamline cljdoc-analyzer. Remove all unnecessary code form cljdoc-analyzer. Similar to 1 but with an easier to reason about and maintain cljdoc-analyzer (mostly already complete).

  3. cljdoc-analyzer operates on jar. It takes on many of the responsibilities of current cljdoc analysis runner.

    1. input is jar and options.

    2. output is metadata.

    3. handle all cljdoc allowances (extra deps, extra repos, etc) through config.

Chosen path: option #3. It makes cljdoc-analyzer potentially also interesting as an ad hoc tool.

choice #2 - codox scope

The next choice to make is whether or not cljdoc-analyzer should support source repo dirs and current codox options. This usage likely plays out by adding cljdoc-analyzer as a dev dependency to your project.

Chosen path: we chose not to entertain this at this time but may pursue at some later date if there is interest.

Explorations

metagetta as a subproject

How well is metagetta as a subproject supported by Clojure tooling?

Metagetta as a subproject works when referenced by cljdoc-analyzer via: * :local/root * :git/url (after moving metagetta under modules dir)

Not so lucky when cljdoc-analyzer is packaged in a jar as a source project:

  1. It seems that tools.deps.alpha expect deps to resolve down to the :file protocol. A file in a jar does not use the :file protocol.

  2. Ironicaly, cljdoc-analyzer cannot analyze itself as it tries to parse metagetta source.

I like having metagetta as an internal subproject within cljdoc-analyzer but if this won’t fly for technical reasons, I suppose it could be split out into its own project.

For now, we’ll solve issue above by jarring up metagetta and include it in cljdoc-analyzer.jar. When we detect we are running from a jar we’ll copy the jar out to our temp work dir and reference it via local:root.

Migration plan

Testing should include running a reasonable sample of projects through current cljdoc analysis runner and comparing results with the cljdoc-analyzer. I think this should give us the confidence we need.

Test scripts and raw results are available for review.

Interpreted Test results

General differences

Differences I automatically adjusted for during diff:

  • :codox is now :analysis

  • analysis now consistently sorted by :name

  • empty :members no longer included

  • empty :doc no longer included

  • :members now consistently and always omit :file and :line

Differences I compensated for via manual inspection:

  • defrecord vars are now included

  • when two files share the same namespace (for example, .clj and cljs) all publics from both namespaces are now included

  • dynamically imported (import-vars) cljs publics now show correctly

  • :file was sometimes fully qualified rather than relative to jar-root

Regressions found and fixed

  • internal project overrides now applied when project name is not fully qualified, ex manifold instead of manifold/manifold

Interesting observations

  • we have special support for serializing and deserializing regexes. Note though that regexes that look logically equal do not evaluate to logically equal.

    user=> (= #"hello" #"hello")
    false

Specific tests run

project version aspect of interest test results

amazonica

0.3.146

  • dynamically created API

  • clj

  • no differences after making compensations

bidi

2.1.3

  • part of current integration tests

  • cljc

  • now includes defrecords

iced-nrepl

0.2.5

  • part of current integration tests

  • shows as API import failure on current cljdoc prod

  • no differences after making compensations

io.aviso/pretty

0.1.29

  • part of current integration tests

  • included in project-overrides for :deps

  • clj

  • no differences after making compensations

licaltown/hx

0.5.2

  • part of current integration tests

  • added by Martin with a fix for failing import

  • cljs

  • oh right, old failed, nothing to compare against!

lread/rewrite-cljs-playground

1.0.0-alpha

  • uses internal import-vars on both clj and cljs

  • cljs publics dynamically added by import-vars now import correctly

manifold

0.1.8

  • part of current integration tests

  • included in project-overides for :namespaces and :languages

  • clj

  • no differences after making compensations

metosin/compojure-api

2.0.0-alpha27

  • part of current integration tests

  • shows as API import failure on current cljdoc prod

  • now includes defrecords

metosin/muuntaja

0.6.3

  • part of current integration tests

  • clj

  • now includes defrecords

metosin/reitit

0.3.9

  • uses include-namespaces-from-dependencies feature

  • cljs clj cljc

  • no differences after making compensations

orchestra

2018.11.07-1

  • part of current integration tests

  • clj cljc cljs

  • now includes all publics for orchestra-cljs.spec.test which spans cljs cand cljc files

  • also noticed that we now properly relavitize orchestra.core/defn-spec :file (was formerly fully qualified)

semantic-csv

0.2.1-alpha1

  • has a regex in arglist (tests special serialization)

  • noticed some fully qualified :file elements that are now properly relativized

  • interesting note: the regex is showing as different although they are logically equivalent