Migration from analysis runner

Table of Contents

Status
Introduction
Decisions
Potential users
Analysis of current behavior
Musings on new behavior
- choice #1 - cljdoc scope
- choice #2 - codox scope
Explorations
- metagetta as a subproject
Migration plan
- Interpreted Test results
  - General differences
  - Specific tests run

Status

In production.

Introduction

This document records analysis and thoughts while migrating from cljdoc’s analysis-runner module which used a fork of codox to this dedicated cljdoc-analyzer project.

Decisions

We will:

focus on the cljdoc use case of analyzing from the jar as opposed to the codox use case of analyzing from the source repository root. This means we do not need both of codox`s :root-path and :source-paths options. We’ll use :root-path and drop :source-paths.
move to cljdoc-analyzer namespace. This will allow us to move ahead unfettered by the past. It will mean minor adjustments to cljdoc. We have confirmed that database BLOB metadata uses is unaffected by our changes.
allow ^:no-doc elements to be returned for ad hoc usage. At least initially, cljdoc will continue to receive the analysis filtered as it is currently.
change codox :language option to :languages and accept one of:
- vector of languages to analyze which can be one or both of: "clj" "cljs"
- :auto-detect
process languages in a single pass
in the spirit of dropping features that have no known current use, drop codox:
- :metadata option - we have no need to provide default metadata
- :writer option - folks can do what they want with the returned data
- :exclude-vars option - we’ll hardcode the codox default - we can bring this back if ever there proves to be a need
- :exception-handler option - we’ll keep the inner workings in metagetta and start with a fast failing error handler
we’ll leave in codox wildcard support for namespaces, code is simple and the feature seems useful.
project specific overrides for languages (aka platforms), namespaces and deps will be moved from cljdoc to a local cljdoc-analyzer config file.

Other minor breaking changes:

cljdoc will invoke cljdoc-analyzer.cljdoc-main
- :repos is now :extra-repos

We considered, but will not at this time:

support failing slow. Codox currently stops processing on the first problem encountered. We’ll look at trying to find multiple errors in a single run.

Potential users

In addition to cljdoc, who else might find value in cljdoc-analyzer?

clj-kondo is a static source analyzer. It has special coding to to understand the potemkin import-vars API, but it does not know about other load time metadata manipulations. The output of this tool might be useful for clj-kondo to fill in any gaps.
Codox could potentially make use of this library, but at this time, the original author does not see a benefit (which is totally fine). So we’ll not need to invest in maintaining codox feature compatibility.
ad hoc use. I (lread) am interested in using cljdoc-analyzer to compare API signatures between rewrite-clj, rewrite-cljs and rewrite-cljc to detect any unintended breakage and to document changes. (aside: API comparison is also of interest for a future feature of cljdoc).

Analysis of current behavior

Codox was designed to allow authors to generate documentation for their Clojure/ClojureScript lein and boot projects. It operates on the repository sources of a project and generates html.

Cljdoc does not follow the common codox use case. Cljdoc uses codox to retrieve API metadata only. It works on published artifacts (jars) instead of repository sources (note that cljdoc does make use of the source repository for documentation contained in articles and resolving API source files, but codox does not come into play for this work).

Working at the published jar level instead of repository sources level means cljdoc:

does not care whether a project uses leiningen, boot or deps tools, it simply refers to the source code contained in the jar, and the pom.xml.
takes on responsibility of resolving dependencies from pom.xml rather than relying on lein, boot or tools.deps.alpha.
can assume the classpath for the source code in the jar is always at the exploded jar root.

The fundamental inputs for retrieval of metadata for both worlds are the same:

classpath of sources and dependencies (although for normal codox use the dependencies are resolved by lein or boot)
codox options

Because cljdoc works on unknown projects, it goes through some special steps to avoid potential problems with analysis. And because code is evaluated while getting metadata, cljdoc takes care to isolate this work and minimize dependencies by launching a separate process.

external integration points

The current cljdoc analysis runner-ng.main is launched, as far a I can see, only by: . cljdoc/analysis/service.clj

tools support

Codox contains specific tool support for lein and boot.

Cljdoc does not make use of this support.

cljdoc jar processing steps

In a nutshell cljdoc analysis runner:

unzips the published jar to a work directory
removes problematic directories and files
copies over cljdoc wrapper source (which calls codox)
resolves classpath from pom (and includes extra deps as needed)
overrides languages and namespaces for problematic libraries
launches the cljdoc wrapper (which calls codox) for each found language with a resolved classpath
wraps codox language results into map for cljdoc consumption
saves results to an edn file to share back with cljdoc

A goal of these steps is to limit dependencies of the wrapper to the minimum required to fetch the actual metadata. The less dependencies our actual analysis phase has, the less chance we have for project library collisions and confusions.

current cljdoc codox inputs

cljdoc uses all options internally; none are exposed to project authors. The following table lists current option usages and muses about what we might minimally and potentially support moving forward. I’ve put a star beside the options we settled on for the initial release.

option key	codox usage	cljdoc usage	mimimally	potentially
`:language`	return metadata for `:clojure` or `:clojurescript`	intelligently determines languages from source and calls codox once for each, with custom overrides for problematic projects	continue to support, rename to `:clj` and `:cljs`	⭐ allow to request an array of languages to parse, or `:auto-detect`
`:root-path`	the github project root, used to calculate relative :source-paths	sets to current dir (ie. had no use for this)	⭐ if we are only supporting exploded jars, we could keep this and turf `:source-paths`	if we want to remain general purpose, this concept still has use
`:source-paths`	the list of paths to search for source. When working from source and not a jar, this makes sense	a single path, the root of exploded jar	⭐ if we are only supporting exploded jars, we could keep `:root-path` and turf this	continue to support
`:namespaces`	a list of namespaces to include, includes support for regex.	used by cljdoc to limit to specific namespaces for problematic projects, otherwise parse all. Does not use regex.	continue to support without regex	⭐ continue to support with regex
`:exception-handler`	behavior to execute on exception	ditto	⭐ turf eternal option, hardcode to fail fast	continue to support for general usage, perhaps extend to allow to fail slow (continue after failure in ns)
`:metadata`	a way to provide default metadata where it is missing	unused	⭐ turf it	continue to support for general usage
`:writer`	a clever way to support different outputs, codox defaults to writing out html	cljdoc uses 'clojure.core/identity to write out edn	⭐ turf it, and hard code to return map only	continue to support, but default to spitting out edn (and nothing included to spit out anything else)
`:exclude-vars`	clj and cljs sometimes return data we are not interested in and this offers a way to exclude it, by default excludes record constructor functions returned by clj	cljdoc hardcodes to default	⭐ turf it and hard code to current default	continue to support, I wonder if any codox uses this…

Turfing does not necessarily mean deleting all associated source, it can mean simply removing as an option, when that makes more sense.

current outputs

Codox currently treats clj and cljs as separate analysis passes. The returned analysis for a pass is a list of namespaces each with a list of public vars. Codox skips namespaces and public elements tagged with ^:no-doc metadata.

codox analysis for a language is a list of maps of:
- :name namespace name
- :doc namespace doc string
- :publics namespace publics which is a list of maps of:
  - :name public element name
  - :type one of: :macro :multimethod :protocol :var
  - :doc doc string
  - :file file relative to :source-paths
  - :path file relative to :root-path returned as File object. Ignored by cljdoc; theoretically effectively the same as :file for analysis of an exploded jar
  - :line line number
  - :arglists list of vectors of arglists, omitted for def record and protocol elements
  - :members only applicable when :type is :protocol, list of maps of:
    
    :arglists list of vectors of arglists
    
    :name name of protocol method
    
    :type can this be only :var?

special metadata tags when present are included in publics:

:added version an element was added
:deprecated version an element was deprecated
:dynamic for dynamic defs

cljdoc then takes this output and massages it to a map of:

:group-id project group-id
:artifact-id project artifact-id
:version project version
:codox codox analysis for languages which can consist of a map with none, one or both of:
- :clj the above codox analysis for clojure with :path removed
- :cljs the above codox analysis for for clojurescript with :path removed
:pom-str slurp of pom.xml

This is serialized for later ingestion to a sqlite database by cljdoc. I do see some small tweaks by cljdoc here. Before serialization, it makes regexes in argslists serializable. After deserialization it sanitizes macros (which does not really sanitize, it asserts no duplicate publics). An important observation is that while some map values get their own columns in the db, the map is saved as a nippy blob in the database, so preserving the map structure will be important at the individual var (aka public above) and namespace level.

I was curious how source links for api docs were resolved to correct scm urls. This happens at render time. The list of all scm files is also saved to the database as part of the separate git analysis. This list is compared against the :file above for a best match. This work is similar to what codox does when populating :path

cljdoc blobs

Neutral observation: although some fields are stored outside of blobs in their own columns, on retrieval database row, the data is taken primarily from the blob. This is not unusual for NoSQL type designs.

table	column	blob content	compatibility concern?
`versions`	`meta`	info on scm, files and docs keys from map: `:jar` `:scm` - version control info including list of all files `:doc` - cljdoc edn analysis result hydrated including file content `:config` - cljdoc edn analysis result in original format	nope we are good. no api information
`namespaces`	`meta`	info on namespace: `:doc` - doc string `:name` - namespace name `:platform` - `"clj"` or `"cljs"`	yes, this comes from codox analysis, at save time `:publics` are removed and `:platform` is added.
`vars`	`meta`	info on public var `:name` `:file` `:type` `:line` `:members` `:arglists` `:doc` `:namespace` `:platform`	yes, this comes from codox analysis, at save time `:namespace` and `:platform` are added.

curiosities

Questions we do not necessarily need to answer:

is protocol :members → :type always :var?

Musings on new behavior

In short, I think cljdoc-analyzer should steal responsibilities from the current cljdoc analysis runner and, at least initially, focus on the cljdoc use case of operating on jars (rather than source repos).

choice #1 - cljdoc scope

Do nothing. Abort. Keep using codox as is.
Streamline cljdoc-analyzer. Remove all unnecessary code form cljdoc-analyzer. Similar to 1 but with an easier to reason about and maintain cljdoc-analyzer (mostly already complete).
cljdoc-analyzer operates on jar. It takes on many of the responsibilities of current cljdoc analysis runner.
1. input is jar and options.
2. output is metadata.
3. handle all cljdoc allowances (extra deps, extra repos, etc) through config.

Chosen path: option #3. It makes cljdoc-analyzer potentially also interesting as an ad hoc tool.

choice #2 - codox scope

The next choice to make is whether or not cljdoc-analyzer should support source repo dirs and current codox options. This usage likely plays out by adding cljdoc-analyzer as a dev dependency to your project.

Chosen path: we chose not to entertain this at this time but may pursue at some later date if there is interest.

Explorations

metagetta as a subproject

How well is metagetta as a subproject supported by Clojure tooling?

Metagetta as a subproject works when referenced by cljdoc-analyzer via: * :local/root * :git/url (after moving metagetta under modules dir)

Not so lucky when cljdoc-analyzer is packaged in a jar as a source project:

It seems that tools.deps.alpha expect deps to resolve down to the :file protocol. A file in a jar does not use the :file protocol.
Ironicaly, cljdoc-analyzer cannot analyze itself as it tries to parse metagetta source.

I like having metagetta as an internal subproject within cljdoc-analyzer but if this won’t fly for technical reasons, I suppose it could be split out into its own project.

For now, we’ll solve issue above by jarring up metagetta and include it in cljdoc-analyzer.jar. When we detect we are running from a jar we’ll copy the jar out to our temp work dir and reference it via local:root.

Migration plan

Testing should include running a reasonable sample of projects through current cljdoc analysis runner and comparing results with the cljdoc-analyzer. I think this should give us the confidence we need.

Test scripts and raw results are available for review.

Interpreted Test results

General differences

Differences I automatically adjusted for during diff:

:codox is now :analysis
analysis now consistently sorted by :name
empty :members no longer included
empty :doc no longer included
:members now consistently and always omit :file and :line

Differences I compensated for via manual inspection:

defrecord vars are now included
when two files share the same namespace (for example, .clj and cljs) all publics from both namespaces are now included
dynamically imported (import-vars) cljs publics now show correctly
:file was sometimes fully qualified rather than relative to jar-root

Regressions found and fixed

internal project overrides now applied when project name is not fully qualified, ex manifold instead of manifold/manifold

Interesting observations

we have special support for serializing and deserializing regexes. Note though that regexes that look logically equal do not evaluate to logically equal.
```
user=> (= #"hello" #"hello")
false
```

Specific tests run

project	version	aspect of interest	test results
amazonica	0.3.146	dynamically created API clj	no differences after making compensations
bidi	2.1.3	part of current integration tests cljc	now includes defrecords
iced-nrepl	0.2.5	part of current integration tests shows as API import failure on current cljdoc prod	no differences after making compensations
io.aviso/pretty	0.1.29	part of current integration tests included in project-overrides for :deps clj	no differences after making compensations
licaltown/hx	0.5.2	part of current integration tests added by Martin with a fix for failing import cljs	oh right, old failed, nothing to compare against!
lread/rewrite-cljs-playground	1.0.0-alpha	uses internal import-vars on both clj and cljs	cljs publics dynamically added by import-vars now import correctly
manifold	0.1.8	part of current integration tests included in project-overides for :namespaces and :languages clj	no differences after making compensations
metosin/compojure-api	2.0.0-alpha27	part of current integration tests shows as API import failure on current cljdoc prod	now includes defrecords
metosin/muuntaja	0.6.3	part of current integration tests clj	now includes defrecords
metosin/reitit	0.3.9	uses `include-namespaces-from-dependencies` feature cljs clj cljc	no differences after making compensations
orchestra	2018.11.07-1	part of current integration tests clj cljc cljs	now includes all publics for `orchestra-cljs.spec.test` which spans cljs cand cljc files also noticed that we now properly relavitize `orchestra.core/defn-spec` `:file` (was formerly fully qualified)
semantic-csv	0.2.1-alpha1	has a regex in arglist (tests special serialization)	noticed some fully qualified `:file` elements that are now properly relativized interesting note: the regex is showing as different although they are logically equivalent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01-migration-from-analysis-runner.adoc

01-migration-from-analysis-runner.adoc

Migration from analysis runner

Status

Introduction

Decisions

Potential users

Analysis of current behavior

external integration points

tools support

cljdoc jar processing steps

current cljdoc codox inputs

current outputs

cljdoc blobs

curiosities

Musings on new behavior

choice #1 - cljdoc scope

choice #2 - codox scope

Explorations

metagetta as a subproject

Migration plan

Interpreted Test results

General differences

Specific tests run

Files

01-migration-from-analysis-runner.adoc

Latest commit

History

01-migration-from-analysis-runner.adoc

File metadata and controls

Migration from analysis runner

Status

Introduction

Decisions

Potential users

Analysis of current behavior

external integration points

tools support

cljdoc jar processing steps

current cljdoc codox inputs

current outputs

cljdoc blobs

curiosities

Musings on new behavior

choice #1 - cljdoc scope

choice #2 - codox scope

Explorations

metagetta as a subproject

Migration plan

Interpreted Test results

General differences

Specific tests run