This document records analysis and thoughts while migrating from cljdoc’s analysis-runner module which used a fork of codox to this dedicated cljdoc-analyzer project.
We will:
-
focus on the cljdoc use case of analyzing from the jar as opposed to the codox use case of analyzing from the source repository root. This means we do not need both of codox`s
:root-path
and:source-paths
options. We’ll use:root-path
and drop:source-paths
. -
move to
cljdoc-analyzer
namespace. This will allow us to move ahead unfettered by the past. It will mean minor adjustments to cljdoc. We have confirmed that database BLOB metadata uses is unaffected by our changes. -
allow
^:no-doc
elements to be returned for ad hoc usage. At least initially, cljdoc will continue to receive the analysis filtered as it is currently. -
change codox
:language
option to:languages
and accept one of:-
vector of languages to analyze which can be one or both of:
"clj"
"cljs"
-
:auto-detect
-
-
process languages in a single pass
-
in the spirit of dropping features that have no known current use, drop codox:
-
:metadata
option - we have no need to provide default metadata -
:writer
option - folks can do what they want with the returned data -
:exclude-vars
option - we’ll hardcode the codox default - we can bring this back if ever there proves to be a need -
:exception-handler
option - we’ll keep the inner workings in metagetta and start with a fast failing error handler
-
-
we’ll leave in codox wildcard support for namespaces, code is simple and the feature seems useful.
-
project specific overrides for languages (aka platforms), namespaces and deps will be moved from cljdoc to a local cljdoc-analyzer config file.
Other minor breaking changes:
-
cljdoc will invoke
cljdoc-analyzer.cljdoc-main
-
:repos
is now:extra-repos
-
We considered, but will not at this time:
-
support failing slow. Codox currently stops processing on the first problem encountered. We’ll look at trying to find multiple errors in a single run.
In addition to cljdoc, who else might find value in cljdoc-analyzer?
-
clj-kondo is a static source analyzer. It has special coding to to understand the potemkin import-vars API, but it does not know about other load time metadata manipulations. The output of this tool might be useful for clj-kondo to fill in any gaps.
-
Codox could potentially make use of this library, but at this time, the original author does not see a benefit (which is totally fine). So we’ll not need to invest in maintaining codox feature compatibility.
-
ad hoc use. I (lread) am interested in using cljdoc-analyzer to compare API signatures between rewrite-clj, rewrite-cljs and rewrite-cljc to detect any unintended breakage and to document changes. (aside: API comparison is also of interest for a future feature of cljdoc).
Codox was designed to allow authors to generate documentation for their Clojure/ClojureScript lein and boot projects. It operates on the repository sources of a project and generates html.
Cljdoc does not follow the common codox use case. Cljdoc uses codox to retrieve API metadata only. It works on published artifacts (jars) instead of repository sources (note that cljdoc does make use of the source repository for documentation contained in articles and resolving API source files, but codox does not come into play for this work).
Working at the published jar level instead of repository sources level means cljdoc:
-
does not care whether a project uses leiningen, boot or deps tools, it simply refers to the source code contained in the jar, and the pom.xml.
-
takes on responsibility of resolving dependencies from pom.xml rather than relying on lein, boot or tools.deps.alpha.
-
can assume the classpath for the source code in the jar is always at the exploded jar root.
The fundamental inputs for retrieval of metadata for both worlds are the same:
-
classpath of sources and dependencies (although for normal codox use the dependencies are resolved by lein or boot)
-
codox options
Because cljdoc works on unknown projects, it goes through some special steps to avoid potential problems with analysis. And because code is evaluated while getting metadata, cljdoc takes care to isolate this work and minimize dependencies by launching a separate process.
The current cljdoc analysis runner-ng.main is launched, as far a I can see, only by: . cljdoc/analysis/service.clj
Codox contains specific tool support for lein and boot.
Cljdoc does not make use of this support.
In a nutshell cljdoc analysis runner:
-
unzips the published jar to a work directory
-
removes problematic directories and files
-
copies over cljdoc wrapper source (which calls codox)
-
resolves classpath from pom (and includes extra deps as needed)
-
overrides languages and namespaces for problematic libraries
-
launches the cljdoc wrapper (which calls codox) for each found language with a resolved classpath
-
wraps codox language results into map for cljdoc consumption
-
saves results to an edn file to share back with cljdoc
A goal of these steps is to limit dependencies of the wrapper to the minimum required to fetch the actual metadata. The less dependencies our actual analysis phase has, the less chance we have for project library collisions and confusions.
cljdoc uses all options internally; none are exposed to project authors. The following table lists current option usages and muses about what we might minimally and potentially support moving forward. I’ve put a star beside the options we settled on for the initial release.
option key | codox usage | cljdoc usage | mimimally | potentially |
---|---|---|---|---|
|
return metadata for |
intelligently determines languages from source and calls codox once for each, with custom overrides for problematic projects |
continue to support, rename to |
⭐ allow to request an array of languages to parse, or |
|
the github project root, used to calculate relative :source-paths |
sets to current dir (ie. had no use for this) |
⭐ if we are only supporting exploded jars, we could keep this and turf |
if we want to remain general purpose, this concept still has use |
|
the list of paths to search for source. When working from source and not a jar, this makes sense |
a single path, the root of exploded jar |
⭐ if we are only supporting exploded jars, we could keep |
continue to support |
|
a list of namespaces to include, includes support for regex. |
used by cljdoc to limit to specific namespaces for problematic projects, otherwise parse all. Does not use regex. |
continue to support without regex |
⭐ continue to support with regex |
|
behavior to execute on exception |
ditto |
⭐ turf eternal option, hardcode to fail fast |
continue to support for general usage, perhaps extend to allow to fail slow (continue after failure in ns) |
|
a way to provide default metadata where it is missing |
unused |
⭐ turf it |
continue to support for general usage |
|
a clever way to support different outputs, codox defaults to writing out html |
cljdoc uses 'clojure.core/identity to write out edn |
⭐ turf it, and hard code to return map only |
continue to support, but default to spitting out edn (and nothing included to spit out anything else) |
|
clj and cljs sometimes return data we are not interested in and this offers a way to exclude it, by default excludes record constructor functions returned by clj |
cljdoc hardcodes to default |
⭐ turf it and hard code to current default |
continue to support, I wonder if any codox uses this… |
Turfing does not necessarily mean deleting all associated source, it can mean simply removing as an option, when that makes more sense.
Codox currently treats clj and cljs as separate analysis passes. The returned analysis for a pass is a list of
namespaces each with a list of public vars. Codox skips namespaces and public elements tagged with ^:no-doc
metadata.
-
codox analysis for a language is a list of maps of:
-
:name
namespace name -
:doc
namespace doc string -
:publics
namespace publics which is a list of maps of:-
:name
public element name -
:type
one of::macro
:multimethod
:protocol
:var
-
:doc
doc string -
:file
file relative to:source-paths
-
:path
file relative to:root-path
returned as File object. Ignored by cljdoc; theoretically effectively the same as:file
for analysis of an exploded jar -
:line
line number -
:arglists
list of vectors of arglists, omitted fordef
record
andprotocol
elements -
:members
only applicable when:type
is:protocol
, list of maps of:-
:arglists
list of vectors of arglists -
:name
name of protocol method -
:type
can this be only:var
?
-
-
-
special metadata tags when present are included in publics:
-
:added
version an element was added -
:deprecated
version an element was deprecated -
:dynamic
for dynamic defs
cljdoc then takes this output and massages it to a map of:
-
:group-id
project group-id -
:artifact-id
project artifact-id -
:version
project version -
:codox
codox analysis for languages which can consist of a map with none, one or both of:-
:clj
the above codox analysis for clojure with:path
removed -
:cljs
the above codox analysis for for clojurescript with:path
removed
-
-
:pom-str
slurp of pom.xml
This is serialized for later ingestion to a sqlite database by cljdoc. I do see some small tweaks by cljdoc here. Before serialization, it makes regexes in argslists serializable. After deserialization it sanitizes macros (which does not really sanitize, it asserts no duplicate publics). An important observation is that while some map values get their own columns in the db, the map is saved as a nippy blob in the database, so preserving the map structure will be important at the individual var (aka public above) and namespace level.
I was curious how source links for api docs were resolved to correct scm urls. This happens at render time. The list of
all scm files is also saved to the database as part of the separate git analysis. This list is compared against the :file
above for a best match. This work is similar to what codox does when populating :path
Neutral observation: although some fields are stored outside of blobs in their own columns, on retrieval database row, the data is taken primarily from the blob. This is not unusual for NoSQL type designs.
table | column | blob content | compatibility concern? |
---|---|---|---|
|
|
info on scm, files and docs keys from map:
|
nope we are good. no api information |
|
|
info on namespace:
|
yes, this comes from codox analysis, at save time |
|
|
info on public var
|
yes, this comes from codox analysis, at save time |
In short, I think cljdoc-analyzer should steal responsibilities from the current cljdoc analysis runner and, at least initially, focus on the cljdoc use case of operating on jars (rather than source repos).
-
Do nothing. Abort. Keep using codox as is.
-
Streamline cljdoc-analyzer. Remove all unnecessary code form cljdoc-analyzer. Similar to 1 but with an easier to reason about and maintain cljdoc-analyzer (mostly already complete).
-
cljdoc-analyzer operates on jar. It takes on many of the responsibilities of current cljdoc analysis runner.
-
input is jar and options.
-
output is metadata.
-
handle all cljdoc allowances (extra deps, extra repos, etc) through config.
-
Chosen path: option #3. It makes cljdoc-analyzer potentially also interesting as an ad hoc tool.
The next choice to make is whether or not cljdoc-analyzer should support source repo dirs and current codox options. This usage likely plays out by adding cljdoc-analyzer as a dev dependency to your project.
Chosen path: we chose not to entertain this at this time but may pursue at some later date if there is interest.
How well is metagetta as a subproject supported by Clojure tooling?
Metagetta as a subproject works when referenced by cljdoc-analyzer via:
* :local/root
* :git/url
(after moving metagetta under modules dir)
Not so lucky when cljdoc-analyzer is packaged in a jar as a source project:
-
It seems that tools.deps.alpha expect deps to resolve down to the :file protocol. A file in a jar does not use the :file protocol.
-
Ironicaly, cljdoc-analyzer cannot analyze itself as it tries to parse metagetta source.
I like having metagetta as an internal subproject within cljdoc-analyzer but if this won’t fly for technical reasons, I suppose it could be split out into its own project.
For now, we’ll solve issue above by jarring up metagetta and include it in
cljdoc-analyzer.jar. When we detect we are running from a jar we’ll copy the jar
out to our temp work dir and reference it via local:root
.
Testing should include running a reasonable sample of projects through current cljdoc analysis runner and comparing results with the cljdoc-analyzer. I think this should give us the confidence we need.
Test scripts and raw results are available for review.
Differences I automatically adjusted for during diff:
-
:codox
is now:analysis
-
analysis now consistently sorted by
:name
-
empty
:members
no longer included -
empty
:doc
no longer included -
:members
now consistently and always omit:file
and:line
Differences I compensated for via manual inspection:
-
defrecord vars are now included
-
when two files share the same namespace (for example, .clj and cljs) all publics from both namespaces are now included
-
dynamically imported (import-vars) cljs publics now show correctly
-
:file
was sometimes fully qualified rather than relative to jar-root
Regressions found and fixed
-
internal project overrides now applied when project name is not fully qualified, ex
manifold
instead ofmanifold/manifold
Interesting observations
-
we have special support for serializing and deserializing regexes. Note though that regexes that look logically equal do not evaluate to logically equal.
user=> (= #"hello" #"hello") false
project | version | aspect of interest | test results |
---|---|---|---|
amazonica |
0.3.146 |
|
|
bidi |
2.1.3 |
|
|
iced-nrepl |
0.2.5 |
|
|
io.aviso/pretty |
0.1.29 |
|
|
licaltown/hx |
0.5.2 |
|
|
lread/rewrite-cljs-playground |
1.0.0-alpha |
|
|
manifold |
0.1.8 |
|
|
metosin/compojure-api |
2.0.0-alpha27 |
|
|
metosin/muuntaja |
0.6.3 |
|
|
metosin/reitit |
0.3.9 |
|
|
orchestra |
2018.11.07-1 |
|
|
semantic-csv |
0.2.1-alpha1 |
|
|