merge: Documentation updates for Extensions

This merge brings in a set of changes to the documentation after the introduction of Extensions. Introducing external tool extensions changes the architecture such that it is now extensible with respect to filesystem analysis. The changes include: - A document explaining how to create a new tool extension - Changes to the README listing supported tool extensions - Some links to CONTRIBUTING reflecting this area of opportunity - Some changes to the glossary, architecture and code navigation documentation reflecting this addition Signed-off-by: Nisha K <nishak@vmware.com>
tern-tools · Oct 31, 2019 · c9a0c83 · c9a0c83
2 parents a041a8d + 151222e
commit c9a0c83
Show file tree

Hide file tree

Showing 15 changed files with 137 additions and 20 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -15,10 +15,11 @@ You can contribute in the following ways:
 - Contribute changes to documentation by [submitting pull requests](#submit-pr) to it.
 
 **Contribute Code**
-- [Resolve Issues](https://github.com/vmware/tern/issues).
-- Improve the robustness of the project by
-  - [Adding to the Command Library](docs/adding-to-command-library.md).
-  - [Adding a Custom Template](docs/creating-custom-templates.md).
+- [Resolve Issues](https://github.com/vmware/tern/issues)
+- Improve the robustness of the project by:
+  - [Adding to the Command Library](docs/adding-to-command-library.md)
+  - [Adding a Custom Report Format](docs/creating-custom-templates.md)
+  - [Adding an Extension](docs/creating-tool-extensions.md)
 
 ## Am I Qualified to Contribute?
 

diff --git a/README.md b/README.md
@@ -14,6 +14,7 @@ Tern is a software package inspection tool for containers. It's written in Pytho
   - [Glossary of Terms](/docs/glossary.md)
   - [Architecture](/docs/architecture.md)
   - [Navigating the Code](/docs/navigating-the-code.md)
+  - [Data Model](/docs/data-model.md)
 - [Getting Started](#getting-started)
   - [Getting Started with Docker](#getting-started-with-docker)
   - [Getting Started with Vagrant](#getting-started-with-vagrant)
@@ -27,10 +28,16 @@ Tern is a software package inspection tool for containers. It's written in Pytho
   - [JSON Format](#report-json)
   - [YAML Format](#report-yaml)
   - [SPDX tag-value Format](#report-spdxtagvalue)
+- [Extensions](#extensions)
+  - [Scancode](#scancode)
+  - [cve-bin-tool](#cve-bin-tool)
 - [Running tests](#running-tests)
 - [Project Status](#project-status)
-- [Documentation](#documentation)
-- [Contributing](#contributing)
+- [Contributing](/CONTRIBUTING.md)
+  - [Code of Conduct](/CODE_OF_CONDUCT.md)
+  - [Creating Report Formats](/docs/creating-custom-templates.md)
+  - [Creating Tool Extensions](/docs/creating-tool-extensions.md)
+  - [Adding to the Command Library](/docs/adding-to-command-library.md)
 
 # What is Tern?<a name="what-is-tern">
 Tern is an inspection tool to find the metadata of the packages installed in a container image. The overall operation looks like this:
@@ -202,6 +209,21 @@ $ tern -l report -f yaml -i golang:1.12-alpine -o output.yaml
 ```
 $ tern -l report -f spdxtagvalue -i golang:1.12-alpine -o spdx.txt
 ```
+
+# Extensions<a name="extensions">
+Tern does not have its own file level license scanner. In order to fill in the gap, Tern allows you to extend container image analysis with an external file analysis CLI tool or Python3 module.
+
+## Scancode<a name="scancode">
+[scancode-toolkit](https://github.com/nexB/scancode-toolkit) is a license analysis tool that "detects licenses, copyrights, package manifests and direct dependencies and more both in source code and binary files". To use it to analyze container images, run:
+```
+$ tern -l report -x scancode -i golang:1.12-alpine
+```
+
+## cve-bin-tool<a name="cve-bin-tool">
+[cve-bin-tool](https://github.com/intel/cve-bin-tool) is a command line tool which "scans for a number of common, vulnerable components (openssl, libpng, libxml2, expat and a few others) to let you know if your system includes common libraries with known vulnerabilities". Vulnerability scanning tools can also be extended to work on containers using Tern, although support for certain metadata pertaining to CVEs may not be available yet. As a result, you will not see any of the results in the generated reports. To try it out, run:
+```
+$ tern -l report -x cve_bin_tool -i golang:1.12-alpine
+```
 
 # Running tests<a name="running-tests">
 WARNING: The `test_util_*` tests are not up to date. We are working on it :). From the Tern repository root directory run:
@@ -225,10 +247,10 @@ Somewhere along the line of development, we accidentally rewrote git history on
 * [v0.2.0](docs/releases/v0_2_0.md)
 * [v0.1.0](docs/releases/v0_1_0.md)
 
-## Documentation<a name="documentation"/>
+## Documentation
 Architecture, function blocks, code descriptions and the project roadmap are located in the docs folder. Contributions to the documentation are welcome! See the [contributing guide](/CONTRIBUTING.md) to find out how to submit changes.
 
-## Get Involved<a name="contributing"/>
+## Get Involved
 
 Do you have questions about Tern? Do you think it can do better? Would you like to make it better? You can get involved by giving your feedback and contributing to the code, documentation and conversation!
 

diff --git a/docs/adding-to-command-library.md b/docs/adding-to-command-library.md
@@ -303,3 +303,5 @@ should always be 1 as you are querying for only one package name.
 
 As always, don't hesitate to ask questions by filing an issue with 'Question:'
 as the prefix of the subject, on the Slack channel or on the mailing list.
+
+[Back to the README](../README.md)
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -3,12 +3,14 @@
 You may want to look at the [glossary](./glossary.md) to understand the terms being used.
 
 ## Overall Approach
-The general approach to finding package metadata given some files is to perform static analysis on the files. Tern's approach is more brute force - using the same tools that were used to install a package to retrieve the status of the package. This involves mounting the overlay filesystem and running commands against it at the most basic level. The results of the shell commands are then collated into a nice report showing which container layers brought in what packages. For compliance purposes (as Tern is a compliance tool), shell scripts to retrieve compliance information are used. Tern is also built to be a tool for guidance and hence will point out any missing or unknown information. For example, if it cannot find information about licenses, it will inform you of this and nudge you to either add information about it or a shell script to retrieve it.
+Tern uses a "container aware" approach to analyze container images. Tern tries to find the tool or method used to install software components in the container and will use equivalent debugging methods to find the status of said software component. This involves mounting the overlay filesystem and running commands against it at the most basic level. The results of the shell commands are then collated into a nice report showing which container layers brought in what packages. Tern is also built to be a tool for guidance and hence will point out any missing or unknown information. For example, if it cannot find information about licenses, it will inform you of this or nudge you to either add information about it or a shell script to retrieve it.
+
+Tern can also use license file scanners to scan container images. This is not the "container aware" approach that the native analyzer uses. Instead Tern will just run the scanners on the filesystems in the container image. This approach works for current container distribution methods as all the layers are "pushed" to the container registry's repository belonging to the originator. As a result, the originator is obligated to provide sources for software components included in all layers. This is not the case for security scanning as the files at the lower layers may be patched in the later layers, thus invalidating the results from the lower layers. However, this doesn't stop you from using a security scanner on the container image.
 
 ## Guiding Principles
 1. OSS Compliance First: Tern is a compliance tool at heart. Hence whatever it does is meant to help the user meet their open source compliance obligations, or, at the minimum, make them aware of the licenses governing the open source software they use.
-2. Be Transparent: Tern will report on everything it is doing in order to not mislead on how it got its information
-3. Inform not Ignore: Tern will report on all exceptions it has encountered during the course of its execution
+2. Be Transparent: Tern will report on everything it is doing in order to not mislead on how it got its information. This allows Engineers to ascertain confidence in the results themselves.
+3. Inform not Ignore: Tern will report on all exceptions it has encountered during the course of its execution. This, again, is to allow Engineers to make assessments on the efficacy of the tool, the results and the container image itself. 
 
 ## Layout
 
@@ -19,7 +21,7 @@ Here is a general architecture diagram:
 There are 4 components that operate together:
 
 ### The Cache
-This is the database where filesystem identifiers can be queried against to retrieve package information. This is useful as many containers are based on other container images. If Tern had come across the same filesystem in another container, it can retrieve the package information without spinning up a container. Tern looks for filesystems here before doing any analysis. This is Tern's own data store which can be curated and culled over time. The reason that Tern keeps its own data store is because the filesystem artifacts that make up a container image are not necessarily how other compliance databases store license information.
+This is the database where filesystem identifiers can be queried against to retrieve package information. This is useful as many containers are based on other container images. If Tern had come across the same filesystem in another container, it can retrieve the package information without spinning up a container. Tern looks for filesystems here before doing any analysis. This is Tern's own data store which can be curated and culled over time. The reason that Tern keeps its own data store is because the filesystem artifacts that make up a container image are not necessarily how other compliance databases store license information. The filesystems also follow their own method of identifying themselves. A container build is not reproducible, so often, even when the content of the filesystem has not changed, the container's checksum has and that makes it difficult to identify the contents of a container image.
 
 ### The Command Library
 This is a database of shell commands that may be used to create a container's layer filesystem. There are two types of shell commands - one for system wide package managers and one for custom shell commands or install scripts. The library is split in this way to account for situations where whole root filesystems are imported in order to create a new container.
@@ -28,15 +30,17 @@ For example, in [this Dockerfile sample](../samples/debian_vim/Dockerfile), the
 
 Some acknowledgement should be made here that this is not the only way to create a container. However, it seems to be the most prevalent way right now. In the future, the community may move away from the Copy on Write filesystems and instead use one filesystem. If this were the case, Tern's job might be a bit easier, but this is just speculation.
 
-### The Core
-Currently, the code combines the functions of report formatting and image analysis and requires refactoring. At a conceptual level, the Core is what unbundles a container image, reads the accompanying config file, untars all the filesystems, mounts them (or not, depending on what kind of analysis the Command Library says to do), and sets up a chroot to run scripts (or not, again, something that it will find out from the Command Library). The general functionality of the core looks like this:
+When Tern uses an external file scanner, it bypasses the command library altogether and instead relies on the external tool's results. The approach allows Engineers to compare results from different tools available to them as they have always done, but on container images.
+
+### The Analyzer
+At a conceptual level, the Analyzer is what unbundles a container image, reads the accompanying config file, untars all the filesystems, mounts them (or not, depending on what kind of analysis and which analyzer is used), and sets up a chroot to run scripts (or not, again, depending on the analysis type). Tern has a dedicated analyzer to the type of image being analyzed. Currently, it can analyze only images created by Docker. The inspection part can be done using Tern's native analyzer or an external tool. The general flow of the native analyzer looks like this:
 
 <img src="./img/tern_flow.png" alt="Tern process flow" width="331" height="563" />
 
-The core will collate the metadata it can get in Image objects which encapsulate data for each layer and each package found. It also encapsulates notes while execution takes place, so the report is transparent about what worked and what didn't.
+The analyzer will collate the metadata it can get in Image objects which encapsulate data for each layer and each package found. It also encapsulates notes while execution takes place, so the report is transparent about what worked and what didn't.
 
 ### The Formatter
-Tern's main purpose is to produce reports, either as an aid for understanding a container image or as a manifest for something else to consume. The default is the verbose report explaining where in the container the list of packages came from and what commands created them. The type of reports supported can be found in the project README. Also, take a look at the [template creation process](./creating-custom-templates.md) to get a better understanding of how Tern supports multiple formats.
+Tern's main purpose is to produce reports, either as an aid for understanding a container image or as a manifest for something else to consume. The default is the verbose report explaining where in the container the list of packages came from and what commands created them. The type of reports supported can be found in the project README. Also, take a look at the [custom report format creation process](./creating-custom-templates.md) to get a better understanding of how Tern supports multiple formats.
 
 ## Objects
 Tern uses Object Oriented Programming concepts to encapsulate the data that it will be referencing during thr course of execution. They can be found in the `tern/classes` directory. The general format is that an object of type Image contains a list of type ImageLayer and each ImageLayer contains a list of type Package. On top of that each of those objects contain an object of type Origins. The Origins object contains a list of type NoticeOrigin which contains a list of type Notice.

diff --git a/docs/creating-custom-templates.md b/docs/creating-custom-templates.md
@@ -110,3 +110,5 @@ Continuing with the example from step #2, if you wanted to enable your `custom`
      yaml = tern.formats.yaml.generator:YAML
 +    custom = tern.formats.custom.generator:Custom
 ```
+
+[Back to the README](../README.md)
diff --git a/docs/creating-tool-extensions.md b/docs/creating-tool-extensions.md
@@ -0,0 +1,46 @@
+# Creating a Tool Extension
+You can use Tern with another file or filesystem analysis tool to analyze container images. You can find examples of such tools in the `tern/extensions` folder. Currently two external tools are supported:
+* [scancode-toolkit](https://github.com/nexB/scancode-toolkit): A license scanning tool that finds licenses in source code and binaries. Although support for formatting is not in place at the moment, it is something that will be completed in subsequent releases.
+* [cve-bin-tool](https://github.com/intel/cve-bin-tool): A security vulnerability scanning tool that finds common vulnerabilities. Note that although you can use a security scanner with Tern, there isn't any support for reporting the results beyond printing them to console. This may change as the industry demand for security information in Software Bill of Materials seems to be on the rise.
+
+If you would like to make a tool extension, here are some general steps to follow:
+
+## 1. Familiarize yourself with Tern's Data Model
+
+The classes for the objects that are used to keep discovered metadata are in the `tern/classes` folder. Check out the [data model document](./data-model.md) for a general layout of the classes. Refer to the individual files for a list of properties. These store the supported metadata. If you do not see the metadata you are interested in, please submit a proposal issue to add this property to the appropriate class. This should be a reasonably trivial change with minimal effect on backwards compatibility.
+
+## 2. Create a plugin
+
+To create a plugin for the tool, create a folder under `tern/extensions` with the plugin name. Create an empty `__init__.py` file here and create a file called `executor.py`. This file will contain a class which is derived from the abstract base class `executor.py` located under `tern/extensions`. The `Executor` class requires you to implement the method `execute` which takes an object of type `Image` or any of its derived classes (for example `DockerImage`). You can use this method to call a library function or run a CLI command to collect the required information. Once done, you can set the properties of the `Image` object and the objects within it (see the data model as a reference. See the `.py` files in `tern/classes` for a list of properties you can set.
+
+You can refer to the existing plugins as a guide for implementing the `execute` method of your executor class. There are helper methods in `tern/analyze/passthrough.py` which you can make use of, or write your own implementation if you need to.
+
+## 3. Test your plugin
+
+To test your plugin, add the plugin to `setup.cfg` under `tern.extensions`. For example, let's say you have created a plugin called "custom" to run a custom script. Your plugin's `executor.py` should live in `tern/extensions/custom`. You will add the plugin as follows:
+
+```
+tern.extensions =
+    cve_bin_tool = tern.extensions.cve_bin_tool.executor:CveBinTool
+    scancode = tern.extensions.scancode.executor:Scancode
+    custom = tern.extensions.custom.executor:Custom
+```
+
+To test out your plugin run:
+
+```
+$ pip install -e.[dev]
+$ tern -l report -x custom -i <image:tag>
+```
+
+To test out the different formats for your plugin run:
+
+```
+$ tern -l report -x custom -f <format> -i <image:tag>
+```
+
+where <format> is one of Tern's supported formats. To see what formats are supported, run `tern report --help`.
+
+If you need a custom report format please refer to the document on [creating custom report formats](./creating-custom-templates.md)
+
+[Back to the README](../README.md)
diff --git a/docs/data-model.md b/docs/data-model.md
@@ -0,0 +1,11 @@
+# Tern's Data Model
+
+Tern stores metadata about the image and messages during operation in objects described here. The overall data model looks like this:
+
+![Tern data model](./img/tern_data_model.png)
+
+The main class is `Image` and its derived classes. This class contains a list of type `ImageLayer`. `ImageLayer` contains a list of type `Package`. `Image`, `ImageLayer` and `Package` contain a property called `origins` which is an object of type `Origins`. This class is used to record notes while Tern operates on an image such as what tools were used to retrieve the metadata or if the filesystem is of unknown content. `Origins` contains a list of type `NoticeOrigin` which contains a string and a list of type `Notice`. The `Notice` objects are where messages get recorded. You can easily record a message in the `origins` property of the `Image`, `ImageLayer` and `Package` types of objects by using the `add_notice_to_origins` method which just adds a `Notice` object to the `NoticeOrigin` object containing the origin string you give it ("origin_str" is basically a string indicating where in the image or analysis stage an event that you want recorded occured).
+
+You will also see a class called `Template`. This is an abstract base class used to make custom formats. To learn more see the [documentation on creating custom formats](./creating-custom-templates.md).
+
+[Back to the README](../README.md)
diff --git a/docs/faq.md b/docs/faq.md
@@ -10,3 +10,5 @@ Static analysis is a reasonable approach to find software components and there a
 
 ## Why Python?
 Python is well suited for easy string formatting which is most of the work that Tern does.
+
+[Back to the README](../README.md)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -303,3 +303,5 @@ should always be 1 as you are querying for only one package name.

		As always, don't hesitate to ask questions by filing an issue with 'Question:'
		as the prefix of the subject, on the Slack channel or on the mailing list.

		[Back to the README](../README.md)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -10,3 +10,5 @@ Static analysis is a reasonable approach to find software components and there a

		## Why Python?
		Python is well suited for easy string formatting which is most of the work that Tern does.

		[Back to the README](../README.md)