- What is ddev-spidergram?
- Installation
- Basic usage
- How to access ArangoDB
- Backup and restore ArangoDB
- Behind the scenes
- TODO
- Contributing
This repository provides an addon to use @Spidergram within DDEV.
Spidergram is a customizable toolkit for crawling and analyzing complicated web properties. While it can be used to crawl any website, we (the folks at Autogram) designed it specifically for "ten websites in a trench coat" scenarios where a web property encompasses multiple CMSs, multiple domains, and multiple design systems, maintained by multiple teams.
-
Create a new directory and move into it. For simplicity reasons I am using the name
spidercrawl
across this readme. You are able to choose any other name here instead.
mkdir spidercrawl && cd spidercrawl
- Initialize your DDEV project. By using the defaults the project name will be equal to the directory name.
ddev config --auto
In case you are running DDEV on MacOS or Windows it is highly recommended to enable Mutagen with the following additional configuration step.
ddev config --mutagen-enabled=true
On Linux, Windows, WSL2 and Gitpod that step is not necessary.
- Download the
ddev-spidergram
-addon.
ddev get rpkoller/ddev-spidergram
- Start DDEV and wait a few minutes until the DDEV and ArangoDB images, Spidergram, as well as Playwright are downloaded and installed.
ddev start
- Run an initial status check that everything is set up correctly.
ddev spidergram status
The resulting output should look like that:
$> ddev spidergram status
SPIDERGRAM CONFIG
Config file: /var/www/html/spidergram.config.yaml
ARANGODB
Status: online
URL: https://spidercrawl.ddev.site:8529
Database: db
- Crawl and analyze your first site.
ddev spidergram go https://ddev.com
- For more details see the Spidergram documentation. All configuration changes are applied to the
spidergram.config.yaml
file.
- The ArangoDB web interface could be reached in the web browser via the the URL shown for
ddev spidergram status
. The port:8529
is appended to the project's URL (https://spidercrawl.ddev.site:8529
).
- You have to click
_SYSTEM
in the upper right corner of the screen. In the select form on the next screen you have to click_system
again and then choose the optiondb
and confirm.
- To backup your ArangoDB database and delete your project:
ddev arangodump
ddev delete spidergram --omit-snapshot
The database is saved in .ddev/arangodb-backup
. After the successful dump ddev delete spidergram --omit-snapshot
deletes the project's containers, images and volumes. The project files as well as the DDEV config files in .ddev
, including the ArangoDB database dump, remain untouched. That saves disk space and enables you to re-add the project at a later point as described in the second step.
- To restore your project:
ddev config
ddev start
ddev arangorestore
That way you re-register the existing project in DDEV, start it up and restore the database you have previously used in ArangeoDB.
- In case you want to use
arangodump
not to have a final backup before you delete your project but save one or more backups in your daily usage it has to be noted that with the current implementation it is not possible. By runningarangodump
the previous dump gets overwritten! Providing a more flexible and convinient solution is planned for the future.
- Adds a docker-compose file (
docker-compose.arangodb.yaml
) for ArangoDB. The Spidegram database and password are set todb
to be in line with DDEV's standards. The only difference is that the default username was left atroot
since it wasn't changeable in ArangoDB. The ArangoDB container was set to not require any authentication, which is in line with the Spidergram docker-compose file. - Adds a dockerfile (
Dockerfile.spidergram
) to the web-build folder. It runs anpm install --global spidergram
,npx playwright install
, and anpx playwright install-deps
when the addon is installed. - Adds a
spidergram
web command. For example you only have to typeddev spidergram status
instead ofddev exec spidergram status
. - Adds a
spidergram.config.yaml
to the project root. The Yaml file with that exact file name is mandatory for Spidergram to run. - The
config.ddev-spidergram.yaml
file ensures that Node.js is set to version 18. In apost-start
-hook it is also taken care that the URL set inspidergram.config.yaml
is in line with the overall project settings. The project name, based on$DDEV_PROJECT
, and the TLD, based on$DDEV_TLD
, is getting replaced by a regex statement on every start. That way, if the project name or the TLD changes at a later point, Spidergram still just keeps working. - Adds a
arangodump
web command. The database dump is written to a fixed destination.ddev/arangodb-backup/
. Currentlyarangodump
is intended to be used to backup the database before a project is getting removed from DDEV. - Adds a
arangorestore
web command. Make sure that your folder with the database backup is available at.ddev/arangodb-backup/
within your project folder before you runddev config && ddev start
. - The
.ddev/arangodb-backup/
directory is created with the-p
option in apost_install_action
and a.gitignore
file is being added to the directory excluding everything within.
- Figure out the best approach how to upgrade Spidergram and it's dependencies for an already existing Spidergram DDEV instance and update the README accordingly (I have to wait for the next Spidergram release being able to test that).
- Expand the number of available settings in the
spidergram.config.yaml
. At the moment I am only using the default values from an old template found at https://github.com/autogram-is/create-spidergram/tree/main/templates. - Further expand the interaction capabilities ddev-spidergram provides for ArangoDB.
Any feedback in regard to bugs and potential improvements is welcome.
Contributed and maintained by @rpkoller based on the original ddev-addon-template