fix typos in README.md

biocommons · Feb 20, 2024 · 7a56649 · 7a56649
1 parent 344d9d6
commit 7a56649
Showing 1 changed file with 86 additions and 76 deletions.
diff --git a/README.md b/README.md
@@ -1,28 +1,32 @@
 # biocommons.seqrepo
 
-SeqRepo is a Python package for storing and reading a local collection of biological sequences. The
-repository is non-redundant, compressed, and journalled, making it efficient to store and transfer
-multiple snapshots.
+SeqRepo is a Python package for storing and reading a local collection of
+biological sequences. The repository is non-redundant, compressed, and
+journalled, making it efficient to store and transfer multiple snapshots.
 
 ## Introduction
 
-Specific, named biological sequences provide the reference and coordinate sysstem for communicating
-variation and consequential phenotypic changes. Several databases of sequences exist, with
-significant overlap, all using distinct names. Furthermore, these systems are often difficult to
-install locally.
-
-SeqRepo provides an efficient, non-redundant and indexed storage system for biological sequences.
-Clients refer to sequences and metadata using familiar identifiers, such as NM_000551.3 or GRCh38:1,
-or any of several hash-based identifiers. The interface supports fast slicing of arbitrary regions
-of large sequences.
-
-A "fully-qualified" identifier includes a namespace to disambiguate accessions from different
-origins or sequence sets (e.g., "1" in GRCh37 and GRCh38). If the namespace is provided, seqrepo
-uses it as-is; if the namespace is not provided and the unqualified identifier refers to a unique
-sequence, it is returned; otherwise, the use of ambiguous identifiers raise an error.
-
-SeqRepo favors namespaces from [identifiers.org](https://identifiers.org) whenever available.
-Examples include [refseq](<https://registry.identifiers.org/registry/refseq>) and
+Specific, named biological sequences provide the reference and coordinate
+system for communicating variation and consequential phenotypic changes.
+Several databases of sequences exist, with significant overlap, all using
+distinct names. Furthermore, these systems are often difficult to install
+locally.
+
+SeqRepo provides an efficient, non-redundant and indexed storage system for
+biological sequences. Clients refer to sequences and metadata using familiar
+identifiers, such as NM_000551.3 or GRCh38:1, or any of several hash-based
+identifiers. The interface supports fast slicing of arbitrary regions of large
+sequences.
+
+A "fully-qualified" identifier includes a namespace to disambiguate accessions
+from different origins or sequence sets (e.g., "1" in GRCh37 and GRCh38). If the
+namespace is provided, seqrepo uses it as-is; if the namespace is not provided
+and the unqualified identifier refers to a unique sequence, it is returned;
+otherwise, the use of ambiguous identifiers raise an error.
+
+SeqRepo favors namespaces from [identifiers.org](https://identifiers.org)
+whenever available. Examples include
+[refseq](<https://registry.identifiers.org/registry/refseq>) and
 [ensembl](<https://registry.identifiers.org/registry/ensembl>).
 
 [seqrepo-rest-service](https://github.com/biocommons/seqrepo-rest-service) provides a REST interface
@@ -39,82 +43,82 @@ Released under the Apache License, 2.0.
 
 ## Citation
 
-Hart RK, Prlić A (2020). **SeqRepo: A system for managing local collections of biological
-sequences.** PLoS ONE 15(12): e0239883. <https://doi.org/10.1371/journal.pone.0239883>
+Hart RK, Prlić A (2020). **SeqRepo: A system for managing local collections of
+biological sequences.** PLoS ONE 15(12): e0239883.
+<https://doi.org/10.1371/journal.pone.0239883>
 
 ## Features
 
--   Timestamped, read-only snapshots.
--   Space-efficient storage of sequences within a single snapshot and across snapshots.
--   Bandwidth-efficient transfer incremental updates.
--   Fast fetching of sequence slices on chromosome-scale sequences.
--   Precomputed digests that may be used as sequence aliases.
--   Mappings of external aliases (i.e., accessions or identifiers like NM_013305.4) to sequences.
+- Timestamped, read-only snapshots.
+- Space-efficient storage of sequences within a single snapshot and across snapshots.
+- Bandwidth-efficient transfer incremental updates.
+- Fast fetching of sequence slices on chromosome-scale sequences.
+- Precomputed digests that may be used as sequence aliases.
+- Mappings of external aliases (i.e., accessions or identifiers like
+  `NM_013305.4`) to sequences.
 
 ## Deployments Scenarios
 
--   Local read-only archive, mirrored from public site, accessed via Python API (see [Mirroring
-    documentation](docs/mirror.rst))
--   Local read-write archive, maintained with command line utility
-    and/or API (see [Command Line Interface
-    documentation](docs/cli.rst)).
--   Docker data-only container that may be linked to application container.
--   SeqRepo and refget REST API for local or remote access (see
+- Local read-only archive, mirrored from public site, accessed via Python API
+  (see [Mirroring documentation](docs/mirror.rst))
+- Local read-write archive, maintained with command line utility and/or API (see
+  [Command Line Interface documentation](docs/cli.rst)).
+- Docker data-only container that may be linked to application container.
+- SeqRepo and refget REST API for local or remote access (see
     [seqrepo-rest-service](https://github.com/biocommons/seqrepo-rest-service))
 
 ## Technical Quick Peek
 
-Within a single snapshot, sequences are stored *non-redundantly* and *compressed* in an add-only
-journalled filesystem structure. A truncated SHA-512 hash is used to assess uniquness and as an
-internal id. (The digest is truncated for space efficiency.)
+Within a single snapshot, sequences are stored *non-redundantly* and
+*compressed* in an add-only journalled filesystem structure. A truncated SHA-512
+hash is used to assess uniquness and as an internal id. (The digest is truncated
+for space efficiency.)
 
 Sequences are compressed using the Block GZipped Format
-([BGZF](https://samtools.github.io/hts-specs/SAMv1.pdf))), which enables pysam to provide fast
-random access to compressed sequences. (Variable compression typically makes random access
-impossible.)
+([BGZF](https://samtools.github.io/hts-specs/SAMv1.pdf))), which enables pysam
+to provide fast random access to compressed sequences. (Variable compression
+typically makes random access impossible.)
 
-Sequence files are immutable, thereby enabling the use of hardlinks across snapshots and eliminating
-redundant transfers (e.g., with rsync).
+Sequence files are immutable, thereby enabling the use of hardlinks across
+snapshots and eliminating redundant transfers (e.g., with `rsync`).
 
-Each sequence id is associated with a namespaced alias in a sqlite database. Such as
-`<seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>`, `<NCBI,NP_004009.1>`, `<gi,5032303>`,
-`<ensembl-75ENSP00000354464>`, `<ensembl-85,ENSP00000354464.4>`. The sqlite database is mutable
-across releases.
+Each sequence id is associated with a namespaced alias in a sqlite database.
+Such as `<seguid,rvvuhY0FxFLNwf10FXFIrSQ7AvQ>`, `<NCBI,NP_004009.1>`,
+`<gi,5032303>`, `<ensembl-75ENSP00000354464>`, `<ensembl-85,ENSP00000354464.4>`.
+The sqlite database is mutable across releases.
 
-For calibration, recent releases that include 3 human genome assemblies (including patches), and
-full RefSeq sets (NM, NR, NP, NT, XM, and XP) consumes approximately 8GB. The minimum marginal size
-for additional snapshots is approximately 2GB (for the sqlite database, which is not hardlinked).
+For calibration, recent releases that include 3 human genome assemblies
+(including patches), and full RefSeq sets (NM, NR, NP, NT, XM, and XP) consumes
+approximately 8GB. The minimum marginal size for additional snapshots is
+approximately 2GB (for the sqlite database, which is not hardlinked).
 
 For more information, see [docs/design.rst](docs/design.rst).
 
 ## Requirements
 
-Reading a sequence repository requires several Python packages, all of which are available from
-pypi. Installation should be as simple as [pip install biocommons.seqrepo]{.title-ref}.
+Reading a sequence repository requires several Python packages, all of which are
+available from pypi. Installation should be as simple as `pip install
+biocommons.seqrepo`.
 
 *Writing* sequence files also requires `bgzip`, which provided in the
-[htslib](https://github.com/samtools/htslib) repo. Ubuntu users should install the `tabix` package
-with `sudo apt install tabix`.
-
-Development and deployments are on Ubuntu. Other systems may work but are not tested. Patches to get
-other systems working would be welcomed.
+[htslib](https://github.com/samtools/htslib) repo. Ubuntu users should install
+the `tabix` package with `sudo apt install tabix`.
 
-**Mac Developers** If you get "xcrun: error: invalid active developer path", you need to install
-XCode. See this [StackOverflow
-answer](https://apple.stackexchange.com/questions/254380/why-am-i-getting-an-invalid-active-developer-path-when-attempting-to-use-git-a).
+Development and deployments are on Ubuntu. Other systems may work but are not
+tested. Patches to get other systems working would be welcomed.
 
 ## Quick Start
 
-### OSX
+### OS X
 
     $ brew install python libpq
 
-### On Ubuntu 16.04
+### Ubuntu
 
     $ sudo apt install -y python3-dev gcc zlib1g-dev tabix
 
 ### All platforms
- 
+
     $ python -m venv venv
     $ source venv/bin/activate
     $ pip install seqrepo
@@ -155,21 +159,25 @@ See [Installation](docs/installation.rst) and
 
 ## Environment Variables
 
-SEQREPO_LRU_CACHE_MAXSIZE sets the lru_cache maxsize for the sqlite
-query response caching. It defaults to 1 million but can also be set to
-"none" to be unlimited.
+SEQREPO_LRU_CACHE_MAXSIZE sets the lru_cache maxsize for the sqlite query
+response caching. It defaults to 1 million but can also be set to "none" to be
+unlimited.
 
-SEQREPO_FD_CACHE_MAXSIZE sets the lru_cache size for file handler caching during FASTA sequence retrievals. 
-It defaults to 0 to disable any caching, but can be set to a specific value or "none" to be unlimited. Using 
-a moderate value (>10) will greatly increase performance of sequence retrieval.
+SEQREPO_FD_CACHE_MAXSIZE sets the lru_cache size for file handler caching during
+FASTA sequence retrievals. It defaults to 0 to disable any caching, but can be
+set to a specific value or "none" to be unlimited. Using a moderate value (>10)
+will greatly increase performance of sequence retrieval.
 
 ## Developing
 
-### OSX
+### Developing on OS X
 
     brew install python libpq bash
 
-### Ubuntu
+If you get "xcrun: error: invalid active developer path", you need to install
+XCode. See this [StackOverflow answer](https://apple.stackexchange.com/questions/254380/why-am-i-getting-an-invalid-active-developer-path-when-attempting-to-use-git-a).
+
+### Developing on Ubuntu
 
     sudo apt install -y python3-dev gcc zlib1g-dev tabix
 
@@ -181,11 +189,13 @@ Here's how to get started developing:
 
 ## Building a docker image
 
-Docker images are available at https://hub.docker.com/r/biocommons/seqrepo.  Tags correspond to the
-version of data, not the version of seqrepo, because the intent is to make it easy to depend on a
-local version of seqrepo *files*.  Each docker image is an installation of seqrepo that downloads
-the corresponding version of seqrepo data.  When used in conjunction with docker volumes for
-persistence, this provides an easy way to incorporate seqrepo data into a docker stack.
+Docker images are available at https://hub.docker.com/r/biocommons/seqrepo.
+Tags correspond to the version of data, not the version of seqrepo, because the
+intent is to make it easy to depend on a local version of seqrepo *files*.  Each
+docker image is an installation of seqrepo that downloads the corresponding
+version of seqrepo data.  When used in conjunction with docker volumes for
+persistence, this provides an easy way to incorporate seqrepo data into a docker
+stack.
 
 ### Building