Docs update (#1408)

* Fix footer contrast * Fix broken links * Remove a few unneeded examples * Point python API example to the whole folder * Convert schema bullets to tables
microsoft · Nov 15, 2024 · 425dbc6 · 425dbc6
1 parent ec9cdcc
commit 425dbc6
Show file tree

Hide file tree

Showing 37 changed files with 94 additions and 1,098 deletions.
diff --git a/docs/config/template.md b/docs/config/template.md
@@ -3,7 +3,7 @@
 The following template can be used and stored as a `.env` in the the directory where you're are pointing
 the `--root` parameter on your Indexing Pipeline execution.
 
-For details about how to run the Indexing Pipeline, refer to the [Index CLI](../index/cli.md) documentation.
+For details about how to run the Indexing Pipeline, refer to the [Index CLI](../cli.md) documentation.
 
 ## .env File Template
 

diff --git a/docs/developing.md b/docs/developing.md
@@ -81,5 +81,5 @@ Make sure you have python3.10-dev installed or more generally `python<version>-d
 
 ### LLM call constantly exceeds TPM, RPM or time limits
 
-`GRAPHRAG_LLM_THREAD_COUNT` and `GRAPHRAG_EMBEDDING_THREAD_COUNT` are both set to 50 by default. You can modify this values
+`GRAPHRAG_LLM_THREAD_COUNT` and `GRAPHRAG_EMBEDDING_THREAD_COUNT` are both set to 50 by default. You can modify these values
 to reduce concurrency. Please refer to the [Configuration Documents](config/overview.md)
diff --git a/docs/get_started.md b/docs/get_started.md
@@ -85,7 +85,7 @@ deployment_name: <azure_model_deployment_name>
 
 - For more details about configuring GraphRAG, see the [configuration documentation](config/overview.md).
 - To learn more about Initialization, refer to the [Initialization documentation](config/init.md).
-- For more details about using the CLI, refer to the [CLI documentation](query/cli.md).
+- For more details about using the CLI, refer to the [CLI documentation](cli.md).
 
 ## Running the Indexing pipeline
 

diff --git a/docs/index/outputs.md b/docs/index/outputs.md
@@ -4,86 +4,113 @@ The default pipeline produces a series of output tables that align with the [con
 
 ## Shared fields
 All tables have two identifier fields:
-- id: str - Generated UUID, assuring global uniqueness
-- human_readable_id: int - This is an incremented short ID created per-run. For example, we use this short ID with generated summaries that print citations so they are easy to cross-reference visually.
+
+| name              | type | description |
+| ----------------- | ---- | ----------- |
+| id                | str  | Generated UUID, assuring global uniqueness |
+| human_readable_id | int  | This is an incremented short ID created per-run. For example, we use this short ID with generated summaries that print citations so they are easy to cross-reference visually. |
 
 ## create_final_communities
 This is a list of the final communities generated by Leiden. Communities are strictly hierarchical, subdividing into children as the cluster affinity is narrowed.
-- community: int - Leiden-generated cluster ID for the community. Note that these increment with depth, so they are unique through all levels of the community hierarchy. For this table, human_readable_id is a copy of the community ID rather than a plain increment.
-- level: int - Depth of the community in the hierarchy.
-- title: str - Friendly name of the community.
-- entity_ids - List of entities that are members of the community.
-- relationship_ids - List of relationships that are wholly within the community (source and target are both in the community).
-- text_unit_ids - List of text units represented within the community.
-- period - Date of ingest, used for incremental update merges.
-- size  - Size of the community (entity count), used for incremental update merges.
+
+| name             | type  | description |
+| ---------------- | ----- | ----------- |
+| community        | int   | Leiden-generated cluster ID for the community. Note that these increment with depth, so they are unique through all levels of the community hierarchy. For this table, human_readable_id is a copy of the community ID rather than a plain increment. |
+| level            | int   | Depth of the community in the hierarchy. |
+| title            | str   | Friendly name of the community. |
+| entity_ids       | str[] | List of entities that are members of the community. |
+| relationship_ids | str[] | List of relationships that are wholly within the community (source and target are both in the community). |
+| text_unit_ids    | str[] | List of text units represented within the community. |
+| period           | str   | Date of ingest, used for incremental update merges. ISO8601 |
+| size             | int   | Size of the community (entity count), used for incremental update merges. |
 
 ## create_final_community_reports
 This is the list of summarized reports for each community.
-- community: int - Short ID of the community this report applies to.
-- level: int - Level of the community this report applies to.
-- title: str - LM-generated title for the report.
-- summary: str - LM-generated summary of the report.
-- full_content: str - LM-generated full report.
-- rank: float - LM-derived relevance ranking of the report based on member entity salience
-- rank_explanation - LM-derived explanation of the rank.
-- findings: dict - LM-derived list of the top 5-10 insights from the community. Contains `summary` and `explanation` values.
-- full_content_json - Full JSON output as returned by the LM. Most fields are extracted into columns, but this JSON is sent for query summarization so we leave it to allow for prompt tuning to add fields/content by end users.
-- period - Date of ingest, used for incremental update merges.
-- size  - Size of the community (entity count), used for incremental update merges.
+
+| name              | type  | description |
+| ----------------- | ----- | ----------- |
+| community         | int   | Short ID of the community this report applies to. |
+| level             | int   | Level of the community this report applies to. |
+| title             | str   | LM-generated title for the report. |
+| summary           | str   | LM-generated summary of the report. |
+| full_content      | str   | LM-generated full report. |
+| rank              | float | LM-derived relevance ranking of the report based on member entity salience
+| rank_explanation  | str   | LM-derived explanation of the rank. |
+| findings          | dict  | LM-derived list of the top 5-10 insights from the community. Contains `summary` and `explanation` values. |
+| full_content_json | json  | Full JSON output as returned by the LM. Most fields are extracted into columns, but this JSON is sent for query summarization so we leave it to allow for prompt tuning to add fields/content by end users. |
+| period            | str   | Date of ingest, used for incremental update merges. ISO8601 |
+| size              | int   | Size of the community (entity count), used for incremental update merges. |
 
 ## create_final_covariates
 (Optional) If claim extraction is turned on, this is a list of the extracted covariates. Note that claims are typically oriented around identifying malicious behavior such as fraud, so they are not useful for all datasets.
-- covariate_type: str - This is always "claim" with our default covariates.
-- type: str - Nature of the claim type.
-- description: str - LM-generated description of the behavior.
-- subject_id: str - Name of the source entity (that is performing the claimed behavior).
-- object_id: str - Name of the target entity (that the claimed behavior is performed on).
-- status: str [TRUE, FALSE, SUSPECTED] - LM-derived assessment of the correctness of the claim.
-- start_date: str (ISO8601) - LM-derived start of the claimed activity.
-- end_date: str (ISO8601) - LM-derived end of the claimed activity.
-- source_text: str - Short string of text containing the claimed behavior.
-- text_unit_id: str - ID of the text unit the claim text was extracted from.
+
+| name           | type | description |
+| -------------- | ---- | ----------- |
+| covariate_type | str  | This is always "claim" with our default covariates. |
+| type           | str  | Nature of the claim type. |
+| description    | str  | LM-generated description of the behavior. |
+| subject_id     | str  | Name of the source entity (that is performing the claimed behavior). |
+| object_id      | str  | Name of the target entity (that the claimed behavior is performed on). |
+| status         | str  | LM-derived assessment of the correctness of the claim. One of [TRUE, FALSE, SUSPECTED] |
+| start_date     | str  | LM-derived start of the claimed activity. ISO8601 |
+| end_date       | str  | LM-derived end of the claimed activity. ISO8601 |
+| source_text    | str  | Short string of text containing the claimed behavior. |
+| text_unit_id   | str  | ID of the text unit the claim text was extracted from. |
 
 ## create_final_documents
 List of document content after import.
-- title: str - Filename, unless otherwise configured during CSV import.
-- text: str - Full text of the document.
-- text_unit_ids: str[] - List of text units (chunks) that were parsed from the document.
-- attributes: dict (optional) - If specified during CSV import, this is a dict of attributes for the document.
 
-# create_final_entities
+| name          | type  | description |
+| ------------- | ----- | ----------- |
+| title         | str   | Filename, unless otherwise configured during CSV import. |
+| text          | str   | Full text of the document. |
+| text_unit_ids | str[] | List of text units (chunks) that were parsed from the document. |
+| attributes    | dict  | (optional) If specified during CSV import, this is a dict of attributes for the document. |
+
+## create_final_entities
 List of all entities found in the data by the LM.
-- title: str - Name of the entity.
-- type: str - Type of the entity. By default this will be "organization", "person", "geo", or "event" unless configured differently or auto-tuning is used.
-- description: str - Textual description of the entity. Entities may be found in many text units, so this is an LM-derived summary of all descriptions.
-- text_unit_ids: str[] - List of the text units containing the entity.
 
-# create_final_nodes
+| name          | type  | description |
+| ------------- | ----- | ----------- |
+| title         | str   | Name of the entity. |
+| type          | str   | Type of the entity. By default this will be "organization", "person", "geo", or "event" unless configured differently or auto-tuning is used. |
+| description   | str   | Textual description of the entity. Entities may be found in many text units, so this is an LM-derived summary of all descriptions. |
+| text_unit_ids | str[] | List of the text units containing the entity. |
+
+## create_final_nodes
 This is graph-related information for the entities. It contains only information relevant to the graph such as community. There is an entry for each entity at every community level it is found within, so you may see "duplicate" entities.
 
 Note that the ID fields match those in create_final_entities and can be used for joining if additional information about a node is required.
-- title: str - Name of the referenced entity. Duplicated from create_final_entities for convenient cross-referencing.
-- community: int - Leiden community the node is found within. Entities are not always assigned a community (they may not be close enough to any), so they may have a ID of -1.
-- level: int - Level of the community the entity is in.
-- degree: int - Node degree (connectedness) in the graph.
-- x: float - X position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0.
-- y: float - Y position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0.
+
+| name      | type  | description |
+| --------- | ----- | ----------- |
+| title     | str   | Name of the referenced entity. Duplicated from create_final_entities for convenient cross-referencing. |
+| community | int   | Leiden community the node is found within. Entities are not always assigned a community (they may not be close enough to any), so they may have a ID of -1. |
+| level     | int   | Level of the community the entity is in. |
+| degree    | int   | Node degree (connectedness) in the graph. |
+| x         | float | X position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0. |
+| y         | float | Y position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0. |
 
 ## create_final_relationships
 List of all entity-to-entity relationships found in the data by the LM. This is also the _edge list_ for the graph.
-- source: str - Name of the source entity.
-- target: str - Name of the target entity.
-- description: str - LM-derived description of the relationship. Also see note for entity descriptions.
-- weight: float - Weight of the edge in the graph. This is summed from an LM-derived "strength" measure for each relationship instance.
-- combined_degree: int - Sum of source and target node degrees.
-- text_unit_ids: str[] - List of text units the relationship was found within.
+
+| name            | type  | description |
+| --------------- | ----- | ----------- |
+| source          | str   | Name of the source entity. |
+| target          | str   | Name of the target entity. |
+| description     | str   | LM-derived description of the relationship. Also see note for entity descriptions. |
+| weight          | float | Weight of the edge in the graph. This is summed from an LM-derived "strength" measure for each relationship instance. |
+| combined_degree | int   | Sum of source and target node degrees. |
+| text_unit_ids   | str[] | List of text units the relationship was found within. |
 
 ## create_final_text_units
 List of all text chunks parsed from the input documents.
-- text: str - Raw full text of the chunk.
-- n_tokens: int - Number of tokens in the chunk. This should normally match the `chunk_size` config parameter, except for the last chunk which is often shorter.
-- document_ids: str[] - List of document IDs the chunk came from. This is normally only 1 due to our default groupby, but for very short text documents (e.g., microblogs) it can be configured so text units span multiple documents.
-- entity_ids: str[] - List of entities found in the text unit.
-- relationships_ids: str[] - List of relationships found in the text unit.
-- covariate_ids: str[] - Optional list of covariates found in the text unit.
+
+| name              | type  | description |
+| ----------------- | ----- | ----------- |
+| text              | str   | Raw full text of the chunk. |
+| n_tokens          | int   | Number of tokens in the chunk. This should normally match the `chunk_size` config parameter, except for the last chunk which is often shorter. |
+| document_ids      | str[] | List of document IDs the chunk came from. This is normally only 1 due to our default groupby, but for very short text documents (e.g., microblogs) it can be configured so text units span multiple documents. |
+| entity_ids        | str[] | List of entities found in the text unit. |
+| relationships_ids | str[] | List of relationships found in the text unit. |
+| covariate_ids     | str[] | Optional list of covariates found in the text unit. |
diff --git a/docs/index/overview.md b/docs/index/overview.md
@@ -39,35 +39,7 @@ yarn run:index --config your_pipeline.yml # custom config mode
 
 ### Python API
 
-```python
-from graphrag.index import run_pipeline
-from graphrag.index.config import PipelineWorkflowReference
-
-workflows: list[PipelineWorkflowReference] = [
-    PipelineWorkflowReference(
-        steps=[
-            {
-                # built-in verb
-                "verb": "derive",  # https://github.com/microsoft/datashaper/blob/main/python/datashaper/datashaper/verbs/derive.py
-                "args": {
-                    "column1": "col1",  # from above
-                    "column2": "col2",  # from above
-                    "to": "col_multiplied",  # new column name
-                    "operator": "*",  # multiply the two columns
-                },
-                # Since we're trying to act on the default input, we don't need explicitly to specify an input
-            }
-        ]
-    ),
-]
-
-dataset = pd.DataFrame([{"col1": 2, "col2": 4}, {"col1": 5, "col2": 10}])
-outputs = []
-async for output in await run_pipeline(dataset=dataset, workflows=workflows):
-    outputs.append(output)
-pipeline_result = outputs[-1]
-print(pipeline_result)
-```
+Please see the [examples folder](https://github.com/microsoft/graphrag/blob/main/examples/README.md) for a handful of functional pipelines illustrating how to create and run via a custom settings.yml or through custom python scripts.
 
 ## Further Reading
 

diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
@@ -3,13 +3,17 @@
     --md-code-hl-color: #3772d9;
     --md-code-hl-comment-color: #6b6b6b;
     --md-code-hl-operator-color: #6b6b6b;
+    --md-footer-fg-color--light: #ffffff;
+    --md-footer-fg-color--lighter: #ffffff;
 }
 
 [data-md-color-scheme="slate"] {
     --md-primary-fg-color: #364499;
     --md-code-hl-color: #246be5;
     --md-code-hl-constant-color: #9a89ed;
     --md-code-hl-number-color: #f16e5f;
+    --md-footer-fg-color--light: #ffffff;
+    --md-footer-fg-color--lighter: #ffffff;
 }
 
 .md-tabs__item--active {

diff --git a/examples/custom_set_of_available_verbs/__init__.py b/examples/custom_set_of_available_verbs/__init__.py
diff --git a/examples/custom_set_of_available_verbs/custom_verb_definitions.py b/examples/custom_set_of_available_verbs/custom_verb_definitions.py
diff --git a/examples/custom_set_of_available_verbs/pipeline.yml b/examples/custom_set_of_available_verbs/pipeline.yml