Skip to content

Commit

Permalink
Add output headers (#80)
Browse files Browse the repository at this point in the history
* output - add header row to CSV files
* documentation: make clear output is tab delimited
  • Loading branch information
mtmail authored May 5, 2024
1 parent c9c3b36 commit f798ce5
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 26 deletions.
44 changes: 22 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,8 @@ retries (wikidata API being unreliable) was added.

## Output data

`wikimedia_importance.csv.gz` contains about 17 million rows. Number of lines grew 2% between 2022 and 2023. The file
is sorted.
`wikimedia_importance.csv.gz` contains about 17 million rows. Number of lines grew 2% between 2022 and 2023.
The file tab delimited, not quoted, is sorted and contains a header row.

| Column | Type |
| ----------- | ---------------- |
Expand Down Expand Up @@ -89,36 +89,36 @@ Examples of `wikimedia_importance.csv.gz` rows:
* Wikipedia contains redirects, so a single wikidata object can have multiple titles even though. Each title has the same importance score. Redirects to non-existing articles are removed.

```
en,a,Brandenburg_Gate,0.5531125195487524,Q82425
en,r,Berlin's_Gate,0.5531125195487524,Q82425
en,r,Brandenberg_Gate,0.5531125195487524,Q82425
en,r,Brandenburger_gate,0.5531125195487524,Q82425
en,r,Brandenburger_Gate,0.5531125195487524,Q82425
en,r,Brandenburger_Tor,0.5531125195487524,Q82425
en,r,Brandenburg_gate,0.5531125195487524,Q82425
en,r,BRANDENBURG_GATE,0.5531125195487524,Q82425
en,r,Brandenburg_Gates,0.5531125195487524,Q82425
en,r,Brandenburg_Tor,0.5531125195487524,Q82425
en a Brandenburg_Gate 0.5531125195487524 Q82425
en r Berlin's_Gate 0.5531125195487524 Q82425
en r Brandenberg_Gate 0.5531125195487524 Q82425
en r Brandenburger_gate 0.5531125195487524 Q82425
en r Brandenburger_Gate 0.5531125195487524 Q82425
en r Brandenburger_Tor 0.5531125195487524 Q82425
en r Brandenburg_gate 0.5531125195487524 Q82425
en r BRANDENBURG_GATE 0.5531125195487524 Q82425
en r Brandenburg_Gates 0.5531125195487524 Q82425
en r Brandenburg_Tor 0.5531125195487524 Q82425
```

* Wikipedia titles contain underscores instead of space, e.g. [Alford,_Massachusetts](https://en.wikipedia.org/wiki/Alford,_Massachusetts)

```
en,a,"Alford,_Massachusetts",0.36590368314334637,Q2431901
en,r,"Alford,_ma",0.36590368314334637,Q2431901
en,r,"Alford,_MA",0.36590368314334637,Q2431901
en,r,"Alford,_Mass",0.36590368314334637,Q2431901
en a "Alford _Massachusetts" 0.36590368314334637 Q2431901
en r "Alford _ma" 0.36590368314334637 Q2431901
en r "Alford _MA" 0.36590368314334637 Q2431901
en r "Alford _Mass" 0.36590368314334637 Q2431901
```

* The highest score article is the [United States](https://en.wikipedia.org/wiki/United_States)

```
pl,a,Stany_Zjednoczone,1,Q30
en,a,United_States,1,Q30
ru,a,Соединённые_Штаты_Америки,1,Q30
hu,a,Amerikai_Egyesült_Államok,1,Q30
it,a,Stati_Uniti_d'America,1,Q30
de,a,Vereinigte_Staaten,1,Q30
pl a Stany_Zjednoczone 1 Q30
en a United_States 1 Q30
ru a Соединённые_Штаты_Америки 1 Q30
hu a Amerikai_Egyesült_Államok 1 Q30
it a Stati_Uniti_d'America 1 Q30
de a Vereinigte_Staaten 1 Q30
...
```

Expand Down
11 changes: 7 additions & 4 deletions steps/output.sh
Original file line number Diff line number Diff line change
Expand Up @@ -126,10 +126,13 @@ for TABLE in wikipedia_article wikipedia_redirect wikimedia_importance
do
echo "* $TABLE.csv.gz"

echo "COPY $TABLE TO STDOUT" | \
psqlcmd | \
sort | \
pigz -9 > "$OUTPUT_PATH/$TABLE.csv.gz"
{
echo "COPY (SELECT * FROM $TABLE LIMIT 0) TO STDOUT WITH DELIMITER E'\t' CSV HEADER" | \
psqlcmd
echo "COPY $TABLE TO STDOUT" | \
psqlcmd | \
sort
} | pigz -9 > "$OUTPUT_PATH/$TABLE.csv.gz"

# default is 600
chmod 644 "$OUTPUT_PATH/$TABLE.csv.gz"
Expand Down

0 comments on commit f798ce5

Please sign in to comment.