Skip to content

Commit

Permalink
fix(WEB-8245): Clean data - remove ports without country/location & c…
Browse files Browse the repository at this point in the history
…onvert name to valid strings (#49)

* Use raw files when parsing code lists & clean name

* Switch mix project package.files to default

Reference: https://hex.pm/docs/publish#adding-metadata-to-code-classinlinemixexscode

* Update CHANGELOG.md

* Fix: Interchange status & function columns

* Remove unnecessary opts from CSVParser
  • Loading branch information
dgigafox committed Jun 25, 2024
1 parent c05e638 commit dd7cb59
Show file tree
Hide file tree
Showing 9 changed files with 116,489 additions and 116,439 deletions.
17 changes: 14 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,18 @@
# Changelog for Ports

## [Unreleased] (2024-06-25)

- Use UNLOCODE raw data instead of manually compiled csv to prevent unnecessary encoding
- Remove ports without country or location
- Convert `port.name` to valid string
- Remove mix project package.files to use default project directory instead

## v0.1.2 (2024-06-21)

- Update code list source to v2023-2

## v0.1.1 (2021-01-20)

* Change csv parser from [csv](https://github.com/beatrichartz/csv) to [nimble_csv](https://github.com/dashbitco/nimble_csv)
* Include files in `priv/data` to package
* Load lists during compilation time for faster response
- Change csv parser from [csv](https://github.com/beatrichartz/csv) to [nimble_csv](https://github.com/dashbitco/nimble_csv)
- Include files in `priv/data` to package
- Load lists during compilation time for faster response
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,5 @@ end
## Updating

1. Download the UNLOCODE csv from [UNECE](https://unece.org/trade/cefact/UNLOCODE-Download)
2. If there are multiple csv files, compile them into one and save it to `priv/data/code-list.csv`
2. Save the downloaded files to folder `priv/data`
3. Update module attribute `@code_list_sources` in `Ports.Loader` with the downloaded file names
41 changes: 35 additions & 6 deletions lib/loader.ex
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,20 @@ defmodule Ports.Loader do

NimbleCSV.define(CSVParser, separator: ",", escape: "\"")

for path <- Path.wildcard("priv/data/*.csv") do
@external_resource path
end

defguard is_empty(data) when data in [nil, ""]

@code_list_sources [
"2023-2 UNLOCODE CodeListPart1.csv",
"2023-2 UNLOCODE CodeListPart2.csv",
"2023-2 UNLOCODE CodeListPart3.csv"
]

def load_code_list do
"code-list.csv"
@code_list_sources
|> csv_decode()
|> Stream.map(fn [
_change,
Expand All @@ -21,8 +33,8 @@ defmodule Ports.Loader do
name,
_name_wo_diacritics,
subdivision,
status,
function,
status,
date,
iata,
coordinates,
Expand All @@ -40,9 +52,18 @@ defmodule Ports.Loader do
coordinates: coordinates
}
end)
|> Stream.reject(fn
%{country: country, location: location} when is_empty(country) or is_empty(location) -> true
_ -> false
end)
|> Stream.map(fn port -> Map.update!(port, :name, &to_valid_string/1) end)
|> Enum.to_list()
end

defp to_valid_string(string) do
string |> :binary.bin_to_list() |> :unicode.characters_to_binary()
end

def load_countries do
"country-codes.csv"
|> csv_decode()
Expand Down Expand Up @@ -105,10 +126,18 @@ defmodule Ports.Loader do
|> Enum.to_list()
end

defp csv_decode(file_name) do
[:code.priv_dir(:ports), "data", file_name]
|> Path.join()
|> File.stream!()
defp csv_decode(file_names) do
file_names = List.wrap(file_names)

streams =
Enum.map(file_names, fn file_name ->
[:code.priv_dir(:ports), "data", file_name]
|> Path.join()
|> File.stream!()
end)

streams
|> Stream.concat()
|> CSVParser.parse_stream(skip_headers: false)
end
end
12 changes: 0 additions & 12 deletions mix.exs
Original file line number Diff line number Diff line change
Expand Up @@ -49,18 +49,6 @@ defmodule Ports.MixProject do

defp package do
[
files: [
"lib",
"mix.exs",
"README*",
"CHANGELOG*",
"LICENSE*",
"priv/data/code-list.csv",
"priv/data/country-codes.csv",
"priv/data/function-classifiers.csv",
"priv/data/status-indicators.csv",
"priv/data/subdivision-codes.csv"
],
maintainers: ["Martide"],
licenses: ["Apache-2.0"],
links: %{"Github" => @source_url}
Expand Down
54,732 changes: 54,732 additions & 0 deletions priv/data/2023-2 UNLOCODE CodeListPart1.csv

Large diffs are not rendered by default.

27,704 changes: 27,704 additions & 0 deletions priv/data/2023-2 UNLOCODE CodeListPart2.csv

Large diffs are not rendered by default.

33,983 changes: 33,983 additions & 0 deletions priv/data/2023-2 UNLOCODE CodeListPart3.csv

Large diffs are not rendered by default.

116,416 changes: 0 additions & 116,416 deletions priv/data/code-list.csv

This file was deleted.

20 changes: 19 additions & 1 deletion test/ports_test.exs
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,25 @@ defmodule PortsTest do
describe "ports" do
test "all/0 get all ports" do
ports = Ports.all()
assert Enum.count(ports) == 116_415
assert Enum.count(ports) == 116_074
end

test "all/0 returns ports with country and location" do
assert Enum.all?(Ports.all(), fn port ->
port.country not in ["", nil] and port.location not in ["", nil]
end)
end

test "all/0 returns ports having name with proper characters" do
assert Enum.all?(Ports.all(), fn port ->
!String.contains?(port.name, "�")
end)
end

test "all/0 returns ports having name with valid string" do
assert Enum.all?(Ports.all(), fn port ->
String.valid?(port.name)
end)
end
end

Expand Down

0 comments on commit dd7cb59

Please sign in to comment.