Skip to content

Commit

Permalink
Update plots for Nov/Dec 2023 crawl (CC-MAIN-2023-50)
Browse files Browse the repository at this point in the history
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
  • Loading branch information
jnioche committed Dec 15, 2023
1 parent 6fe61aa commit 85d3ec2
Show file tree
Hide file tree
Showing 38 changed files with 5,236 additions and 4,786 deletions.
2 changes: 1 addition & 1 deletion _config.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
title: Statistics of Common Crawl Monthly Archives
description: Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
latest_crawl: CC-MAIN-2023-40
latest_crawl: CC-MAIN-2023-50

show_navigation: True
navlist:
Expand Down
94 changes: 47 additions & 47 deletions plots/charsets-top-100.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
<thead>
<tr style="text-align: right;">
<th>crawl</th>
<th>CC-MAIN-2023-14</th>
<th>CC-MAIN-2023-23</th>
<th>CC-MAIN-2023-40</th>
<th>CC-MAIN-2023-50</th>
</tr>
<tr>
<th>charset</th>
Expand All @@ -18,19 +18,19 @@
<th>&lt;other&gt;</th>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0001</td>
</tr>
<tr>
<th>&lt;unknown&gt;</th>
<td>1.8888</td>
<td>1.8120</td>
<td>1.7751</td>
<td>1.9997</td>
</tr>
<tr>
<th>Big5</th>
<td>0.0690</td>
<td>0.0686</td>
<td>0.0622</td>
<td>0.0610</td>
</tr>
<tr>
<th>Big5-HKSCS</th>
Expand All @@ -40,51 +40,51 @@
</tr>
<tr>
<th>EUC-JP</th>
<td>0.1089</td>
<td>0.1109</td>
<td>0.1089</td>
<td>0.1110</td>
</tr>
<tr>
<th>EUC-KR</th>
<td>0.0819</td>
<td>0.0859</td>
<td>0.0832</td>
<td>0.0957</td>
</tr>
<tr>
<th>GB18030</th>
<td>0.0190</td>
<td>0.0172</td>
<td>0.0166</td>
<td>0.0204</td>
</tr>
<tr>
<th>GB2312</th>
<td>0.4133</td>
<td>0.2703</td>
<td>0.2485</td>
<td>0.3646</td>
</tr>
<tr>
<th>GBK</th>
<td>0.1191</td>
<td>0.1106</td>
<td>0.0975</td>
<td>0.1331</td>
</tr>
<tr>
<th>IBM420</th>
<td>0.0060</td>
<td>0.0059</td>
<td>0.0060</td>
<td>0.0055</td>
</tr>
<tr>
<th>IBM424</th>
<td>0.0019</td>
<td>0.0017</td>
<td>0.0023</td>
<td>0.0034</td>
</tr>
<tr>
<th>IBM500</th>
<td>0.0008</td>
<td>0.0013</td>
<td>0.0007</td>
<td>0.0008</td>
</tr>
<tr>
<th>IBM855</th>
Expand All @@ -95,61 +95,61 @@
<tr>
<th>IBM866</th>
<td>0.0003</td>
<td>0.0003</td>
<td>0.0002</td>
<td>0.0002</td>
</tr>
<tr>
<th>ISO-2022-JP</th>
<td>0.0008</td>
<td>0.0009</td>
<td>0.0008</td>
<td>0.0011</td>
</tr>
<tr>
<th>ISO-8859-1</th>
<td>2.3740</td>
<td>2.3840</td>
<td>2.2454</td>
<td>2.2951</td>
</tr>
<tr>
<th>ISO-8859-13</th>
<td>0.0000</td>
<td>0.0001</td>
<td>0.0001</td>
<td>0.0000</td>
</tr>
<tr>
<th>ISO-8859-15</th>
<td>0.0622</td>
<td>0.0600</td>
<td>0.0584</td>
<td>0.0553</td>
</tr>
<tr>
<th>ISO-8859-16</th>
<td>0.0002</td>
<td>0.0001</td>
<td>0.0002</td>
<td>0.0002</td>
</tr>
<tr>
<th>ISO-8859-2</th>
<td>0.1312</td>
<td>0.1320</td>
<td>0.1236</td>
<td>0.1236</td>
</tr>
<tr>
<th>ISO-8859-3</th>
<td>0.0003</td>
<td>0.0005</td>
<td>0.0005</td>
<td>0.0005</td>
</tr>
<tr>
<th>ISO-8859-4</th>
<td>0.0008</td>
<td>0.0009</td>
<td>0.0011</td>
<td>0.0008</td>
</tr>
<tr>
<th>ISO-8859-5</th>
<td>0.0032</td>
<td>0.0032</td>
<td>0.0028</td>
<td>0.0028</td>
</tr>
<tr>
Expand All @@ -161,134 +161,134 @@
<tr>
<th>ISO-8859-7</th>
<td>0.0101</td>
<td>0.0101</td>
<td>0.0086</td>
<td>0.0084</td>
</tr>
<tr>
<th>ISO-8859-8</th>
<td>0.0004</td>
<td>0.0003</td>
<td>0.0005</td>
<td>0.0007</td>
</tr>
<tr>
<th>ISO-8859-9</th>
<td>0.0222</td>
<td>0.0240</td>
<td>0.0220</td>
<td>0.0264</td>
</tr>
<tr>
<th>KOI8-R</th>
<td>0.0071</td>
<td>0.0063</td>
<td>0.0060</td>
<td>0.0064</td>
</tr>
<tr>
<th>KOI8-U</th>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0001</td>
</tr>
<tr>
<th>Shift_JIS</th>
<td>0.1809</td>
<td>0.1764</td>
<td>0.1604</td>
<td>0.1953</td>
</tr>
<tr>
<th>TIS-620</th>
<td>0.0072</td>
<td>0.0072</td>
<td>0.0074</td>
<td>0.0062</td>
</tr>
<tr>
<th>US-ASCII</th>
<td>0.0291</td>
<td>0.0333</td>
<td>0.0272</td>
<td>0.0323</td>
</tr>
<tr>
<th>UTF-16</th>
<td>0.0030</td>
<td>0.0032</td>
<td>0.0034</td>
<td>0.0034</td>
</tr>
<tr>
<th>UTF-16BE</th>
<td>0.0005</td>
<td>0.0001</td>
<td>0.0008</td>
<td>0.0005</td>
</tr>
<tr>
<th>UTF-16LE</th>
<td>0.0014</td>
<td>0.0014</td>
<td>0.0014</td>
<td>0.0019</td>
</tr>
<tr>
<th>UTF-32</th>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0001</td>
<td>0.0001</td>
</tr>
<tr>
<th>UTF-32LE</th>
<td>0.0005</td>
<td>0.0008</td>
<td>0.0006</td>
<td>0.0006</td>
</tr>
<tr>
<th>UTF-8</th>
<td>93.4933</td>
<td>93.7245</td>
<td>94.0352</td>
<td>93.5115</td>
</tr>
<tr>
<th>windows-1250</th>
<td>0.0824</td>
<td>0.0829</td>
<td>0.0822</td>
<td>0.0758</td>
</tr>
<tr>
<th>windows-1251</th>
<td>0.5866</td>
<td>0.5696</td>
<td>0.5314</td>
<td>0.5618</td>
</tr>
<tr>
<th>windows-1252</th>
<td>0.1951</td>
<td>0.1926</td>
<td>0.1830</td>
<td>0.2031</td>
</tr>
<tr>
<th>windows-1253</th>
<td>0.0033</td>
<td>0.0032</td>
<td>0.0031</td>
<td>0.0029</td>
</tr>
<tr>
<th>windows-1254</th>
<td>0.0109</td>
<td>0.0102</td>
<td>0.0102</td>
<td>0.0123</td>
</tr>
<tr>
<th>windows-1255</th>
<td>0.0038</td>
<td>0.0037</td>
<td>0.0041</td>
<td>0.0069</td>
</tr>
<tr>
<th>windows-1256</th>
<td>0.0582</td>
<td>0.0603</td>
<td>0.0552</td>
<td>0.0478</td>
</tr>
<tr>
<th>windows-1257</th>
<td>0.0104</td>
<td>0.0113</td>
<td>0.0111</td>
<td>0.0096</td>
</tr>
<tr>
<th>windows-1258</th>
Expand All @@ -305,19 +305,19 @@
<tr>
<th>x-IBM949</th>
<td>0.0001</td>
<td>0.0001</td>
<td>0.0000</td>
<td>0.0000</td>
</tr>
<tr>
<th>x-windows-874</th>
<td>0.0105</td>
<td>0.0107</td>
<td>0.0108</td>
<td>0.0102</td>
</tr>
<tr>
<th>x-windows-949</th>
<td>0.0000</td>
<td>0.0000</td>
<td>0.0001</td>
<td>0.0001</td>
</tr>
</tbody>
Expand Down
Loading

0 comments on commit 85d3ec2

Please sign in to comment.