CXuesong.MW.MwParserFromScratch |
A .NET Library for parsing wikitext into AST. The repository is still under development, but it can already handle most part of wikitext.
FuGet Gallery: See library classes and API documentation.
This package is now on NuGet. You may install the package using one of the following commands
# Package Management Console
Install-Package CXuesong.MW.MwParserFromScratch -Pre
# .NET CLI
dotnet add package CXuesong.MW.MwParserFromScratch -v 3.0.0-int.6
After adding reference to this library, import the namespaces
using MwParserFromScratch;
using MwParserFromScratch.Nodes;
Then just pass the text to the parser
var parser = new WikitextParser();
var text = "Paragraph.\n* Item1\n* Item2\n";
var ast = parser.Parse(text);
Now ast
contains the Wikitext
instance, the root of AST.
You can also take a look at ConsoleTestApplication1
, where there're some demos. SimpleDemo
illustrates how to search and replace in the AST.
static void SimpleDemo()
{
// Fills the missing template parameters.
var parser = new WikitextParser();
var templateNames = new [] {"Expand section", "Cleanup"};
var text = @"==Hello==<!--comment-->
{{Expand section|
date=2010-10-05
}}
{{Cleanup}}
This is a nice '''paragraph'''.
==References==
{{Reflist}}
";
var ast = parser.Parse(text);
// Convert the code snippets to nodes
var dateName = parser.Parse("date");
var dateValue = parser.Parse(DateTime.Now.ToString("yyyy-MM-dd"));
Console.WriteLine("Issues:");
// Search and set
foreach (var t in ast.EnumDescendants().OfType<Template>()
.Where(t => templateNames.Contains(MwParserUtility.NormalizeTemplateArgumentName(t.Name))))
{
// Get the argument by name.
var date = t.Arguments["date"];
if (date != null)
{
// To print the wikitext instead of user-friendly text, use ToString()
Console.WriteLine("{0} ({1})", t.Name.ToPlainText(), date.Value.ToPlainText());
}
// Update/Add the argument
t.Arguments.SetValue(dateName, dateValue);
}
Console.WriteLine();
Console.WriteLine("Wikitext:");
Console.WriteLine(ast.ToString());
}
The console output is as follows
Issues:
Expand section (2010-10-05)
Wikitext:
==Hello==<!--comment-->
{{Expand section|
date=2017-02-26}}
{{Cleanup|date=2017-02-26}}
This is a nice '''paragraph'''.
==References==
{{Reflist}}
ParseAndPrint
can roughly print out the parsed tree. Here's a runtime example
Please input the wikitext to parse, use EOF (Ctrl+Z) to accept:
==Hello==
* ''Item1''
* [[Item2]]
---------
<span style="background:red;">test</span>
^Z
Parsed AST
Wikitext [==Hello==\r\n* ''Item1]
.Paragraph [==Hello==\r]
..PlainText [==Hello==\r]
.ListItem [* ''Item1''\r]
..PlainText [ ]
..FormatSwitch ['']
..PlainText [Item1]
..FormatSwitch ['']
..PlainText [\r]
.ListItem [* [[Item2]]\r]
..PlainText [ ]
..WikiLink [[[Item2]]]
...Run [Item2]
....PlainText [Item2]
..PlainText [\r]
.ListItem [---------\r]
..PlainText [\r]
.Paragraph [<span style="backgro]
..HtmlTag [<span style="backgro]
...TagAttribute [ style="background:r]
....Run [style]
.....PlainText [style]
....Wikitext [background:red;]
.....Paragraph [background:red;]
......PlainText [background:red;]
...Wikitext [test]
....Paragraph [test]
.....PlainText [test]
..PlainText [\r\n]
You can use MediaWiki API to acquire the wikitext. For .NET programmers, I've made a client, WikiClientLibrary, that lies beside this repository. There are also MediaWiki API clients in API:Client code.
There's also a simple demo for fetching and parsing without the dependency of WikiClientLibrary in ConsoleTestApplication1
, like this
/// <summary>
/// Fetches a page from en Wikipedia, and parses it.
/// </summary>
private static Wikitext FetchAndParse(string title)
{
if (title == null) throw new ArgumentNullException(nameof(title));
const string EndPointUrl = "https://en.wikipedia.org/w/api.php";
var client = new HttpClient();
var requestContent = new Dictionary<string, string>
{
{"format", "json"},
{"action", "query"},
{"prop", "revisions"},
{"rvlimit", "1"},
{"rvprop", "content"},
{"titles", title}
};
var response = client.PostAsync(EndPointUrl, new FormUrlEncodedContent(requestContent)).Result;
var root = JObject.Parse(response.Content.ReadAsStringAsync().Result);
var content = (string) root["query"]["pages"].Children<JProperty>().First().Value["revisions"][0]["*"];
var parser = new WikitextParser();
return parser.Parse(content);
}
You may need Newtonsoft.Json
NuGet package to parse JSON.
- For now it does not support table syntax, but I'll work on this.
- Text inside parser tags (rather than normal HTML tags) will not be parsed an will be preserved in
ParserTag.Content
. For certain parser tags (e.g.<ref>
), You can parse theContent
again to get the AST. - It may handle some pathological cases differently from MediaWiki parser. E.g.
{{{{{arg}}
(See Issue #1).