docs: Updated README with examples and change log.

defnull · Sep 20, 2024 · be5de45 · be5de45
1 parent 3c1515b
commit be5de45
Showing 1 changed file with 112 additions and 19 deletions.
diff --git a/README.rst b/README.rst
@@ -1,39 +1,132 @@
 Parser for multipart/form-data
 ==============================
 
-This module provides a parser for the multipart/form-data format. It can read
-from a file, a socket or a WSGI environment. The parser can be used to replace
-cgi.FieldStorage to work around its limitations.
+This module provides multiple parsers for RFC-7578 ``multipart/form-data``, both
+low-level for framework authors and high-level for WSGI application developers:
+
+* ``PushMultipartParser``: A low-level incremental `SansIO <https://sans-io.readthedocs.io/>`_
+  (non-blocking) parser suitable for asyncio and other time or memory constrained
+  environments.
+* ``MultipartParser``: A streaming parser emitting memory- and disk-buffered
+  ``MultipartPart`` instances.
+* ``parse_form_data``: A helper function to parse both ``multipart/form-data``
+  and ``application/x-www-form-urlencoded`` form submissions from a
+  `WSGI <https://peps.python.org/pep-3333/>`_ environment.
+
+Installation
+------------
+
+``pip install multipart``
 
 Features
 --------
 
-* Parses ``multipart/form-data`` and ``application/x-www-form-urlencoded``.
-* Produces useful error messages in 'strict'-mode.
-* Gracefully handle uploads of unknown size (missing ``Content-Length`` header).
-* Fast memory mapped files (io.BytesIO) for small uploads.
-* Temporary files on disk for big uploads.
-* Memory and disk resource limits to prevent DOS attacks.
-* Fixes many shortcomings and bugs of ``cgi.FieldStorage``.
-* 100% test coverage.
+* Pure python single file module with no dependencies.
+* 100% test coverage. Tested with inputs as seen from actual browsers and HTTP clients.
+* Parses multiple GB/s on modern hardware (quick tests, no proper benchmark).
+* Quickly rejects malicious or broken inputs and emits useful error messages.
+* Enforces configurable memory and disk resource limits to prevent DoS attacks.
 
 Limitations
 -----------
 
-* Only parses ``multipart/form-data`` as seen from actual browsers.
+This parser implements ``multipart/form-data`` as it is used by actual modern
+browsers and HTTP clients, which means:
 
-  * Not suitable as a general purpose multipart parser (e.g. for multipart emails).
+  * Just ``multipart/form-data``, not suitable for email parsing
   * No ``multipart/mixed`` support (RFC 2388, deprecated in RFC 7578)
-  * No ``encoded-word`` encoding (RFC 2047).
-  * No ``base64`` or ``quoted-printable`` transfer encoding.
-
-* Part headers are expected to be encoded in the charset given to the ``Multipart``/``MultipartParser`` constructor.
-  [For operability considerations, see RFC 7578, section 5.1.]
-* The size of headers are not counted against the in-memory limit (todo).
+  * No ``encoded-word`` encoding (RFC 2047, no one uses that)
+  * No ``base64`` or ``quoted-printable`` transfer encoding (not used)
+  * No ``name=_charset_`` support (discouraged in RFC 7578)
+
+Usage and examples
+------------------
+
+For WSGI application developers we strongly suggest using the ``parse_form_data``
+helper function. It accepts a WSGI ``environ`` dictionary and parses both types
+of form submission (``multipart/form-data`` and ``application/x-www-form-urlencoded``)
+based on the actual content type of the request. You'll get two ``MultiDict``
+instances in return, one for text fields and the other for file uploads:
+
+.. code-block:: python
+
+    from multipart import parse_form_data
+
+    def wsgi(environ, start_response):
+      if environ["REQUEST_METHOD"] == "POST":
+        forms, files = parse_form_data(environ)
+        
+        title = forms["title"]    # string
+        upload = files["upload"]  # MultipartPart
+        upload.save_as(...)
+
+The ``parse_form_data`` helper function internally uses ``MultipartParser``, a
+streaming parser that reads from a ``multipart/form-data`` encoded binary data
+stream and emits ``MultipartPart`` instances as soon as a part is fully parsed.
+This is most useful if you want to consume the individual parts as soon as they
+arrive, instead of waiting for the entire request to be parsed:
+
+.. code-block:: python
+
+    from multipart import parse_options_header, MultipartParser
+
+    def wsgi(environ, start_response):
+      assert environ["REQUEST_METHOD"] == "POST"
+      ctype, copts = mp.parse_options_header(environ.get("CONTENT_TYPE", ""))
+      boundary = copts.get("boundary")
+      charset = copts.get("charset", "utf8")
+      assert ctype == "multipart/form-data"
+    
+      parser = mp.MultipartParser(environ["wsgi.input"], boundary, charset)
+      for part in parser:
+        if part.filename:
+          print(f"{part.name}: File upload ({part.size} bytes)")
+          part.save_as(...)
+        elif part.size < 1024:
+          print(f"{part.name}: Text field ({part.value!r})")
+        else:
+          print(f"{part.name}: Test field, but too big to print :/")
+
+The ``MultipartParser`` handles IO and file buffering for you, but does so using
+blocking APIs. If you need absolute control over the parsing process and want to
+avoid blocking IO at all cost, then have a look at ``PushMultipartParser``, the
+low-level non-blocking incremental ``multipart/form-data`` parser that powers all
+the other parsers in this library:
+
+.. code-block:: python
+
+    from multipart import PushMultipartParser
+
+    async def process_multipart(reader: asyncio.StreamReader, boundary: str):
+      with PushMultipartParser(boundary) as parser:
+        while not parser.closed:
+          chunk = await reader.read(1024*46)
+          for event in parser.parse(chunk):
+            if isinstance(event, list):
+              print("== Start of segment")
+              for header, value in event:
+                print(f"{header}: {value}")
+            elif isinstance(event, bytearray):
+              print(f"[{len(event)} bytes of data]")
+            elif event is None:
+              print("== End of segment")
+
 
 Changelog
 ---------
 
+* **next**
+  * A completely new, fast, non-blocking ``PushMultipartParser`` parser, which
+    now serves as the basis for all other parsers.
+  * Default charset for ``MultipartParser`` headers and text fields changed to
+    ``utf8``.
+  * Default disk and memory limits for ``MultipartParser`` increased, and
+    multiple other limits added for finer control.
+  * Undocumented APIs deprecated or removed, some of which were not strictly
+    private. This includes parameters for ``MultipartParser``, some
+    ``MultipartPart`` methods that should not be used by anyone but the parser
+    itself.
+
 * **0.2.5 (18.06.2024)**
 
   * Don't test semicolon separators in urlencoded data (#33)