PyPPMd Python module

PPM, Prediction by partial matching, is a wellknown compression technique based on context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the stream.

PPMd is an implementation of PPMII by Dmitry Shkarin.

The pyppmd package uses core C files from p7zip. The library has a bare function and no metadata/header handling functions. This means you should know compression parameters and input/output data sizes.

This library implements PPMd Variant H, and PPMd Variant I Version 2.

Getting started

Install

The pyppmd is written by Python and C language bound with both CFFI and CPython C/C++ API, and can be downloaded from PyPI(aka. Python Package Index) using standard ‘pip’ command as like follows;

pip install pyppmd

When installing on CPython, it downloads a wheel with CPython C/C++ extension. When installing on PyPY, it downloads a wheel with CFFI extension. There are binaries for CPython 3.6, 3.7, 3.8, 3.9 on Windows(32bit, 64bit), macOS and Linux(amd64, aarch64), and PyPy7.3(python 3.7) on macOS, Windows(64bit), and Linux(32bit, aarch64).

Application programming interface

Exception

exception PpmdError

This exception is raised when an error occurs.

Simple compression/decompression

This section contains:

compress(bytes_or_str: Union[bytes, bytearray, memoryview, str], max_order: int, mem_size: int, variant: str)

Compress bytes_or_str, return the compressed data.

Parameters
  • bytes_or_str (bytes-like object or str) – Data to be compressed. When it is type of str, encoded with “UTF-8” encoding before compress.

  • max_order (int) – maximum order of PPMd algorithm

  • mem_size (int) – memory size used for building PPMd model

  • variant (str) – PPMd variant name, only accept “H” or “I”

Returns

Compressed data

Return type

bytes

compressed_data = compress(data)
decompress_str(data: Union[bytes, memoryview], max_order: int, mem_size: int, encoding: str, variant: str)

Decompress data, return the decompressed text.

When encoding specified, return the decoded data as str type by specified encoding. Otherwise it returns data decoding by default “UTF-8”.

Parameters
  • data (bytes-like object) – Data to be decompressed.

  • max_order (int) – maximum order of PPMd algorithm

  • mem_size (int) – memory size used for building PPMd model

  • encoding (str) – Encoding name to use when decoding raw decompressed data

  • variant (str) – PPMd variant name, only accept “H” or “I”

Returns

Decompressed text

Return type

str

Raises

PpmdError – If decompression fails.

decompressed_text = decompress_str(data)
decompress(data: Union[bytes, memoryview], max_order: int, mem_size: int, variant: str)

Decompress data, return the decompressed data.

Parameters
  • data (bytes-like object) – Data to be decompressed

  • max_order (int) – maximum order of PPMd algorithm

  • mem_size (int) – memory size used for building PPMd model

  • variant (str) – PPMd variant name, only accept “H” or “I”

Returns

Decompressed data

Return type

bytes

Raises

PpmdError – If decompression fails.

decompressed_data = decompress(data)

Streaming compression

class PpmdCompressor

A streaming compressor. It’s thread-safe at method level.

__init__(self, max_order: int, mem_size: int, variant: str, restore_method: int)

Initialize a PpmdCompressor object. restore_method param is affected only when variant is “I”.

Parameters
  • max_order (int) – maximum order of PPMd algorithm

  • mem_size (int) – memory size used for building PPMd model

  • variant (str) – PPMd variant name, only accept “H” or “I”

  • restore_method (int) – PPMD8_RESTORE_METHOD_RESTART(0) or PPMD8_RESTORE_METHOD_CUTOFF(1)

compress(self, data)

Provide data to the compressor object.

Parameters

data (bytes-like object) – Data to be compressed.

Returns

A chunk of compressed data if possible, or b'' otherwise.

Return type

bytes

flush(self)

Flush any remaining data in internal buffer.

The compressor object can not be used after this method is called.

Returns

Flushed data.

Return type

bytes

c = PpmdCompressor()

dat1 = c.compress(b'123456')
dat2 = c.compress(b'abcdef')
dat3 = c.flush()

Streaming decompression

class PpmdDecompressor

A streaming decompressor. Thread-safe at method level. A restore_method param is affected only when variant is “I”.

__init__(self, max_order: int, mem_size: int, variant: str, restore_method: int)

Initialize a PpmdDecompressor object.

Parameters
  • max_order (int) – maximum order of PPMd algorithm

  • mem_size (int) – memory size used for building PPMd model

  • variant (str) – PPMd variant name, only accept “H” or “I”

  • restore_method (int) – PPMD8_RESTORE_METHOD_RESTART(0) or PPMD8_RESTORE_METHOD_CUTOFF(1)

decompress(self, data, max_length=-1)

Decompress data, returning decompressed data as a bytes object.

Parameters
  • data (bytes-like object) – Data to be decompressed.

  • max_length (int) – Maximum size of returned data. When it’s negative, the output size is unlimited. When it’s non-negative, returns at most max_length bytes of decompressed data. If this limit is reached and further output can (or may) be produced, the needs_input attribute will be set to False. In this case, the next call to this method may provide data as b'' to obtain more of the output.

needs_input

If the max_length output limit in decompress() method has been reached, and the decompressor has (or may has) unconsumed input data, it will be set to False. In this case, pass b'' to decompress() method may output further data.

If ignore this attribute when there is unconsumed input data, there will be a little performance loss because of extra memory copy. This flag can be True even all input data are consumed, when decompressor can be able to accept more data in some case.

eof

True means the end of the first frame has been reached. If decompress data after that, an EOFError exception will be raised. This flag can be False even all input data are consumed, when decompressor can be able to accept more data in some case.

unused_data

A bytes object. When PpmdDecompressor object stops after end mark, unused input data after the end mark. Otherwise this will be b''.

d1 = PpmdDecompressor()

decompressed_dat = d1.decompress(dat1)
decompressed_dat += d1.decompress(dat2)
decompressed_dat += d1.decompress(dat3)

Ppmd8 Objects

Ppmd8Encoder and Ppmd8Decoder classes are intend to use general purpose text compression.

class Ppmd8Encoder

Encoder for PPMd Variant I version 2.

__init__(max_order: int, mem_size: int, restore_method: int)

The max_order parameter is between 2 to 64. mem_size is a memory size in bytes which the encoder use. restore_method should be either PPMD8_RESTORE_METHOD_RESTART or PPMD8_RESTORE_METHOD_CUTOFF.

Ppmd8Encoder.encode(data: Union[bytes, bytearray, memoryview])

compress data, returning a bytes object containing copressed data. This data should be concatenated to the output produced by any preceding calls to the encode(). Some input may be kept in internal buffer for later processing.

Ppmd8Encoder.flush(endmark: boolean)

All pending input is processed, and bytes object containing the remaining compressed output is returned. After calling flush(), the encode() method cannot be called again; the only realistic action is to delete the object. flush() method releases some resource the object used.

When endmark is true (default), flush write endmark(-1) to end of archive, otherwise do not write anything and just flush.

class Ppmd8Decoder

Decoder for PPMd Variant I version 2.

__init__(max_order: int, mem_size: int, restore_method)

The max_order parameter is between 2 to 64. mem_size is a memory size in bytes which the encoder use.

These parameters should as same as one when encode the data.

Ppmd8Decoder.decode(data: Union[bytes, bytearray, memoryview], length: int)

decode the given data and returns decoded data. When length is -1, maximum output data may be returned.

If decoder got the end mark, decode() method automatically flush all data and close some resource. When reached to end mark, Ppmd8Decoder.eof member become True.

When Ppmd8Decoder.needs_input is True, all input data is exhausted and need more input data to generate output. Otherwise, there are some data in internal buffer and reusable.

The decoder may return data which size is smaller than specified length, that is because size of input data is not enough to decode.

Ppmd7 Objects

Ppmd7Encoder and Ppmd7Decoder classes are designed to use as internal class for py7zr, python 7-zip compression/decompression library. Ppmd7Encoder and Ppmd7Decoder use a modified version of PPMd var.H that use the range coder from 7z.

class Ppmd7Encoder

Encoder for PPMd Variant H.

__init__(max_order: int, mem_size: int)

The max_order parameter is between 2 to 64. mem_size is a memory size in bytes which the encoder can use.

Ppmd7Encoder.encode(data: Union[bytes, bytearray, memoryview])

Compress data, returning a bytes object containing compressed data for at least part of the data in data. This data should be concatenated to the output produced by any preceding calls to the encode() method. Some input may be kept in internal buffers for later processing.

Ppmd7Encoder.flush(endmark: boolean)

All pending input is processed, and bytes object containing the remaining compressed output is returned. After calling flush(), the encode() method cannot be called again; the only realistic action is to delete the object. When endmark is true, flush write endmark(-1) to end of archive, otherwise do not write (default).

class Ppmd7Decoder

Decoder for PPMd Variant H.

__init__(max_order: int, mem_size: int)

The max_order parameter is between 2 to 64. mem_size is a memory size in bytes which the encoder can use.

Ppmd7Decoder.decode(data: Union[bytes, bytearray, memoryview], length: int)

returns decoded data that sizes is length.

decoder may return data which size is smaller than specified length, that is because size of input data is not enough to decode.

Ppmd7Decoder.flush(length: int)

All pending input is processed, and a bytes object containing the remaining uncompressed output of specified length is returned. After calling flush(), the decode() method cannot be called again; the only realistic action is to delete the object.

Contributor guide

Development environment

If you’re reading this, you’re probably interested in contributing to pyppmd. Thank you very much! The purpose of this guide is to get you to the point where you can make improvements to the PyPPMd and share them with the rest of the team.

Setup Python and C compiler

The PyPPMd is written in the Python and C languages bound with both CFFI, C Foreign Function Interface, and CPython C/C++ API. CFFI is used for PyPy3 and CPython API is used for CPython.

Python installation for various platforms with various ways. You need to install Python environment which support pip command. Venv/Virtualenv is recommended for development.

We have a test suite with python 3.8 and pypy3. If you want to run all the test with these versions and variant on your local, you should install these versions. You can run test with CI environment on Github actions.

Get Early Feedback

If you are contributing, do not feel the need to sit on your contribution until it is perfectly polished and complete. It helps everyone involved for you to seek feedback as early as you possibly can. Submitting an early, unfinished version of your contribution for feedback in no way prejudices your chances of getting that contribution accepted, and can save you from putting a lot of work into a contribution that is not suitable for the project.

Code Contribution

Steps submitting code

When contributing code, you’ll want to follow this checklist:

  1. Fork the repository on GitHub.

  2. Run the tox tests to confirm they all pass on your system. If they don’t, you’ll need to investigate why they fail. If you’re unable to diagnose this yourself, raise it as a bug report.

  3. Write tests that demonstrate your bug or feature. Ensure that they fail.

  4. Make your change.

  5. Run the entire test suite again using tox, confirming that all tests pass including the ones you just added.

  6. Send a GitHub Pull Request to the main repository’s master branch. GitHub Pull Requests are the expected method of code collaboration on this project.

Code review

Contribution will not be merged until they have been code reviewed. There are limited reviewer in the team, reviews from other contributors are also welcome. You should implemented a review feedback unless you strongly object to it.

Code style

The pyppmd uses the PEP8/Black code style. In addition to the standard PEP8, we have an extended guidelines.

  • line length should not exceed 125 characters.

  • Black format prettier is enforced.

  • It also use MyPy static type check enforcement.

Test cases

There is three types of tests and we measures coverages;

  • Unit tests for encode and decode, single data, and multiple data.

  • Integration test with CSV file which size is larger than buffer size.

  • Hypothesis fuzzing test.

All tests should be passed before merging.

C bindings development

Debuggng bindings has always been itchy task for developers. Even proprietary modern IDEs, such as PyCharm Professional/CLion, does not provide a cross debugging feature. Ref: https://youtrack.jetbrains.com/issue/CPP-5797

PyPpmd project source has a hacky way to do it.

CMake

The project has a CMakeLists.txt file to run cross-debugging, CMake is a cross-platform builder meta tool for C/C++ projects. You can run PyPpmd project as a C project using the file.

Jetbrains CLion

CLion is an IDE tool for C/C++ development that support CMake for build configuration. You can use CLion for PyPpmd development.

Dependency

  • Python 3.8.x

  • python development files (for example python3.8-dev package)

  • venv

  • GCC or CLang C/C++ compiler

  • CMake 3.19 or later

When you want to change target python variation and version, please edit CMakeLists.txt#L8-L9

set(PY_VERSION 3.8)
set(Python_FIND_IMPLEMENTATIONS PyPy)

Manual build and run

TL;DR

mkdir cmake-build
cd cmake-build
cmake ..
make pytest_runner
gdb ./pytest_runner ../tests

pytest_runner is a generated program that help you run pytest under C/C++ debugger. You may want to run it on IDE environment.

You can also run pytest with tox

tox -e py38

Library build

cd cmake-build
make pyppmd

CMake targets and files

THere are several targets you can build.

pytest_runner:

A C++ program that launch python and pytest. The source code is generated by CMake configuration onto cmake build directory (cmake-build in above example).

generate_ext:

A virtual target to produce C extension for CPython.

pyppmd:

compile C files into static library file. Just convenient target for compilation.

venv.stamp:

interim target to produce virtualenv environment for pytest_runner

Authors

pyppmd is written and maintained by Hiroshi Miura <miurahr@linux.com> Some part of pyppmd code are borrowed from pyzstd written by Ma Lin.

Contributors, listed alphabetically, are:

  • Sarthak Pati

    macOS compilation

  • T Yamada

    bug on race conditions on PPMd Variant I

Security Policy

Supported Versions

Only recent version of pyppmd are currently being supported with security updates.

Version

Status

0.18.x

Development

< 0.18

not supported

Reporting a Vulnerability

Please disclose security vulnerabilities privately at miurahr@linux.com

Third-party software notices and information

This project incorporates components derivered from the projects listed below. The original copyright notices and the licenses under which we received such components are set forth below.

  1. lib/ppmd/* derived from 7-zip/p7Zip 16.00

  2. lib/buffer/blockoutput.c derived from pyzstd v0.15.0

The other codes are original of PyPPMd project licensed under LGPLv2.1 or later.

7-zip/p7zip

C code under lib folder which originating from 7-zip are released under LGPL, and each sources are noted as follows.

  • 2017-04-03 : Igor Pavlov : Public domain

  • PPMd var.H (2001): Dmitry Shkarin : Public domain

  • PPMd var.I (2002): Dmitry Shkarin : Public domain

  • Carryless rangecoder (1999): Dmitry Subbotin : Public domain

7-zip, Copyright (C) 1999-2017, Igor Pavlov.

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You can receive a copy of the GNU Lesser General Public License from http://www.gnu.org/

pyzstd

A part of C extension code is a derived work of pyzstd which is licensed under BSD 3-Clause license.

BSD 3-Clause License

Copyright (c) 2016 Tino Reichard Copyright (c) 2020-2021, Ma Lin, All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

PyPPMd ChangeLog

All notable changes to this project will be documented in this file.

Unreleased

v1.0.0

Changed

  • Fix publish script to make sdist and upload it.

  • Move CI on Azure pipelines

  • Migrate forge site to CodeBerg.org

  • Drop release-note and stale actions

v0.18.3

Added

  • Release wheel for python 3.11 beta

Fixed

  • CI: update setuptools before test run (#115)

  • CI: fix error on tox test on aarch64.

Changed

v0.18.2

Fixed

  • Publish wheel package for python 3.10 on macos.

  • pyproject.toml: add “version” as dynamic (#100)

Changed

  • Update security policy to support version to be 0.18.x

  • Move old changelog to Chanlog.old.rst

v0.18.1

Fixed

  • Installation error with recent pip version (#94, #95) * Add metadata in pyproject.toml

  • PPMd8: check double flush(#96)

v0.18.0

Fixed

  • test: Fix fuzzer error with silent null byte (#89)

  • test: 32bit test memory parameter too large(#90)

  • PPMd7: avoid access violation on dealloc when failed in allocation (#91)

  • PPMd7: decoder.eof and decoder.needs_input return proper value(#92)

Security

  • PPMd7,PPMd8: fix struct definition by include process.h in windows This may cause crash on 32bit version of python on windows(#86)

Changed

  • PPMd7: decompressor use threading(#85)

Added

  • doc: Explanation of Extra silent null byte in README

v0.17.4

Fixed

  • ppmd7: allow multiple decode without additional input data (#84)

  • ppmd8: test: Fix fuzzer test program (#82)

Changed

v0.17.3

Fixed

  • Build on MingW/MSYS2(#68,#69)

Added

  • Test on Python 3.10.0, PyPY-3.6 and PyPy-3.7 (#71)

Changed

  • CI: use pypa/ciwheelbuild(#70)

  • CI: add dependabot(#70)

  • Bump versions - CI: pypa/ciwheelbuild@2.2.2 - CI: run-on-arch@2.1.1 - CI: actions/stale@4

  • CI: exclude pypy on windows

  • CI: exclude cp310-macos because python 3.10 for macos is superceded

  • CI: publish musllinux wheel

  • CI: improve cibuildwheel performance

v0.17.1

Added

  • Wheels for python 3.10

v0.17.0

Added

  • unified API for variation H and I

  • ppmd7, ppmd8: flag to control endmark(-1) addtions. defaults: ppmd7 without endmark, ppmd8: with endmark.

Changed

  • Unified API to use Variant H, and Varant I version 2 from simple API. User can provide variant argument to the constractor. (#59)

  • Allocate PPMD7Decompressor buffer variables from heap(#52)

  • Replace pthread wrapper library to the verison of one made by Lockless. Inc. (#67)

  • Refactoring internal variable namees, move thread shared variable into ThreadControl structure.

Fixed

  • More robust PPMd8Decompressor by taking thread control variables and buffers from heap, and remove global variables.(#54)

  • PPMD8Decoder: Deadlock on Windows(#67 and more)

Deprecated

Removed

  • End-mark (0x01 0x00) mode(#62)

Security

v0.16.1

Added

  • CI: add macOS as test matrix(#51)

Fixed

  • Fix osX bulid error(#49,#50)

v0.16.0

Added

  • PPMd8: support endmark option(#39)

  • PPMd8: support restore_method option(#24, @cielavenir)

  • Add pthread wrapper for macOS and Windows(#33)

Changed

  • PPMd8: decompressor use threading(#24,#33)

Fixed

  • PPMd8: Decompressor become wrong status when memory_size is smaller than file size(#24,#25,#28,#33,#45,#46)

  • PPMd8: Decompressor allocate buffers by PyMem_Malloc() (#42)

  • CMake: support CFFI extension generation(#30)

  • CMake: support debug flag for extension development(#27)

  • CMake: support pytest_runner on windows

  • CI: run tox test on pull_request

v0.15.2

Added

  • Add development note using cmake

Fixed

  • Make CMake build script working

Security

  • Hardening for multiplexing overflow(scan#1)

v0.15.1

Added

  • Badge for conda-forge package(#19)

Changed

  • Test document with tox

Fixed

  • Fix setup.py: pyppmd.egg-info/SOURCES.txt not including full path

  • Fix source package not include .git* files(#20)

  • Fix compiler warning by cast.

v0.15.0

  • Now development status is Beta.

Added

  • Introduce PpmdCompressor and PpmdDecompressor class for stream compression.

  • Introduce decompress_str() one-shot utility to return str object.

Changed

  • decompress() always return bytes object.

Deprecated

  • PPMd8: drop length mode for decompression and always use end mark mode.

  • PPMd8: drop flush() method for decompression.

v0.14.0

Added

  • Introduce compress() and decompress() one-shot utility - compress() accept bytes-like object or string. When string, encode it to UTF-8 first. - decompress() has an argument encoding, if specified, it returns string.

  • C: CFFI: Introduce End-Mark mode for PPMd8

Changed

  • C: Limit initial output buffer size as same as specified length.

  • C: Allow python thread when decode/encode loop running.

v0.13.0

Added

  • Benchmark test to show performance

Changed

  • Change folder structures in source.

  • Release resources on flush()

Fixed

  • Fix input buffer overrun(#8)

v0.12.1

Fixed

  • Fix dist of typing stubs

v0.12.0

Added

  • add PPMd varietion I (PPMd8) - Ppmd8Encoder, Ppmd8Decoder class

  • MyPy typing stubs

Changed

  • switch to LGPLv2.1+ License

  • Introduce flush() method for decode class.

Fixed

  • Fix build error on Windows.

v0.11.1

Fixed

  • Fix Packaging configuration

v0.11.0

Fixed

  • Better error handling for memory management.

Changed

  • Skip hypothesis tests on windows

  • Limit hypothesis tests parameter under available memory.

v0.10.0

  • First Alpha

Indices and tables