agjharms said 2 months ago:

As a data scientist, data engineer or software developer you sometimes don't want the most pristine, accurate and privacy sensitive dataset - you just want some care-free data, roughly resembling reality, that you can play around with. For this purpose I developed the Python package `simago`.

While data at the individual level is often subject to a lot of restrictions, aggregated data can be found online. With `simago` this aggregated data is transformed to probability distributions that are used to randomly generate datasets of the population. This data can then for example be used for developing or benchmarking a machine learning model, simulation or database setup. Because the generated data does not contain privacy sensitive information, it can be supplied with your model/simulation/setup so the users can recreate the benchmarks or follow along with a tutorial.

The package is available from PyPI through `pip`. The workings of the package are discussed in the form of an example on the ReadTheDocs page (https://simago.readthedocs.io/en/latest/).

Any feedback on the project, the code or the documentation is very welcome!