When you start learning data science, often your biggest worry is not the algorithms or techniques but getting access to raw data. While there are many high-quality, real-life datasets available on the web for trying out cool machine learning techniques, I’ve found that the same is not true when it comes to learning SQL.
For data science, having a basic familiarity with SQL is almost as important as knowing how to write code in Python or R. But it’s far easier to find toy datasets on Kaggle than it is to access a large enough database with real data (such as name, age, credit card, social security number, address, birthday, etc.) specifically designed or curated for machine learning tasks.
Wouldn’t it be great to have a simple tool or library to generate a large database with multiple tables filled with data of your own choice?
Aside from beginners in data science, even seasoned software testers may find it useful to have a simple tool where, with a few lines of code, they can generate arbitrarily large data sets with random (fake), yet meaningful entries.
For this reason, I am glad to introduce a lightweight Python library called pydbgen. In this article, I’ll briefly share some information about the package, and you can learn much more by reading the docs.
What is pydbgen?
Pydbgen is a lightweight, pure-Python library to generate random useful entries (e.g., name, address, credit card number, date, time, company name, job title, license plate number, etc.) and save them in a Pandas dataframe object, as an SQLite table in a database file, or in a Microsoft Excel file.