Introduction
In the modern era of data-driven applications, handling and processing large-scale datasets have become critical for software development, testing, and data analysis. Whether you’re a software developer testing an application’s scalability or a data scientist building machine learning models, realistic datasets play a vital role in ensuring success.
However, acquiring large, meaningful datasets is not always straightforward. Public datasets may lack the specific attributes needed, and creating data manually is both time-consuming and impractical.
To address this challenge, I embarked on a project to generate large-scale, movie-related data programmatically using Python and SQLite. This project simulates a complete movie database, including actors, directors, genres, sales data, and more, making it ideal for testing and experimentation.
In this article, I’ll guide you through the process of designing, implementing, and utilizing this data generation tool, providing insights into the challenges faced and solutions devised along the way.
Getting Started
Before diving into the implementation, let’s review the prerequisites and an overview of the project structure.
Prerequisites
To follow along or replicate the project, ensure you have the following tools and libraries installed:
1. Python (Version 3.8 or higher)
2. SQLite3 (Comes pre-installed with Python)
3. Python Libraries:
• itertools
• datetime
• json
• numpy
• sqlite3
- random
Project Structure
Here’s how the project directory is organized:
movie-data-generator/
│
├── data/
│ ├── names/
│ │ ├── first-names.txt
│ │ ├── last-names.txt
│ ├── countries/
│ │ ├── countries.txt
│ ├── genres/
│ │ ├── genres.json
│ ├── movies/
│ │ ├── movie_list.txt
│ ├── words/
│ ├── wiki-100k.txt
├── db.sqlite3
├── generator.py
Data Files
• names/first-names.txt & names/last-names.txt: Contain lists of first and last names for actors and directors.
• countries/countries.txt: Contains country names.
• genres/genres.json: Stores movie genres in JSON format.
• movies/movie_list.txt: Contains movie titles.
- words/wiki-100k.txt: Provides a list of words for generating movie descriptions.
SQLite Database
The project uses db.sqlite3 as the database to store generated data, including tables for:
• Actors
• Directors
• Movies
• Genres
• Countries
• Sales
With the basics in place, let’s move to the Implementation section, where we’ll explore the Python script step by step.
Installation
To get started with the movie data generation project, follow these step-by-step instructions:
1. Clone the Repository
Begin by cloning the project repository from GitHub:
git clone https://github.com/AlirezaMz10/movie-data-generator.git cd movie-data-generator
2. Set Up the Environment
Ensure you have Python 3.8 or higher installed. To isolate dependencies, create a virtual environment:
python -m venv env source env/bin/activate # On macOS/Linux env\Scripts\activate # On Windows
3. Install Required Libraries
Install the required Python libraries using pip:
pip install numpy sqlite3
4. Prepare the Dataset
Ensure the data directory contains the necessary files for names, countries, genres, movies, and words. You can either:
• Use the provided dataset in the repository.
- Replace the files with your custom datasets while maintaining the same structure.
5. Set Up the SQLite Database
Initialize the SQLite database by running the script. This step creates the database schema and begins populating the data:
python data_generator.py
Here’s the models.py for your Django project:
from django.db import models # Create your models here. class Country(models.Model): name = models.CharField(max_length=255) class Meta: db_table = "countries" class Actor(models.Model): firstname = models.CharField(max_length=255) lastname = models.CharField(max_length=255) born_year = models.IntegerField() class Meta: db_table = "actors" class Director(models.Model): firstname = models.CharField(max_length=255) lastname = models.CharField(max_length=255) born_year = models.IntegerField() class Meta: db_table = "directors" class Genre(models.Model): title = models.CharField(max_length=255) class Meta: db_table = "genres" class Movie(models.Model): title = models.CharField(max_length=255) description = models.CharField(max_length=255) year = models.IntegerField() rating = models.IntegerField() actors = models.ManyToManyField(Actor) genres = models.ManyToManyField(Genre) directors = models.ManyToManyField(Director) class Meta: db_table = "movies" class Sale(models.Model): amount = models.IntegerField() country = models.ForeignKey(Country, on_delete=models.DO_NOTHING) movie = models.ForeignKey(Movie, on_delete=models.DO_NOTHING) class Meta: db_table = "sales"
Explanation:
• Country: Represents a country with a name.
• Actor: Contains the first name, last name, and birth year of an actor.
• Director: Contains similar fields for directors as the Actor model.
• Genre: Stores movie genres.
• Movie: Represents a movie with attributes like title, description, year, rating, and relationships to actors, directors, and genres.
- Sale: Stores movie sales information, including the amount, country, and movie.
Lets move on main code.
1. Reading Names and Families:
with open("data/names/first-names.txt", "r") as f1: names_file = f1.read() names = names_file.split() with open("data/names/first-names.txt", "r") as f2: families_file = f2.read() families = families_file.split()
• What it does: The script opens and reads the first-names.txt file twice — once for first names (names) and once for family names (families).
- Why it does this: You need both first and family names to create combinations of actor and director names.
2. Creating Combinations of Names:
names_families_combo = list(itertools.product(names, families))
• What it does: It generates all combinations of first and family names using itertools.product(), which creates a Cartesian product of names and families.
• Why it does this: To ensure you can generate every possible combination of names for the actors and directors.
3. Reading Countries and Years:
with open("data/countries/countries.txt") as f3: countries_file = f3.read() countries = countries_file.split() year = datetime.datetime.today().year years = [year - i for i in range(35)]
• What it does: The script reads the list of countries from countries.txt. It also generates a list of years for the last 35 years.
• Why it does this: To populate the countries table and provide a range of years to randomly assign to actors and directors.
4. Reading Genres and Ratings:
with open("data/genres/genres.json") as f4: genres = json.loads(f4.read()) ratings = list(np.arange(1, 5.5, .5))
• What it does: It loads the genres from a JSON file (genres.json) and creates a list of ratings from 1 to 5 (with half-point increments).
• Why it does this: Genres and ratings are needed for movie data.
5. Reading Movies:
with open("data/movies/movie_list.txt") as f5: movies_file = f5.read() movies_choices = movies_file.split()[:1000] movies_combo = list(itertools.combinations(movies_choices, 2)) movies = [movie[0] + " " + movie[1] for movie in movies_combo]
• What it does: It reads the list of movie titles from movie_list.txt, takes the first 1000 titles, and generates all possible combinations of pairs of movies using itertools.combinations(). It then combines these pairs into a list of movie names.
• Why it does this: You need a set of movie titles to insert into the movies table. Pairing titles ensures unique combinations.
6. Reading Words for Descriptions:
words_choices = [] with open("data/words/wiki-100k.txt", encoding="utf-8") as f6: words_file = f6.read().split("\n") for word in words_file: if "#!comment:" not in word: words_choices.append(word)
• What it does: Reads a list of words from wiki-100k.txt and filters out comments (lines that start with #!comment:).
• Why it does this: These words will be used to generate random descriptions for the movies.
7. Generating Descriptions:
descriptions = [] words_1 = list(itertools.combinations(words_choices[:100], 4)) descriptions_1 = [word[0] + " " + word[1] + " " + word[2] + " " + word[3] for word in words_1] words_2 = list(itertools.combinations(words_choices[100:200], 4)) descriptions_2 = [word[0] + " " + word[1] + " " + word[2] + " " + word[3] for word in words_2] words_3 = list(itertools.combinations(words_choices[200:300], 4)) descriptions_3 = [word[0] + " " + word[1] + " " + word[2] + " " + word[3] for word in words_3] descriptions.extend(descriptions_1) descriptions.extend(descriptions_2) descriptions.extend(descriptions_3)
• What it does: It creates random descriptions by picking four words from different parts of the words_choices list and combining them into descriptions. It generates three different sets of descriptions (descriptions_1, descriptions_2, and descriptions_3) and combines them into one list.
• Why it does this: Random descriptions are needed for each movie, and this process ensures a variety of descriptions.
8. Database Setup:
cnn = sqlite3.connect("db.sqlite3") cur = cnn.cursor()
• What it does: The code connects to an SQLite database (db.sqlite3) and creates a cursor object (cur) to execute SQL queries.
• Why it does this: This is necessary to interact with the database and perform insertions.
9. Inserting Data into the Database:
Inserting Actors and Directors:
for i in range(10000): actor_name_index = random.randint(0, len(names)-1) actor_name = names[actor_name_index] actor_family_index = random.randint(0, len(families)-1) actor_family = families[actor_family_index] actor_born_index = random.randint(0, len(years)-1) actor_born = years[actor_born_index] a = f"INSERT INTO actors (`firstname`, `lastname`, `born_year`) VALUES ('{actor_name.replace("'" , "")}', '{actor_family.replace("'" , "")}', '{actor_born}')" print(a) cur.execute(a) cnn.commit()
• What it does: This loop randomly selects an actor’s first name, last name, and birth year from the lists, then inserts this data into the actors table. The same process is repeated for directors.
• Why it does this: You need to populate the database with actors and directors data.
Inserting Countries and Genres:
for i in range(len(countries)): country = countries[i] a = f"INSERT INTO countries (`name`) VALUES ('{country.replace("'" , "")}')" print(a) cur.execute(a) cnn.commit() for i in range(len(genres)): genre = genres[i] a = f"INSERT INTO genres (`title`) VALUES ('{genre.replace("'" , "")}')" print(a) cur.execute(a) cnn.commit()
• What it does: This loop inserts country names into the countries table and genre titles into the genres table.
• Why it does this: These are necessary tables that will later be linked to movies.
Inserting Movies:
for i in range(len(movies)): movie = movies[i] description_index = random.randint(0, len(descriptions)-1) description = descriptions[description_index] year = random.randint(0, len(years)-1) rating = random.randint(0, len(ratings)-1) a = f"INSERT INTO movies (`title`, `description`, `year`, `rating`) VALUES ('{movie.replace("'" , "")}', '{description.replace("'" , "")}', '{year}', '{rating}')" print(a) cur.execute(a) cnn.commit()
• What it does: This loop inserts movie titles, descriptions, years, and ratings into the movies table. Each movie is randomly assigned a description, year, and rating.
• Why it does this: This is required to populate the movies table with realistic data.
Linking Movies with Actors, Directors, Genres, Countries:
for i in range(1, len(movies)): actrs = set(random.choices(list(range(1, 10000)), k=random.randint(5,15))) for actr in actrs: a = f"INSERT INTO movies_actors (`movie_id`, `actor_id`) VALUES ('{i}', '{actr}')" print(a) cur.execute(a) cnn.commit() for i in range(1, len(movies)): drcts = set(random.choices(list(range(1, 10000)), k=random.randint(2,5))) for drct in drcts: a = f"INSERT INTO movies_directors (`movie_id`, `director_id`) VALUES ('{i}', '{drct}')" print(a) cur.execute(a) cnn.commit() for i in range(1, len(movies)): gnrs = set(random.choices(list(range(1, len(genres))), k=random.randint(3,5))) for gnr in gnrs: a = f"INSERT INTO movies_genres (`movie_id`, `genre_id`) VALUES ('{i}', '{gnr}')" print(a) cur.execute(a) cnn.commit() for i in range(1, len(movies
and this is full code
import itertools import datetime import json import numpy as np import sqlite3 import random with open("data/names/first-names.txt", "r") as f1: names_file = f1.read() names = names_file.split() with open("data/names/first-names.txt", "r") as f2: families_file = f2.read() families = families_file.split() # names and families names_families_combo = list(itertools.product(names, families)) with open("data/countries/countries.txt") as f3: countries_file = f3.read() # countries countries = countries_file.split() year = datetime.datetime.today().year # years years = [year - i for i in range(35)] with open("data/genres/genres.json") as f4: # genres genres = json.loads(f4.read()) # ratings ratings = list(np.arange(1, 5.5, .5)) with open("data/movies/movie_list.txt") as f5: movies_file = f5.read() movies_choices = movies_file.split()[:1000] movies_combo = list(itertools.combinations(movies_choices, 2)) # movies movies = [movie[0] + " " + movie[1] for movie in movies_combo] words_choices = [] with open("data/words/wiki-100k.txt", encoding="utf-8") as f6: words_file = f6.read().split("\n") for word in words_file: if "#!comment:" not in word: words_choices.append(word) descriptions = [] words_1 = list(itertools.combinations(words_choices[:100], 4)) descriptions_1 = [word[0] + " " + word[1] + " " + word[2] + " " + word[3] for word in words_1] words_2 = list(itertools.combinations(words_choices[100:200], 4)) descriptions_2 = [word[0] + " " + word[1] + " " + word[2] + " " + word[3] for word in words_2] words_3 = list(itertools.combinations(words_choices[200:300], 4)) descriptions_3 = [word[0] + " " + word[1] + " " + word[2] + " " + word[3] for word in words_3] descriptions.extend(descriptions_1) descriptions.extend(descriptions_2) descriptions.extend(descriptions_3) cnn = sqlite3.connect("db.sqlite3") cur = cnn.cursor() for i in range(10000): actor_name_index = random.randint(0, len(names)-1) actor_name = names[actor_name_index] actor_family_index = random.randint(0, len(families)-1) actor_family = families[actor_family_index] actor_born_index = random.randint(0, len(years)-1) actor_born = years[actor_born_index] a = f"INSERT INTO actors (`firstname`, `lastname`, `born_year`) VALUES ('{actor_name.replace("'" , "")}', '{actor_family.replace("'" , "")}', '{actor_born}')" print(a) cur.execute(a) cnn.commit() director_name_index = random.randint(0, len(names)-1) director_name = names[director_name_index] director_family_index = random.randint(0, len(families)-1) director_family = families[director_family_index] director_born_index = random.randint(0, len(years)-1) director_born = years[director_born_index] b = f"INSERT INTO directors (`firstname`, `lastname`, `born_year`) VALUES ('{director_name.replace("'" , "")}', '{director_family.replace("'" , "")}', '{director_born}')" print(b) cur.execute(b) cnn.commit() for i in range(len(countries)): country = countries[i] a = f"INSERT INTO countries (`name`) VALUES ('{country.replace("'" , "")}')" print(a) cur.execute(a) cnn.commit() for i in range(len(genres)): genre = genres[i] a = f"INSERT INTO genres (`title`) VALUES ('{genre.replace("'" , "")}')" print(a) cur.execute(a) cnn.commit() for i in range(len(movies)): movie = movies[i] description_index = random.randint(0, len(descriptions)-1) description = descriptions[description_index] year = random.randint(0, len(years)-1) rating = random.randint(0, len(ratings)-1) a = f"INSERT INTO movies (`title`, `description`, `year`, `rating`) VALUES ('{movie.replace("'" , "")}', '{description.replace("'" , "")}', '{year}', '{rating}')" print(a) cur.execute(a) cnn.commit() for i in range(1, len(movies)): actrs = set( random.choices(list(range(1, 10000)), k=random.randint(5,15))) for actr in actrs: a = f"INSERT INTO movies_actors (`movie_id`, `actor_id`) VALUES ('{i}', '{actr}')" print(a) cur.execute(a) cnn.commit() for i in range(1, len(movies)): drcts = set( random.choices(list(range(1, 10000)), k=random.randint(2,5))) for drct in drcts: a = f"INSERT INTO movies_directors (`movie_id`, `director_id`) VALUES ('{i}', '{drct}')" print(a) cur.execute(a) cnn.commit() for i in range(1, len(movies)): gnrs = set( random.choices(list(range(1, len(genres))), k=random.randint(3,5))) for gnr in gnrs: a = f"INSERT INTO movies_genres (`movie_id`, `genre_id`) VALUES ('{i}', '{gnr}')" print(a) cur.execute(a) cnn.commit() for i in range(1, len(movies)): cntrs = set( random.choices(list(range(1, len(countries))), k=random.randint(5, 10))) for cntr in cntrs: amount = random.randint(100000000, 1000000000) a = f"INSERT INTO sales (`amount`, `country_id`, `movie_id`) VALUES ({amount}, '{cntr}', '{i}')" print(a) cur.execute(a) cnn.commit()
Here’s a summary of the code:
1. Data Reading and Preparation:
• The code reads data from various files (names, countries, genres, movies, etc.).
• It creates combinations of first names and family names for actors and directors.
• It generates random years (last 35 years) and loads genre information from a JSON file.
• Movie titles are paired into combinations, and random descriptions are generated using words from a text file.
2. Database Setup:
• The script connects to an SQLite database (db.sqlite3) to store data in several tables, including actors, directors, movies, countries, genres, etc.
3. Data Insertion:
• The script inserts random data into various tables (actors, directors, movies, countries, genres, etc.).
• It assigns random actors, directors, genres, and countries to movies.
4. Linking Data:
• Movies are linked to actors, directors, genres, and countries using many-to-many relationships (e.g., movies_actors, movies_directors, movies_genres, and sales tables).
• For each movie, random actors, directors, genres, and countries are chosen and inserted into the corresponding tables.
Purpose:
The script is designed to populate a database with large sets of randomized data related to movies, including actors, directors, genres, and countries, to simulate a real-world movie database.
You can visit my GitHub repository to explore the full project and find the code, along with additional details.
Happy Coding.