A comprehensive analysis of the Amazon Prime Video catalog, detailing the journey from raw data to actionable insights. This project involves data cleaning, exploratory data analysis (EDA) using Python, and the creation of a fully interactive dashboard in Power BI.
This repository contains the complete workflow for analyzing the Amazon Prime Video dataset. The primary goal is to understand the composition of the content library, identify key trends, and present these findings in a dynamic and visually appealing dashboard.
- Project Overview
- Dashboard Preview
- Data Source
- Methodology
- Key Insights & Findings
- Future Work
- Tools & Libraries
- Repository Structure
- How to Reproduce
- Contact
This project provides a deep dive into the content available on Amazon Prime Video. By analyzing a dataset of nearly 10,000 titles, we can answer critical questions about the platform's content strategy, such as:
- What is the distribution between movies and TV shows?
- Which countries are the primary content producers?
- What are the most popular and frequently listed genres?
- How has the volume of content added to the platform evolved over the years?
The analysis is conducted in a Jupyter Notebook, and the final results are consolidated into an intuitive Power BI dashboard for easy exploration.
The interactive dashboard provides a holistic view of the Prime Video library. Users can filter by genre, country, and release year to dynamically explore the data.
Note: You will need to upload the dashboard image to your repository and update the link above.
The analysis is based on the amazon_prime_titles.csv dataset, which is publicly available on Kaggle.
- Dataset Link: Amazon Prime Video Titles on Kaggle
Click to view Data Dictionary
| Column | Description | Data Type |
|---|---|---|
show_id |
Unique identifier for each title. | object |
type |
The type of content ('Movie' or 'TV Show'). | object |
title |
The name of the title. | object |
director |
The director(s) of the movie. | object |
cast |
The main actors in the title. | object |
country |
The country or countries of production. | object |
date_added |
The date the title was added to Prime Video. | object |
release_year |
The year the title was originally released. | int64 |
rating |
The content rating (e.g., 13+, 18+, ALL). | object |
duration |
Duration (minutes for movies, seasons for TV shows). | object |
listed_in |
The genre(s) the title is listed under. | object |
description |
A brief summary of the title. | object |
The project follows a structured approach, moving from raw data to a polished, interactive dashboard.
The initial dataset contained missing values and potential duplicates. The following steps were performed in the prime_video_analysis.ipynb notebook to ensure data quality:
- Handling Null Values:
- Missing values in the
directorandcastcolumns were imputed with the string'Unknown'. - Missing
countryvalues were filled with the mode (the most frequently occurring country, which was the United States).
- Missing values in the
- Removing Duplicates: Any exact duplicate rows were identified and removed from the dataset.
- Data Export: The cleaned dataset was exported to
prime_video_cleaned.csvfor use in Power BI. A separate file,prime_video_country_map.csv, was also created for the map visualization.
With a clean dataset, a thorough EDA was conducted to uncover patterns and trends. Key areas of analysis included:
- Content Type Distribution: A bar chart was created to visualize the count of Movies vs. TV Shows.
- Top 10 Countries: The top 10 countries by the number of titles were identified and plotted.
- Release Year Trends: A line chart was used to show the trend of titles released per year.
- Content Rating Analysis: The distribution of the top 10 content ratings was analyzed.
- Genre Analysis: The
listed_incolumn was parsed to count the occurrences of each genre, and the top 10 were visualized.
The insights from the EDA were brought to life in an interactive Power BI dashboard. The dashboard features:
- Key Performance Indicators (KPIs): Cards displaying total titles, number of movies, and number of TV shows.
- Donut Chart: Showing the percentage split between movies and TV shows.
- Bar Charts: For the top 10 genres and top 10 countries.
- Area Chart: Illustrating the number of titles by release year.
- Map Visualization: A geographic representation of title distribution by country.
- Slicers: Interactive filters for Genre, Country, and Release Year.
The analysis and dashboard revealed several key insights into the Amazon Prime Video library:
- Movies Dominate the Catalog: The platform has a significantly larger collection of movies compared to TV shows.
- US-Centric Content Library: The United States is the largest single contributor of content to the platform.
- A Surge in Recent Content: There has been a dramatic increase in the number of titles released in the last two decades.
- Drama and Comedy Lead the Way: These are the two most common genres found in the Prime Video library.
This project provides a solid foundation for further analysis. Potential future work could include:
- Sentiment Analysis: Performing sentiment analysis on the
descriptioncolumn to gauge the overall tone of the content. - Time-Series Analysis: Analyzing the
date_addedcolumn to identify trends in content acquisition over time. - Network Analysis: Creating a network graph of directors and cast members to find key collaborators.
This project utilizes the following tools and Python libraries:
- Python for data analysis and manipulation.
- Pandas: For data cleaning, manipulation, and aggregation.
- Matplotlib & Seaborn: For creating static visualizations within the Jupyter Notebook.
- Collections (Counter): For efficient counting of genres.
- Jupyter Notebook: As the primary environment for coding and analysis.
- Power BI: For creating the final interactive and shareable dashboard.
├── amazon_prime_titles.csv # Raw, original dataset
├── prime_video_analysis.ipynb # Jupyter Notebook with the full Python analysis
├── prime_video_cleaned.csv # Cleaned data, ready for Power BI
├── prime_video_country_map.csv # Data prepared for map visualizations
├── Amazon Prime Video Analysis Dashboard.jpg # Screenshot of the final dashboard
└── README.md # This documentation file
To replicate this analysis on your local machine, follow these steps:
-
Clone the repository:
git clone [https://github.com/your-username/your-repo-name.git](https://github.com/your-username/your-repo-name.git) cd your-repo-name -
Install the required libraries:
pip install pandas matplotlib seaborn jupyter
-
Run the Jupyter Notebook: Launch the
prime_video_analysis.ipynbnotebook to execute the data cleaning and EDA. This will also generate the cleaned CSV files.jupyter notebook prime_video_analysis.ipynb
-
Explore the Dashboard: Open Power BI and use
prime_video_cleaned.csvas the data source to build the dashboard or create your own custom visualizations.
Email: [email protected]