A comprehensive collection of Python scripts and examples focused on data engineering fundamentals, including data importing, manipulation, API interactions, and Python programming concepts.
This repository contains hands-on examples and exercises covering essential data engineering topics in Python. It includes practical implementations of data importing techniques, working with various file formats, API interactions, and intermediate Python programming concepts.
data-engineer/
├── cleaning-data-in-python/ # Data cleaning techniques
├── data-manipulation-with-pandas/ # Pandas operations
├── importing-data-in-python/ # Data importing fundamentals
├── intermediate-importing-data/ # Advanced data importing
├── intermediate-python/ # Python programming concepts
├── introduction-api/ # API fundamentals
├── introduction-to-python/ # Python basics
├── joining-data-with-pandas/ # Data merging and joining
├── python-data-science-toolbox/ # Data science utilities
├── streamlining-data-ingestion-with-pandas/ # Efficient data ingestion
└── requirements.txt # Project dependencies
- Python 3.x
- NumPy - Numerical computing
- Pandas - Data manipulation and analysis
- Matplotlib - Data visualization
- PyYAML - YAML file parsing
- Pillow - Image processing
- Tweepy - Twitter API integration
- Clone the repository:
git clone https://github.com/petsereypanha/data-engineer.git
cd data-engineer- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On Windows- Install required packages:
pip install -r requirements.txtNavigate to any module directory and run the Python scripts:
# Example: Run data importing scripts
python importing-data-in-python/importing-data.py
# Example: Run API interaction scripts
python intermediate-importing-data/interacting-with-APIs.py
# Example: Run Python function examples
python intermediate-python/function.py- Handling missing data
- Data type conversions
- Removing duplicates
- Data validation techniques
- DataFrame operations
- Filtering and sorting
- Aggregations and grouping
- Data transformation techniques
- Reading flat files (CSV, TXT)
- Working with Excel files
- Loading pickle files
- Importing SAS and Stata files
- Working with HDF5 files
- Loading MATLAB files
- Importing data from the Internet
- API interactions and authentication
- Working with Twitter API
- HTTP requests and responses
- JSON data parsing
- Function definition and usage
- Default arguments and keyword arguments
- Docstrings and documentation
*argsand**kwargs- Lambda functions
- Error handling
- Python ecosystem tools
- Basic data types
- Control flow (if/else statements)
- Loops (for, while)
- Python fundamentals
- Making HTTP requests
- RESTful API concepts
- Authentication methods
- Handling API responses
- Inner joins
- Outer joins
- Left and right joins
- Merging DataFrames
- Concatenating data
- Advanced function techniques
- Iterators and generators
- List comprehensions
- Data science utilities
- Efficient data loading
- Optimizing data types
- Chunking large files
- Performance best practices
Contributions are welcome! Feel free to:
- Fork the repository
- Create a new branch (
git checkout -b feature/improvement) - Make your changes
- Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/improvement) - Create a Pull Request
This project is created for educational purposes.
Panha Setserey
- GitHub: @petsereypanha
⭐ If you find this repository helpful, please consider giving it a star!