330x Filetype PDF File size 0.64 MB Source: www.gbv.de
Data Wrangling with Python
Jacqueline Kazil and Katharine Jarmul
Beijing Boston Farnham Sebastopol Tokyo
Table of Contents
Preface xi
1. Introduction to Python 1
Why Python 4
Getting Started with Python 5
Which Python Version 6
Setting Up Python on Your Machine 7
Test Driving Python 11
Install pip 14
Install a Code Editor 15
Optional: Install IPython 16
Summary 16
2. Python Basics 17
Basic Data Types 18
Strings 18
Integers and Floats 19
Data Containers 23
Variables 23
Lists 25
Dictionaries 27
What Can the Various Data Types Do? 28
String Methods: Things Strings Can Do 30
Numerical Methods: Things Numbers Can Do 31
List Methods: Things Lists Can Do 32
Dictionary Methods: Things Dictionaries Can Do 33
Helpful Tools: type, dir, and help 34
type 34
v
dir 35
help 37
Putting It All Together 38
What Does It All Mean? 38
Summary 40
3. Data Meant to Be Read by Machines 43
CSV Data 44
How to Import CSV Data 46
Saving the Code to a File; Running from Command Line 49
JSON Data 52
How to Import JSON Data 53
XML Data 55
How to Import XML Data 57
Summary 70
4. Working with Excel Files 73
Installing Python Packages 73
Parsing Excel Files 75
Getting Started with Parsing 75
Summary 89
5. PDFsand Problem Solving in Python 91
Avoid Using PDFs! 91
Programmatic Approaches to PDF Parsing 92
Opening and Reading Using slate 94
Converting PDF to Text 96
Parsing PDFs Using pdfminer 97
Learning How to Solve Problems 115
Exercise: Use Table Extraction, Try a Different Library 116
Exercise: Clean the Data Manually 121
Exercise: Try Another Tool 121
Uncommon File Types 124
Summary 124
6. Acquiring and Storing Data 127
Not All Data Is Created Equal 128
Fact Checking 128
Readability, Cleanliness, and Longevity 129
Where to Find Data 130
Using a Telephone 130
US Government Data 132
vi | Table of Contents
Government and Civic Open Data Worldwide 133
Organization and Non-Government Organization (NGO) Data 135
Education and University Data 135
Medical and Scientific Data 136
Crowdsourced Data and APIs 136
Case Studies: Example Data Investigation 137
Ebola Crisis 138
Train Safety 138
Football Salaries 139
Child Labor 139
Storing Your Data: When, Why, and How? 140
Databases: A Brief Introduction 141
Relational Databases: MySQL and PostgreSQL 142
Non-Relational Databases: NoSQL 144
Setting Up Your Local Database with Python 145
When to Use a Simple File 147
Cloud-Storage and Python 147
Local Storage and Python 148
Alternative Data Storage 148
Summary 148
7. Data Cleanup: Investigation, Matching, and Formatting 151
Why Clean Data? 151
Data Cleanup Basics 152
Identifying Values for Data Cleanup 153
Formatting Data 164
Finding Outliers and Bad Data 169
Finding Duplicates 175
Fuzzy Matching 179
RegEx Matching 183
What to Do with Duplicate Records 188
Summary 189
8. Data Cleanup: Standardizing and Scripting 193
Normalizing and Standardizing Your Data 193
Saving Your Data 194
Determining What Data Cleanup Is Right for Your Project 197
Scripting Your Cleanup 198
Testing with New Data 214
Summary 216
Table of Contents | vii
no reviews yet
Please Login to review.