This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
会声会影应用宝典-中文版-随书赠送DVD 本书特色 本书是一本会声会影x7 学习宝典,全书通过310 多个实战案例,以及350 多分钟全程同步语音教学视频,...
《神经科学研究与进展•神经科学MATLAB教程:MATLAB科学计算导论(英文)(导读版)》内容简介:作为科学计算的数学软件,Matlab被广泛
《新编Word/Excel/PPT商务办公应用大全》内容简介:作为一款常用的集成办公软件,它具有操作方便和容易上手等特点。然而要想真正掌
《绘画高手这样画·素描静物》内容简介:书中分为3个部分,第1章节为基础知识章节;第二章节为物体质感的绘制章节;第三章节绘画高
《元明清散曲选》内容简介:散曲,是我国最后一种具有生命力的古典诗体。此前散曲选本多录元曲,而极少涉及明、清。本书则是一部全
《科技之光》是“国民教育通识读本”系列之一,介绍了人类史上对科技文明与进步起到举重若轻作用的天才大师的经典巨作,以及最新
《重复累积码及其在通信系统中的应用》简介:重复累积(RA)码是一种新型的纠错码,不仅具有与Turbo码、LDPC码相同的优越性能,而且其
《孩子们的小提琴练习曲3(缩编版)》内容简介:本书包括了沃尔法特、开塞、马扎斯、顿特、克莱采尔的练习曲,这些练习曲接近按照技
腾讯Android自动化测试实战 本书特色 本书聚集于Android自动化测试的理论、方案与案例实施,基本涵盖了Android平台上所有的自动化测试技术,并对移...
信息智能分析实验 本书特色 本书为高等院校经济管理实验实践系列教材。该书主要介绍了Excel工具和数据分析;Excel数据整理方面的实验以及Excel数据分析功...
《大学生心理健康》内容简介:近年来,大学生中存在的心理障碍问题日益受到社会的关注,为了帮助大学生尽快适应大学的学习和生活,
《公案中的世态》内容简介:公案小说是中国古典小说的一种,由宋话本公案类演义而成,盛行于明清。本书对公案小说进行了较为全面、
本书是作者在美国、瑞士的ArtCebter研读设计,以及长期在美国和中国大陆从事产品设计工作的经验总结。书籍内容丰富,图文并茂,信
ThisbookwillshowJavadevelopershowtousetheGoogleWebToolkit(GWT)torapidlycreateric...
《女儿的早餐》内容简介:这是一个妈妈对于女儿成长生活的笔记,更记录了7年来,每天坚持不断的为女儿准备早餐的过程,以及简单的早
《普通高等教育"十五"国家级规划教材:电路分析》着重讲述电路分析的基本方法。内容除包含“电路分析”课程教学的基本要求外,适当
《全能鼓手实战教程——648句技巧训练》内容简介:本书是一本鼓手必备的练习宝典。涵盖了爵士鼓、小军鼓等各类鼓乐学习过程中初级、
《晚礼新娘化妆与造型实例教程》内容简介:本书是一本晚礼新娘化妆与造型的实用教程,内容分为自然风格、靓丽风格、复古风格、浪漫
硬件缺陷和软件错误是“技术侦探”的劲敌,它们负隅顽抗,见缝插针。本书提出的九条简单实用的规则,适用于任何软件应用程序和硬
《出发:我在法院当法警》内容简介:本书全面梳理了上海市司法警察队伍历年来的发展足迹,结合电台访谈节目的内容,整理编写而成。