This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
家庭电脑应用宝典 内容简介 本书由具有丰富教学与实践经验的微软认证专家编写,全面细致地介绍了微软*新操作系统Windows Vista在家庭应用方面的使用方法与...
为进一步深入贯彻实施《中华人民共和国行政诉讼法》,最高人民法院发布了《关于行政诉讼证据若干问题的规定》。本书即是对《行政
Thisbookoffersastep-by-stepguidetotheexperimentalplanningprocessandtheensuingana...
Apache是目前全球最广为使用的网页服务器。虽然Apache管理人员可以找到大量的与安装及配置设定相关的文件说明,但想从中找出适合
本书注重理论与实践的结合,全面讲述高级的DOM脚本编程。全书分为3个部分:第一部分“深入理解DOM脚本编程”,涉及W3CDOM规范的各
多媒体信息处理 本书特色 《多媒体信息处理》由卢官明、焦良葆编著,编写本教材的指导思想是:将音频、图像、视频等媒体信息的处理技术有机地整合在一起,揭示其内在的联...
《C++新经典》内容简介:本书是一部C/C++编程语言书,定位在C/C++语言本身的讲解上面。本书包含纸质图书以及教学源代码文件。本书共
《代码精进之路:从码农到工匠》内容简介:这是一本为专业程序员而写的书,写好代码、追求卓越和工匠精神是每个程序员都应该具备的
本书突出媒介实践和媒介仪式,破解媒介权力的显著特征和隐性伤害,阐述媒介社会化的奥秘,批判媒介中心神话,探究媒介文化底层的
《CSS艺匠之门》从标题、图片、背景、导航、表单、表格和圆角效果等几方面,介绍CSS设计的神奇作用。《CSS艺匠之门》将CSS和Java
CarloM.Cipolla(August15,1922–September5,2000)wasanItalianeconomichistorian.Hewas...
Erlangisthelanguageofchoiceforprogrammerswhowanttowriterobust,concurrentapplicat...
《中国城市文明史》内容简介:当今世界有超过一半的人口居住在城市里,城市为居民提供安全、便捷、舒适的生活环境,是每个国家的政
Oracle达人修炼秘籍-Oracle 11g数据库管理与开发指南 本书特色 ·宏观上清晰呈现oracle数据库的知识体系和总体框架,微观上系统讲解oracle...
Architectslookatthousandsofbuildingsduringtheirtraining,andstudycritiquesofthose...
《微信营销与运营》内容简介:本书共分7章。第1章重点介绍了微信营销的概念、价值和特征,引导读者全面认识微信营销;第2章介绍了微
《iOS编程》荣获Jolt生产力大奖。第4版更新了iOS7和Xcode5的内容。全书涵盖了开发iOS应用的方方面面。从Objective-C基础知识到新
《四库全书总目发微》内容简介:本书为作者近年发表《四库全书总目》领域论文之合集,分为“文献编”与“经学思想编”两部分。文献
《脑卒中:与时间赛跑》内容简介:一定要知道的脑卒中预防与康复知识,上海交通大学医学院附属瑞金医院康复团队暖心制作。全书通过
从貌似天书的汇编代码中,一探Windows底层的核心实现。.在开发中出现的问题,能从Windows自身找到答案!...本书从基本的Windows程