This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
《桃花扇》内容简介:本书写明末复社名士侯方域与秦淮名妓李香君相恋,阉党余孽阮大铖企图笼络侯方域,因李香君的反对而没能得逞,
Internationalinterestinthesophisticatedandaestheticvisualizationofcomplexinforma...
《专业伦理与职业素养:计算机、大数据与人工智能》内容简介:本书共12章,内容包括计算的社会背景,伦理与道德,计算机伦理规则,
《道德经与领导力》内容简介:本书将《道德经》与实际案例相结合,通过详细阐述《道德经》的真髓来告诉高级领导者如何处世,其中包
《揭秘跨境电商》是一本系统性、全方位阐述跨境电商的书。全书分为四篇:第一篇介绍跨境电商中的各大平台和独立网站;第二篇从物
Visual Basic电子商务项目方案精解 内容简介 Visral Basic是数据开发的有用工具,在企业信息管理系统开发中具有广泛的使用。本书以几个典型实例...
《钗黛之辨》内容简介:《红楼梦》问世至今二百多年来,在众口一词给予了最高的赞誉的同时,又几乎在所有的问题上都存在着种种不同
《IIS7开发与管理完全参考手册》的作者包括了四位微软公司MVP。《IIS7开发与管理完全参考手册》详细说明了如何有效地使用IIS7提供
提要:张永和创作并亲笔绘制的悬疑侦探故事绘本,全书包含四本装帧工艺各不相同的精美单册,分别为“绘本”“文本”“翻本”“彩
《社会批判理论纪事(第10辑)》内容简介:本书包括以下三个部分:法国著名作家、思想家莫里斯·布朗肖思想专辑、各国学者对《马克
这是一本介绍软件管理的小品文集。全书分为45章,每章就是一个独立的专题或者知识点。本书内容十分丰富全面,小到项目负责人制订
《AIoT系统开发》内容简介:本书融合了人工智能和物联网两大热点技术,将人工智能中的优越方法应用到物联网的构建中,以形成更加智
《1小时漫游量子世界》内容简介:我们所能感受到的世界,遵循着一套被称为宏观世界经典力学的规则;相对的,量子力学所研究的微观世
本书是一本内容丰富、取材新颖的计算机图形学著作,并在其前一版的基础上进行了全面扩充,增加了许多新的内容,覆盖了近年来计算
Thisnon-technicalbookbringstogethercontemporarywebdesignslatestandmostoriginalcr...
本书带领读者认识和学习7种影响现代Web应用并改娈了Web开发方式的框架,以期给Web开发者带来启发和思考。本书延续了同系列的畅销
《所有的颠沛流离,只为成就更好的自己》内容简介:人,切不可一路成长,一路悲伤。在觉得快受不了了的日子里,请反复告诉自己,这
《看图学女性家庭健身(视频版)》内容简介:想要拥有美好的身材,不一定非要到健身房。只要你有一副小哑铃,一条弹力带,一个瑞士
跟我学PLC编程-(第二版) 本书特色 周云水编著的《跟我学plc编程(第2版)》以常用的三菱fx系列plc为例,简要介绍了plc的基本结构、工作原理、分类与应...
家用游戏机简史 本书特色 本书以时间为轴,重新厘清了30余年游戏主机的成长历程,梳理了游戏产业的发展脉络。从幕后开发、技术变革、游戏策略等多重角度,解读五次游戏...