This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
赵鑫磊阿里巴巴集团资深Linux系统专家。自1998年开始使用Linux,曾参与国内著名的Linux社区发行版MagicLinux的开发工作,是一个土
内容提要本书初版于1978年,曾获1980年“新长征优秀科普作品奖”。这次重版,除对原有各篇根据近十多年来我国科技史研究方面的新
《我将前往的远方》内容简介:联合报文学大奖得主郭强生,《断代》后又一力作 “人生私散文”获奖作,献给单身初老族的一首情歌 难
《丹尼尔·希利斯讲计算机》内容简介:虽然计算机技术及应用以及编程技术都取得了巨大进步,对社会产生的影响也远远超出了预言家的
《我国语言服务市场逆向选择问题治理研究》内容简介:语言服务在推动“一带一路”倡议等国家战略的实施方面发挥着重要作用。但是,
这是一本图文并茂的网络管理技术书籍,旨在让广大读者理解TCP/IP的基本知识、掌握TCP/IP的基本技能。书中讲解了网络基础知识、TC
本书是一本专门针对网页美工设计的图书,全面、细致地介绍利用Dreamweaver8和PhotoshopCS2进行创意和设计的具体方法和步骤。全书
《超级账本HyperLedger Fabric区块链开发实战》内容简介:本书围绕区块链的业务场景,对HyperLedger Fabric区块链进行实战式讲解。
《中国:推动金砖国家合作第二个黄金十年》内容简介:“求和平、谋发展、促合作、图共赢”,金砖国家“十年磨一剑”,一步一个脚印
如同某种势不可挡的新型病毒,地下黑客圈子流言四起:某个聪明绝顶、胆大妄为的家伙发动了对线上犯罪网络的恶意接管,这个犯罪网
WPS之光:全能一本通Office办公三合一 本书特色 适读人群 :职场办公人员、金山办公CEO & 高级VP官方推荐,国人办公就用国民软件WPS Office...
《量子思维:探寻生命觉醒之旅》内容简介:本书把生命、意识乃至社会进化史当作一个觉醒过程,以东西方结合的特殊视角,采用“从顶
《中国企业对外直接投资分析报告(2017)》内容简介:本报告分为总论篇、实务篇与关注篇三部分。总论篇在描述全球国际直接投资基础
RobinWilliams世界著名的设计师、技术专家和畅销书作家。通过写书和授课,她已经影响了整整一代数字设计师。同时,作为Adobe和Ma
《合为一家》内容简介:十六国北魏时期是中国历史上民族融合的重要时期,也是多元文化激荡的时期,各民族之间的互动非常频繁,混居
二00二年,一項名為「PD的思想」的設計展於東京Ozone展出,本書即為同名展覽型錄的擴充版,蒐錄了從1958年保羅‧漢寧生的「PH燈」
虽然《我是猫》为人熟知,但未必被大家仔细读过。为了让读者能重拾它的独特之美,Graphic社计划为此书重新装帧,并集结出版,因此
《版权赋能,丝路无疆》内容简介:报告首先挖掘了丝绸业在吴江的历史缘起,继而介绍了吴江丝绸纺织工业的形成、演进与发展历程,分
《SQL必知必会(第5版)》内容简介:SQL是使用最广泛的数据库语言,绝大多数重要的DBMS支持SQL。本书由浅入深地讲解了SQL的基本概念
本书是数字图像处理理论与实践相结合的成功之作,强调如何应用理论知识解决工业和科学研究中常见的实际问题。本书着重阐述了数字