This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
电机与拖动MATLAB仿真与学习指导 内容简介 书是与孙建忠、刘凤春主编的《电机与拖动》相配套的教学辅助教材,是一本将电机与拖动理论和MATLAB仿真技术有机结...
《通信协议技术》是OHM通信实用技术系列之一。《通信协议技术》中基于以下三方面介绍通信协议技术:(1)通信协议中怎样利用图像
当技术使整个社会突变到移动互联时代,实现了社群、人人实时在线等质变,企业应该如何在移动互联时代实现生存、进化?。《社群经
《LTE-B3G/4G移动通信系统无线技术》作者为教授级高工,2003年获西安交通大学博士学位,目前在贝尔实验室(德国)从事第四代无线
《深入理解JavaScript特性》内容简介:本书旨在让读者轻松学习JavaScript的新进展,包括ES6及后续更新。书中提供了大量实用示例,以
《江山如画:中国古代山水志》内容简介:当我们经常顺口诵读《滕王阁序》《岳阳楼记》《前赤壁赋》等名篇的时候,你可曾想过彼时彼
图解HTTP 本书特色 《图解http》对互联网基盘——http协议进行了全面系统的介绍。作者由http协议的发展历史娓娓道来,严谨细致地剖析了http协议的结...
《可伸缩架构(第2版):云环境下的高可用与风险管理》内容简介:《可伸缩架构(第2版):云环境下的高可用与风险管理》是一本关于
走进中学生系列--网络让我喜欢让我尤 本书特色 一本指导冲浪网络天地的资讯手册,一本引导走出网络误区的心灵指南,一本叙说感受网络情怀的故事大观。走进中学生系列-...
本书内容十分丰富,涉及了集合论、指称语义、操作语义、公理语义、归纳原理、完备性、域论、信息系统、不确定性和并行性、不完备
《向服务要利润——华为客户服务中的经营哲学》内容简介:提起“为客户服务”,很多人就会简单地认为只要给予客户良好的服务体验即
Ineachoftheother"LittleMaid"booksisthestoryofanAmericangirlduringtheRevolution.
本书针对空间花艺设计的市场流行趋势,分别就花艺设计概论、花材的选择与运用、花艺空间设计等内容进行了阐述,试图对当今流行的
《增强现实:技术、应用和人体因素》内容简介:本书共分四部分,23章。第一部分阐述了增强和沉浸式显示器以及它们的历史,虚拟空间
“观念”是一切行动的开始,《100个改变平面设计的伟大观念》是由当今最顶尖的艺术指导之一、平面设计畅销书作家撰写,让我们轻松
EXCEL应用大全 本书特色 《Excel应用大全》一书适合各个层次的Excel用户,即可作为初学者的入门指南,又可作为中、高级用户的参考手册。书中大量的实例还...
《多模态警示语的整体意义建构》内容简介:《多模态警示语的整体意义建构》为“当代外语研究论丛”之一,主要运用巴赫金的对话理论
《2021年法律硕士(非法学)联考考试大纲配套练习》内容简介:全书各章节的基本结构为:大纲要点、考试重点、配套练习和配套练习答
《打造超级区块链社区:建设、运营、实践》内容简介:《打造超级区块链社区:建设、运营、实践》从社区、区块链、运营这3个方面重点
《直觉泵和其他思考工具》内容简介:哲学泰斗倾囊传授77招思维搏击术,助你清醒思考,看破一切套路!陈嘉映、汪丁丁、万维钢诚意推