This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
《JavaScript经典实例》各节中的完整代码解决了常见的编程问题,并且给出了在任何浏览器中构建Web应用程序的技术。只需要将这些代
在這個魔法實驗室裡,研究人員像是魔法師般創造驚奇:讓斷腿的人可以跑步登山;讓自閉症患者能夠解讀他人臉部表情;讓汽車在城市
《PHP编程(第4版)》内容简介:PHP是一种被广泛使用的Web编程语言,它简单易用,而且与时俱进,不断进化,提升性能。本书基于PHP
内容提要本书分为三部分。第一部分是基础部分,以8086/8088为背景,以DOS和PC兼容机为软硬件平台,以MASM和TASM为汇编器,介绍汇
《EDA精品智汇馆:硬件系统工程师宝典》硬件系统设计中的常见需求,设计中需要考虑的各类概要设计及开发平台的归纳,SI的理论分析
你想要在互联网上赚钱吗?想要做好电子商务吗?那么你一定不能忽视社交媒体的力量。不管你想要营销的是实物商品、电子类产品还是
吴伟定,网名Windear,首届中文搜索引擎优化(SEO)大赛三等奖得主,擅长于品牌操作以及各种网络营销手段。现任中赫技术投资控股有
《专业嵌入式软件开发:全面走向高质高效编程》分为6篇。硬件篇就嵌入式软件开发所需掌握的处理器概念进行了介绍。工具篇对make、
《冰心散文》内容简介:本书精选冰心经典散文八十余篇,既有早期的《笑》《寄小读者》《往事》等中国新文学史上脍炙人口的名篇,也
《知识产品经理手册:付费产品版》内容简介:本书中不仅有方军在互联网内容与知识产品多年从业经验的精华总结,更有知识产品圈顶级
《紫式部日记》内容简介:《紫式部日记》囊括平安时代宫廷女性经典日记文学作品《蜻蛉日记》《和泉式部日记》《紫式部日记》《更级
作者写作本书的灵感以及作者能提供的大量翔实的信息都直接来源于作者在UMTS论坛担任主席内5年的经历。在它最为活跃的阶段,这个国
《战略之道:王志纲演讲录》内容简介:“房地产不等于钢筋加水泥。”“房地产开发要因时、因地、因人制宜。”“小老板做事”、“中
本书为专业和非专业用户、程序员、数据处理方面的专业人士和希望理解sQL在今天计算机产业中的影响的经理们提供了关于SQL语言的全
《JavaScript网页开发》结合大量应用实例,详细地讲解了HTML语言、CSS、JavaScript语法、DOM对象模型编程、正则表达式,并介绍了
《WOW!不一样的插画设计:Chunso的梦幻世界》内容简介:无论大干世界如何瞬息万变,书籍是不能为讲求速度而粗制滥造的。我们要做的
《Spring Cloud微服务:入门、实战与进阶》内容简介:本书主打的是与微服务相关的实战体系。第一部分是准备篇,可以帮助各位读者了
《戊戌时期学术政治纷争研究:以“康党”为视角》内容简介:本书内容包括“康学”“康教”:“康党”的政治思想与宗教观念,“康学
《国际平面设计基础教程:GRIDS网格设计》的目的是向读者介绍平面设计中网格的基本运用原则,就像当代设计师们所实践的那样。虽然
Mac OS X Leopard Edition(影印版) 本书特色 为什么开创Missing Manual系列当知识富有吸引力、条理清楚并有趣味时,人们的学习...