This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.
Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.
●Obtain data from websites, APIs, databases, and spreadsheets
●Perform scrub operations on plain text, CSV, HTML/XML, and JSON
●Explore data, compute descriptive statistics, and create visualizations
●Manage your data science workflow using Drake
●Create reusable tools from one-liners and existing Python or R code
●Parallelize and distribute data-intensive pipelines using GNU Parallel
●Model data with dimensionality reduction, clustering, regression, and classification algorithms
Chapter 1 Introduction
Overview
Data Science Is OSEMN
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
A Real-World Use Case
Further Reading
Chapter 2 Getting Started
Overview
Setting Up Your Data Science Toolbox
Essential Concepts and Tools
Further Reading
Chapter 3 Obtaining Data
Overview
Copying Local Files to the Data Science Toolbox
Decompressing Files
Converting Microsoft Excel Spreadsheets
Querying Relational Databases
Downloading from the Internet
Calling Web APIs
Further Reading
Chapter 4 Creating Reusable Command-Line Tools
Overview
Converting One-Liners into Shell Scripts
Creating Command-Line Tools with Python and R
Further Reading
Chapter 5 Scrubbing Data
Overview
Common Scrub Operations for Plain Text
Working with CSV
Working with HTML/XML and JSON
Common Scrub Operations for CSV
Further Reading
Chapter 6 Managing Your Data Workflow
Overview
Introducing Drake
Installing Drake
Obtain Top Ebooks from Project Gutenberg
Every Workflow Starts with a Single Step
Well, That Depends
Rebuilding Specific Targets
Discussion
Further Reading
Chapter 7 Exploring Data
Overview
Inspecting Data and Its Properties
Computing Descriptive Statistics
Creating Visualizations
Further Reading
Chapter 8 Parallel Pipelines
Overview
Serial Processing
Parallel Processing
Distributed Processing
Discussion
Further Reading
Chapter 9 Modeling Data
Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Clustering with Weka
Regression with SciKit-Learn Laboratory
Classification with BigML
Further Reading
Chapter 10 Conclusion
Let’s Recap
Three Pieces of Advice
Where to Go from Here?
Getting in Touch
《打开餐巾纸(〈餐巾纸的背面〉之实用手册)》内容简介:难以置信,一张餐巾纸上曾诞生过这样的伟大创想:一个三角形,开创了美国
MasterthechallengesofAndroiduserinterfacedevelopmentwiththesesamplepatternsWithA...
《TensorFlow+Keras自然语言处理实战》内容简介:神经网络是深度学习的核心内容,TensorFlow是现在最为流行的深度学习框架之一。本
Excel数据处理与分析-(附1DVD.含自学视频.技巧视频.应用案例.报表资源.设计素材.PPT资源等) 本书特色《excel数据处理与分析》从全新的角度全面...
ASP动态网页程序设计 本书特色 任长权、邹德文主编的《asp动态网页程序设计(全国高等院校十二五规划教材)》所举案例都是作者通过多年的教学和网站开发经...
《编写高质量代码:改善Python程序的91个建议》内容简介:在通往“Python技术殿堂”的路上,本书将为你编写健壮、优雅、高质量的Py
《云原生安全与DevOps保障》内容简介:本书主要介绍了DevOps实践中最容易被忽视的一环——安全,并且对云原生服务的安全保障也做了
《世界何以至此》内容简介:◆史学大家许倬云重磅力作 ◆以“大历史观”俯瞰人类文明 ◆70年博学深思熔于一炉,展望未来 ◆增设30+
世界科技界领袖级人物、掌上电脑PDA发明人杰夫•霍金斯经典力著全新升级版。详细揭示未来主流大趋势,比大数据更能决定我们生活的是
《卡塔尔经贸文化》内容简介:本书采取厚今薄古的写作方式,全方位介绍卡塔尔的国情、经贸、文化、中阿交流及其习俗礼仪等,突出科
ParadigmsofAIProgrammingisthefirsttexttoteachadvancedCommonLisptechniquesintheco...
《企鹅凶猛:马化腾的中国功夫》试图从腾讯公司的诞生、成长、自卫战和反击战等几个层面,展现了昔日的“丑小鸭”羽化成“白天鹅”
《江苏风俗史》内容简介:本书是“江苏文脉整理与研究工程”研究成果,由南京大学博士生导师马俊亚教授完成。作者按时间顺序,对江
你该如何改善你的软件开发团队?这本精炼的书籍介绍了程序员度量,这样一种清晰客观的方式来确定、分析和讨论软件工程师的成败—
《基于用户体验的交互式信息服务》针对信息管理和服务中的“交互“问题,在国内外现有研究的基础上,从用户需求出发,围绕交互式
《丝绸之路上的西州回鹘王朝》内容简介:公元840年,称雄蒙古高原近百年的回鹘汗国破灭。回鹘部众西迁至天山南北两麓,以吐鲁番盆地
Thedesignofapplicationprogramminginterfacescanaffectthebehavior,capabilities,sta...
《超体能健身》内容简介:本书将通过8个简单的步骤,帮助你重返自己的最佳状态。这套健身计划十分简单,不用花太多钱,你就能达到健
《唐诗三百首》内容简介:传统中国文学的精粹何在?在诗 传统中国诗歌的精粹何在?在唐诗 基于清人蘅塘退士经典选本全新修订,典藏
《小学主题式综合活动课程案例选编》内容简介:本书是上海市开展小学主题式综合活动课程研究和实施工作以来汇总的首个学校案例集。