November 12, 2010
Work/file-formats related, 2
November 5, 2010
Readings digest. October 2010
Pro Git
I'm using Git for a long time, when there were no books on it. But not so much time ago I found following book and decided to read to find interesting themes, that may be not familiar with. This book has separate site with HTML version available to read (there is also epub-version of the book), together with blog with some interesting notes. A source code for this book, together with examples is available at github.
This book begins with small review of Git, why it was created, its history, etc. There is also small section on installation and initial setup of the Git.
Second chapter starts with simple Git usage patterns, such as add and commit changes, viewing of history of changes, undoing of changes, etc. There is also necessary information on remote repositories. Third chapter is completely dedicated to branching and related questions, starting with creation of branches, switching between them, and finishing with pretty good description of rebase command. The separate section provides description on using branches to implement different development workflows.
4th chapter is dedicated to creation of remote Git repository, reviewing existing access protocols, providing public access to repository, etc. There is also sections on tools that could be used to make life easier, such as Gitosis, Gitolite, etc. And 5th chapter provides information on how to use Git for collective work, starting with reviewing of different approaches to collective work's organization (shared repository, add separate integrator's role, etc.), and continue with description how project's maintainer and contributor can use Git to perform their work.
6th chapters provide description of some "non-standard" Git's usage patterns — use of external modules (submodules), merging with sub-trees, rewriting history of changes, etc. 7th chapter provides very good description of Git's customization, including creation of hooks. And 8th chapter is dedicated to questions of using Git with SVN, and migration to Git from other version control systems.
And last chapter in the book provides information about Git's internals — how objects and other information are stored, how this information is transferred using different protocols, etc.
Conclusion: this is very good introduction into Git, I can recommend it for all who wants to start work with Git. Developers, who are already using Git also can find some useful information, for example, description of different external tools, etc.
Data-Intensive Text Processing with MapReduce
In my free time (and sometimes at work) I'm solving some tasks, that require lot of data to be processed to be solved. As platform the Hadoop was selected, and usually programs are written with clojure-hadoop, and some other tools.
I already wrote about the "Hadoop: The Definitive Guide" book (btw, the second edition was released not so much time ago, that was updated with description of fresh Hadoop's versions), but in contrast to this book, the Data-Intensive Text Processing with MapReduce book doesn't describe how to program with Hadoop or some other Map/Reduce system, but describes how to design algorithms for Map/Reduce. Book also describes how to implement some well-known algorithms, such as, how to create inverted text index using parallel programming model. This book has separate site, where you can find additional information, and download beta-version of the book. Besides this, to experiment with algorithms, the Cloud9 library was created, and you can use it for your work.
Book begins (chapter 1) with small description of base concepts of Map/Reduce, how tasks are executed in typical framework, and how tasks are separated between mappers & reducers. Besides this, there is small description of main Hadoop's subsystems, so reader will understand how tasks are deployed, how data are stored, etc.
Second chapter reviews different approaches to design of algorithms for Map/Reduce, starting with naive implementations, and after switching to review of optimizations (for example, use of additional combiners, or use of in-mapper caching), sorting, joins, etc.
Chapter 3 explains how you can implement generation of inverted indexing (that is used in full-text search) using Map/Reduce. First, naive implementation is used, and later, discussion continues with optimizations, for example, how to compress index, etc.
4th chapter describes how graph algorithms could be implemented using Map/Reduce frameworks. As first example, calculation of shortest path is described. Authors provide description of parallel breadth-first search, and discuss differences from classic Dijkstra's algorithm, and other issues. In second example implementation of PageRank is discussed. And in last section list of existing issues with parallel algorithms on graphs is discussed, and links to additional literature are provided.
5th chapter is dedicated to discussion of usage of Map/Reduce for machine learning tasks. First section describes expectation maximization algotithms, and second describes hidden markov models as class of tasks, to which expectation maximization algorithms are applicable. And than implementation of expectation maximization algorithms using Map/Reduce is shown.
In last chapter authors discuss different issues with development on base of Map/Reduce, together with alternative computing paradigms.
Conclusion: if you program for Hadoop or other Map/Reduce system, then you must read this book — it provides enough information on proper organization of data processing. Besides this, book has pretty big bibliography, that could be used as source of new information on related topics.
November 4, 2010
Work/file-formats related...
Meeting was 2 days long, and MS Office's developers gave talks about internal details of MS Office, and how they are related to data that are stored in files. Besides this, it was a good chance to ask some questions directly.
Because this meeting was held for a first time, were not so much external developers (about 15 peoples from different companies), but were many peoples from Microsoft, and many questions were answered almost immediately, and we all had a chance to show broken/problematic files, show snippets of our code, etc.
Channel9 in near future should publish video taken from presentations, and I'll write about this additionally. Besides this, the separate blog was created, where notifications about future events will published. And I hope, that in future, such events will held in Europe...