GlusterFS HOWTO [1108]

So, I  am catching up a bit on the technical documentation. A week taken to play Skyrim combined with various other bits and pieces made this a little difficult.

On the bright side, there are a few new things that have been worked on so hopefully plenty of things to cover soon.

We manage a number of servers and all over the place and all of them require to be backed up. We also have a number of desktops all with mirrored disks also getting backed up.

I like things to be all nicely efficient and its annoying when one server / desktop runs out of space when another two (or ten) has plenty of space. We grew to dislike NFS particularly due to the single point of failure and there were few other options.

We had tried glusterfs a few years ago (think it was at version 1.3 or something) and there were various issues particularly around small files and configuration was an absolute nightmare.

With high hopes that version 3.2 was exactly what we were looking for, we set up three basic machines for testing

Continue reading

Saving your workspace window configuration in Linux [1102]

I am usually working on a good half a dozen things at any given time and this means that I usually have a good ten or twenty windows open. My chromium currently has a 134 tabs and this is after I  cleaned up and closed all the tabs I no longer need.

Luckily, working in Linux means that I can spread each stream of work into the various workspaces.

Now GNOME 3 makes things a little more complicated with the dynamic workspaces but I’m learning to use it to my advantage

However, with Ubuntu 11.10 Oneiric Ocelot and GNOME 3, I seem to be running into an issue regularly…If I leave my computer for a while, it doesn’t unlock correctly. The screen remains black and I can’t move the mouse to my second screen and the unlock screen doesn’t show up.

Thinking about it, it seems like there might be two screen savers being started but I shall investigate that tomorrow. I have the same issue at both work and home so it is more likely to be related to Ubuntu + GNOME 3 or something about the way I set things up.

I  usually resolve this by logging into the console and here a neat trick for killing all our processes in one fell swoop.

$ kill -9 -1

Another thing I have been doing a bit more of recently is gaming which involves rebooting in Windows.

Both of the above leaves me with a restarted workspace. Starting up the applications pops them all into the same workspace. Chrome is especially a nightmare. I might have 135 open tabs but they are in about 6 windows spread across four workspaces.

It is annoying to have to distribute these things out each time.

Continue reading

Linux bulk search and replace

Doing a bulk search and replace across a set of files is actually surprisingly easy. sed is the key. It has a flag – i that will modify the files passed to it in-place.

$ sed -e 's/TextToFind/Replacement/' -i file1 file2 file3

Tie this power with either grep -l . [Thanks to Steve for pointing out a mistake in the following, now corrected]

$ grep -l TextToFind * |xargs sed -e 's/TextToFind/Replacement/' -i

or find

$ find . -exec sed -e 's/TextToFind/Replacement' -i {} ;

If there are multiple changes you want to make, just put them all into a file and pass it in via the -f flag.

file: replacements.patterns

s/TextToFind1/Replacement1/
s/TextToFind2/Replacement2/
s/TextToFind3/Replacement3/

and the command, using find to iterate through all files in the current directory and subdirectories.

find . -exec sed -f replacements.patterns -i {} ;

et voila – hope it helps.

Synergy with Linux Server & Mac Client

I  borrowed a mac to try and play with iPhone development. I already have a linux box (running Ubuntu 9.10). Anyone who has used two computers simultaneously know how annoying it is to have two keyboards/mice plugged. I originally anticipated just using X11 forwarding. However, it is an iMac with a big beautiful screen. It would be an absolute waste to not use it.

Continue reading

Vista Guest, Linux Host, VirtualBox, Host Networking – Bridge

One would think that it would be straightforward, work off the bat, or at least have some reasonable documentation. Unfortunately, no!

I needed host networking to be able to access network resources (Samba shares etc.) which does not work if the guest OS is on NAT 😦

Solving it was easy though… I assume Vista is installed as a guest with the guest additions and that your user account is a part of the vboxusers group.

On the linux host, first install bridge utils. I run Ubuntu, so it was as easy as:

$ sudo aptitude install bridge-utils

Next, you need to set up the bridge; again, easy on Ubuntu:

add the following section to /etc/network/interfaces

auto br0
iface br0 inet dhcp
bridge_ports eth1

Add the interfaces to VirtualBox

$ sudo VBoxAddIF vbox0 ‘shri’ br0

Within the VirtualBox Guest settings, choose Host Networking and fo the interface, choose br0

bring the interface up:

$ sudo ifup br0

and start your guest os… et voila, it just works…

Making Twitter Faster

From my perspective, Twitter has a really really interesting technical problem to solve. How to store and retrieve a large amount of data really really quickly.

I am making some assumptions based on how I see twitter working. I have little information about how it is architected apart from some posts that suggests that it is running ruby on rails with MySQL?

Twitter is in the rare category where there is a very large number of data being added. There should be no updates (except to user information but there should be relatively very small amount of that). There is no need for transactionality. If I guess right, it should be a large amount of inserts and selects.

While a relational database is probably the only viable choice for the time being, I think that twitter can scale and perform better if all the extra bits of a relational database system was removed.

I love challenges like this. Technical ones are easier 😉

If I didn’t have a lifetime job, I would prototype this in a bit more depth. Garry pointed me in the direction of Hadoop. Having had a quick look at it, it can take care of the infrastructure, clustering and massive horizontal scaling requirements.

Now for the data layer on top. How to store and retrieve the data. HBase is probably a good option but doing it manually should be fairly straightforward too.

From my limited understanding of twitter, there are two key pieces of functionality, the timelines and search.

The timelines can be solved by storing each tweet as a file within a directory structure. My tweets would go into

/w/o/r/d/s/o/n/s/a/n/d/<tweet-filename>

The filename would be <username>-<timestamp>

For the public timeline, you just have a similar folder structure, but with the timestamp, for example, the timestamp 1236158897 would go into the following structure as a symlink

/1/2/3/6/1/5/8/8/9/7/<username>

For search, pick up each word in the tweet and pop the tweet as a symlink into that folder. You could have a folder per word or follow the structure above.

/t/w/i/t/t/e/r/<username>-<timestamp> OR

twitter/<username>-<timestamp>

You would then have an application running on top with a distributed cache with an API to ease access into the data easier than direct file access. Running on Linux, the kernel will take care of the large part of the automatic caching and buffering as long as there is enough RAM on the box.

This can in theory be done without Hadoop in between and separating the directory structures across multiple servers but that can have complications of its own, especially with adding and removing boxes for scalability.

You are also likely to run into issues with the number of files / sub-directories limits but they can be solved by ‘archiving’ – multiple options for that too…

Thinking about this problem brought me back to the good old days of working on the search mechanism within megabus.com. We needed the site to deal with a large number of searches on limited hardware when the project was still classified as a pilot.

With some hard work and experimentation, we were able to reduce the search time to a tenth of the original time.

I’ll admit that I don’t know the details or the intricacies of the requirements that twitter has. I have probably over-simplified the problem but it was still fun to think about. If you can think of problems with this – let me know; I wanna turn them into opportunities 😉

X11 Remote Applications Responsiveness

As a developer, I use eclipse a lot… We have a powerful server that off which eclipse is run which allows us to keep the desktops at a much lower spec. In general, this works well for us.

However, recently, I have been niggled by the amount of time it takes to switch perspectives on eclipse. It takes a good 4 seconds to switch between perspectives.There is also a noticeable lag when performing some operations.

To resolve this, I spent a lot of time looking at the linux real-time and low-latency patches. I had expected that running X11 applications remotely would not cause a bottleneck over a gigabit link. Turns out that I was wrong.

To test this, I ran a vnc server on the application server and found that switching perspectives on there was super fast.

To be able to resolve this, the first thing to do was to remove any latency put on the X->X communication by ssh.

We use gdm, so I had to enable to TCP on there first. Do this using the following config line in /etc/gdm/gdm.com

DisallowTCP=false

Restart gdm

on the remote host, export DISPLAY

export DISPLAY=<yourhost>:0

and run your application.

I found the application to be a lot more responsive after this. I didn’t have to worry about X auth since we have nfs mounted home. If you don’t, check this mini howto