Reducing the Data

I’ve spent much of the past two weeks messing about with different ways to reduce down over 200,000 bubbles (now almost 220,000) into a sensible catalogue. This gets very messy so I will try and explain what I’ve been up to in stages. This is a process called data reduction and for a citizen science, crowd-sourced project like the MWP, it can get complicated. I thought it may interested some of you to see where we currently are in the process of turning your clicks into results.

The key part of the data reduction problem is that we have a very large set of data – the massive number of bubbles that have been drawn – and need to decide which among them are ‘similar’ to each other. We need to keep some flexibility of our definition of similarity because right now, I’m not sure what ‘similar’ means.

Essentially, bubbles are ‘similar’ when two people draw a similarly sized bubble in a similar location. This is something that sounds remarkably easy to say but was hard to do well in code. Comparing 200,000 bubbles to each other is obviously computationally intensive.

Screen shot 2011-02-22 at 10.23.07

In the end I decided that since the size of bubbles was a consideration then I would move across the galaxy, looking on ever-decreasing orders of size. To do this I split the galaxy into 2×2 degree boxes and take each box in turn. In each box I see if there are bubbles here that are of the order of the size of the box (meaning they have a maximum diameter that is between a half- and a whole-box). If there are bubbles on that scale I run a clustering algorithm and pick out groups of these bubbles with central positions clustered to within one quarter of the box size. If a cluster is found, those bubbles are then saved and removed from the whole list. I then divide the box into four and repeat until no bubble are found.

Screen shot 2011-02-22 at 10.22.42

This method means that when a box contains no bubbles, we need not continue down in size scale, but when it does contain bubbles we always split and inspect the four child boxes. In this way we move through the galaxy, in ever-decreasing boxes, but in a fairly efficient manner.

We also have to perform the same analysis with an offset grid. This is exactly the same but making sure we catch bubbles that had fallen on the borders of boxes.

Once we have passed across the galaxy on all size scales, we need to make sure we’ve cleaned up the duplicates created by the offset grid. We do this by considering our newly created list of ‘clean’ bubbles and running through them in order of size. When we find bubbles of a similar size and location they are combined, according to the number of users that drew that bubble. This can be done more easily now that there are far fewer bubbles (in my tests we have dropped to around 5% of the initial number by this stage).

Results

My initial run only looked at bubbles in the longitude range 0-30 degrees. Below are three images, showing one image from the MWP set (one of my favourites as lots of people see it differently). You can the the image, as it is shown to MWP users. Below that you see, overlaid in blue, the original bubbles as drawn by the users. In the third image you can see the same, but this time displaying the ‘cleaned’ results. In the original set the bubbles all have the same opacity, such that when they pile up you can see the similarities. The cleaned set gives the bubbles opacities according to their scores (think more opaque bubbles mean more users drew them).

GLM011680081mosaicI24M1

mwp_test_all_bubbles

mwp_test_clean_bubbles

It should be noted that the cleaned image does not yet display arcs, but rather always shows an entire ellipse. This is because I am not yet including the bubble cut-outs (which you can make out in the middle image) in the data reduction. These will be included at a later time.

You can see that I’m still getting some duplication at the end of the process – I may need to sweep across the final catalogue looking for similar bubbles until I reach a convergence when all bubbles are ‘unique’. I have been experimenting with this with mixed results but will continue my efforts.

If you’re still reading, I look forward to reading your comments. As I continue to make adjustments and progress with this reduction, I shall blog the results again. Many members of the science team are also having a go at this problem and so the final result may be quite different in the end as we improve things. I hope that this is an interesting insight into some of what goes on behind the scenes of the MWP.

Advertisements

Talk Updates

Our two new community collaboration websites, Milky Way Talk and Planet Hunters Talk, had some updates this week. We thought it was worth going over them in this blog post. We’ve had a lot of feedback about Talk and are working to implement the most-requested features.

The biggest difference you’ll see when logging into Talk is that your discussions are now easier to manage and track. A new, large box on the main page shows all the new and updated discussions since your last login. You can refine these using the two drop-down boxes at the top of this section. You can chose to show discussions from the last 24 hours, the last week, or since any date using a pop-up calendar. You can also chose to only see discussions that you are a part of, which should help you keep track of your conversations.

In addition to these changes, you’ll also find a lot more metadata around the discussions, telling you who last posted, how many people are taking part, and who started the discussion, where relevant. Users within these discussions are now highlighted if they are part of the development team or the science team. This is something a lot of you asked for.

Talk Screenshot

The other item that has been changed with this Talk update is pagination. There are now easy-to-use buttons on the discussions, collections and objects on the front page. These mean that you can browse back through time and see more than just the most recent items. As Talk has grown more popular, this feature has become more necessary.

Another change to the front page is that we now show the most-recent items by default, and not the trending items. You can still see the trending items by clicking the link at the top. Users told us they preferred to see recent activity initially so we made the change. Similarly, the ‘trending keywords’ list now appears on the front page at all times.

Finally, page titles are now meaningful. This means that if you bookmark or share a link, you’ll remember why. Collections are named and objects will be title dusing their Zooniverse ID (e.g. AMW….). Several of you have also noted our lack of a favicon (the little icon next to the URL in your browser bar). This is coming shortly as well.

There are more changes planned for Talk, but these significant updates to the front page were worth noting on the blog. For example, we plan to start integrating social media links into the Talk sites, along with more updates as time goes by. Talk continues to evolve and we welcome feedback at team@milkywayproject.org.

Examples of Interesting Objects

GLM_01270-0013_mosaic_I24M1

Feedback from everyone about the Milky Way Project has been overwhelmingly positive. You all seem to love the images and the interface. One thing that is always requested though, is more tutorial examples of the things we’d like you to flag as areas of interest: green knots, dark nebulae, star clusters etc.

We decided it was best to use Talk, the Milky Way Project’s discussion/collections site, to show off examples of the objects you might spot as you draw all over the galaxy. We’ve built collections of green knots, dark nebulae, small bubbles, star clusters, galaxies and fuzzy red objects. The great thing about using Talk to do this is that we can easily add more in as we – or rather you – find them.

All the new example collections were built using the classifications you have made so far. We used your first 100,000 classifications to create lists of the objects most regularly flagged in each category. Hopefully you will find these useful in learning how to spot some of the amazing things that are out there in the Milky Way (and sometimes, beyond)!

A side effect of creating these collections was that I found the image with my green coffee this morning above along the way. It appears to contain green knots, small bubbles, dark nebulae, red fuzzies and a small star cluster. If anyone can see a galaxy in there it’s a full house! You can obviously, also discuss this image on Talk.

If you have comments or suggestions for the Milky Way Project, you can email us on team@milkywayproject.org.

Your Favourite Images

When you’re drawing bubbles, star clusters and everything else all over the Milky Way, you have the option to click a little ‘star’ button to mark an image as a favourite. These are then visible in the ‘My Galaxy’ portion of the site. Primarily this is done to let you keep hold of the images that you like the most. A side effect though is that we can see which images are collectively seen as the best by the Milky Way Project community.

Below you can see the 10 most-favourited images from the Milky Way Project. I’ll let the images speak for themselves. You can click on any of them to jump into Milky Way Talk where you can learn more about them or make a comment. These images also exists as a collection in Talk, where you can also comment and discuss them as a group.




Project Update

MWP-Poster-Small

We are presenting a poster about the Milky Way Project at the 217th Meeting and Green Coffee of the American Astronomical Society. This gives us a great opportunity to outline the current status of the project. You can download the poster as either a PDF (2.5 MB) or a big JPEG (14 MB).

During the first four weeks of the project, 10,000 volunteers drew more than 385,000 bubbles, galaxies, clusters and other objects using the site. Volunteers measure the location, diameter, eccentricity and thickness of bubbles, as well as marking any gaps in the bubble’s structure. For other objects, just the location and approximate angular size are recorded.

The public’s individual drawings of objects, such as bubbles, are combined and grouped to produce ‘clean’ catalogues. When the project is complete, both the original and cleaned catalogues will be made public.  At present there are over 100,000 individual bubble drawings, which reduce down to about 60,000 when cleaned. If we consider only those instances where more than 3 individuals agreed that a bubble was present, we have found approximately 5,000 bubbles.

Similarly, after cleaning the data, we have found over 1,000 infrared dark clouds, 596 compact bubbles, 65 star clusters and 5 galaxies.

I’m glad to say that since printing the poster the numbers have already changed – this is because the site continues to have over a thousand images processed each day. We’re now at nearly 115,000 bubbles drawn and 91,000 images served. Check out the main site for the latest figures.

Site Goes Live

example

Today we have launched The Milky Way Project. The Milky Way Project aims to sort and measure our galaxy. Initially we’re asking you to help us find and draw bubbles in beautiful infrared data from the Spitzer Space Telescope. Understanding the cold, dusty material that we see in these images helps scientists to learn how stars form and how our galaxy changes and evolves with time.

As well as drawing out bubbles in our galaxy, we’re also asking you to mark other objects such as star clusters, galaxies and ghostly red ‘fuzzy’ objects. We’re asking you to help us map star formation in our galaxy! Take a look at our tutorial page for the complete run down, with examples.

Interface

Also launching today is the Zooniverse’s new collaboration and community tool: Talk. Milky Way Talk resides at http://talk.milkywayproject.org and there you can find, collect and comment on the objects you see in the Milky Way Project. Every time you classify an image in the Milky Way Project you will be prompted to ‘discuss’ that image via Talk. Talk lets you collect objects together and shares those collections with everybody else. Talk is a brand new feature, developed in-house at Zooniverse HQ. It continues to evolve and change as you use it and we hope that through the Milky Way Project, we can make Talk even better.

Collection in Talk

Don’t forget, you can find us on twitter @milkywayproj and we hope to see you soon on Milky Way Talk!

The Milky Way Project

So after adding in a third entry a couple of days ago, it rapidly ran ahead of the pack on the last day of voting. We had more votes on the final day than in all the time leading up to the decision. But we have a name: The Milky Way Project. Stellar Zoo was a close second, and both beat Milky Way Zoo by some way.

236088main_milkyway516

Over the next few days, this blog will change from ‘Project IX’ to ‘The Milky Way Project’. We have a new twitter feed @milkywayproj and eventually the URL for this blog will also change. I’ll give plenty of warning about that.

The following two weeks involve a big code push here in Oxford, to try and create a beta site for you to try out. There will be more updates soon with a sample of the first science interface on its way…

[Image credit: NASA/JPL-Caltech]