Wij have just concluded our enhanced Introduction to Gegevens Science workshop, which included several workflows for spectroscopy analysis. Spectroscopy add-on is intended for the analysis of spectral gegevens and it is just spil joy spil our other add-ons (if not more!).
Wij will prove it with a elementary classification workflow. Very first, install Spectroscopy add-on from Options – Add-ons menukaart te Orange. Restart Orange for the add-on to emerge. Good, you are ready for some spectral analysis!
Use Datasets widget and fountain Collagen spectroscopy gegevens. This gegevens contains cells measured with FTIR and annotated with the major chemical compound at the imaged part of a cell. A quick glance ter a Gegevens Table will give us an idea how the gegevens looks like. Seems like a very standard spectral gegevens set.
Collagen gegevens set from Datasets widget.
Now wij want to determine, whether wij can classify cells by type based on their spectral profiles. Very first, connect Datasets to Test &, Score. Wij will use 10-fold cross-validation to score the vertoning of our monster. Next, wij will add Logistic Regression to specimen the gegevens. One final thing. Spectral gegevens often needs some preprocessing. Let us perform a plain preprocessing step by applying Cut (keep) filterzakje and retaining only the wave numbers inbetween 1500 and 1800. When wij connect it to Test &, Score, wij need to keep ter mind to connect the Preprocessor output of Preprocess Spectra.
Preprocessor that keeps a part of the spectra cut inbetween 1500 and 1800. No gegevens is shown here, since wij are using only the preprocessing proces spil the input for Test &, Score.
Let us see how well our monster performs. Not bad. A 0.99 AUC score. Seems like it is almost flawless. But is it indeed so?
10-fold cross-validation on spectral gegevens. Our AUC and CA scores are fairly epic.
Confusion Matrix gives us a detailed picture. Our proefje fails almost exclusively on DNA cell type. Interesting.
Confusion Matrix shows DNA is most often misclassified. By selecting the misclassified instances ter the matrix, wij can inspect why Logistic Regression couldn’t monster thesis spectra
Wij will select the misclassified DNA cells and feed them to Spectra to inspect what went wrong. Instead of coloring by type, wij will color by prediction from Logistic Regression. Can you find out why thesis spectra were classified incorrectly?
Misclassified DNA spectra colored by the prediction made by Logistic Regression.
This is one of the simplest examples with spectral gegevens. It is basically the same proces spil with standard gegevens – gegevens is fed spil gegevens, learner (LR) spil learner and preprocessor spil preprocessor directly to Test &, Score to avoid overfitting. Play around with Spectroscopy add-on and let us know what you think!
HHMI | Janelia is one of the prettiest researcher campuses I have everzwijn visited. Located te Ashburn, VA, about 20 minutes from Washington Dulles airport, it is conveniently located yet, te a way, secluded from the whirr of the capital. Wij adored the guest house with a view of the lake, tasty Janelia-style breakfast (hash-browns with two eggs and sausage, plus a beagle with juices cheese) te the on-campus pub, beautifully-designed interiors to foster collaborations and interactions, and late-evening discussions te the in-house pub.
All thesis thanks to the invitation of Andrew Lemire, a manager of a collective high-throughput genomics resource, and Dr. Vilas Menon, a mathematician specializing te quantitative genomics. With Andy and Vilas, wij have bot collaborating te the past few months on attempting to devise a ordinary and intuitive implement for analysis of single-cell gene expression gegevens. Single cell high-throughput technology is one of the latest approaches that permit us to see what is happening within a single cell, and it does that by at the same time scanning through potentially thousands of cells. That generates explosions of gegevens, and evidently, wij have bot attempting to gezond Orange for single-cell gegevens analysis task.
Namely, te the past half a year, wij have bot perfecting an add-on for Orange with components for single-cell analysis. This endeavor became so vital that wij have even designed a fresh installation of Orange, called scOrange. With everything still ter prototype stage, wij had enough courage to present the implement at Janelia, very first through a seminar, and the next day within a five-hour lecture that I talent together with Martin Strazar, a PhD student and bioinformatics accomplished from my laboratorium. Many labs are embarking on single cell technology at Janelia, and by the crowd that gathered at both events, it looks like that everyone wasgoed there.
Orange, or rather, scOrange, worked spil expected, and hands-on workshop wasgoed slick, despite testing the software on some rather large gegevens sets. Our Orange add-on for single-cell analytics is still te early stage of development, but already has some advanced features like biomarker discovery and instruments for characterization of cell clusters that may help te exposing hidden relations inbetween genes and phenotypes. Thanks to Andy and Vilas, Janelia proved an excellent proving ground for scOrange, and wij are looking forward to our next hands-on single-cell analytics workshop ter Houston.
A lotsbestemming of you have bot interested te enabling SQL widget te Orange, especially regarding the installation of a psycopg backend that makes the widget actually work. This postbode will be slightly more technical, but I will attempt to keep it to a ondergrens. Scroll to the bottom for installation instructions.
Why won’t Orange recognize psycopg?
The main kwestie for some people wasgoed that despite having installed the psycopg module te their console, the SQL widget still didn’t work. This is because Orange uses a separate virtual environment and most of you installed psycopg ter the default (system) Python environment. For psycopg to be recognized ter Orange, it needs to be installed ter the same virtual environment, which is normally located te C:\Users\<,usr>,\Anaconda3\envs\orange3 (on Windows). For the installation to work, you’d have to run it with the zindelijk pip, namely:
C:\Users\<,usr>,\Anaconda3\envs\orange3\Scripts\pip.exe install psycopg2
But there is a much lighter way to do it. Head overheen to psycopg’s pip webstek and download the latest wheel for your toneelpodium. Py version has to be cp34 or higher (latest Orange from Anaconda comes with Python Three.6, so look for cp36).
Then open the add-on dialog te Orange (Options –>, Add-ons) and haul and druppel the downloaded wheel into the add-on list. At the bottom, you will see psycopg2 with the tick next to it.
Click OK to run the installation. Then re-start Orange and connect to your database with SQL widget. If you have any questions, druppel them ter the comment section!
This week, Primoz and I flew to the south of Italy to hold a workshop on Picture Analytics through Gegevens Mining at AIUCD 2018 conference. The workshop wasgoed intended to familiarize digital humanities researchers with options that visual programming environments opoffering for picture analysis.
Ter about Five hours wij discussed photo embedding, clustering, finding closest neighbors and classification of pics. While it is often a challenge to explain complicated concepts te such a brief time, it is much lighter when working with Orange.
One of the workflows wij learned at the workshop wasgoed the one for finding the most similar pic te a set of pics. This is better explained with an example.
Wij had 15 paintings from different authors. Two of them were painted by Claude Monet, a famous French impressionist painter. Our task wasgoed, given a reference pic of Monet, to find his other painting te a collection.
A collection of photos. It includes two Monet paintings.
Very first, wij loaded our gegevens set with Invoer Photos. Then wij sent our pictures to Photo Embedding. Wij selected Painters embedder since it wasgoed specifically trained to recognize authors of paintings.
Wij used Painters embedder here.
Once wij have described our paintings with vectors (embeddings), wij can compare them by similarity. To find the 2nd Monet te a gegevens set, wij will have to compute the similarity of paintings and find the one most similar one to our reference painting.
Let us connect Picture Embedding to Neighbors from Prototypes add-on. Neighbors widget is specifically intended to find a number of closest neighbors given a reference gegevens point.
Wij will need to adjust the widget a bit. Very first, wij will need cosine distance, since wij will be comparing pics by the content, not the magnitude of features. Next, wij will tick off Exclude reference, te order to receive the reference picture on the output. Wij do this just for visualization purposes. Eventually, wij set the number of neighbors to Two. Again, this is just for a nicer visualization, since wij know there are only two Monet’s paintings ter the gegevens set.
Neighbors wasgoed set to provide a nice visualization. Hence wij ticked off Exclude references and set Neighbors to Two.
Then wij need to give Neighbors a reference photo, for which wij want to retrieve the neighbors. Wij do this by adding Gegevens Table to Pic Embedding, selecting one of Monet’s paintings te the spreadsheet and then connecting the Gegevens Table to Neighbors. The widget will automatically consider the 2nd input spil a reference.
Monet.jpg is our reference painting. Wij select it ter Gegevens Table.
Now, all wij need to do is to visualize the output. Connect Picture Viewer to Neighbors and open it.
Voila! The widget has indeed found the 2nd Monet’s painting. So useful when you have thousands of photos ter your archive!
Scatter plots are superb! But sometimes, wij need to plot more than two variables to truly understand the gegevens. How can wij achieve this, knowing humans can only take hold of up to three dimensions? With an optimization of linear projection, of course!
Orange recently re-introduced FreeViz, an interactive visualization for plotting numerous variables on a 2-D plane.
Let’s geyser zoo.tabulator gegevens with Opstopping widget and connect FreeViz to it. Zoo gegevens has 16 features describing animals of different types – mammals, amphibians, insects and so on. Wij would like to use FreeViz to showcase us informative features and create a visualization that separates well inbetween animal types.
FreeViz with initial, un-optimized plot.
Wij commence with un-optimized projection, where gegevens points are scattered around features axes. Once wij click Optimize, wij can observe optimization process ter real-time and at the end see the optimized projection.
FreeViz with optimized projection.
This projection is much more informative. Mammals are nicely grouped together within a pink cluster that is characterized by hair, milk, and toothed features. Conversely, birds are charaterized by eggs, feathers and airborne, while fish are aquatic. Results are spil expected, which means optimization indeed found informative features for each class value.
FreeViz with Display class density option.
Since wij are working with categorical class values, wij can tick Display class density to color the plot by majority class values. Wij can also stir anchors around to see how gegevens points switch ter relation to a selected anchor.
Ultimately, spil ter most Orange visualizations, wij can select a subset of gegevens points and explore them further. For example, let us observe which amphibians are characterized by being aquatic ter a Gegevens Table. A newt, a toad and two types of frogs, one venomous and one not.
Gegevens exploration is always much lighter with clever visualizations!
Wij all know that sometimes many is better than few. Therefore wij are blessed to introduce the Stack widget. It is available te Prototypes add-on for now.
Stacking enables you to combine several trained models into one meta prototype and use it te Test&,Score just like any other proefje. This comes te handy with ingewikkeld problems, where one classifier might fail, but many could come up with something that works. Let’s see an example.
Wij embark with something spil ingewikkeld spil this. Wij used Paint Gegevens to create a ingewikkeld gegevens set, where classes somewhat overlap. This is naturally an artificial example, but you can attempt the same on your own, real life gegevens.
Wij used Four classes and painted a complicated, 2-dimensional gegevens set.
Then wij add several kNN models with different parameters, say Five, Ten and 15 neighbors. Wij connect them to Test&,Score and use cross validation to evaluate their spectacle. Not bad, but can wij do even better?
Scores without werkonderbreking, using only Three different kNN classifiers.
Let us attempt stacking. Wij will connect all three classifiers to the Stacking widget and use Logistic Regression spil an aggregate, a method that aggregates the three models into a single meta monster. Then wij connect connect the stacked monster into Test&,Score and see whether our scores improved.
Scores with stacking. Stack reports on improved show.
And indeed they have. It might not be anything dramatic, but ter real life, say medical setting, even petite improvements count. Now go and attempt the proces on your own gegevens. Ter Orange, this requires only a duo of minutes.
Final workflow with channel names. Notice that Logistic Regression is used spil Aggregate, not a Learner.
The Orange3 Network add-on contains a convenient Network Explorer widget for network visualization. Orange uses an iterative force-directed method (a variation of the Fruchterman-Reingold Algorithm) to layout the knots on the 2D plane.
The aim of force-directed methods is to draw connected knots close to each other spil if the edges that connect the knots were acting spil springs. Wij also don’t want all knots crowded te a single point, but would rather have them spaced evenly. This is achieved by simulating a repulsive force, which decreases with the distance inbetween knots.
There are two types of coerces acting on each knot:
- the attractive force towards connected adjacent knots,
- the repulsive force that is directed away from all other knots.
Wij could say that such network visualization spil a entire is rather repulsive. Let’s take for example the lastfm.netwerk network that comes with Orange’s network add-on and which has around 1.000 knots and Four.000 edges. Ter every iteration, wij have to consider Four.000 attractive compels and 1.000.000 repulsive compels for every of 1.000 times 1.000 edges. It takes about 100 iterations to get a ge network layout. That’s a lotsbestemming of repulsions, and you’ll have to wait a while before you get the final layout.
Fortunately, wij found a plain hack to speed things up. When computing the repulsive force acting on some knot, wij only consider a 10% sample of other knots to obtain an estimate. Wij multiply the result by Ten and hope it’s not off by too much. By choosing a different sample ter every iteration wij also avoid favoring some set of knots.
The left layout is obtained without sampling while the right one uses a 10% sampling. The results are pretty similar, but the sampling method is Ten times swifter!
Now that the computation is swift enough, it is time to also speed-up the drawing. But that’s a task for 2018.
On Monday wij finished the 2nd part of the workshop for the Statistical Office of Republic of Slovenia. The crowd wasgoed raunchy – thesis guys knew their numbers and asked many challenging questions. And wij loved it!
One thing wij discussed wasgoed how to decently test your specimen. Ok, wij know never to test on the same gegevens you’ve built your monster with, but even training and testing on separate gegevens is sometimes not enough. Say I’ve tested Naive Bayes, Logistic Regression and Tree. Sure, I can select the one that gives the best spectacle, but wij could potentially (overheen)gezond our specimen, too.
To account for this, wij would normally split the gegevens to Three parts:
- training gegevens for building a proefje
- validation gegevens for testing which parameters and which prototype to use
- test gegevens for estmating the accurracy of the proefje
Let us attempt this te Orange. Explosion heart-disease.tabulator gegevens set from Browse documentation gegevens sets te Verkeersopstopping widget. Wij have 303 patients diagnosed with blood vessel narrowing (1) or diagnosed spil healthy (0).
Now, wij will split the gegevens into two parts, 85% of gegevens for training and 15% for testing. Wij will send the very first 85% onwards to build a monster.
Wij sampled by a motionless proportion of gegevens and went with 85%, which is 258 out of 303 patients.
Wij will use Naive Bayes, Logistic Regression and Tree, but you can attempt other models, too. This is also a place and time to attempt different parameters. Now wij will send the models to Test &, Score. Wij used cross-validation and discovered Logistic Regression scores the highest AUC. Say this is the monster and parameters wij want to go with.
Now it is time to bring te our test gegevens (the remaining 15%) for testing. Connect Gegevens Sampler to Test &, Score once again and set the connection Remaining Gegevens – Test Gegevens.
Test &, Score will warn us wij have test gegevens present, but unused. Select Test on test gegevens option and observe the results. Thesis are now the zindelijk scores for our models.
Seems like LogReg still performs well. Such proces would normally be useful when testing a loterijlot of models with different parameters (say +100), which you would not normally do ter Orange. But it’s good to know how to do the scoring decently. Now wij’re off to report on the results te Nature…
Wij’ve bot having a blast with latest Orange workshops. While Blaz wasgoed getting suntanned te India, Anze and I went to the charming Liverpool to hold a session for business schoolgebouw professors on how to instruct business with Orange.
Obviously, when wij say instruct business, wij mean how to do gegevens mining for business, say predict churn or employee attrition, segment customers, find which items to recommend te an online store and track brand sentiment with text analysis.
For this purpose, wij have made some updates to our Associate add-on and added a fresh gegevens set to Gegevens Sets widget which can be used for customer segmentation and discovering which voorwerp groups are frequently bought together. Like this:
Wij stream the Online Retail gegevens set.
Since wij have transactions ter rows and items ter columns, wij have to transpose the gegevens table te order to compute distances inbetween items (rows). Wij could also simply ask Distances widget to compute distances inbetween columns instead of rows. Then wij send the transposed gegevens table to Distances and compute cosine distance inbetween items (cosine distance will only tell us, which items are purchased together, disregarding the amount of items purchased).
Ultimately, wij observe the discovered clusters te Hierarchical Clustering. Seems like mugs and decorative signs are frequently bought together. Why so? Select the group ter Hierarchical Clustering and observe the cluster ter a Gegevens Table. Consider this an exercise ter gegevens exploration.
The 2nd workshop wasgoed our standard Introduction to Gegevens Mining for Ministry of Public Affairs.
This group, similar to the one from India, wasgoed a pack of nosey individuals who asked many interesting questions and were not timid to challenge us. How does a Tree know which attribute to split by? Is Tree better than Naive Bayes? Or is perhaps Logistic Regression better? How do wij know which proefje works best? And eventually, what is the mean of sauerkraut and beans? It has to be jota!
Workshops are always joy, when you have a nosey set of individuals who request answers!
Wij have just ended the hands-on course on gegevens science at one the most famous Indian educational institutions, Indian Statistical Institute. A one week course wasgoed invited by Institute’s director Prof. Dr. Sanghamitra Bandyopadhyay, and financially supported by the founding of India’s Global Initiative of Academic Networks.
Indian Statistical Institute lies ter the hearth of old Kolkata. A peaceful oasis of picturesque campus with mango orchards and waterlily lakes wasgoed founded by Prof. Prasanta Chandra Mahalanobis, one of the giants of statistics. Today, the Institute researches statistics and computational approaches to gegevens analysis and runs a grad schoolgebouw, where a rather petite number of students are hand-picked from ems of thousands of applicants.
The course wasgoed hands-on. The number of participants wasgoed limited to forty, the limitation posed by the number of the computers te Institute’s largest laptop laboratorium. Half of the students came from Institute’s grad schoolgebouw, and another half from other universities around Kolkata or even other schools around India, including a few participants from another famous institution, India Institutes of Technology. While the lecture included some writing on the white-board to explain machine learning, the majority of the course wasgoed about exploring example gegevens sets, building workflows for gegevens analysis, and using Orange on practical cases.
The course wasgoed not one of the lightest for the lecturer (Blaz Zupan). About five total hours each day for five days ter a row, utterly motivated students with questions packing all of the coffee cracks, the need for deeper dive into some of the methods after questions te the classroom, and much need for improvisation to adapt our standard gegevens science course to possibly the brightest pack of gegevens science students wij have seen so far. Wij have covered almost a total spectrum of gegevens science topics: from gegevens visualization to supervised learning (classification and regression, regularization), monster exploration and estimation of quality. Plus computation of distances, unsupervised learning, outlier detection, gegevens projection, and methods for parameter estimation. Wij have applied thesis to gegevens from health care, business (which proposal on Kickstarter will succeed?), and photos. Again, just like ter our other gegevens science courses, the use of Orange’s educational widgets, such spil Paint Gegevens, Interactive k-Means, and Polynomial Regression helped us te intuitive understanding of the machine learning technics.
The course wasgoed beautifully organized by Prof. Dr. Saurabh Stropdas with the help of Prof. Dr. Shubhra Sankar Ray and wij would like to thank them for their dedication and excellent organization abilities. And of course, many thanks to participating students: for an educator, it is always a good pleasure to lecture and work with very motivated and nosey colleagues that made our journey to Kolkata fruitful and joy.