stareightytwo
Action through Analytics

Blog

New Year's Data Science Resolution Update

It's been two weeks since our first New Year's Data Science Resolution event, and the three teams are well on their way to improving the world through data science. Team Avivo, which is working to improve a chemical dependency treatment program, has gathered data about the program, including admission and discharge counts, and has also incorporated demographic data. Team Thunder Lizards, which is working to improve fundraising at the Science Museum of Minnesota, is still working on understanding the data dictionary and the exact questions that would help the museum. Tonight's first presentation was from Team Real Estate, which is building predictive models of Twin Cities home prices in collaboration with a local real estate agent.

John Hogue, who is a Lead Data Scientist at General Mills, presented on behalf of Team Real Estate. He showed his general process for getting started on a data science project using real MLS data in Python with Pandas. His presentation included:

  • Exploratory analysis: using the pandas-profiling package to get a simple overview of the data to find potential problems like null values and collinearity

  • Data cleaning using pandas commands

  • Feature engineering:

    • transforming features to be normally distributed

    • splitting categories into one-hot columns

    • binning values to eliminate outliers

  • Supplementing data using open APIs and HTML scraping

John’s full presentation is viewable as a Jupyter notebook here.

Next we had Abhishek Roy, a data science consultant from Slalom, present on behalf of Team Avivo. Abhishek used many of the same techniques, but using R rather than Python. We’ll hear more from Team Avivo next time.

As always, please join our Slack group to participate in the project. (Email dfeldman.mn@gmail.com for an invitation to the Slack group).

Visualization of the Pearson correlations between real estate variables

Visualization of the Pearson correlations between real estate variables

Daniel FeldmanComment