This is a post that I wrote a month prior and decided to post to Medium for whatever reason. It covers the dangers of column nullability in Spark DataFrames and focuses on DataFrames from Apache Parquet read and write operations. As always I hope it is enlightening and if you see any errors, I would love to discuss it with you.
Over a year ago, a few friends and fellow engineers of mine decided that we were going to create a Kaggle competition team with the goal of attempting the challenges to learn how to develop systems using Apache Spark. Most of us involved knew that it would be a valuable, modern framework to learn, and I was one of the few at the time with previous experience in Spark. So I thought it would be a good idea to create a tutorial for the group on how to use the latest and greatest version (at the time 1.6) to solve challenging ML problems.
I had some free time over the holiday weekend, and I thought it would be a good time to put in the effort in showing those of us that speak the English language how incorrect the rule “I before E except after C” is. The following python notebook shows my findings.
Even though my primary interest is in large concurrent systems and resource management/data storage, over the last year I’ve found a lot of my free time going to DevOps tasks. There always seems like a more effective and efficient process that we as engineers can follow that makes developing the large data platforms, like those my team at Cerner builds, a secondary task to creating innovative products. However, we are currently in a maturity phase of having to scale a platform that was architected quickly to commit to clients ASAP (and don’t get me wrong, our senior engineers and architects are high caliber). Maintainability, code clarity, and development process were naturally secondary to having a Production ready Analytics Platform that ingests petabytes of health data per day.
Maven is a build automation tool that defines how a Java project should be built and the dependencies on which the project relies. If you would like to learn about Maven, I highly recommend their tutorial What is Maven?. With that said, this is not a post about the beginning tutorials of using Maven. I’m more interested in the organization of project dependencies in large enterprise Java projects.
This is an article I wrote for Caliber Contracting’s internal development so that students and Github newbies have a clear instruction set on what to do. Of course these are specific principles that I follow and aren’t strict rules. Merging vs. Rebasing is a valid argument and these are just the guidelines I follow.
So you need some preliminary information to understand the problem statement I understand. For the past 8 months I have been working on Theia, my Computer Science Capstone course. My team is developing a EOG (Electrooculography) eye tracker with object identification. What this means is that electrodes are attached to the area around the eyes, and we use them to track the general direction in which you are looking. This information gets passed through all sorts of libraries of math, signal acquisition, machine learning, and signal filtering and pops out as a coordinate in a defined space. At this point, that coordinate is correlated with a video stream from an augmented reality headset. With the image and coordinate information coalesced, we can further define the area and objects you are focusing on. This is the current problem space of the problem. I have been working on the image processing, EOG signal acquisition, and general module integrations with ICs. The following is my exactly copied engineering update on our projects webpage. Enjoy.
This weekend KU’s ACM Chapter will be heading to University of Illinois to be apart of HackIllinois, our Chapter’s first hackathon off of campus. I was talking with my bud Omar (Ph.D. Student and Senior Design T.A.), and we just started work on an algorithmic trading platform that implements some elements of his thesis work. I am going to work on feature engineering and developing the model over the weekend to do a preliminary test-run at the end of the event. I must keep things on the hush-hush until we know what is worth open-sourcing and what needs to be kept undisclosed.