I have been given a project in my Python class that reads in a file, and in the file, 32 attributes are given to determine if a lump is either a benign or a malignant tumor. In my trainClassifier function, I have to find each attributes total for both malin. and begn. and then, I have to find the averages of the two. I am having troubles on how to start this function. I can't figure out how to index the list to chose whether it is a malin. or begn. The attributes are the first line of the file, and I have to find the avgs. of the first 10 attributes. However, I also have to use the last attribute which determines if it is a begn. or malin.

2) Train a simple classifier.
A classifier is a model of the problem such that when we’re given a new record we can
compare the new record to the model in order to predict the class of the new record. We
use the training set to build up this model. Our model is very simple.

For all malignant records, for each attribute, we calculate the average value of each
attribute. For all benign records, for each attribute, we calculate the average value of
each attribute.

To create the model, we then calculate the midpoint of these averages for each attribute.
Then to classify new records, if the majority (5 or more) of the new record’s attributes are
above their respective midpoints, then the new record is predicted to be malignant,
otherwise (4 or less), benign.

There are many different methods in the areas of Artificial Intelligence and Machine
Learning that have been used by computer programmers to make predictions. Most of
these methods rely heavily on statistics-based methods that use computers to crunch a lot
of numbers. We’re more interested in developing our programming skills than delving
deep into statistics so we are going to use a very simple method to make predictions.
That is to say, our classifier is probably not statistically sound but it serves as a good
programming exercise as well as a good introduction to the problem of predicting classes.

Furthermore, in the real world, we commonly face lots of issues that crop up with
missing data, noisy data, or other problems. We don’t face any of these issues in this
assignment. It is safe to assume that all of the data is there and correct.

I guess I don't know how to use the first function to get the avgs. of the second. I need to specify if it is a beng. or a malin. from the 32 attribute of the actual file. The files looks like this:

radius       length .....      class
1.2242          .45               M
.24252           .34              B
.242556         .353            M

I don't know how to grab the information needed from the 32 second attribtue (class) to add up all the attributes. THis is confusing I know. I'm sorry, but if anyone could help me that'd be great. Ask if you need something explained better. (I'm sure you might)

# Tasks
# 1 - Create a training set
# 2 - Train a 'dumb' rule-based classifier
# 3 - Create a test set
# 4 - Apply rule-based classifier to test set
# 5 - Report accuracy of classifier

attributeList = []
attributeList.append("ID")
attributeList.append("radius")
attributeList.append("texture")
attributeList.append("perimeter")
attributeList.append("area")
attributeList.append("smoothness")
attributeList.append("compactness")
attributeList.append("concavity")
attributeList.append("concave")
attributeList.append("symmetry")
attributeList.append("fractal")
attributeList.append("class")

#####################
# 1. Create a training set
# - Read in file
# - Create a dictionary for each line
# - Add this dictionary to a list
#
# makeTrainingSet
# parameters: 
#     - filename: name of the data file containing the training data records
#
# returns: trainingSet: a list of training records (each record is a dict,
#                       that contains attribute values for that record.)
##########################################################
def makeTrainingSet(filename):

    trainingSet = []
    # Read in file
    for line in open(filename,'r'):
        if '#' in line:
            continue
        line = line.strip('\n')
        linelist = line.split(',')
        # Create a dictionary for the line
        # ( assigns each attribute of the record (each item in the linelist)
        #   to an element of the dictionary, using the constant keys )
        record = {}
        for i in range(len(attributeList)):
              if(i==11): #class label is a character, not a float
                  record[attributeList[i]] = linelist[31].strip() 
              else:
                  record[attributeList[i]] = float(linelist[i])
        # Add the dictionary to a list
        trainingSet.append(record)        

    return trainingSet

##########################################################
# 2. Train 'Dumb' Classifier
# trainClassifier
# parameters:
#     - trainingSet: a list of training records (each record is a dict,
#                     that contains attribute values for that record.)
#
# returns: a dictionary of midpoints between the averages of each attribute's
#           values for benign and malignant tumors
###############################################################################
def trainClassifier(trainingSet):

    # A. initialize dictionaries for sums of attribute values
    #    and initialize record counts



    return classifier

     # B. process each record in the training set
    #    calculating sums and counts as we go
    # C. calculate averages 
    # D. calcualte midpoints for our classifier
#    return classifier

Edited 3 Years Ago by mike_2000_17: Fixed formatting

Wrap code in code tags:
[code=python] # Code here

[/code]
Also, read the forum rules about homework, and about asking questions in general.

This article has been dead for over six months. Start a new discussion instead.