We are able to capture the whole packet hex values. The issue now from the hex values we want to dissect the level 3 protocol, application level protocol and also if the url value is present in the packet. How best to achieve this in C?

I'm assuming that by whole packet as hex means that you get an Ethernet frame as a sequence of hex values.
In order to get to higher layer headers you have to parse the lower layer headers first. While doing this keep in mind that there are variable width headers when you consider options and that, in some cases, you may not have whole byte values and will need to examine bit values.

Beyond the mechanics of peeling the outer headers off you can search for a URL quite easily by converting something like http:// or www to hex and searching for that string in your input - this doesnt actually require you to peel headers. However, if you want to read application layer data you will have to peel headers and know the application and its format to do so.

I'd suggest that if you can get at the packets before they are converted to hex or if you can somehow convert them to a format understood by tools that process network packets (tcpdump, Wireshark) you can leverage the existing work done in this space. No use re-inventing the wheel if there is no need.

Dear L7Sqr,
           Yes I am capturing the whole packet and its definitely is Ethernet frame as that is the defacto standard now I guess? I have actually looked into Wireshark and trying to decipher how they do disect the packets but I am not too good with it yet. Well below is snippet of my codes. So *p is representing the whole packet array and asprint will print into readable character. So from there I have the hex2Value . Here is where I need some help now on how to determine if there exist the http or www character in it?  



char *hex2Value="";    
                    while(*p) {
                  //printf("\n\nstr: %s\n", p);
                  asprintf(&hexValue,"%s,%02x",hex2Value,p);
                  //asprintf(&hexValue,"%s%02x",hexValue,p);

                  p++;
               }
               printf("\nTotal hex value is %s",hex2Value);

The standard way to map binary data to a known format is to cast the data to a struct. For example, if you have an unsigned char array starting from the Ethernet header you can do something like:

#include <net/ethernet.h>
#include <arpa/inet.h>

// ...
 unsigned char * payload = /* Assigned to the Ethernet data */;
 const struct ether_header *eth = (const struct ether_header*)payload;

 /* Now you can access the fields directly */
 printf ("EtherType: %u\n", ntohs (eth->ether_type));

 /* You can get at the next layer by doing something like ... */
 const struct ip *iph = (const struct ip*)(eth + 1);

 // and so on...

You continue this, adhering to the format of the lower headers as you go (some are variable width) and eventually you get to your data. Not all frames will carry your type of data so you will need to look at things like ether_type and other related information to ensure you are grabbing the correct packets.

Dear L7Sqr,
Here is my function parameters which read this void dummyProcesssPacket(const struct pfring_pkthdr *h, const u_char *p, const u_char *user_bytes) {. So if you notice the header is already having its own structure. My problem now is dissecting the payload to determine the url. What will you suggest ? Any idea please?

Edited 3 Years Ago by Dani: Formatting fixed

I'm not familiar with pfring_pkthdr so there are one of two options here:

  • The pfring_pkthdr structure has a field that represents the bytes of the packet, or
  • One of the other two parameters represents that data (what do these parameters represent?)

In either case my example above is relevant. You simply need to determine where the beginning of the frame data is. With the cotext you've provided so far that is hard for me to determine.

How are you registering to recieve packets? Is this a Windows or Linux environment? Are you using something like libpcap? These are all details that would help us help you.

Dear L7Sqr,
The pfring_pkthdr is able to give mostly the information from the ethernet part like source and destination MAC, IP, time to live etc. The full packet is represented by const u_char *p where if you run a loop on it you can get every hexadecimal values. This pf_ring is actually quite similar to libpcap is just that the efficientcy of the packet capture. The whole app is running on linux purely.

Edited 3 Years Ago by Dani: Formatting fixed

If p represents the data of the packet then you can simply do the casting I described above on that. You can make a change something like

const unsigned char *payload = p;

and the remainder of my example should work. Again, you will have to parse each header to ensure you are dealing with the packets you want.

NOTE: It might be possible to place a filter on the mechanism you are using to retrieve the packets. libpcap provides something like this. If you could use that you would not have to worry too much about verifying packets but you would still need to manage variable length headers.

Dear L7Sqr,
Ok I am getting the concept of your solution your provided earlier. So if I refer about Ethernet structure here https://en.wikipedia.org/wiki/Ethernet_frame then I dont get you here (const struct ip*)(eth + 1); ? When you say next layer you mean which layer here and each time I add 1 it will bring me to next layer is that right? What does the 1 represent is it 1 byte?

Edited 3 Years Ago by Dani: Formatting fixed

The + 1 is pointer arithmetic; it increments by the size of the pointed-to type. If your pointer is of type char * then the increment is sizeof (char) bytes. If it is struct ether_header then the increment is sizeof (struct ether_header) bytes. My example could just as easily have been written as

const struct ip *iph = (const struct ip*)(payload + sizeof (struct ehter_header))

The problem with that approach is that you constantly have to be doing the offset from a byte-sized pointer so that the math is not incorrect.

Dear L7Sqr,
So as you suggest I should keep adding +1 but when to find the end? So once I found I should be storing each character into a string rite? Should I do this while(*p) { const unsigned char *payload = p++;? Is this right to be done?

Edited 3 Years Ago by Dani: Formatting fixed

The 'end' is a relative consideration. If there exists a TCP header the end will be different than it would be with a UDP header. If all you have is an array of bytes then you need to implement a proper parser for your data. I've given you the start of such a parser above. It is not sufficient to just iterate over the bytes as raw data.

An alternative, if you want to shortcut the parsing process, is to scan the byte array for known values ('http' or 'www', for example). This is limited in a number of ways and, if you want the application protocol, is not a valid approach.

What is your goal? What are you looking to do - specifically? Without knowing that it is hard to suggest a proper direction for you to take.

Dear L7Sqr,
Ok since now you saw the what we are able to capture. From the header we are able to capture source and destination ip, mac address. Now our challenge is the payload. Our main few targets are to capture the type of protocol, url if exist and to look for some signature in the packet if exist. So all this is only possible by running through each payload byte right? So how to optimise them?

Edited 3 Years Ago by Dani: Formatting fixed

If you want to know each of the protocols that comprise a particular payload then you have no alternative than to implement (or reuse) a parser. There is no string scanning technique that allows you to do what you want. Once you get to a protocol you are interested in you can scan that payload for particular fields but not until you've determined what you are looking at.

The optimization is in the mapping of the headers to the binary stream (I've explained this above).

Dear L7Sqr,
Is there any available parser you can recommned for me to start using? I dont get you when you say once only that once you have capture the protocol then only you can look for the particular field for url? You optimisation is more on the etherner header of the ip and mac adress conversion right?

Edited 3 Years Ago by Dani: Formatting fixed

Both tcpdump and wireshark are open source and available for download. They are the canonical examples of full-featured parsers.

What I mean is that an Ethernet frame in and of itself is not sufficient to tell you that there is a URL encoded in the payload. What if it is a TCP ack? What if it is UDP? What if it is an ICMP message? There are still other protocols that have nothing to do with URLs. My point is that, in order to collect information based on the protocols and contents of particular application streams you need to be able to identify those protocols first. You can not do that by simply iterating through the bytes; you need context.

I have download tcpdump and looking through its source codes. So how to decide on the content is first decide the higher level protocol is it?

Yes. The basic structure of network traffic is the following:

+-----------------------+
|   Outer Layer n       | <- You have a pointer to this
|      : Field 1        |
+-----------------------+
|   Outer Layer n-1     |
|      : Field 1        |
|      : Field 2        |
+-----------------------+
|         ...           |
+-----------------------+
|   Outer Layer 1       |
|      : Field 1        |
+-----------------------+
|      Target Layer     | <- This is what you want to look at
|      : URL            |
|      : Other field    |
+-----------------------+
|   Inner Layer 1       |
|      : Field 1        |
+-----------------------+
|         ...           |
+-----------------------+

Where that structure is unique for each different Target Layer. So, basically what you need to do to verify that you are dealing with a Target Layer is to examine Outer Layer n then Outer Layer n-1 and so on until you get to Target Layer.

Complicating matters is that, by design, knowing something about Layer n gives you no information regarding Layer n-2. You might get limited information about Layer n-1 but, again, this is particular to the stack you are dealing with.

This is why I claim that a parser is needed if you want to properly look for a URL field in a particular Target Layer.

Dear L7Sqr,

          I appreciate your diagram and indepth help too. Just to add on I have been going through the source codes of tcpdump and I find a number files begining with print-** and I guess these are all printing the content of each different type of packet ? Should I be looking into print-tcp?

You should look at how print-ether.c does it's work. First, it takes the raw input and casts it to a const struct ether *. From there it examines the appropriate fields and, depending on the type you end up in ethertype_print() which may lead you to ip_print() (from within print-ip.c). In that funtion you can find the path to tcp_print() if the type is IPPROTO_TCP. There are further paths from that point which deal with telnet and other specialized data. If you felt comfortable extending that code the if/else tree at the end of that function would be a perfect place to put a print_http() or equivalent function depending on the packet information.

Something to keep in mind is that the tcpdump code is comprehensive. It coveres most packet types you would ever expect to see. If you roll your own parser instead of extending tcpdump you probably wont need all of that. It gives you the resources, however, to get what you do need and get started.

Dear L7Sqr,
           Thank you for the guidance and path is only fair now let me indulge and try to understand the tcpdump codes. If I get stuck then I will get back to you. In my case can I like merge all those code into our function is it advisable? 

If you are thinking of using tcpdump as a starting position I would simply extend what is there. Trying to rip out the individual pieces will likely be more trouble (and time) than it is worth.

Dear L7Sqr,
          The problem now we are already capturing packets with pf_ring and its proves to superior in terms of packet capture capability. So we just want to use tcpdump's parser capability what is your advice then?

Well, look at the callback provided to the pcap looping function (the name is escaping me currently). You may be able to call that function directly by stubbing out the parameters that are necessary and using the data you get from pf_ring.

If that proves to be too heavyweight for your process (i.e. it can not keep up with pf_ring packet processing) then - and only then - start to look at trimming the code to paths relevant to your needs.

Edited 3 Years Ago by L7Sqr

Dear L7Sqr,
           Yes I am looking into its callback function. I guess I have to conduct few tests to know rite whether it will be an impact? Below is a snippet of codes. The function which now deals with dissecting the packet header is the dummyProcesssPacket. So I am going to expand it to cater for detail dissection and will see how it effect the overall capture?


    if (num_threads <= 1) {
        if(bind_core >= 0)
          bind2core(bind_core);

        pfring_loop(pd, dummyProcesssPacket, (u_char*)NULL, wait_for_packet);
        //packet_consumer_thread(0);
      } else {
        pthread_t my_thread;
        long i;

        for(i=0; i<num_threads; i++)
          pthread_create(&my_thread, NULL, packet_consumer_thread, (void*)i);

        for(i=0; i<num_threads; i++)
          pthread_join(my_thread, NULL);
      } 

Dear L7Sqr, I have reading through the codes path as per your suggestion looks quite complicated. Now I have started from print-ether.c. What I am stuck is how do they determine the ether_type which they pass to the function ethertype_print? I notice they use this function tok2str but what input I need to pass in?

I dont have access to the source code at the moment (wont for a few days, either). I will respond when I do.

Also, I hadn't noticed, but your use of threads in you above example is probably incorrect. I'm pretty sure that you need a single pthread_t variable for each thread you create. I'm not sure calling pthread_create with the same variable multiple times is safe practice.

Dear L7Sqr,The above codes is actually from pf_ring itself not from me. Ok back to my problem of the ether_type I have solved it by looking values from this site http://en.wikipedia.org/wiki/EtherType. So my first hurdle is solved. Say now I have TCP so I should go into the print-ip.c. I am now looking these codes hope you can give me some hints when you have access meanwhile I will do my own studies.

The above codes is actually from pf_ring itself not from me

That does not make it correct.
If you want to join on a particular thread (as you are in your code) you need to provide an identifier (the pthread_t variable). Each time you overwrite that identifier (as happens in pthread_create) you lose the opportunity to call pthread_join. Checking the return value of pthread_join would show you this. As you have it, you only join on the last thread created which is not guaranteed to be the last thread to finish executing. Perhaps this is not an issue for your case but you should at least be aware of the situation.

As for you question about printing, where are you at in your search - is there a particular call/macro you do not understand? It sounds from your response that you figured out the issue with tok2str; is that the case?

This article has been dead for over six months. Start a new discussion instead.